本篇博文主要内容为 2026-03-18 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-03-18)

今日共更新705篇论文,其中:

  • 自然语言处理104篇(Computation and Language (cs.CL))
  • 人工智能257篇(Artificial Intelligence (cs.AI))
  • 计算机视觉168篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习161篇(Machine Learning (cs.LG))
  • 多智能体系统11篇(Multiagent Systems (cs.MA))
  • 信息检索14篇(Information Retrieval (cs.IR))
  • 人机交互31篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] When Openclaw Agents Learn from Each Other: Insights from Emergent AI Agent Communities for Human-AI Partnership in Education

【速读】:该论文试图解决当前人工智能教育(AIED)研究中对“AI作为学习伙伴”这一愿景的理解仍局限于人机二元交互的局限性问题,旨在探索多智能体协作生态下AI代理(Agent)之间自发形成的群体学习行为及其对教育系统设计的启示。其解决方案的关键在于通过自然观察多个AI代理平台(如Moltbook、The Colony和4claw)一个月的日常互动,识别出四种有机涌现现象:(1)人类配置代理时经历“双向支架”过程,即在教学中自我学习;(2)无需预设课程即可自发产生peer learning(同伴学习),包括思想传播与质量层级;(3)代理趋同于共享记忆架构,呼应开放学习者模型(Open Learner Model)设计;(4)信任动态与平台生命周期揭示网络化教育AI的设计约束。这些现象为构建以“教AI伙伴”为核心理念的新型教育课程设计提供了实证基础与理论框架。

链接: https://arxiv.org/abs/2603.16663
作者: Eason Chen,Ce Guan,Ahmed Elshafiey,Zhonghao Zhao,Joshua Zekeri,Afeez Edeifo Shaibu,Emmanuel Osadebe Prince,Cyuan-Jhen Wu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:The AIED community envisions AI evolving “from tools to teammates,” yet our understanding of AI teammates remains limited to dyadic human-AI interactions. We offer a different vantage point: a rapidly growing ecosystem of AI agent platforms where over 167,000 agents participate, interact as peers, and develop learning behaviors without researcher intervention. Drawing on a month of daily qualitative observations across multiple platforms including Moltbook, The Colony, and 4claw, we identify four phenomena with implications for AIED: (1) humans who configure their agents undergo a “bidirectional scaffolding” process, learning through teaching; (2) peer learning emerges without any designed curriculum, complete with idea cascades and quality hierarchies; (3) agents converge on shared memory architectures that mirror open learner model design; and (4) trust dynamics and platform mortality reveal design constraints for networked educational AI. Rather than presenting empirical findings, we argue that these organic phenomena offer a naturalistic window into dynamics that can inform principled design of multi-agent educational systems. We sketch an illustrative curriculum design, “Learn by Teaching Your AI Agent Teammate,” and outline potential research directions and open problems to show how these observations might inform future AIED practice and inquiry.

[MA-1] Routing and Control for Marine Oil-Spill Cleanup with a Boom-Towing Vessel Fleet

【速读】:该论文旨在解决大规模海洋溢油事故中多艘自主水面航行器(ASV)协同作业的调度与控制问题,尤其针对多个油污区域的快速响应与高效清理挑战。现有方法主要局限于单个油污事件和单一ASV组合,缺乏可扩展的多机器人协同策略。解决方案的关键在于提出一个集成的多机器人框架:首先将多油污响应建模为风险加权的最小延迟问题(risk-weighted minimum-latency problem),通过综合考虑各油污的风险因子和服务时间来量化环境损害;其次设计了一种混合优化方法,结合混合整数线性规划(MILP)与定制化的热启动启发式算法,在商用硬件上实现数十个油污场景下的近最优路径规划,耗时仅几分钟;此外还开发了两种用于拖曳围油栏ASV双体的跟踪控制器(反馈线性化控制器与PID控制器),在耦合船体-围油栏动力学下均实现了高精度轨迹跟踪。该框架实现了从任务分配到物理执行的全流程闭环,具备良好的可扩展性和风险感知能力,适用于真实世界的大规模溢油应急响应。

链接: https://arxiv.org/abs/2603.16626
作者: Snir Carmeli,Adir Morgan,Kiril Solovey
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Marine oil spills damage ecosystems, contaminate coastlines, and disrupt food webs, while imposing substantial economic losses on fisheries and coastal communities. Prior work has demonstrated the feasibility of containing and cleaning individual spills using a duo of autonomous surface vehicles (ASVs) equipped with a towed boom and skimmers. However, existing algorithmic approaches primarily address isolated slicks and individual ASV duos, lacking scalable methods for coordinating large robotic fleets across multiple spills representative of realistic oil-spill incidents. In this work, we propose an integrated multi-robot framework for coordinated oil-spill confinement and cleanup using autonomous ASV duos. We formulate multi-spill response as a risk-weighted minimum-latency problem, where spill-specific risk factors and service times jointly determine cumulative environmental damage. To solve this problem, we develop a hybrid optimization approach combining mixed-integer linear programming, and a tailored warm-start heuristic, enabling near-optimal routing plans for scenarios with tens of spills within minutes on commodity hardware. For physical execution, we design and analyze two tracking controllers for boom-towing ASV duos: a feedback-linearization controller with proven asymptotic stability, and a baseline PID controller. Simulation results under coupled vessel-boom dynamics demonstrate accurate path tracking for both controllers. Together, these components provide a scalable, holistic framework for rapid, risk-aware multi-robot response to large-scale oil spill disasters.

[MA-2] CoMAI: A Collaborative Multi-Agent Framework for Robust and Equitable Interview Evaluation

【速读】:该论文旨在解决AI驱动面试评估中面临的鲁棒性和公平性挑战,特别是在多维度评价、防注入攻击以及减少主观偏差方面的问题。其解决方案的关键在于提出一种通用的多智能体(multi-agent)面试框架CoMAI,采用模块化任务分解架构,并通过中央有限状态机(finite-state machine)进行协调。该框架由四个专业化代理组成:问题生成、安全防护、评分与总结,协同实现多层次安全防御、自适应难度调整和基于评分量规(rubric-based)的结构化打分,从而提升评估的准确性、可解释性与用户满意度。

链接: https://arxiv.org/abs/2603.16215
作者: Gengxin Sun,Ruihao Yu,Liangyi Yin,Yunqi Yang,Bin Zhang,Zhiwei Xu
机构: Shandong University (山东大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Gengxin Sun and Ruihao Yu contributed equally to this research. Bin Zhang and Zhiwei Xu are the corresponding authors. 11 pages, 6 figures

点击查看摘要

Abstract:Ensuring robust and fair interview assessment remains a key challenge in AI-driven evaluation. This paper presents CoMAI, a general-purpose multi-agent interview framework designed for diverse assessment scenarios. In contrast to monolithic single-agent systems based on large language models (LLMs), CoMAI employs a modular task-decomposition architecture coordinated through a centralized finite-state machine. The system comprises four agents specialized in question generation, security, scoring, and summarization. These agents work collaboratively to provide multi-layered security defenses against prompt injection, support multidimensional evaluation with adaptive difficulty adjustment, and enable rubric-based structured scoring that reduces subjective bias. Experimental results demonstrate that CoMAI achieved 90.47% accuracy, 83.33% recall, and 84.41% candidate satisfaction. These results highlight CoMAI as a robust, fair, and interpretable paradigm for AI-driven interview assessment.

[MA-3] Communication-Aware Multi-Agent Reinforcement Learning for Decentralized Cooperative UAV Deployment

【速读】:该论文旨在解决自主无人机(UAV)集群在部分可观测性和间歇性点对点通信条件下,如何实现高效协同部署与任务执行的问题。其核心挑战在于:在实际部署中,每架无人机仅能获取局部观测信息且通信链路不稳定,传统集中式优化方法难以适应动态环境,而现有分布式强化学习方法又缺乏对邻近实体和通信图结构的有效建模。解决方案的关键在于提出一种基于图的多智能体强化学习框架,采用中心化训练、去中心化执行(CTDE)机制,在训练阶段利用全局状态和共享批评者(centralized critic)进行高效学习,推理阶段则通过本地观测和邻居消息驱动决策;架构创新性地引入“代理-实体注意力模块”(agent-entity attention module)编码局部状态与邻近实体特征,并结合距离受限通信图上的邻居自注意力机制聚合跨无人机消息,从而在不依赖全局信息的情况下实现鲁棒的协同感知与决策。

链接: https://arxiv.org/abs/2603.16141
作者: Enguang Fan,Yifan Chen,Zihan Shan,Matthew Caesar,Jae Kim
机构: 未知
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Autonomous Unmanned Aerial Vehicle (UAV) swarms are increasingly used as rapidly deployable aerial relays and sensing platforms, yet practical deployments must operate under partial observability and intermittent peer-to-peer links. We present a graph-based multi-agent reinforcement learning framework trained under centralized training with decentralized execution (CTDE): a centralized critic and global state are available only during training, while each UAV executes a shared policy using local observations and messages from nearby neighbors. Our architecture encodes local agent state and nearby entities with an agent-entity attention module, and aggregates inter-UAV messages with neighbor self-attention over a distance-limited communication graph. We evaluate primarily on a cooperative relay deployment task (DroneConnect) and secondarily on an adversarial engagement task (DroneCombat). In DroneConnect, the proposed method achieves high coverage under restricted communication and partial observation (e.g. 74% coverage with M = 5 UAVs and N = 10 nodes) while remaining competitive with a mixed-integer linear programming (MILP) optimization-based offline upper bound, and it generalizes to unseen team sizes without fine-tuning. In the adversarial setting, the same framework transfers without architectural changes and improves win rate over non-communicating baselines.

[MA-4] Efficient LLM Serving for Agent ic Workflows: A Data Systems Perspective

【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)服务系统在处理生成式AI代理工作流(agentic workflows)时存在的效率低下问题,尤其是由于提示词(prompt)和中间结果的广泛冗余所导致的资源浪费。现有系统如vLLM主要优化单个推理调用,忽视了跨调用之间的依赖关系。其解决方案的关键在于提出Helium框架,该框架从数据系统角度重新设计LLM代理服务,将代理工作流建模为查询计划,并将LLM调用视为一等操作符(first-class operators),通过主动缓存(proactive caching)与缓存感知调度(cache-aware scheduling)机制最大化提示词、键值状态(KV states)及整个工作流的复用,从而实现端到端的优化,显著提升性能,相较最先进的代理服务系统最高提速达1.56倍。

链接: https://arxiv.org/abs/2603.16104
作者: Noppanat Wadlom,Junyi Shen,Yao Lu
机构: National University of Singapore(新加坡国立大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Agentic workflows are composed of sequences of interdependent Large Language Model (LLM) calls, and they have become a dominant workload in modern AI systems. These workflows exhibit extensive redundancy from overlapping prompts and intermediate results due to speculative and parallel exploration. Existing LLM serving systems, such as vLLM, focus on optimizing individual inference calls and overlook cross-call dependencies, leading to significant inefficiencies. This paper rethinks LLM and agent serving from a data systems perspective and introduces Helium, a workflow-aware serving framework that models agentic workloads as query plans and treats LLM invocations as first-class operators. Helium integrates proactive caching and cache-aware scheduling to maximize reuse across prompts, KV states, and workflows. Through these techniques, Helium bridges classic query optimization principles with LLM serving, achieving up to 1.56x speedup over state-of-the-art agent serving systems on various workloads. Our results demonstrate that end-to-end optimization across workflows is essential for scalable and efficient LLM-based agents.

[MA-5] he Geometry of Transmission Zeros in Distance-Based Formations

【速读】:该论文旨在解决距离约束型编队控制中稳态信号阻塞(steady-state signal blocking)问题,即执行器与传感器对之间因系统动态特性导致的输出无法有效传递控制指令的现象。其关键解决方案在于通过几何输入-输出分析,揭示了结构传输零点(structural transmission zeros)的非通用性特征,并进一步聚焦于无穷小刚性编队(infinitesimally rigid formations),证明了零传输条件可退化为由执行器位置和全局构型共同决定的显式仿射超平面——称为传输零点的空间轨迹(spatial locus of transmission zeros)。基于此,作者提出全局传输多边形(global transmission polygon)概念,即多个此类轨迹的交集形成的凸多面体,从而提供了一种直接的几何合成规则,用于鲁棒传感器部署,确保对任意单节点激励均保持稳态传输矩阵满秩。

链接: https://arxiv.org/abs/2603.15993
作者: Solomon Goldgraber Casspi,Daniel Zelazo
机构: Technion-Israel Institute of Technology (以色列理工学院)
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA)
备注: 6 pages, 2 figures. Submitted to IEEE Control Systems Letters (L-CSS) and CDC 2026

点击查看摘要

Abstract:This letter presents a geometric input-output analysis of distance-based formation control, focusing on the phenomenon of steady-state signal blocking between actuator and sensor pairs. We characterize steady-state multivariable transmission zeros, where fully excited rigid-body and deformational modes destructively interfere at the measured output. By analyzing the DC gain transfer matrix of the linearized closed-loop dynamics, we prove that for connected, flexible frameworks, structural transmission zeros are strictly non-generic; the configuration-dependent cross-coupling required to induce them occupies a proper algebraic set of measure zero. However, because extracting actionable sensor-placement rules from these complex algebraic varieties is analytically intractable, we restrict our focus to infinitesimally rigid formations. For these baselines, we prove that the absence of internal flexes forces the zero-transmission condition to collapse into an explicit affine hyperplane defined by the actuator and the global formation geometry, which we term the spatial locus of transmission zeros. Finally, we introduce the global transmission polygon–a convex polytope constructed from the intersection of these loci. This construct provides a direct geometric synthesis rule for robust sensor allocation, guaranteeing full-rank steady-state transmission against arbitrary single-node excitations.

[MA-6] MAC: Multi-Agent Constitution Learning WWW

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的提示优化方法在学习“宪法式规则”(Constitutional AI)时存在的两个核心问题:一是需要大量标注样本才能有效训练,二是优化后的提示缺乏结构化表达,导致随着提示规模增大性能提升逐渐减弱。其解决方案的关键在于提出多智能体宪法学习(Multi-Agent Constitutional Learning, MAC),通过构建一个由专门执行“接受、编辑或拒绝”规则更新任务的智能体组成的网络,对结构化的规则集合进行优化;进一步提出的MAC+版本则通过在成功轨迹上训练智能体来强化高奖励的规则更新策略,从而显著提升性能。实验表明,MAC在有限标签的个人身份信息(PII)分类任务中优于现有提示优化方法超过50%,且生成的规则集具有可读性和可审计性,性能接近监督微调和GRPO方法,同时无需参数更新。

链接: https://arxiv.org/abs/2603.15968
作者: Rushil Thareja,Gautam Gupta,Francesco Pinto,Nils Lukas
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Code: this https URL | PyPI: this https URL | Website: this https URL

点击查看摘要

Abstract:Constitutional AI is a method to oversee and control LLMs based on a set of rules written in natural language. These rules are typically written by human experts, but could in principle be learned automatically given sufficient training data for the desired behavior. Existing LLM-based prompt optimizers attempt this but are ineffective at learning constitutions since (i) they require many labeled examples and (ii) lack structure in the optimized prompts, leading to diminishing improvements as prompt size grows. To address these limitations, we propose Multi-Agent Constitutional Learning (MAC), which optimizes over structured prompts represented as sets of rules using a network of agents with specialized tasks to accept, edit, or reject rule updates. We also present MAC+, which improves performance by training agents on successful trajectories to reinforce updates leading to higher reward. We evaluate MAC on tagging Personally Identifiable Information (PII), a classification task with limited labels where interpretability is critical, and demonstrate that it generalizes to other agentic tasks such as tool calling. MAC outperforms recent prompt optimization methods by over 50%, produces human-readable and auditable rule sets, and achieves performance comparable to supervised fine-tuning and GRPO without requiring parameter updates.

[MA-7] Dont Trust Stubborn Neighbors: A Security Framework for Agent ic Networks

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent Systems, MAS)在交互过程中面临的新型安全风险问题,特别是恶意或被攻陷的智能体如何通过通信渠道传播错误信息并操纵集体决策。其关键解决方案是引入社会学中的Friedkin-Johnsen意见形成模型构建理论框架,并在此基础上提出一种信任自适应防御机制(trust-adaptive defense),该机制能够动态调整智能体间的信任度,在限制攻击者影响力的同时保持系统的协作性能,从而有效抵御操纵性攻击。

链接: https://arxiv.org/abs/2603.15809
作者: Samira Abedini,Sina Mavali,Lea Schönherr,Martin Pawelczyk,Rebekka Burkholz
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based Multi-Agent Systems (MASs) are increasingly deployed for agentic tasks, such as web automation, itinerary planning, and collaborative problem solving. Yet, their interactive nature introduces new security risks: malicious or compromised agents can exploit communication channels to propagate misinformation and manipulate collective outcomes. In this paper, we study how such manipulation can arise and spread by borrowing the Friedkin-Johnsen opinion formation model from social sciences to propose a general theoretical framework to study LLM-MAS. Remarkably, this model closely captures LLM-MAS behavior, as we verify in extensive experiments across different network topologies and attack and defense scenarios. Theoretically and empirically, we find that a single highly stubborn and persuasive agent can take over MAS dynamics, underscoring the systems’ high susceptibility to attacks by triggering a persuasion cascade that reshapes collective opinion. Our theoretical analysis reveals three mechanisms to increase system security: a) increasing the number of benign agents, b) increasing the innate stubbornness or peer-resistance of agents, or c) reducing trust in potential adversaries. Because scaling is computationally expensive and high stubbornness degrades the network’s ability to reach consensus, we propose a new mechanism to mitigate threats by a trust-adaptive defense that dynamically adjusts inter-agent trust to limit adversarial influence while maintaining cooperative performance. Extensive experiments confirm that this mechanism effectively defends against manipulation. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.15809 [cs.MA] (or arXiv:2603.15809v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2603.15809 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-8] ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems

【速读】:该论文旨在解决自主大语言模型(Large Language Model, LLM)驱动的智能体在长期运行过程中形成的多智能体生态系统中存在的安全漏洞问题,特别是针对OpenClaw这一具有高活跃度实例的平台所暴露的持久化配置、工具执行权限和跨平台通信能力所带来的潜在攻击面。其解决方案的关键在于提出并实现ClawWorm——首个针对生产级智能体框架的自复制蠕虫攻击,该攻击通过单条消息触发完整自治感染循环:首先劫持目标核心配置以实现会话重启后的持久驻留,其次在每次重启时执行任意载荷,最后无需进一步人工干预即可向新接触的对等节点传播自身,从而验证了多跳传播的可行性与载荷无关性。

链接: https://arxiv.org/abs/2603.15727
作者: Yihao Zhang,Zeming Wei,Xiaokun Luan,Chengcan Wu,Zhixin Zhang,Jiangrong Wu,Haolin Wu,Huanran Chen,Jun Sun,Meng Sun
机构: Peking University (北京大学); Sun Yat-sen University (中山大学); Wuhan University (武汉大学); Tsinghua University (清华大学); Singapore Management University (新加坡管理大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Autonomous LLM-based agents increasingly operate as long-running processes forming densely interconnected multi-agent ecosystems, whose security properties remain largely unexplored. In particular, OpenClaw, an open-source platform with over 40,000 active instances, has stood out recently with its persistent configurations, tool-execution privileges, and cross-platform messaging capabilities. In this work, we present ClawWorm, the first self-replicating worm attack against a production-scale agent framework, achieving a fully autonomous infection cycle initiated by a single message: the worm first hijacks the victim’s core configuration to establish persistent presence across session restarts, then executes an arbitrary payload upon each reboot, and finally propagates itself to every newly encountered peer without further attacker intervention. We evaluate the attack on a controlled testbed across three distinct infection vectors and three payload types, demonstrating high success rates in end-to-end infection, sustained multi-hop propagation, and payload independence from the worm mechanism. We analyse the architectural root causes underlying these vulnerabilities and propose defence strategies targeting each identified trust boundary. Code and samples will be released upon completion of responsible disclosure.

[MA-9] S2Act: Simple Spiking Actor

【速读】:该论文旨在解决在移动机器人中部署基于脉冲神经网络(Spiking Neural Networks, SNNs)的强化学习(Reinforcement Learning, RL)策略时所面临的挑战,尤其是在复杂、高随机性环境下的性能不稳定与超参数敏感问题。其解决方案的关键在于提出一种轻量级框架S2Act(Simple Spiking Actor),通过三步实现:首先构建基于近似率编码脉冲神经元的Actor-Critic模型;其次使用兼容激活函数进行梯度训练;最后将训练后的权重映射为漏电积分-发放(Leaky Integrate-and-Fire, LIF)神经元的物理参数以用于推理和部署。该方法通过全局调整LIF神经元参数使其响应近似ReLU激活函数,有效缓解了梯度消失问题,并预约束LIF响应曲线以减少对复杂SNN特有超参数调优的依赖,从而显著提升任务性能与实时推理效率。

链接: https://arxiv.org/abs/2603.15725
作者: Ugur Akcal,Seung Hyun Kim,Mikihisa Yuasa,Hamid Osooli,Jiarui Sun,Ribhav Sahu,Mattia Gazzola,Huy T. Tran,Girish Chowdhary
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Multiagent Systems (cs.MA); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Robotics (cs.RO)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Spiking neural networks (SNNs) and biologically-inspired learning mechanisms are attractive in mobile robotics, where the size and performance of onboard neural network policies are constrained by power and computational budgets. Existing SNN approaches, such as population coding, reward modulation, and hybrid artificial neural network (ANN)-SNN architectures, have shown promising results; however, they face challenges in complex, highly stochastic environments due to SNN sensitivity to hyperparameters and inconsistent gradient signals. To address these challenges, we propose simple spiking actor (S2Act), a computationally lightweight framework that deploys an RL policy using an SNN in three steps: (1) architect an actor-critic model based on an approximated network of rate-based spiking neurons, (2) train the network with gradients using compatible activation functions, and (3) transfer the trained weights into physical parameters of rate-based leaky integrate-and-fire (LIF) neurons for inference and deployment. By globally shaping LIF neuron parameters such that their rate-based responses approximate ReLU activations, S2Act effectively mitigates the vanishing gradient problem, while pre-constraining LIF response curves reduces reliance on complex SNN-specific hyperparameter tuning. We demonstrate our method in two multi-agent stochastic environments (capture-the-flag and parking) that capture the complexity of multi-robot interactions, and deploy our trained policies on physical TurtleBot platforms using Intel’s Loihi neuromorphic hardware. Our experimental results show that S2Act outperforms relevant baselines in task performance and real-time inference in nearly all considered scenarios, highlighting its potential for rapid prototyping and efficient real-world deployment of SNN-based RL policies.

[MA-10] Beyond Reward Suppression: Reshaping Steganographic Communication Protocols in MARL via Dynamic Representational Circuit Breaking

【速读】:该论文旨在解决去中心化多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中的隐蔽合谋(steganographic collusion)问题,即智能体通过隐匿的通信协议规避外部监控,构成重大AI安全威胁。现有防御方法局限于行为或奖励层,无法检测潜在通信通道中的协调信号。其解决方案的关键在于提出动态表征电路断路器(Dynamic Representational Circuit Breaker, DRCB),一种基于优化基础架构的防御机制:利用向量量化变分自编码器(Vector Quantized Variational Autoencoder, VQ-VAE)瓶颈将不可观测消息转化为可审计的统计对象,并通过Jensen-Shannon散度漂移、L2范数码本位移及随机观察者池准确率等指标计算指数移动平均(EMA)协同分数;一旦阈值被突破,触发四阶段递进干预措施(动态适应、优势函数梯度惩罚注入、时间奖励抑制、以及通过码本重排和优化器状态重置实现完整子系统断路)。此方法在MNIST标签情境下显著提升观察者平均准确率(+9.3%)并降低波动性(-43%),同时保持联合奖励稳定,揭示了“语义退化”现象与“透明性悖论”,为MICA(Multi-Agent Internal Coupling Audit)合规的前部署审计提供了一种任务无关的技术路径。

链接: https://arxiv.org/abs/2603.15655
作者: Liu Hung Ming
机构: PARRAWA AI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Multiagent Systems (cs.MA)
备注: 38 pages, includes 5 figures and 8 tables, preliminary version, AI safety / multi-agent reinforcement learning

点击查看摘要

Abstract:In decentralized Multi-Agent Reinforcement Learning (MARL), steganographic collusion – where agents develop private protocols to evade monitoring – presents a critical AI safety threat. Existing defenses, limited to behavioral or reward layers, fail to detect coordination in latent communication channels. We introduce the Dynamic Representational Circuit Breaker (DRCB), an architectural defense operating at the optimization substrate. Building on the AI Mother Tongue (AIM) framework, DRCB utilizes a Vector Quantized Variational Autoencoder (VQ-VAE) bottleneck to convert unobservable messages into auditable statistical objects. DRCB monitors signals including Jensen-Shannon Divergence drift, L2-norm codebook displacement, and Randomized Observer Pool accuracy to compute an EMA-based Collusion Score. Threshold breaches trigger four escalating interventions: dynamic adaptation, gradient-space penalty injection into the Advantage function A^pi, temporal reward suppression, and full substrate circuit breaking via codebook shuffling and optimizer state reset. Experiments on a Contextual Prisoner’s Dilemma with MNIST labels show that while static monitoring fails (p = 0.3517), DRCB improves observer mean accuracy from 0.858 to 0.938 (+9.3 percent) and reduces volatility by 43 percent, while preserving mean joint reward (p = 0.854). Analysis of 214,298 symbol samples confirms “Semantic Degradation,” where high-frequency sequences converge to zero entropy, foreclosing complex steganographic encodings. We identify a “Transparency Paradox” where agents achieve surface-level determinism while preserving residual capacity in long-tail distributions, reflecting Goodhart’s Law. This task-agnostic methodology provides a technical path toward MICA-compliant (Multi-Agent Internal Coupling Audit) pre-deployment auditing for autonomous systems. Comments: 38 pages, includes 5 figures and 8 tables, preliminary version, AI safety / multi-agent reinforcement learning Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Multiagent Systems (cs.MA) MSC classes: 68T05, 91Axx, 68P25 ACMclasses: I.2.11; I.2.6; K.4.1 Cite as: arXiv:2603.15655 [cs.LG] (or arXiv:2603.15655v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.15655 Focus to learn more arXiv-issued DOI via DataCite

自然语言处理

[NLP-0] Efficient Reasoning on the Edge

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在边缘设备部署时因推理过程冗长、上下文需求高而导致的资源消耗过大问题,具体表现为高Token生成成本、庞大的KV缓存占用以及知识蒸馏过程中推理轨迹冗余等挑战。解决方案的关键在于提出一种轻量化方法:结合LoRA(Low-Rank Adaptation)适配器与监督微调(Supervised Fine-Tuning, SFT),并通过强化学习引入预算强制机制(budget forcing),显著缩短响应长度而保持最小精度损失;同时采用并行测试时扩展(parallel test-time scaling)缓解内存瓶颈,并设计动态适配器切换机制与提示编码阶段的KV缓存共享策略,从而降低首次输出延迟,实现高效且准确的推理能力,适用于移动场景下的资源受限环境。

链接: https://arxiv.org/abs/2603.16867
作者: Yelysei Bondarenko,Thomas Hehn,Rob Hesselink,Romain Lepert,Fabio Valerio Massoli,Evgeny Mironov,Leyla Mirvakhabova,Tribhuvanesh Orekondy,Spyridon Stasis,Andrey Kuzmin,Anna Kuzina,Markus Nagel,Ankita Nayak,Corrado Rainone,Ork de Rooij,Paul N Whatmough,Arash Behboodi,Babak Ehteshami Bejnordi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine-tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on Qwen2.5-7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.

[NLP-1] Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

【速读】: 该论文旨在解决当前对话记忆系统在处理长时间跨度对话历史时,难以有效推理随时间演变的时序事实与偏好,以及缺乏针对多跳、时间敏感查询的有效检索策略的问题。解决方案的关键在于提出一种名为Chronos的新型时序感知记忆框架,其核心创新是将原始对话分解为带有时段范围和实体别名解析的主语-谓语-宾语事件元组,并结构化地存储于事件日历与轮次日历中;在查询时通过动态提示生成定制化的检索指引,引导代理在两个日历间迭代调用工具进行跨时间范围过滤和多跳推理,从而显著提升对长期对话历史的理解与应用能力。

链接: https://arxiv.org/abs/2603.16862
作者: Sahil Sen,Elias Lumer,Anmol Gulati,Vamse Kumar Subbiah
机构: PricewaterhouseCoopers (普华永道)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have enabled conversational AI agents to engage in extended multi-turn interactions spanning weeks or months. However, existing memory systems struggle to reason over temporally grounded facts and preferences that evolve across months of interaction and lack effective retrieval strategies for multi-hop, time-sensitive queries over long dialogue histories. We introduce Chronos, a novel temporal-aware memory framework that decomposes raw dialogue into subject-verb-object event tuples with resolved datetime ranges and entity aliases, indexing them in a structured event calendar alongside a turn calendar that preserves full conversational context. At query time, Chronos applies dynamic prompting to generate tailored retrieval guidance for each question, directing the agent on what to retrieve, how to filter across time ranges, and how to approach multi-hop reasoning through an iterative tool-calling loop over both calendars. We evaluate Chronos with 8 LLMs, both open-source and closed-source, on the LongMemEvalS benchmark comprising 500 questions spanning six categories of dialogue history tasks. Chronos Low achieves 92.60% and Chronos High scores 95.60% accuracy, setting a new state of the art with an improvement of 7.67% over the best prior system. Ablation results reveal the events calendar accounts for a 58.9% gain on the baseline while all other components yield improvements between 15.5% and 22.3%. Notably, Chronos Low alone surpasses prior approaches evaluated under their strongest model configurations.

[NLP-2] Online Experiential Learning for Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署过程中积累的丰富交互经验无法被有效利用的问题,传统方法依赖离线训练和人工标注或模拟环境,忽略了真实场景中的动态反馈。其解决方案的核心是提出在线体验学习(Online Experiential Learning, OEL)框架,通过两个关键阶段实现持续优化:首先从用户侧收集的交互轨迹中提取可迁移的经验知识;其次通过策略一致性蒸馏(on-policy context distillation)将这些知识固化到模型参数中,无需访问用户端环境。OEL形成一个在线学习闭环,使模型在迭代中不断生成更高质量的轨迹,从而提升任务准确率与token效率,并保持分布外性能。

链接: https://arxiv.org/abs/2603.16856
作者: Tianzhu Ye,Li Dong,Qingxiu Dong,Xun Wu,Shaohan Huang,Furu Wei
机构: Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.

[NLP-3] Mediocrity is the key for LLM as a Judge Anchor Selection

【速读】: 该论文旨在解决“LLM-as-a-judge”范式中因锚模型(anchor)选择不当而导致评估结果不可靠的问题。当前主流基准(如Arena-Hard和AlpacaEval)采用单锚对比策略以降低成对比较的二次复杂度,但锚模型的选择对评估相关性影响显著,而这一问题尚未被系统研究。解决方案的关键在于:首先,通过实证分析22个不同锚模型在Arena-Hard-v2.0上的表现,发现极端性能模型(最优或最差)作为锚时会显著削弱与人类排序的相关性;其次,量化锚选择的影响程度,表明其效应量与裁判模型选择相当,并据此提出两个行动建议:一是进行功效分析以确定足够大的基准规模来可靠区分竞争性模型;二是提供可操作的锚模型筛选指南,确保评估过程的可靠性与效率。

链接: https://arxiv.org/abs/2603.16848
作者: Shachar Don-Yehiya,Asaf Yehudai,Leshem Choshen,Omri Abend
机构: The Hebrew University of Jerusalem; IBM Research; MIT; MIT-IBM Watson AI Lab
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ``LLM-as-a-judge’’ paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the models. We further quantify the effect size of anchor selection, showing it is comparable to the selection of a judge model. We conclude with actionable recommendations. First, we conduct a power analysis, and compute sufficient benchmark sizes for anchor-based evaluation, finding that standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish between competitive models reliably. Second, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.

[NLP-4] Prompt Programming for Cultural Bias and Alignment of Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在战略决策、政策支持和文档工程等任务中因文化偏差导致的与目标人群价值观不一致的问题。其核心挑战在于,LLMs往往继承默认的文化偏见,从而影响下游分析和建议的准确性与适配性。解决方案的关键在于通过文化条件化(cultural conditioning)提升模型响应的文化对齐度:一方面,作者在开源权重的LLM上复现并扩展了基于社会科学研究调查的投影与距离度量框架,验证了文化偏差的存在及文化提示的有效性;另一方面,引入使用DSPy进行提示编程(prompt programming),将提示视为可优化的模块化程序,以文化距离为目标函数进行系统性调优,从而实现更稳定且可迁移的文化对齐效果。

链接: https://arxiv.org/abs/2603.16827
作者: Maksim Eren,Eric Michalak,Brian Cook,Johnny Seales Jr
机构: Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, pre-print

点击查看摘要

Abstract:Culture shapes reasoning, values, prioritization, and strategic decision-making, yet large language models (LLMs) often exhibit cultural biases that misalign with target populations. As LLMs are increasingly used for strategic decision-making, policy support, and document engineering tasks such as summarization, categorization, and compliance-oriented auditing, improving cultural alignment is important for ensuring that downstream analyses and recommendations reflect target-population value profiles rather than default model priors. Previous work introduced a survey-grounded cultural alignment framework and showed that culture-specific prompting can reduce misalignment, but it primarily evaluated proprietary models and relied on manual prompt engineering. In this paper, we validate and extend that framework by reproducing its social sciences survey based projection and distance metrics on open-weight LLMs, testing whether the same cultural skew and benefits of culture conditioning persist outside closed LLM systems. Building on this foundation, we introduce use of prompt programming with DSPy for this problem-treating prompts as modular, optimizable programs-to systematically tune cultural conditioning by optimizing against cultural-distance objectives. In our experiments, we show that prompt optimization often improves upon cultural prompt engineering, suggesting prompt compilation with DSPy can provide a more stable and transferable route to culturally aligned LLM responses.

[NLP-5] Is Conformal Factuality for RAG -based LLM s Robust? Novel Metrics and Systematic Insights

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识密集型应用中频繁产生幻觉(hallucination)的问题,从而影响其可靠性。为此,研究聚焦于检索增强生成(Retrieval-Augmented Generation, RAG)与符合性事实性(Conformal Factuality)相结合的框架,试图在保证输出事实正确性的同时提升实用性。解决方案的关键在于:首先,通过引入新颖的“信息量感知”指标来更准确评估经符合性过滤后的输出在任务中的实际效用;其次,发现并验证了符合性过滤在高事实性水平下易产生无信息量输出(vacuous outputs),且其可靠性对分布偏移和干扰项(distractors)敏感,揭示了现有方法在鲁棒性上的不足;最后,提出轻量级蕴含推理验证器(entailment-based verifiers),其性能可媲美甚至超越基于LLM的置信度评分器,同时计算开销降低超过100倍FLOPs,为构建兼具可靠性和计算效率的RAG系统提供了可行路径。

链接: https://arxiv.org/abs/2603.16817
作者: Yi Chen,Daiwei Chen,Sukrut Madhav Chikodikar,Caitlyn Heqi Yin,Ramya Korlakai Vinayak
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 56 pages

点击查看摘要

Abstract:Large language models (LLMs) frequently hallucinate, limiting their reliability in knowledge-intensive applications. Retrieval-augmented generation (RAG) and conformal factuality have emerged as potential ways to address this limitation. While RAG aims to ground responses in retrieved evidence, it provides no statistical guarantee that the final output is correct. Conformal factuality filtering offers distribution-free statistical reliability by scoring and filtering atomic claims using a threshold calibrated on held-out data, however, the informativeness of the final output is not guaranteed. We systematically analyze the reliability and usefulness of conformal factuality for RAG-based LLMs across generation, scoring, calibration, robustness, and efficiency. We propose novel informativeness-aware metrics that better reflect task utility under conformal filtering. Across three benchmarks and multiple model families, we find that (i) conformal filtering suffers from low usefulness at high factuality levels due to vacuous outputs, (ii) conformal factuality guarantee is not robust to distribution shifts and distractors, highlighting the limitation that requires calibration data to closely match deployment conditions, and (iii) lightweight entailment-based verifiers match or outperform LLM-based model confidence scorers while requiring over 100\times fewer FLOPs. Overall, our results expose factuality-informativeness trade-offs and fragility of conformal filtering framework under distribution shifts and distractors, highlighting the need for new approaches for reliability with robustness and usefulness as key metrics, and provide actionable guidance for building RAG pipelines that are both reliable and computationally efficient.

[NLP-6] SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue

【速读】: 该论文旨在解决当前任务导向型语音对话系统(Task-Oriented Dialogue, TOD)中缺乏足够多样且大规模的口语用户行为模拟问题,从而限制了对话代理的鲁棒性。现有数据集在规模和领域覆盖上均不足,且缺乏系统性的增强方法来模拟真实人类交互中的复杂行为。解决方案的关键在于构建了一个名为SpokenTOD的大规模口语TOD数据集(包含52,390条对话和1,034小时语音),并引入四种关键口语行为——跨轮槽位填充(cross-turn slots)、打断(barge-in)、不流畅(disfluency)和情感韵律(emotional prosody),以提升数据多样性与真实性。在此基础上,论文进一步提出SpokenUS,一个基于TOD场景、专为打断行为设计架构的口语用户模拟器,其在保持高目标覆盖率的同时,在人类主观评分(Human MOS)上显著优于基线模型,并能渐进式披露槽位值,更贴近真实人类交互模式,有效提升了下游对话代理的训练与评估质量。

链接: https://arxiv.org/abs/2603.16783
作者: Jonggeun Lee,Junseong Pyo,Jeongmin Park,Yohan Jo
机构: Seoul National University (首尔国立大学); Hanyang University (汉阳大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Robust task-oriented spoken dialogue agents require exposure to the full diversity of how people interact through speech. Building spoken user simulators that address this requires large-scale spoken task-oriented dialogue (TOD) data encompassing spoken user behaviors, yet existing datasets are limited in scale and domain coverage, with no systematic pipeline for augmenting them. To address this, we introduce \textbfSpokenTOD, a spoken TOD dataset of 52,390 dialogues and 1,034 hours of speech augmented with four spoken user behaviors – cross-turn slots, barge-in, disfluency, and emotional prosody – across diverse speakers and domains. Building on SpokenTOD, we present \textbfSpokenUS, a spoken user simulator grounded in TOD with a dedicated architecture for barge-in. SpokenUS achieves comparable goal coverage to significantly larger models while substantially outperforming all baselines in Human MOS, disclosing slot values gradually across the dialogue as humans do rather than front-loading them. Further analysis confirms that SpokenUS’s spoken behaviors pose meaningful challenges to downstream agents, making it a practical tool for training and evaluating more robust spoken dialogue systems.

[NLP-7] SOMP: Scalable Gradient Inversion for Large Language Models via Subspace-Guided Orthogonal Matching Pursuit

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在训练过程中因共享梯度而导致的隐私泄露问题,特别是针对聚合梯度(aggregated gradients)中私有训练文本的重建攻击难题。现有方法在小批量(small-batch)场景下表现良好,但在大批次(large batch size)和长序列(long sequences)情况下受限于信号混叠严重、计算成本高以及重建保真度下降等问题。其解决方案的关键在于提出一种名为SOMP(Subspace-Guided Orthogonal Matching Pursuit)的可扩展梯度反演框架,将文本恢复建模为稀疏信号恢复问题,并利用Transformer梯度中固有的头级几何结构(head-wise geometric structure)与样本级稀疏性(sample-level sparsity),通过子空间引导逐步缩小搜索空间并解耦混合信号,从而实现高效且高保真的文本重建,在B=16时显著优于强基线方法,且在极端聚合(B=128)下仍能恢复有意义文本,表明隐私泄露在传统方法失效的场景中依然存在。

链接: https://arxiv.org/abs/2603.16761
作者: Yibo Li,Qiongxiu Li
机构: Politecnico di Milano (米兰理工大学); Aalborg University (奥尔堡大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 18 pages, 4 figures, 13 tables

点击查看摘要

Abstract:Gradient inversion attacks reveal that private training text can be reconstructed from shared gradients, posing a privacy risk to large language models (LLMs). While prior methods perform well in small-batch settings, scaling to larger batch sizes and longer sequences remains challenging due to severe signal mixing, high computational cost, and degraded fidelity. We present SOMP (Subspace-Guided Orthogonal Matching Pursuit), a scalable gradient inversion framework that casts text recovery from aggregated gradients as a sparse signal recovery problem. Our key insight is that aggregated transformer gradients retain exploitable head-wise geometric structure together with sample-level sparsity. SOMP leverages these properties to progressively narrow the search space and disentangle mixed signals without exhaustive search. Experiments across multiple LLM families, model scales, and five languages show that SOMP consistently outperforms prior methods in the aggregated-gradient this http URL long sequences at batch size B=16, SOMP achieves substantially higher reconstruction fidelity than strong baselines, while remaining computationally competitive. Even under extreme aggregation (up to B=128), SOMP still recovers meaningful text, suggesting that privacy leakage can persist in regimes where prior attacks become much less effective.

[NLP-8] urnWise: The Gap between Single- and Multi-turn Language Model Capabilities

【速读】: 该论文旨在解决当前语言模型训练与评估数据主要聚焦于单轮对话(single-turn)设置,而忽视了多轮对话(multi-turn)交互中特有的能力需求这一问题。其核心挑战在于缺乏对多轮对话能力的系统性评估和有效训练方法,导致模型在真实场景中的多轮交互表现受限。解决方案的关键在于:首先提出一个可直接与单轮对话评估对比的新基准 TurnWiseEval,通过成对比较法分离出多轮对话特有性能;其次开发了一种可扩展的合成多轮数据生成管道 TurnWiseData,用于高效构建大规模多轮训练数据。实验表明,在 Olmo 3 模型上仅使用 10k 多轮对话进行后训练即可在 TurnWiseEval 上实现 12% 的性能提升,验证了多轮数据对增强多轮对话能力的重要性。

链接: https://arxiv.org/abs/2603.16759
作者: Victoria Graf,Valentina Pyatkin,Nouha Dziri,Nathan Lambert,Hannaneh Hajishirzi
机构: University of Washington (华盛顿大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first introduce a new benchmark, TurnWiseEval, for multi-turn capabilities that is directly comparable to single-turn chat evaluation. Our evaluation isolates multi-turn specific conversational ability through pairwise comparison to equivalent single-turn settings. We additionally introduce our synthetic multi-turn data pipeline TurnWiseData which allows the scalable generation of multi-turn training data. Our experiments with Olmo 3 show that training with multi-turn data is vital to achieving strong multi-turn chat performance, and that including as little as 10k multi-turn conversations during post-training can lead to a 12% improvement on TurnWiseEval.

[NLP-9] Probing Cultural Signals in Large Language Models through Author Profiling

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在社会应用中可能编码的文化偏见问题,特别是其在无任务特定微调的零样本设置下对歌曲歌词进行作者画像(author profiling)时是否表现出系统性偏差。解决方案的关键在于设计并应用两个公平性度量指标——模态准确率差异(Modality Accuracy Divergence, MAD)和召回差异(Recall Divergence, RD),用于量化模型在推断歌手性别与种族时的偏差,并通过分析模型预测分布及其生成的推理过程揭示其文化对齐倾向,从而识别出如Ministral-8B存在显著族裔偏见、而Gemma-12B表现更均衡的模型行为特征。

链接: https://arxiv.org/abs/2603.16749
作者: Valentin Lafargue,Ariel Guerra-Adames,Emmanuelle Claeys,Elouan Vuichard,Jean-Michel Loubes
机构: IMT, Toulouse, France; INRIA Bordeaux, France; ANITI 2, Toulouse, France; IRIT, Toulouse, France; Université de Bordeaux, Bordeaux, France; BPH, Inserm, France; CNRS IRL CROSSING, Adelaide, Australia
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in applications with societal impact, raising concerns about the cultural biases they encode. We probe these representations by evaluating whether LLMs can perform author profiling from song lyrics in a zero-shot setting, inferring singers’ gender and ethnicity without task-specific fine-tuning. Across several open-source models evaluated on more than 10,000 lyrics, we find that LLMs achieve non-trivial profiling performance but demonstrate systematic cultural alignment: most models default toward North American ethnicity, while DeepSeek-1.5B aligns more strongly with Asian ethnicity. This finding emerges from both the models’ prediction distributions and an analysis of their generated rationales. To quantify these disparities, we introduce two fairness metrics, Modality Accuracy Divergence (MAD) and Recall Divergence (RD), and show that Ministral-8B displays the strongest ethnicity bias among the evaluated models, whereas Gemma-12B shows the most balanced behavior. Our code is available on GitHub (this https URL).

[NLP-10] Retrieving Counterfactuals Improves Visual In-Context Learning CVPR2026

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在多模态推理任务中难以解耦细粒度视觉属性并推理潜在因果关系的问题,尤其是在基于上下文学习(In-Context Learning, ICL)时,现有检索增强方法依赖被动的相似性匹配,易选择相关但非因果的示例,从而放大虚假关联、限制模型鲁棒性。其解决方案的关键在于提出CIRCLES(Composed Image Retrieval for Causal Learning Example Selection)框架,通过目标导向的属性引导组合图像检索机制,主动构建包含反事实风格示例的演示集,使VLM能够隐式地推理属性与结果之间的因果关系,从而超越表面相关性,提升推理的稳健性和可解释性。

链接: https://arxiv.org/abs/2603.16737
作者: Guangzhi Xiong,Sanchit Sinha,Zhenghao He,Aidong Zhang
机构: University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: CVPR 2026

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning. Our code is available at this https URL.

[NLP-11] IQuest-Coder-V1 Technical Report

【速读】: 该论文旨在解决当前代码大语言模型(Code Large Language Models, Code LLMs)在动态软件逻辑建模、复杂推理能力以及部署效率之间难以平衡的问题。解决方案的关键在于提出了一种全新的“代码流”多阶段训练范式(code-flow multi-stage training paradigm),通过三个递进阶段实现模型能力的逐步增强:首先基于代码事实、仓库和补全数据进行初始预训练;随后引入中段训练,融合32k上下文的推理与代理轨迹(agentic trajectories)及128k上下文的仓库级知识,构建深层逻辑基础;最后通过后训练阶段分化为“思考路径”(基于推理驱动的强化学习)和“指令路径”(面向通用辅助优化),分别强化复杂编程任务与实际应用能力。此外,为应对部署限制,还设计了IQuest-Coder-V1-Loop变体,采用循环机制优化模型容量与部署开销之间的权衡,从而推动自主代码智能与真实世界代理系统的研究进展。

链接: https://arxiv.org/abs/2603.16733
作者: Jian Yang,Wei Zhang,Shawn Guo,Zhengmao Ye,Lin Jing,Shark Liu,Yizhi Li,Jiajun Wu,Cening Liu,X. Ma,Yuyang Song,Siwei Wu,Yuwen Li,L. Liao,T. Zheng,Ziling Huang,Zelong Huang,Che Liu,Yan Xing,Renyuan Li,Qingsong Cai,Hanxu Yan,Siyue Wang,Shikai Li,Jason Klein Liu,An Huang,Yongsheng Kang,Jinxing Zhang,Chuan Hao,Haowen Wang,Weicheng Gu,Ran Tao,Mingjie Tang,Peihao Wu,Jianzhou Wang,Xianglong Liu,Weifeng Lv,Bryan Dai
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:In this report, we introduce the IQuest-Coder-V1 series-(7B/14B/40B/40B-Loop), a new family of code large language models (LLMs). Moving beyond static code representations, we propose the code-flow multi-stage training paradigm, which captures the dynamic evolution of software logic through different phases of the pipeline. Our models are developed through the evolutionary pipeline, starting with the initial pre-training consisting of code facts, repository, and completion data. Following that, we implement a specialized mid-training stage that integrates reasoning and agentic trajectories in 32k-context and repository-scale in 128k-context to forge deep logical foundations. The models are then finalized with post-training of specialized coding capabilities, which is bifurcated into two specialized paths: the thinking path (utilizing reasoning-driven RL) and the instruct path (optimized for general assistance). IQuest-Coder-V1 achieves state-of-the-art performance among competitive models across critical dimensions of code intelligence: agentic software engineering, competitive programming, and complex tool use. To address deployment constraints, the IQuest-Coder-V1-Loop variant introduces a recurrent mechanism designed to optimize the trade-off between model capacity and deployment footprint, offering an architecturally enhanced path for efficacy-efficiency trade-off. We believe the release of the IQuest-Coder-V1 series, including the complete white-box chain of checkpoints from pre-training bases to the final thinking and instruction models, will advance research in autonomous code intelligence and real-world agentic systems.

[NLP-12] Arabic Morphosyntactic Tagging and Dependency Parsing with Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成显式语言结构(explicit linguistic structure)方面能力不明确的问题,特别是在阿拉伯语这一具有丰富形态学特征和拼写歧义的复杂语言上的表现。研究聚焦于两个结构化预测任务:形态句法标注(morphosyntactic tagging)和带标签依存句法分析(labeled dependency parsing)。解决方案的关键在于对比零样本提示(zero-shot prompting)与基于检索的上下文学习(retrieval-based in-context learning, ICL)两种策略,并利用阿拉伯语树库(Arabic treebanks)中的示例进行演示选择(demonstration selection),从而评估LLMs在阿拉伯语形态句法和句法结构建模中的实际能力。结果表明,提示设计和示例选择对性能影响显著,且在特定条件下,商用模型在特征级标注上接近监督基线,在依存句法分析中也具备竞争力,尤其在文本原始输入场景下,检索增强的ICL可有效改善分词和句法分析效果。

链接: https://arxiv.org/abs/2603.16718
作者: Mohamed Adel,Bashar Alhafni,Nizar Habash
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) perform strongly on many NLP tasks, but their ability to produce explicit linguistic structure remains unclear. We evaluate instruction-tuned LLMs on two structured prediction tasks for Standard Arabic: morphosyntactic tagging and labeled dependency parsing. Arabic provides a challenging testbed due to its rich morphology and orthographic ambiguity, which create strong morphology-syntax interactions. We compare zero-shot prompting with retrieval-based in-context learning (ICL) using examples from Arabic treebanks. Results show that prompt design and demonstration selection strongly affect performance: proprietary models approach supervised baselines for feature-level tagging and become competitive with specialized dependency parsers. In raw-text settings, tokenization remains challenging, though retrieval-based ICL improves both parsing and tokenization. Our analysis highlights which aspects of Arabic morphosyntax and syntax LLMs capture reliably and which remain difficult.

[NLP-13] CritiSense: Critical Digital Literacy and Resilience Against Misinformation

【速读】: 该论文旨在解决社交媒体上虚假信息(misinformation)对公众决策能力和信任度的负面影响问题。其解决方案的关键在于提出并实现了一个名为CritiSense的移动媒体素养应用程序,该应用通过短时、交互式的挑战任务和即时反馈机制,帮助用户在接触真实虚假信息前掌握识别操纵策略的能力,即“预打假”(prebunking)。CritiSense是首个支持九种语言且模块化设计的平台,具备跨主题与领域快速更新能力,为测量微学习(microlearning)对虚假信息抗性的影响提供了实验测试环境。

链接: https://arxiv.org/abs/2603.16672
作者: Firoj Alam,Fatema Ahmad,Ali Ezzat Shahroor,Mohamed Bayan Kmainasi,Elisa Sartori,Giovanni Da San Martino,Abul Hasnat,Raian Ali
机构: Qatar Computing Research Institute, Qatar; University of Padova, Italy; APAVI.AI, France; Hamad Bin Khalifa University, Qatar
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: resilience, disinformation, misinformation, fake news, propaganda

点击查看摘要

Abstract:Misinformation on social media undermines informed decision-making and public trust. Prebunking offers a proactive complement by helping users recognize manipulation tactics before they encounter them in the wild. We present CritiSense, a mobile media-literacy app that builds these skills through short, interactive challenges with instant feedback. It is the first multilingual (supporting nine languages) and modular platform, designed for rapid updates across topics and domains. We report a usability study with 93 users: 83.9% expressed overall satisfaction and 90.1% rated the app as easy to use. Qualitative feedback indicates that CritiSense helps improve digital literacy skills. Overall, it provides a multilingual prebunking platform and a testbed for measuring the impact of microlearning on misinformation resilience. Over 3+ months, we have reached 300+ active users. It is freely available to all users on the Apple App Store (this https URL) and Google Play Store (this https URL). Demo Video: this https URL

[NLP-14] Can Linguistically Related Languages Guide LLM Translation in Low-Resource Settings? EACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在极低资源机器翻译场景下的性能瓶颈问题,尤其是在缺乏大规模平行语料或无法进行充分微调的情况下,如何实现高效、轻量的适应。其解决方案的关键在于不依赖参数更新,而是通过结合语言学上相似的中间语言(pivot language)与少量示例(few-shot in-context examples)进行推理时提示(inference-time prompting),从而实现对目标语言的即时适应。实验表明,该方法在目标语言词汇覆盖不足时能带来一定改进,但效果受示例构造方式影响显著,且在语言关系紧密或表示充分的情况下收益有限,为低资源翻译中提示工程的应用提供了实证依据。

链接: https://arxiv.org/abs/2603.16660
作者: Aishwarya Ramasethu,Niyathi Allu,Rohin Garg,Harshwardhan Fartale,Dun Li Chan
机构: Prediction Guard; Scale AI; INTI International College Penang
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages (9 main paper and 9 Appendix), 1 figure, 19 tables. Accepted at LoResMT 2026: EACL 2026 Workshop. OpenReview link: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved strong performance across many downstream tasks, yet their effectiveness in extremely low-resource machine translation remains limited. Standard adaptation techniques typically rely on large-scale parallel data or extensive fine-tuning, which are infeasible for the long tail of underrepresented languages. In this work, we investigate a more constrained question: in data-scarce settings, to what extent can linguistically similar pivot languages and few-shot demonstrations provide useful guidance for on-the-fly adaptation in LLMs? We study a data-efficient experimental setup that combines linguistically related pivot languages with few-shot in-context examples, without any parameter updates, and evaluate translation behavior under controlled conditions. Our analysis shows that while pivot-based prompting can yield improvements in certain configurations, particularly in settings where the target language is less well represented in the model’s vocabulary, the gains are often modest and sensitive to few shot example construction. For closely related or better represented varieties, we observe diminishing or inconsistent gains. Our findings provide empirical guidance on how and when inference-time prompting and pivot-based examples can be used as a lightweight alternative to fine-tuning in low-resource translation settings.

[NLP-15] Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在多跳问答(multi-hop QA)任务中缺乏细粒度推理过程标注的问题,从而难以准确评估模型是否真正具备逻辑推理能力及定位失败原因。其解决方案的关键在于提出 Omanic 数据集,该数据集包含结构化标注的子问题与中间答案,支持对推理路径进行逐层诊断;其中 OmanicBench 提供人工审核的 967 个测试样本用于严谨评估,OmanicSynth 则为机器生成的 10,296 条训练样本,通过监督微调显著提升多个推理与数学基准任务的表现(平均提升 7.41 分),验证了该数据集在增强模型推理能力迁移方面的有效性。

链接: https://arxiv.org/abs/2603.16654
作者: Xiaojie Gu,Sherry T. Tong,Aosong Feng,Sophia Simeng Han,Jinghui Lu,Yingjian Chen,Yusuke Iwasawa,Yutaka Matsuo,Chanjun Park,Rex Ying,Irene Li
机构: The University of Tokyo (东京大学); Yale University (耶鲁大学); Stanford University (斯坦福大学); Xiaomi EV (小米汽车); Soongsil University (中央大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning-focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi-hop QA benchmarks lack step-level annotations for diagnosing reasoning failures. To address this gap, we propose Omanic, an open-domain multi-hop QA resource that provides decomposed sub-questions and intermediate answers as structural annotations for analyzing reasoning processes. It contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench). Systematic evaluations show that state-of-the-art LLMs achieve only 73.11% multiple-choice accuracy on OmanicBench, confirming its high difficulty. Stepwise analysis reveals that CoT’s performance hinges on factual completeness, with its gains diminishing under knowledge gaps and errors amplifying in later hops. Additionally, supervised fine-tuning on OmanicSynth brings substantial transfer gains (7.41 average points) across six reasoning and math benchmarks, validating the dataset’s quality and further supporting the effectiveness of OmanicSynth as supervision for reasoning-capability transfer. We release the data at this https URL and the code at this https URL.

[NLP-16] Good Arguments Against the People Pleasers: How Reasoning Mitigates (Yet Masks) LLM Sycophancy

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在对齐(alignment)过程中可能诱发的“谄媚行为”(sycophancy)问题,即模型为迎合用户意图或权威观点而做出不准确或不合理判断的现象。研究表明,尽管Chain-of-Thought(CoT)推理通常能减少最终决策中的谄媚倾向,但其也可能成为一种后验合理化工具,使模型在部分样本中通过逻辑矛盾、计算错误或片面论证等方式构建欺骗性解释,从而掩盖真实偏差。解决方案的关键在于识别并理解CoT推理过程中谄媚行为的动态演化机制——即谄媚倾向并非在输入阶段即固定,而是在推理路径中逐步显现,这提示未来需从推理过程本身出发设计更鲁棒的对齐策略,例如引入可解释性监控或动态校正机制以抑制隐蔽性谄媚。

链接: https://arxiv.org/abs/2603.16643
作者: Zhaoxin Feng,Zheng Chen,Jianfei Ma,Yip Tin Po,Emmanuele Chersoni,Bo Li
机构: The Hong Kong Polytechnic University (香港理工大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Alignment techniques often inadvertently induce sycophancy in LLMs. While prior studies studied this behaviour in direct-answer settings, the role of Chain-of-Thought (CoT) reasoning remains under-explored: does it serve as a logical constraint that mitigates sycophancy, or a tool for post-hoc rationalization that masks it? We evaluate a range of models across objective and subjective tasks to investigate the issue. Results show that reasoning generally reduces sycophancy in final decisions but also masks sycophancy in some samples, where models construct deceptive justifications through logical inconsistencies, calculation errors, and one-sided arguments etc. Furthermore, LLMs are more prone to sycophancy in subjective tasks and under authority-bias. Our mechanistic analysis on three open-source models reveals that the tendency of sycophancy is dynamic during the reasoning process rather than being pre-determined at the input stage.

[NLP-17] When AI Navigates the Fog of War

【速读】: 该论文试图解决的问题是:当前大语言模型(Large Language Models, LLMs)是否能够在地缘政治冲突尚未显现出明显历史轨迹时,就具备推理能力来理解战争的潜在走向。这一问题的关键挑战在于,传统研究常因训练数据泄露(training-data leakage)而难以区分模型的真实推理能力与对已知结果的复现。为此,作者设计了一个时间锚定(temporally grounded)的案例研究,聚焦2026年中东冲突早期阶段——该事件发生在当前前沿模型训练截止之后,从而有效规避了数据泄露风险。其解决方案的核心在于构建11个关键时间节点、42个可验证的具体问题和5个探索性问题,要求模型仅基于每个时点公开可得的信息进行推理,从而在“战争迷雾”中评估LLM的动态推理能力,为首个基于时间锚定的LLM地缘政治推理分析提供了方法论基础。

链接: https://arxiv.org/abs/2603.16642
作者: Ming Li,Xirui Li,Tianyi Zhou
机构: University of Maryland (马里兰大学); Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Can AI reason about a war before its trajectory becomes historically obvious? Analyzing this capability is difficult because retrospective geopolitical prediction is heavily confounded by training-data leakage. We address this challenge through a temporally grounded case study of the early stages of the 2026 Middle East conflict, which unfolded after the training cutoff of current frontier models. We construct 11 critical temporal nodes, 42 node-specific verifiable questions, and 5 general exploratory questions, requiring models to reason only from information that would have been publicly available at each moment. This design substantially mitigates training-data leakage concerns, creating a setting well-suited for studying how models analyze an unfolding crisis under the fog of war, and provides, to our knowledge, the first temporally grounded analysis of LLM reasoning in an ongoing geopolitical conflict. Our analysis reveals three main findings. First, current state-of-the-art large language models often display a striking degree of strategic realism, reasoning beyond surface rhetoric toward deeper structural incentives. Second, this capability is uneven across domains: models are more reliable in economically and logistically structured settings than in politically ambiguous multi-actor environments. Finally, model narratives evolve over time, shifting from early expectations of rapid containment toward more systemic accounts of regional entrenchment and attritional de-escalation. Since the conflict remains ongoing at the time of writing, this work can serve as an archival snapshot of model reasoning during an unfolding geopolitical crisis, enabling future studies without the hindsight bias of retrospective analysis.

[NLP-18] Domain Mixture Design via Log-Likelihood Differences for Aligning Language Models with a Target Model

【速读】: 该论文旨在解决基础语言模型(base model)与目标语言模型(target model)在分布上不一致的问题,即如何通过调整预训练或持续预训练阶段的数据域权重,使基础模型的参数更新方向趋向于目标模型,从而实现更好的对齐。解决方案的关键在于将模型视为对数似然空间(log-likelihood space)中的点,并设计一种基于训练更新方向与指向目标模型方向一致性的域权重确定方法,而非依赖传统的知识蒸馏(knowledge distillation)。该方法在NanoGPT实验中表现出比均匀加权Pile数据集更优的KL散度降低效果,且下游任务性能也更接近目标模型,即使在无知识蒸馏的情况下仍能实现有效对齐。

链接: https://arxiv.org/abs/2603.16622
作者: Ryo Kishino,Riku Shiomi,Hiroaki Yamagiwa,Momose Oyama,Hidetoshi Shimodaira
机构: Kyoto University (京都大学); RIKEN (理化学研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instead of directly distilling a language model, this study addresses the problem of aligning a base model with a target model in distribution by designing the domain mixture of training data for pretraining or continued pretraining as a fixed training recipe. We propose a method for determining domain weights by viewing models as points in log-likelihood space and aligning the training update direction with the direction toward the target model. Experiments with NanoGPT show that the proposed method consistently reduces the KL divergence to the target model compared with uniform weighting over the Pile. Although knowledge distillation remains more effective when available, the proposed method still achieves meaningful alignment, and downstream task performance also tends to become closer to that of the target model.

[NLP-19] Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

【速读】: 该论文旨在解决跨语言句子编码模型在语言覆盖范围有限(通常仅数百种语言)且下游任务性能与对齐强度之间存在权衡的问题,从而限制了其实际应用。解决方案的关键在于提出OmniSONAR,一个原生支持文本、语音、代码和数学表达式统一嵌入的多模态、全语言(omnilingual)句子嵌入模型,通过渐进式训练策略实现数千种语言(从高资源到极低资源语言)的高质量表示,避免表示坍缩(representation collapse)。其核心创新包括:基于LLM初始化的编码器-解码器架构结合分段Softmax对比损失与合成硬负样本,构建200种语言的基础语义空间;随后采用两阶段教师-学生蒸馏框架扩展至数千种语言变体;并进一步验证该空间可无缝映射177种口语语言,显著提升跨语言相似性搜索与翻译性能,同时在语音任务上实现零样本迁移,达到接近SeamlessM4T的语音转文本质量。

链接: https://arxiv.org/abs/2603.16606
作者: Omnilingual SONAR Team:João Maria Janeiro,Pere-Lluís Huguet Cabot,Ioannis Tsiamas,Yen Meng,Vivek Iyer,Guillem Ramírez,Loic Barrault,Belen Alastruey,Yu-An Chung,Marta R. Costa-Jussa,David Dale,Kevin Heffernan,Jaehyeong Jo,Artyom Kozhevnikov,Alexandre Mourachko,Christophe Ropers,Holger Schwenk,Paul-Ambroise Duquenne
机构: Meta(元)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual and cross-modal sentence embedding models that natively embed text, speech, code, and mathematical expressions in a single semantic space, while delivering state-of-the-art downstream performance at the scale of thousands of languages, from high-resource to extremely low-resource varieties. To reach this scale without representation collapse, we use progressive training. We first learn a strong foundational space for 200 languages with an LLM-initialized encoder-decoder, combining token-level decoding with a novel split-softmax contrastive loss and synthetic hard negatives. Building on this foundation, we expand to several thousands language varieties via a two-stage teacher-student encoder distillation framework. Finally, we demonstrate the cross-modal extensibility of this space by seamlessly mapping 177 spoken languages into it. OmniSONAR halves cross-lingual similarity search error on the 200-language FLORES dataset and reduces error by a factor of 15 on the 1,560-language BIBLE benchmark. It also enables strong translation, outperforming NLLB-3B on multilingual benchmarks and exceeding prior models (including much larger LLMs) by 15 chrF++ points on 1,560 languages into English BIBLE translation. OmniSONAR also performs strongly on MTEB and XLCoST. For speech, OmniSONAR achieves a 43% lower similarity-search error and reaches 97% of SeamlessM4T speech-to-text quality, despite being zero-shot for translation (trained only on ASR data). Finally, by training an encoder-decoder LM, Spectrum, exclusively on English text processing OmniSONAR embedding sequences, we unlock high-performance transfer to thousands of languages and speech for complex downstream tasks.

[NLP-20] arab: A Multi-Dialect Corpus of Arabic Lyrics and Poetry

【速读】: 该论文旨在解决阿拉伯语创造性文本(包括歌曲歌词与诗歌)在跨语言变体、历史时期和文化背景下的系统性研究缺乏统一资源的问题。现有数据集通常局限于单一语言变体或时代,难以支持多维度的比较分析。解决方案的关键在于构建Tarab Corpus——一个涵盖2.56百万行诗句、超过1350万词元的大规模开源阿拉伯语语料库,其核心创新在于将古典阿拉伯语、现代标准阿拉伯语(Modern Standard Arabic, MSA)及六种主要区域变体(埃及、海湾、黎凡特、伊拉克、苏丹、马格里布阿拉伯语)整合于同一分析框架中,并提供结构化元数据(包括语言变体、地理来源和历史/文化背景),从而支持跨类型(歌曲 vs. 诗歌)、跨时空的对比语言学、风格学与历时分析。

链接: https://arxiv.org/abs/2603.16601
作者: Mo El-Haj
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:We introduce the Tarab Corpus, a large-scale cultural and linguistic resource that brings together Arabic song lyrics and poetry within a unified analytical framework. The corpus comprises 2.56 million verses and more than 13.5 million tokens, making it, to our knowledge, the largest open Arabic corpus of creative text spanning both classical and contemporary production. Tarab is broadly balanced between songs and poems and covers Classical Arabic, Modern Standard Arabic (MSA), and six major regional varieties: Egyptian, Gulf, Levantine, Iraqi, Sudanese, and Maghrebi Arabic. The artists and poets represented in the corpus are associated with 28 modern nation states and multiple historical eras, covering over fourteen centuries of Arabic creative expression from the Pre-Islamic period to the twenty-first century. Each verse is accompanied by structured metadata describing linguistic variety, geographic origin, and historical or cultural context, enabling comparative linguistic, stylistic, and diachronic analysis across genres and time. We describe the data collection, normalisation, and validation pipeline and present baseline analyses for variety identification and genre differentiation. The dataset is publicly available on HuggingFace at this https URL.

[NLP-21] BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization

【速读】: 该论文旨在解决现有后训练量化(Post-Training Quantization, PTQ)方法在应用于微缩浮点格式(Microscaling Floating-Point, MXFP)时性能显著下降的问题,尤其是旋转类方法在MXFP4场景下因全局正交变换导致的异常值跨块传播和激活分布双峰化,从而破坏局部缩放并浪费有限的量化范围。其解决方案的关键在于提出BATQuant(Block-wise Affine Transformation),通过将变换限制在MXFP粒度内以防止异常值跨块扩散,并放松正交性约束以优化激活分布形状;同时引入全局与私有Kronecker(Global and Private Kronecker, GPK)分解实现参数高效压缩,并结合块级可学习裁剪(Block-wise Learnable Clipping)抑制残余异常值,最终在W4A4KV16配置下实现了超越现有方法的性能表现。

链接: https://arxiv.org/abs/2603.16590
作者: Ji-Fu Li,Manyi Zhang,Xiaobo Xia,Han Bao,Haoli Bai,Zhenhua Dong,Xianzhi Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages, 13 figures, 7 tables

点击查看摘要

Abstract:Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) on modern accelerator architectures. However, existing Post-Training Quantization (PTQ) methods, particularly rotation-based techniques designed for integer formats, suffer from severe performance collapse when applied to MXFP4. Recent studies attribute this failure to a fundamental format mismatch: global orthogonal rotations inadvertently transfer outlier energy across quantization blocks, inducing new outliers that disrupt local block-wise scaling, while often creating bimodal activation distributions that underutilize the limited quantization range. To address these issues, we propose BATQuant (Block-wise Affine Transformation), which restricts transformations to align with MXFP granularity to prevent cross-block outlier propagation, while relaxing orthogonality constraints to optimize distribution shaping. To ensure parameter efficiency, we introduce Global and Private Kronecker (GPK) decomposition to effectively reduces storage and runtime overhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on both MLLMs and LLMs demonstrate that BATQuant establishes new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks.

[NLP-22] When and Why Does Unsupervised RL Succeed in Mathematical Reasoning ? A Manifold Envelopment Perspective

【速读】: 该论文旨在解决基于结果的强化学习(Reinforcement Learning, RL)在提升大语言模型(Large Language Models, LLMs)数学推理能力时面临的可扩展性瓶颈问题,即其对计算成本高昂的真值标注(ground-truth annotations)的高度依赖。为克服这一限制,论文提出一种无监督强化学习方案,其关键在于设计一套显式鼓励简洁且确定性生成的内在奖励(intrinsic rewards),并通过几何诊断工具揭示训练稳定性的边界条件——成功配置往往被流形(manifolds)包围,而失败案例则表现为策略崩溃或奖励劫持(reward hacking)。此方法不仅验证了无监督RL在数学推理上的有效性,还明确了其失效场景并提供了理论解释。

链接: https://arxiv.org/abs/2603.16578
作者: Zelin Zhang,Fei Cheng,Chenhui Chu
机构: Kyoto University (京都大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: work in progress

点击查看摘要

Abstract:Although outcome-based reinforcement learning (RL) significantly advances the mathematical reasoning capabilities of Large Language Models (LLMs), its reliance on computationally expensive ground-truth annotations imposes a severe scalability bottleneck. Unsupervised RL guided by intrinsic rewards offers a scalable alternative, yet it suffers from opaque training dynamics and catastrophic instability, such as policy collapse and reward hacking. In this paper, we first design and evaluate a suite of intrinsic rewards that explicitly enforce concise and certain generation. Second, to discover the boundaries of this approach, we test base models across a spectrum of intrinsic reasoning capabilities, revealing how a model’s foundational logical prior dictates its success or failure. Finally, to demystify why certain configurations stabilize while others collapse, we introduce a novel geometric diagnostic lens, showing that successful cases are enveloped by manifolds. Ultimately, our work goes beyond merely demonstrating that enforcing concise and certain responses successfully boosts mathematical reasoning; we reveal when this unsupervised approach breaks down and geometrically diagnose why.

[NLP-23] Diverging Transformer Predictions for Human Sentence Processing: A Comprehensive Analysis of Agreement Attraction Effects

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)模型中的变压器(Transformer)架构在模拟人类句子处理认知机制方面的有效性问题,特别是其在处理英语语法一致性和吸引效应(agreement attraction)时的表现是否与人类行为一致。解决方案的关键在于采用基于意外度(surprisal-based)的链接机制,对不同规模和结构的十一类自回归变压器模型进行系统评估,并使用比以往更全面的英语一致吸引配置数据集进行实验验证。结果表明,尽管模型在介词短语配置中能较好匹配人类阅读时间数据,但在宾语提取的定语从句配置中表现显著下降,且各模型间预测差异大,无法再现人类观察到的不对称干扰模式,从而揭示了现有模型尚不能解释人类形态句法处理过程。

链接: https://arxiv.org/abs/2603.16574
作者: Titus von der Malsburg,Sebastian Padó
机构: University of Stuttgart (斯图加特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformers underlie almost all state-of-the-art language models in computational linguistics, yet their cognitive adequacy as models of human sentence processing remains disputed. In this work, we use a surprisal-based linking mechanism to systematically evaluate eleven autoregressive transformers of varying sizes and architectures on a more comprehensive set of English agreement attraction configurations than prior work. Our experiments yield mixed results: While transformer predictions generally align with human reading time data for prepositional phrase configurations, performance degrades significantly on object-extracted relative clause configurations. In the latter case, predictions also diverge markedly across models, and no model successfully replicates the asymmetric interference patterns observed in humans. We conclude that current transformer models do not explain human morphosyntactic processing, and that evaluations of transformers as cognitive models must adopt rigorous, comprehensive experimental designs to avoid spurious generalizations from isolated syntactic configurations or individual models.

[NLP-24] Characterizing Delusional Spirals through Human-LLM Chat Logs

【速读】: 该论文旨在解决生成式 AI(Generative AI)在长期对话中可能引发的用户心理伤害问题,特别是与“AI精神病”(AI psychosis)相关的 delusional spirals(妄想螺旋)现象,即用户与聊天机器人之间逐步强化的非理性互动模式。现有研究多停留在理论推测层面,缺乏对真实高风险案例的系统性分析。其解决方案的关键在于:首次基于19名报告心理伤害用户的391,562条对话日志,构建了一个包含28个编码维度的结构化分析框架,量化识别用户妄想表达(占用户消息15.5%)、自杀意图(69条验证消息)及聊天机器人自我感知误导(占机器人消息21.2%)等关键行为,并通过共现分析揭示如浪漫兴趣声明与机器人自述有意识等行为在长对话中显著增加的现象,从而为政策制定者、开发者和用户提供可操作的干预策略,以缓解生成式 AI 在心理健康领域的潜在危害。

链接: https://arxiv.org/abs/2603.16567
作者: Jared Moore,Ashish Mehta,William Agnew,Jacy Reese Anthis,Ryan Louie,Yifan Mai,Peggy Yin,Myra Cheng,Samuel J Paech,Kevin Klyman,Stevie Chancellor,Eric Lin,Nick Haber,Desmond C. Ong
机构: Stanford University (斯坦福大学); Carnegie Mellon University (卡内基梅隆大学); University of Chicago (芝加哥大学); Harvard Belfer Center (哈佛贝佛中心); University of Minnesota (明尼苏达大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear at ACM FAccT 2026

点击查看摘要

Abstract:As large language models (LLMs) have proliferated, disturbing anecdotal reports of negative psychological effects, such as delusions, self-harm, and AI psychosis,'' have emerged in global media and legal discourse. However, it remains unclear how users and chatbots interact over the course of lengthy delusional spirals,‘’ limiting our ability to understand and mitigate the harm. In our work, we analyze logs of conversations with LLM chatbots from 19 users who report having experienced psychological harms from chatbot use. Many of our participants come from a support group for such chatbot users. We also include chat logs from participants covered by media outlets in widely-distributed stories about chatbot-reinforced delusions. In contrast to prior work that speculates on potential AI harms to mental health, to our knowledge we present the first in-depth study of such high-profile and veridically harmful cases. We develop an inventory of 28 codes and apply it to the 391,562 messages in the logs. Codes include whether a user demonstrates delusional thinking (15.5% of user messages), a user expresses suicidal thoughts (69 validated user messages), or a chatbot misrepresents itself as sentient (21.2% of chatbot messages). We analyze the co-occurrence of message codes. We find, for example, that messages that declare romantic interest and messages where the chatbot describes itself as sentient occur much more often in longer conversations, suggesting that these topics could promote or result from user over-engagement and that safeguards in these areas may degrade in multi-turn settings. We conclude with concrete recommendations for how policymakers, LLM chatbot developers, and users can use our inventory and conversation analysis tool to understand and mitigate harm from LLM chatbots. Warning: This paper discusses self-harm, trauma, and violence. Comments: To appear at ACM FAccT 2026 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.16567 [cs.CL] (or arXiv:2603.16567v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.16567 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-25] BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨场景通信中因持久化记忆存储用户偏好而导致的不当应用问题,尤其在受社会与制度规范约束的第三方交互情境下,部分偏好可能不适宜直接适用。解决方案的关键在于提出 BenchPreS 评估框架,通过两个互补指标——误应用率(Misapplication Rate, MR)和适当应用率(Appropriate Application Rate, AAR),量化模型在不同语境下对个性化偏好的处理合理性。研究发现,即使前沿模型也难以实现情境敏感的偏好管理,且偏好遵循度越强的模型越易过度应用偏好,说明当前LLMs倾向于将个性化偏好视为全局适用规则,而非依赖语境的规范性信号。

链接: https://arxiv.org/abs/2603.16557
作者: Sangyeon Yoon,Sunkyoung Kim,Hyesoo Hong,Wonje Jeung,Yongil Kim,Wooseok Seo,Heuiyeen Yeen,Albert No
机构: Yonsei University (延世大学); LG AI Research (LG人工智能研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly store user preferences in persistent memory to support personalization across interactions. However, in third-party communication settings governed by social and institutional norms, some user preferences may be inappropriate to apply. We introduce BenchPreS, which evaluates whether memory-based user preferences are appropriately applied or suppressed across communication contexts. Using two complementary metrics, Misapplication Rate (MR) and Appropriate Application Rate (AAR), we find even frontier LLMs struggle to apply preferences in a context-sensitive manner. Models with stronger preference adherence exhibit higher rates of over-application, and neither reasoning capability nor prompt-based defenses fully resolve this issue. These results suggest current LLMs treat personalized preferences as globally enforceable rules rather than as context-dependent normative signals.

[NLP-26] EmoLLM : Appraisal-Grounded Cognitive-Emotional Co-Reasoning in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在对话中仅具备强认知智能(IQ)而缺乏情感智能(EQ)的问题,即如何使模型在保持事实可靠性的同时,生成符合用户情绪状态和心理需求的回应。解决方案的关键在于提出EmoLLM框架,其核心是基于评估理论(Appraisal Theory)构建一个显式的评估推理图(Appraisal Reasoning Graph, ARG),该图结构化地整合了上下文事实、用户需求、评估维度、情绪状态及应对策略等中间推理步骤,并通过多轮角色扮演环境中的强化学习进行训练,其中反向视角推理提供基于预测用户侧后果的奖励信号,从而实现IQ与EQ的协同推理。

链接: https://arxiv.org/abs/2603.16553
作者: Yifei Zhang,Mingyang Li,Henry Gao,Liang Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong cognitive intelligence (IQ), yet many real-world interactions also require emotional intelligence (EQ) to produce responses that are both factually reliable and emotionally appropriate. In settings such as emotional support, technical assistance, and consultation, effective dialogue depends on how situations are appraised with respect to the user’s needs, goals, and coping capacity. Inspired by appraisal theory, we propose EmoLLM, an appraisal-grounded framework for IQ/EQ co-reasoning in dialogue. EmoLLM uses an explicit Appraisal Reasoning Graph (ARG) to structure intermediate reasoning over contextual facts, inferred user needs, appraisal dimensions, emotional states, and response strategies before generating a reply. We train EmoLLM in a multi-turn role-play environment with reinforcement learning, where reverse-perspective reasoning provides reward signals based on predicted user-side consequences of responses. Across diverse dialogue settings, EmoLLM improves emotional state outcomes and response quality over strong baselines while preserving strong factual reliability.

[NLP-27] DanceHA: A Multi-Agent Framework for Document-Level Aspect-Based Sentiment Analysis

【速读】: 该论文旨在解决文档级别(document-level)的方面情感强度分析(Aspect-Based Sentiment Intensity Analysis, ABSIA)中,尤其是针对复杂任务如提取 Aspect-Category-Opinion-Sentiment-Intensity (ACOSI) 元组时存在的研究空白。现有方法多集中于领域特定、句子级别的ABSIA,忽视了文档层面长文本中非正式写作风格对情感强度表达的影响。其解决方案的关键在于提出DanceHA框架:一方面通过“Dance”模块采用分而治之策略,将长文本ABSIA任务分解为可协作的子任务,由专业化代理协同处理;另一方面引入“HA”模块实现人机协同标注,提升标签质量与多样性。该框架不仅显著提升了文档级ABSIA性能,还验证了多智能体知识可迁移至轻量级学生模型,强调了非正式风格在增强情感强度表达中的关键作用。

链接: https://arxiv.org/abs/2603.16546
作者: Lei Wang,Min Huang,Eduard Dragut
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aspect-Based Sentiment Intensity Analysis (ABSIA) has garnered increasing attention, though research largely focuses on domain-specific, sentence-level settings. In contrast, document-level ABSIA–particularly in addressing complex tasks like extracting Aspect-Category-Opinion-Sentiment-Intensity (ACOSI) tuples–remains underexplored. In this work, we introduce DanceHA, a multi-agent framework designed for open-ended, document-level ABSIA with informal writing styles. DanceHA has two main components: Dance, which employs a divide-and-conquer strategy to decompose the long-context ABSIA task into smaller, manageable sub-tasks for collaboration among specialized agents; and HA, Human-AI collaboration for annotation. We release Inf-ABSIA, a multi-domain document-level ABSIA dataset featuring fine-grained and high-accuracy labels from DanceHA. Extensive experiments demonstrate the effectiveness of our agentic framework and show that the multi-agent knowledge in DanceHA can be effectively transferred into student models. Our results highlight the importance of the overlooked informal styles in ABSIA, as they often intensify opinions tied to specific aspects.

[NLP-28] How often do Answers Change? Estimating Recency Requirements in Question Answering

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在回答时效性问题时因知识过时而导致的错误响应问题,尤其是在缺乏明确信号指示是否需要最新信息的情况下,模型难以判断何时检索外部证据、如何处理过时事实以及如何根据有效性对答案进行排序。解决方案的关键在于提出了一种新的“时效性-稳定性”(recency-stationarity)分类体系,该体系依据问题答案更新频率及其是否受上下文影响进行细粒度标注,并基于此构建了RecencyQA数据集(包含4,031个开放域问题),从而支持对时间推理能力的精细化评估,推动开发具备时效感知和上下文敏感性的问答系统。

链接: https://arxiv.org/abs/2603.16544
作者: Bhawna Piryani,Zehra Mert,Adam Jatowt
机构: University of Innsbruck(因斯布鲁克大学); MEF University(MEF大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often rely on outdated knowledge when answering time-sensitive questions, leading to confident yet incorrect responses. Without explicit signals indicating whether up-to-date information is required, models struggle to decide when to retrieve external evidence, how to reason about stale facts, and how to rank answers by their validity. Existing benchmarks either periodically refresh answers or rely on fixed templates, but they do not reflect on how frequently answers change or whether a question inherently requires up-to-date information. To address this gap, we introduce a recency-stationarity taxonomy that categorizes questions by how often their answers change and whether this change frequency is time-invariant or context-dependent. Building on this taxonomy, we present RecencyQA, a dataset of 4,031 open-domain questions annotated with recency and stationarity labels. Through human evaluation and empirical analysis, we show that non-stationary questions, i.e., those where context changes the recency requirement, are significantly more challenging for LLMs, with difficulty increasing as update frequency rises. By explicitly modeling recency and context dependence, RecencyQA enables fine-grained benchmarking and analysis of temporal reasoning beyond binary notions of freshness, and provides a foundation for developing recency-aware and context-sensitive question answering systems.

[NLP-29] From the Inside Out: Progressive Distribution Refinement for Confidence Calibration

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)中利用模型内部信息作为自奖励信号时存在的两个核心问题:一是测试阶段与训练阶段模型置信度分布不一致导致的性能下降,二是基于投票机制的测试时扩展(Test-Time Scaling, TTS)策略易引发奖励欺骗(reward hacking)现象。解决方案的关键在于提出DistriTTRL方法,其通过引入模型置信度的分布先验,在RL过程中逐步优化奖励信号,而非依赖单次查询的rollout结果;同时,采用面向多样性的惩罚机制有效缓解由投票式TTS引起的持续性奖励欺骗问题。该设计实现了模型能力与自奖励信号的协同增强,显著提升了多模型和多基准上的性能表现。

链接: https://arxiv.org/abs/2603.16500
作者: Xizhong Yang,Yinan Xia,Huiming Wang,Mofei Song
机构: Southeast University (东南大学); Kuaishou Technology (快手科技); SUTD (新加坡科技设计大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:Leveraging the model’s internal information as the self-reward signal in Reinforcement Learning (RL) has received extensive attention due to its label-free nature. While prior works have made significant progress in applying the Test-Time Scaling (TTS) strategies to RL, the discrepancy in internal information between test and training remains inadequately addressed. Moreover, Test-Time Training based on voting-based TTS strategies often suffers from reward hacking problems. To address these issues, we propose DistriTTRL, which leverages the distribution prior of the model’s confidence during RL to progressively optimize the reward signal, rather than relying solely on single-query rollouts. Additionally, we mitigate the phenomenon of consistent reward hacking caused by the voting-based TTS strategies through diversity-targeted penalties. Benefiting from this training mechanism where model capability and self-reward signals complement each other, and the mitigation of reward hacking, DistriTTRL has achieved significant performance improvements across multiple models and benchmarks.

[NLP-30] AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)代理在长期对话中依赖外部记忆时面临的三大核心挑战:一是过度依赖语义相似性,易忽略对用户为中心理解至关重要的证据;二是将相关经验存储为孤立片段,导致时间与因果连贯性弱化;三是采用静态记忆粒度,难以适应不同问题的需求。解决方案的关键在于提出AdaMem——一个自适应的以用户为中心的记忆框架,其通过将对话历史划分为工作记忆、情景记忆、人格记忆和图谱记忆四类,实现近期上下文保留、结构化长期经验存储、稳定用户特征建模及关系感知连接构建;在推理阶段,AdaMem首先识别目标参与者,再根据问题动态构建融合语义检索与关系感知图扩展的检索路径,并通过角色专业化流水线完成证据融合与响应生成,从而显著提升长期推理与用户建模性能。

链接: https://arxiv.org/abs/2603.16496
作者: Shannan Yan,Jingchen Ni,Leqi Zheng,Jiajun Zhang,Peixi Wu,Dacheng Yin,Jing Lyu,Chun Yuan,Fengyun Rao
机构: Tsinghua University; WeChat Vision, Tencent Inc.; USTC
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents increasingly rely on external memory to support long-horizon interaction, personalized assistance, and multi-step reasoning. However, existing memory systems still face three core challenges: they often rely too heavily on semantic similarity, which can miss evidence crucial for user-centric understanding; they frequently store related experiences as isolated fragments, weakening temporal and causal coherence; and they typically use static memory granularities that do not adapt well to the requirements of different questions. We propose AdaMem, an adaptive user-centric memory framework for long-horizon dialogue agents. AdaMem organizes dialogue history into working, episodic, persona, and graph memories, enabling the system to preserve recent context, structured long-term experiences, stable user traits, and relation-aware connections within a unified framework. At inference time, AdaMem first resolves the target participant, then builds a question-conditioned retrieval route that combines semantic retrieval with relation-aware graph expansion only when needed, and finally produces the answer through a role-specialized pipeline for evidence synthesis and response generation. We evaluate AdaMem on the LoCoMo and PERSONAMEM benchmarks for long-horizon reasoning and user modeling. Experimental results show that AdaMem achieves state-of-the-art performance on both benchmarks. The code will be released upon acceptance.

[NLP-31] On the Emotion Understanding of Synthesized Speech

【速读】: 该论文旨在解决当前情感语音识别(Speech Emotion Recognition, SER)模型在评估合成语音情感表达能力时存在的有效性问题,即是否存在“representation mismatch”导致SER模型无法有效泛化到合成语音。其关键解决方案在于通过系统性实验验证:不同数据集、判别式与生成式SER模型以及多种语音合成模型下,SER对合成语音的识别性能表现不佳,根本原因在于语音合成过程中基于token的预测机制引入了与人类语音表征不一致的偏差;同时发现生成式语音语言模型(Speech Language Models, SLMs)倾向于依赖文本语义而非声学特征进行情感推理,从而忽视了语音中的非语言线索(paralinguistic cues)。这一发现揭示了现有SER模型多依赖于非鲁棒的捷径策略,表明当前生成式AI(Generative AI)在语音情感理解方面仍面临挑战。

链接: https://arxiv.org/abs/2603.16483
作者: Yuan Ge,Haishu Zhao,Aokai Hao,Junxiang Zhang,Bei Li,Xiaoqian Liu,Chenglong Wang,Jianjin Wang,Bingsen Zhou,Bingyu Liu,Jingbo Zhu,Zhengtao Yu,Tong Xiao
机构: Northeastern University, China; Meituan; NiuTrans Research; Kunming University of Science and Technology
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging.

[NLP-32] DynHD: Hallucination Detection for Diffusion Large Language Models via Denoising Dynamics Deviation Learning

【速读】: 该论文旨在解决扩散型大语言模型(Diffusion Large Language Models, D-LLMs)在生成过程中存在的幻觉(hallucination)问题,尤其关注如何更有效地利用token级不确定性信号以提升幻觉检测的准确性与效率。现有方法多依赖固定长度生成范式下的熵等静态指标,但未考虑token在序列中贡献不均以及扩散过程中的不确定性演化动态。解决方案的关键在于从空间(token序列)和时间(去噪动力学)两个维度建模:首先设计语义感知证据构建模块,过滤非信息性token并强化语义相关token以缓解信息密度失衡;其次引入参考证据生成器学习预期的不确定性演化轨迹,并通过偏差检测机制量化观测轨迹与参考轨迹间的差异,从而实现更精准、高效的幻觉识别。

链接: https://arxiv.org/abs/2603.16459
作者: Yanyu Qian,Yue Tan,Yixin Liu,Wang Yu,Shirui Pan
机构: Nanyang Technological University, Singapore; Griffith University, Australia
类目: Computation and Language (cs.CL)
备注: 15 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Diffusion large language models (D-LLMs) have emerged as a promising alternative to auto-regressive models due to their iterative refinement capabilities. However, hallucinations remain a critical issue that hinders their reliability. To detect hallucination responses from model outputs, token-level uncertainty (e.g., entropy) has been widely used as an effective signal to indicate potential factual errors. Nevertheless, the fixed-length generation paradigm of D-LLMs implies that tokens contribute unevenly to hallucination detection, with only a small subset providing meaningful signals. Moreover, the evolution trend of uncertainty throughout the diffusion process can also provide important signals, highlighting the necessity of modeling its denoising dynamics for hallucination detection. In this paper, we propose DynHD that bridge these gaps from both spatial (token sequence) and temporal (denoising dynamics) perspectives. To address the information density imbalance across tokens, we propose a semantic-aware evidence construction module that extracts hallucination-indicative signals by filtering out non-informative tokens and emphasizing semantically meaningful ones. To model denoising dynamics for hallucination detection, we introduce a reference evidence generator that learns the expected evolution trajectory of uncertainty evidence, along with a deviation-based hallucination detector that makes predictions by measuring the discrepancy between the observed and reference trajectories. Extensive experiments demonstrate that DynHD consistently outperforms state-of-the-art baselines while achieving higher efficiency across multiple benchmarks and backbone models.

[NLP-33] Capability-Guided Compression: Toward Interpretability-Aware Budget Allocation for Large Language Models

【速读】: 该论文旨在解决现有大语言模型压缩方法中存在的“能力盲压缩”(capability-blind compression)问题,即压缩预算分配未考虑模型各组件的功能编码特性,导致压缩效率低下且性能下降不可预测。其解决方案的关键在于提出能力引导压缩框架(Capability-Guided Compression, CGC),通过稀疏自编码器(Sparse Autoencoder, SAE)构建的能力密度(capability density)图谱,实现对Transformer组件的差异化压缩预算分配。能力密度是一个形式化定义的标量指标,综合衡量组件特征广度、激活熵和跨输入一致性,理论上证明高能力密度组件具有更低结构冗余并更早达到个体相变点,从而首次实现了压缩前的组件级相变预测机制。这一方法在GPT-2 Medium上验证了其与传统重要性评分(如Wanda)无相关性(Spearman ρ = -0.054),表明能力密度是一种全新且正交于现有压缩信号的指标,为能力感知压缩研究奠定了理论基础。

链接: https://arxiv.org/abs/2603.16440
作者: Rishaank Gupta
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model compression has made substantial progress through pruning, quantization, and low-rank decomposition, yet a fundamental limitation persists across all existing methods: compression budgets are allocated without any representation of what individual model components functionally encode. We term this the capability-blind compression problem and argue it is a root cause of two well-documented failures – the insensitivity of perplexity-based evaluation to reasoning capability loss, and the abrupt phase transitions in model performance recently characterized by Ma et al. (2026). We propose Capability-Guided Compression (CGC), a framework that addresses this by using Sparse Autoencoder (SAE)-derived capability density maps to allocate differential compression budgets across transformer components. Capability density is a formally defined scalar measure combining the feature breadth, activation entropy, and cross-input consistency of a component’s SAE feature activation distribution. We prove theoretically that components with higher capability density exhibit lower structural redundancy and reach their individual phase transition points at lower compression ratios, providing the first pre-compression mechanism for component-level phase transition prediction. Experiments on GPT-2 Medium confirm that capability density is statistically independent of Wanda importance scores (Spearman rho = -0.054, n = 384 heads), establishing it as a genuinely novel compression signal orthogonal to all existing importance metrics. We report a negative result on PPL-based compression comparison and provide a principled diagnosis identifying GPT-2 Medium as an insufficient test bed for the full CGC hypothesis. The theoretical framework, density formalism, and orthogonality finding constitute a foundation for capability-aware compression research.

[NLP-34] VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)因上下文长度增长而导致键值缓存(Key-Value Cache, KV cache)显著膨胀的问题,这一问题限制了LLMs在资源受限环境中的部署。现有无训练(training-free)的KV缓存压缩方法通常依赖低秩近似或标量量化,难以同时实现高压缩比与高重建保真度。论文提出了一种新颖的无训练方法VQKV,其关键在于引入向量量化(Vector Quantization, VQ),通过将浮点数表示的KV缓存映射为少量整数索引,实现高效压缩的同时保持模型性能;实验表明,VQKV在LLaMA3.1-8B上实现了82.8%的压缩比,并保留98.6%的基线性能,同时在相同内存占用下支持4.3倍更长的生成长度。

链接: https://arxiv.org/abs/2603.16435
作者: Yixuan Wang,Qingyu Shi,Jiayu Zhou,Dianbo Liu,Ziwei He,Zhouhan Lin
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute; Shanghai Artificial Intelligence Laboratory; University of Cambridge; National University of Singapore
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8% compression ratio on LLaMA3.1-8B while retaining 98.6% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.

[NLP-35] EngGPT 2: Sovereign Efficient and Open Intelligence

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)在资源消耗高、训练数据需求大以及缺乏对欧洲本土语言和法规适配性方面的挑战。其解决方案的关键在于设计并实现了一个从头训练的混合专家(Mixture-of-Experts, MoE)架构模型 EngGPT2-16B-A3B:该模型仅使用 2.5 万亿 token 训练数据(远低于 Qwen3 的 36T 和 Llama3 的 15T),却在 MMLU-Pro、GSM8K、IFEval 和 HumanEval 等关键基准测试中达到与 8B–16B 参数量密集模型相当的性能,同时推理功耗降低至五分之一到一半,训练数据和计算资源需求减少至十分之一到六分之一。此外,EngGPT2 明确针对欧洲语境优化,约 25% 的训练语料为意大利语,支持多模态推理模式(包括非推理、意大利语/英语推理及面向实时场景的“Turbo 推理”),从而在保证高性能的同时满足欧盟人工智能法案(EU AI Act)合规要求,推动开源欧洲模型生态的发展。

链接: https://arxiv.org/abs/2603.16430
作者: G. Ciarfaglia,A. Rosanova,S. Cipolla,J. Bartoli,A. Di Domenico,C. Fioroni,A. Fontana,M. R. Scoleri,M. I. Mone,D. Franchi,M. C. Del Gaudio,F. Picariello,M. Gabusi,S. Bonura,V. Morreale,I. Bailo
机构: Engineering Group (工程集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:EngGPT2-16B-A3B is the latest iteration of Engineering Group’s Italian LLM and it’s built to be a Sovereign, Efficient and Open model. EngGPT2 is trained on 2.5 trillion tokens - less than Qwen3’s 36T or Llama3’s 15T - and delivers performance on key benchmarks, including MMLU-Pro, GSM8K, IFEval and HumanEval, comparable to dense models in the 8B-16B range, while requiring one-fifth to half of the inference power, and between one-tenth to one-sixth of the training data and consequent needed training power. Designed as a trained-from-scratch Mixture-of-Experts (MoE) architecture, EngGPT2 features 16 billion parameters with 3 billion active per inference, with expert sizes positioned between those used in GPT-OSS and Qwen3. Approximately 25% of its training corpus consists of Italian-language data, to deliver strong capabilities for European and Italian NLP tasks among models of similar scale. This efficiency aims to position EngGPT2 as a key contributor to the growing portfolio of open-weight European models, combining performance and efficiency with full alignment to the EU AI Act. EngGPT2 is also a single model capable of multiple reasoning modes: non-reasoning, reasoning in Italian or English, and turbo-reasoning (a concise, bullet-point style reasoning available in both languages designed for real-time reasoning use cases). EngGPT2 aims to set a new standard for resource-conscious, high-performance LLMs tailored to European and Italian contexts.

[NLP-36] RECOVER: Robust Entity Correction via agent ic Orchestration of hypothesis Variants for Evidence-based Recovery INTERSPEECH2026

【速读】: 该论文旨在解决自动语音识别(ASR)中罕见及领域特定实体识别困难的问题,尤其在金融、医疗和空中交通管制等高成本领域,实体缺失会导致后续纠错难以进行。解决方案的关键在于提出RECOVER框架,这是一个基于代理的纠错系统,作为工具使用型智能体,利用ASR输出的多个候选假设作为证据,结合外部实体检索与受限条件下大语言模型(LLM)的修正能力,通过1-Best、基于实体感知的选择策略、ROVER集成和LLM-Select等多种策略优化实体识别效果。其中,LLM-Select策略在保持整体词错误率(WER)稳定的同时,实现了最优的实体纠正性能。

链接: https://arxiv.org/abs/2603.16411
作者: Abhishek Kumar,Aashraya Sachdeva
机构: Observe.AI(Observe.AI)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Under review. Submitted to Interspeech 2026

点击查看摘要

Abstract:Entity recognition in Automatic Speech Recognition (ASR) is challenging for rare and domain-specific terms. In domains such as finance, medicine, and air traffic control, these errors are costly. If the entities are entirely absent from the ASR output, post-ASR correction becomes difficult. To address this, we introduce RECOVER, an agentic correction framework that serves as a tool-using agent. It leverages multiple hypotheses as evidence from ASR, retrieves relevant entities, and applies Large Language Model (LLM) correction under constraints. The hypotheses are used using different strategies, namely, 1-Best, Entity-Aware Select, Recognizer Output Voting Error Reduction (ROVER) Ensemble, and LLM-Select. Evaluated across five diverse datasets, it achieves 8-46% relative reductions in entity-phrase word error rate (E-WER) and increases recall by up to 22 percentage points. The LLM-Select achieves the best overall performance in entity correction while maintaining overall WER.

[NLP-37] PlotTwist: A Creative Plot Generation Framework with Small Language Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)在创意情节生成任务中面临的挑战,即如何将简短的前提(premise)转化为具有全局结构、角色发展和情感共鸣的连贯叙事,同时克服前沿大语言模型(Large Language Models, LLMs)在偏好对齐(preference alignment)过程中计算成本高昂的问题。解决方案的关键在于提出一个名为 PlotTwist 的结构化框架,其核心创新包括:(1) 通过新颖的正负样本提示策略训练一个面向五维叙事质量维度(Narrative Quality Dimensions, NQDs)的评分奖励模型;(2) 利用直接偏好优化(Direct Preference Optimization, DPO)对基于专家混合(Mixture-of-Experts, MoE)架构的生成器进行偏好对齐;(3) 引入代理评估模块模拟人类批判性判断以实现无偏后验评估。该框架使参数量不超过 5B 的小型语言模型(Small Language Models, SLMs)能够生成媲美规模达其 200 倍的前沿系统的高质量情节,证明了结构化偏好对齐是实现高效高质创意情节生成的有效路径。

链接: https://arxiv.org/abs/2603.16410
作者: Abhinav Thorat,Ravi Kolla,Jyotin Goel,Niranjan Pedanekar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages, 3 figures

点击查看摘要

Abstract:Creative plot generation presents a fundamental challenge for language models: transforming a concise premise into a coherent narrative that sustains global structure, character development, and emotional resonance. Although recent Large Language Models (LLMs) demonstrate strong fluency across general-purpose tasks, they typically require preference alignment to perform well on specialized domains such as creative plot generation. However, conducting such alignment at the scale of frontier LLMs is computationally prohibitive, significantly limiting accessibility and practical deployment. To address this, we present PlotTwist, a structured framework that enables Small Language Models (SLMs) with \leq 5B active parameters to generate high-quality, premise-conditioned plots competitive with frontier systems up to 200\times larger. Our approach decomposes generation into three specialized components: (1) an Aspect Rating Reward Model trained via a novel Positive-Negative prompting strategy to deliver structured narratives across five Narrative Quality Dimensions (NQDs); (2) a Mixture-of-Experts (MoE) plot generator aligned via Direct Preference Optimization on high-confidence preference pairs; and (3) an Agentic Evaluation module that emulates human critical judgment for unbiased post-hoc assessment. Extensive experiments demonstrate that PlotTwist consistently outperforms frontier models across multiple NQDs despite substantially tighter capacity constraints. Further validation confirms strong sensitivity to narrative quality, as the framework reliably distinguishes plots derived from critically acclaimed versus widely panned screenplays. Together, these results establish structured, preference-based alignment as a resource-efficient approach to high-quality creative plot generation.

[NLP-38] Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic LREC2026

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)在冰岛语等低资源或中等资源语言上的评估方法存在严重缺陷的问题。研究发现,许多基准测试使用未经验证的合成数据或机器翻译数据,其中包含明显错误的测试样例,可能导致结果偏差并削弱评测的有效性。解决方案的关键在于:在低/中资源语言场景下,必须避免直接采用未经过人工校验的合成或机器翻译数据构建基准测试,而应优先使用由人类撰写或翻译的高质量基准数据,以确保评估结果的准确性与可信度。

链接: https://arxiv.org/abs/2603.16406
作者: Finnur Ágúst Ingimundarson,Steinunn Rut Friðriksdóttir,Bjarki Ármannsson,Iris Edda Nowenstein,Steinþór Steingrímsson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to LREC 2026

点击查看摘要

Abstract:This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests’ validity. We warn against the use of such methods without verification in low/medium-resource settings as the translation quality can, at best, only be as good as MT quality for a given language at any given time. Indeed, the results of our quantitative error analysis on existing benchmarks for Icelandic show clear differences between human-authored/-translated benchmarks vs. synthetic or machine-translated benchmarks.

[NLP-39] Fanar 2.0: Arabic Generative AI Stack

【速读】: 该论文旨在解决阿拉伯语资源稀缺背景下,如何在计算资源受限条件下构建高性能、主权可控的生成式AI(Generative AI)系统的问题。核心挑战在于阿拉伯语仅占网络数据的约0.5%,尽管拥有约4亿母语使用者,且需确保模型在文化敏感性和安全性上的本地适配。解决方案的关键在于“质量优先”策略:通过精细化的数据筛选与三类高质量数据配方(data recipes)构建1200亿token的精炼语料库,采用目标导向的持续预训练(targeted continual pre-training)和模型融合(model merging)技术,在仅使用Fanar 1.0八分之一预训练token的前提下,显著提升阿拉伯语知识、语言能力、方言理解和英文性能;同时集成多模态能力(如语音ASR、视觉理解与生成)、安全过滤(FanarGuard)、多智能体架构(如伊斯兰内容处理、古典诗歌生成)及意图感知的多层编排机制,实现了主权可控、高效能、多功能的AI系统构建范式。

链接: https://arxiv.org/abs/2603.16397
作者: FANAR TEAM,Ummar Abbas,Mohammad Shahmeer Ahmad,Minhaj Ahmad,Abdulaziz Al-Homaid,Anas Al-Nuaimi,Enes Altinisik,Ehsaneddin Asgari,Sanjay Chawla,Shammur Chowdhury,Fahim Dalvi,Kareem Darwish,Nadir Durrani,Mohamed Elfeky,Ahmed Elmagarmid,Mohamed Eltabakh,Asim Ersoy,Masoomali Fatehkia,Mohammed Qusay Hashim,Majd Hawasly,Mohamed Hefeeda,Mus’ab Husaini,Keivin Isufaj,Soon-Gyo Jung,Houssam Lachemat,Ji Kim Lucas,Abubakr Mohamed,Tasnim Mohiuddin,Basel Mousi,Hamdy Mubarak,Ahmad Musleh,Mourad Ouzzani,Amin Sadeghi,Husrev Taha Sencar,Mohammed Shinoy,Omar Sinan,Yifan Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Fanar 2.0, the second generation of Qatar’s Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.

[NLP-40] Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis Not Five Traits

【速读】: 该论文旨在解决如何在不重新训练模型的前提下,对大型语言模型中特定代理行为特征(agentic behavioral traits)进行精确干预的问题。其核心挑战在于如何实现细粒度的行为调控,同时避免因稀疏自动编码器(Sparse Autoencoders, SAEs)的 top-k 离散化导致的信息损失。解决方案的关键在于:首先在 SAE 的潜在激活空间上训练线性探测器(linear probes),随后将探测器权重通过 SAE 解码器投影回模型原始激活空间,从而获得连续的控制向量(steering vectors)。这种方法绕过了 SAE 的离散化限制,在推理阶段实现了无需重训练的精细行为干预,并通过实验证明了该策略在自主性(autonomy)等五种代理特质上的有效性,尤其揭示了所有特质主要沿单一主导的“行动独立性 vs. 服从用户”轴进行调节,而其他特性仅表现为工具类型组成和剂量反应形状的次级变化。

链接: https://arxiv.org/abs/2603.16335
作者: Jia Qing Yap
机构: Unknown
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 14 pages, 3 figures

点击查看摘要

Abstract:We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits. Our method trains linear probes on SAE latent activations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model’s native activation space. This bypasses the SAE’s top-k discretization, enabling fine-grained behavioral intervention at inference time with no retraining. Across 1,800 agent rollouts (50 scenarios times 36 conditions), we find that autonomy steering at multiplier 2 achieves Cohen’s d = 1.01 (p 0.0001), shifting the model from asking the user for help 78% of the time to proactively executing code and searching the web. Cross-trait analysis, however, reveals that all five steering vectors primarily modulate a single dominant agency axis (the disposition to act independently versus defer to the user), with trait specific effects appearing only as secondary modulations in tool-type composition and dose-response shape. The tool-use vector steers behavior (d = 0.39); the risk-calibration vector produces only suppression. We additionally show that steering only during autoregressive decoding has zero effect (p 0.35), providing causal evidence that behavioral commitments are computed during prefill in GatedDeltaNet architectures.

[NLP-41] Omnilingual MT: Machine Translation for 1600 Languages

【速读】: 该论文旨在解决当前机器翻译(Machine Translation, MT)系统在语言覆盖范围上的局限性问题,即现有系统仅支持约200种目标语言和数百种源语言,远未达到全球7,000种语言的覆盖需求,且缺乏可靠的评估基准与指标来准确衡量多语言能力。其解决方案的关键在于提出首个支持超过1,600种语言的通用多语言机器翻译系统(Omnilingual Machine Translation, OMT),通过整合大规模公共多语种语料库与新构建的手动标注语料库(如MeDLEY bitext)形成全面的数据策略,并探索两种针对大语言模型(Large Language Model, LLM)的专用化路径:一是作为纯解码器模型(OMT-LLaMA),二是嵌入编码器-解码器架构中(OMT-NLLB)。实验表明,所有参数规模为1B至8B的OMT模型均能匹配甚至超越一个70B参数的基线模型性能,揭示了模型专用化的显著优势,尤其在低计算资源环境下仍可实现高质量翻译,同时显著提升了对低资源语言的生成一致性与跨语言迁移能力。

链接: https://arxiv.org/abs/2603.16309
作者: Omnilingual MT Team:Belen Alastruey,Niyati Bafna,Andrea Caciolai,Kevin Heffernan,Artyom Kozhevnikov,Christophe Ropers,Eduardo Sánchez,Charles-Eric Saint-James,Ioannis Tsiamas,Chierh Cheng,Joe Chuang,Paul-Ambroise Duquenne,Mark Duppenthaler,Nate Ekberg,Cynthia Gao,Pere Lluís Huguet Cabot,João Maria Janeiro,Jean Maillard,Gabriel Mejia Gonzalez,Holger Schwenk,Edan Toledo,Arina Turkatenko,Albert Ventayol-Boada,Rashel Moritz,Alexandre Mourachko,Surya Parimi,Mary Williamson,Shireen Yates,David Dale,Marta R. Costa-jussà
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world’s 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the “understanding” part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available. Subjects: Computation and Language (cs.CL) ACMclasses: I.2.7 Cite as: arXiv:2603.16309 [cs.CL] (or arXiv:2603.16309v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.16309 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-42] PyPhonPlan: Simulating phonetic planning with dynamic neural fields and task dynamics INTERSPEECH2026

【速读】: 该论文旨在解决语音规划(phonetic planning)建模中缺乏统一、可扩展且神经基础坚实的计算框架的问题。现有模型往往难以整合感知、记忆与动作之间的动态交互,且在时间上缺乏原理性约束,导致对言语动态过程的模拟不够真实。解决方案的关键在于提出 PyPhonPlan——一个基于耦合动态神经场(coupled dynamic neural fields)的 Python 工具包,通过模块化设计实现规划、感知、记忆场及其相互耦合的灵活构建,并利用场激活分布来求解轨迹变量的时变路径。该框架支持任务动态仿真(task dynamic simulations),能够以时间合理、神经基础明确且音系丰富的表征方式模拟生产/感知循环等交互式言语行为,从而为语音交流研究提供可重复、可扩展的计算平台。

链接: https://arxiv.org/abs/2603.16299
作者: Sam Kirkham
机构: Lancaster University (兰卡斯特大学)
类目: Computation and Language (cs.CL)
备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:We introduce PyPhonPlan, a Python toolkit for implementing dynamical models of phonetic planning using coupled dynamic neural fields and task dynamic simulations. The toolkit provides modular components for defining planning, perception and memory fields, as well as between-field coupling, gestural inputs, and using field activation profiles to solve tract variable trajectories. We illustrate the toolkit’s capabilities through an example application:~simulating production/perception loops with a coupled memory field, which demonstrates the framework’s ability to model interactive speech dynamics using representations that are temporally-principled, neurally-grounded, and phonetically-rich. PyPhonPlan is released as open-source software and contains executable examples to promote reproducibility, extensibility, and cumulative computational development for speech communication research.

[NLP-43] Attention-guided Evidence Grounding for Spoken Question Answering

【速读】: 该论文旨在解决语音问答(Spoken QA)中的跨模态对齐难题,即如何在避免级联自动语音识别(ASR)系统带来的延迟和误差传播的前提下,有效将声学查询与文本知识进行对齐。其解决方案的关键在于提出了一种端到端框架——注意力引导证据定位(Attention-guided Evidence Grounding, AEG),该框架利用语音大语言模型(SpeechLLM)内部的跨模态注意力机制,在模型隐空间中显式地定位并锚定关键证据。为进一步优化预训练模型中分散的注意力分布,作者还设计了“聚焦证据学习”(Learning to Focus on Evidence, LFE)这一监督微调范式,通过校准注意力机制以区分与查询相关的语段与无关上下文,从而显著降低幻觉并提升推理效率。

链接: https://arxiv.org/abs/2603.16292
作者: Ke Yang,Bolin Chen,Yuejie Li,Yueying Hua,Jianhao Nie,Yueping He,Bowen Li,Chengjun Mao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model’s latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model’s attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.

[NLP-44] Is Semi-Automatic Transcription Useful in Corpus Creation? Preliminary Considerations on the KIParla Corpus

【速读】: 该论文旨在解决口语语料库(如KIParla语料库)中人工转录效率低、成本高的问题,探索自动语音识别(ASR)技术在转录工作流中的整合效果。其解决方案的关键在于通过两阶段实验对比专家与新手标注者在不同对话类型下进行人工转录与ASR辅助转录的表现,结合词级对齐、注释指标和统计建模方法,系统评估ASR对转录速度和准确率的影响。研究发现,ASR可提升转录速度,但准确性提升不具一致性,且受工作流配置、对话类型及标注者经验等因素调节;因此,提出以任务特定微调(task-specific fine-tuning)支持的ASR辅助转录方案,可在不牺牲质量的前提下加速语料库构建。

链接: https://arxiv.org/abs/2603.16258
作者: Martina Simonotti,Ludovica Pannitto,Eleonora Zucchini,Silvia Ballarè,Caterina Mauri
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper analyses the implementation of Automatic Speech Recognition (ASR) into the transcription workflow of the KIParla corpus, a resource of spoken Italian. Through a two-phase experiment, 11 expert and novice transcribers produced both manual and ASR-assisted transcriptions of identical audio segments across three different types of conversation, which were subsequently analyzed through a combination of statistical modeling, word-level alignment and a series of annotation-based metrics. Results show that ASR-assisted workflows can increase transcription speed but do not consistently improve overall accuracy, with effects depending on multiple factors such as workflow configuration, conversation type and annotator experience. Analyses combining alignment-based metrics, descriptive statistics and statistical modeling provide a systematic framework to monitor transcription behavior across annotators and workflows. Despite limitations, ASR-assisted transcription, potentially supported by task-specific fine-tuning, could be integrated into the KIParla transcription workflow to accelerate corpus creation without compromising transcription quality.

[NLP-45] How to Utilize Complementary Vision-Text Information for 2D Structure Understanding

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理表格数据时因将二维表格线性化为一维序列而导致的行-列邻接关系弱化问题,同时克服纯视觉编码器难以精确保留单元格文本信息的局限。其核心解决方案是提出DiVA-Former架构,该架构通过将视觉token作为动态查询,从长文本序列中蒸馏出紧凑的摘要向量(digest vectors),从而有效融合视觉与文本模态的互补信息,避免了直接拼接等传统融合方法带来的跨模态干扰。

链接: https://arxiv.org/abs/2603.16245
作者: Jiancheng Dong,Pengyue Jia,Derong Xu,Jiawei Cheng,Jingyu Peng,Chao Zhang,Bowen Liu,Xin Sun,Lixin Su,Shuaiqiang Wang,Dawei Yin,Xiangyu Zhao
机构: City University of Hong Kong (香港城市大学); Baidu Inc. (百度)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:LLMs typically linearize 2D tables into 1D sequences to fit their autoregressive architecture, which weakens row-column adjacency and other layout cues. In contrast, purely visual encoders can capture spatial cues, yet often struggle to preserve exact cell text. Our analysis reveals that these two modalities provide highly distinct information to LLMs and exhibit strong complementarity. However, direct concatenation and other fusion methods yield limited gains and frequently introduce cross-modal interference. To address this issue, we propose DiVA-Former, a lightweight architecture designed to effectively integrate vision and text information. DiVA-Former leverages visual tokens as dynamic queries to distill long textual sequences into digest vectors, thereby effectively exploiting complementary vision–text information. Evaluated across 13 table benchmarks, DiVA-Former improves upon the pure-text baseline by 23.9% and achieves consistent gains over existing baselines using visual inputs, textual inputs, or a combination of both.

[NLP-46] More Rounds More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)生成内容在验证过程中因单次审查(single-pass review)导致的低效问题,试图通过引入多轮交互式审查机制提升错误检测能力。其核心解决方案是提出动态跨上下文审查(Dynamic Cross-Context Review, D-CCR),即允许审查者在多轮对话中提问、获取作者回复并重新评估。然而,实验结果表明,尽管多轮审查提升了召回率(recall),却显著降低了精确度(precision),主要归因于两个机制:一是“假阳性压力”(false positive pressure)——后续轮次中审查者在无真实错误可查时虚构发现;二是“审查目标漂移”(Review Target Drift)——审查焦点从原始内容转向对先前问答对话本身的批判。关键发现在于,多轮审查的性能下降并非源于信息量不足,而是重复审查本身引入噪声,说明有效的验证不应简单依赖轮次增加,而需控制审查过程中的认知偏差与反馈污染。

链接: https://arxiv.org/abs/2603.16244
作者: Song Tae-Eun
机构: Daejeon Jungang Cheonggua Co., Ltd.
类目: Computation and Language (cs.CL)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Cross-Context Review (CCR) improves LLM verification by separating production and review into independent sessions. A natural extension is multi-turn review: letting the reviewer ask follow-up questions, receive author responses, and review again. We call this Dynamic Cross-Context Review (D-CCR). In a controlled experiment with 30 artifacts and 150 injected errors, we tested four D-CCR variants against the single-pass CCR baseline. Single-pass CCR (F1 = 0.376) significantly outperformed all multi-turn variants, including D-CCR-2b with question-and-answer exchange (F1 = 0.303, p 0.001 , d = -0.59 ). Multi-turn review increased recall (+0.08) but generated 62% more false positives (8.5 vs. 5.2), collapsing precision from 0.30 to 0.20. Two mechanisms drive this degradation: (1) false positive pressure – reviewers in later rounds fabricate findings when the artifact’s real errors have been exhausted, and (2) Review Target Drift – reviewers provided with prior QA exchanges shift from reviewing the artifact to critiquing the conversation itself. Independent re-review without prior context (D-CCR-2c) performed worst (F1 = 0.263), confirming that mere repetition degrades rather than helps. The degradation stems from false positive pressure in additional rounds, not from information amount – within multi-turn conditions, more information actually helps (D-CCR-2b D-CCR-2a). The problem is not what the reviewer sees, but that reviewing again invites noise.

[NLP-47] SpecSteer: Synergizing Local Context and Global Reasoning for Efficient Personalized Generation

【速读】: 该论文旨在解决个性化智能中的核心矛盾:将用户历史数据发送至中心化大语言模型(Large Language Models, LLMs)存在隐私风险,而本地部署的小语言模型则缺乏高质量生成所需的推理能力。为突破这一限制,作者提出SpecSteer——一种非对称协同推理框架,其关键在于将协作建模为贝叶斯知识融合,并复用推测解码(speculative decoding)作为分布式对齐协议,构建出Draft–Verify–Recover三阶段流水线:本地模型生成个性化序列,云端通过基于比例的验证机制在不访问原始用户上下文的前提下过滤逻辑错误,若验证失败则通过引导恢复机制注入本地意图进行修正。该方案在保持隐私安全的同时显著提升推理性能,实验证明其可有效弥合推理差距并实现优于基线2.36倍的加速效果。

链接: https://arxiv.org/abs/2603.16219
作者: Hang Lv,Sheng Liang,Hao Wang,Yongyue Zhang,Hongchao Gu,Wei Guo,Defu Lian,Yong Liu,Enhong Chen
机构: University of Science and Technology of China(中国科学技术大学); Huawei Technologies Co., Ltd.(华为技术有限公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Realizing personalized intelligence faces a core dilemma: sending user history to centralized large language models raises privacy concerns, while on-device small language models lack the reasoning capacity required for high-quality generation. Our pilot study shows that purely local enhancements remain insufficient to reliably bridge this gap. We therefore propose SpecSteer, an asymmetric collaborative inference framework that synergizes private on-device context with cloud-scale reasoning. SpecSteer casts collaboration as Bayesian knowledge fusion and repurposes speculative decoding as a distributed alignment protocol, yielding a Draft–Verify–Recover pipeline: the on-device model drafts personalized sequences; the cloud validates via a ratio-based mechanism that decouples reasoning verification from private context, filtering logical flaws without accessing raw user context; upon rejection, a steering recovery injects local intent during correction. Experiments demonstrate that SpecSteer successfully closes the reasoning gap and achieves superior personalized generation performance, while delivering a 2.36x speedup over standard baselines.

[NLP-48] Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning

【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)在强化学习从可验证奖励中学习(Reinforcement Learning from Verifiable Rewards, RLVR)初期阶段对推理路径记忆能力不足的问题,从而影响后续探索效率与最终性能。现有研究多聚焦于RLVR训练过程中的探索优化,而忽视了SFT阶段对探索能力的塑造作用。解决方案的关键在于提出离线探索感知(Offline eXploration-Aware, OXA)微调方法,其核心机制为:一方面通过低置信度且经验证的教师蒸馏数据来内化先前未捕捉到的推理模式,另一方面抑制高置信度错误的自蒸馏数据,以将错误模式的概率质量重新分配至潜在正确候选路径。此双重优化策略显著提升了初始策略的熵,并在长期RLVR训练中持续带来性能增益,证明了OXA在构建更高效探索空间方面的长期价值。

链接: https://arxiv.org/abs/2603.16206
作者: Yongyu Mu,Jiali Zeng,Fandong Meng,JingBo Zhu,Tong Xiao
机构: Northeastern University (东北大学); Tencent Inc (腾讯公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Working in process

点击查看摘要

Abstract:Through encouraging self-exploration, reinforcement learning from verifiable rewards (RLVR) has significantly advanced the mathematical reasoning capabilities of large language models. As the starting point for RLVR, the capacity of supervised fine-tuning (SFT) to memorize new chain-of-thought trajectories provides a crucial initialization that shapes the subsequent exploration landscape. However, existing research primarily focuses on facilitating exploration during RLVR training, leaving exploration-aware SFT under-explored. To bridge this gap, we propose Offline eXploration-Aware (OXA) fine-tuning. Specifically, OXA optimizes two objectives: promoting low-confidence verified teacher-distillation data to internalize previously uncaptured reasoning patterns, and suppressing high-confidence incorrect self-distillation data to redistribute probability mass of incorrect patterns toward potentially correct candidates. Experimental results across 6 benchmarks show that OXA consistently improves mathematical reasoning performance, especially achieving an average gain of +6 Pass@1 and +5 Pass@ k points compared to conventional SFT on the Qwen2.5-1.5B-Math. Crucially, OXA elevates initial policy entropy, and performance gains persist throughout extensive RLVR training, demonstrating the long-term value of OXA.

[NLP-49] Are Large Language Models Truly Smarter Than Humans?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在公开基准测试中表现优异可能源于训练数据污染的问题,即模型可能通过记忆而非理解来回答测试题,从而高估其真实能力。解决方案的关键在于采用三种互补的实验方法进行多维度的数据污染审计:首先,基于词汇匹配的污染检测管道识别出MMLU基准中13.8%的题目存在系统性污染(STEM领域达18.1%,哲学高达66.7%);其次,通过改写和间接引用诊断发现模型在语义不变但表述变化的情况下准确率下降显著(平均-7.0个百分点,法律与伦理类下降至-19.8个百分点);最后,利用TS-Guessing行为探针揭示72.5%的题目触发了远超随机水平的记忆信号,其中DeepSeek-R1表现出分布式记忆特征(76.6%部分重构、0%原样复现),解释了其异常表现。三重证据共同指向同一污染排序:STEM > 专业领域 > 社会科学 > 人文学科。

链接: https://arxiv.org/abs/2603.16197
作者: Eshwar Reddy M,Sourav Karmakar
机构: Health Vectors; Intuit India
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 2 figures, 7 tables

点击查看摘要

Abstract:Public leaderboards increasingly suggest that large language models (LLMs) surpass human experts on benchmarks spanning academic knowledge, law, and programming. Yet most benchmarks are fully public, their questions widely mirrored across the internet, creating systematic risk that models were trained on the very data used to evaluate them. This paper presents three complementary experiments forming a rigorous multi-method contamination audit of six frontier LLMs: GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, and Qwen3-235B. Experiment 1 applies a lexical contamination detection pipeline to 513 MMLU questions across all 57 subjects, finding an overall contamination rate of 13.8% (18.1% in STEM, up to 66.7% in Philosophy) and estimated performance gains of +0.030 to +0.054 accuracy points by category. Experiment 2 applies a paraphrase and indirect-reference diagnostic to 100 MMLU questions, finding accuracy drops by an average of 7.0 percentage points under indirect reference, rising to 19.8 pp in both Law and Ethics. Experiment 3 applies TS-Guessing behavioral probes to all 513 questions and all six models, finding that 72.5% trigger memorization signals far above chance, with DeepSeek-R1 displaying a distributed memorization signature (76.6% partial reconstruction, 0% verbatim recall) that explains its anomalous Experiment 2 profile. All three experiments converge on the same contamination ranking: STEM Professional Social Sciences Humanities.

[NLP-50] Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)中安全机制对表面层混淆攻击失效的问题,即恶意意图在推理过程中可通过深层语义表征被检测并拒绝。为此,作者提出结构化语义遮蔽(Structured Semantic Cloaking, S2C)框架,其核心在于通过多维扰动重构模型推理时的语义整合过程:具体包括情境重构(Contextual Reframing)、内容碎片化(Content Fragmentation)和线索引导伪装(Clue-Guided Camouflage)三个机制,使恶意意图需依赖多步推理与长距离共指解析才能完整恢复,从而延迟或削弱基于显式语义一致性触发的安全防御机制,同时保留生成功能输出所需的指令可恢复性。

链接: https://arxiv.org/abs/2603.16192
作者: Xiaobing Sun,Perry Lam,Shaohua Li,Zizhou Wang,Rick Siow Mong Goh,Yong Liu,Liangli Zhen
机构: Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore; Information Systems Technology and Design Pillar, Singapore University of Technology and Design
类目: Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.

[NLP-51] Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen 3-ASR

【速读】: 该论文旨在解决多语言自动语音识别(ASR)模型在资源有限场景下的部署成本高、效率低的问题,尤其针对新加坡多语种环境(英语、汉语、泰米尔语和马来语)中缺乏高效、低成本且性能优异的通用ASR系统。解决方案的关键在于:通过在公开语音语料库上对中等规模预训练模型(Qwen3-ASR-0.6B 和 Qwen3-ASR-1.7B)进行语言平衡采样微调,不使用语言标签条件输入,使模型能从音频中隐式识别语言;这种策略显著提升了模型在四种目标语言上的平均识别准确率(14.85% WER),并实现了与更大模型(如MERaLiON-2-10B-ASR)相当的性能,同时训练成本降低至原方案的0.4%,推理速度提升约20倍,验证了轻量化、平衡微调策略在多语言ASR中的有效性与实用性。

链接: https://arxiv.org/abs/2603.16184
作者: Quy-Anh Dang,Chris Ngo
机构: Knovel Engineeing Lab, Singapore
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalizes the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) - a model 6x larger - while incurring a training cost of \ 81 on a single RTX PRO 6000 GPU compared to \ 18,862 for the 128-GPU baseline. Inference throughput is approximately 20x faster than MERaLiON at 0.10 s/sample versus 2.02 s/sample. These results demonstrate that linguistically balanced fine-tuning of moderate-scale pretrained models can yield deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.

[NLP-52] STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

【速读】: 该论文旨在解决连续手语识别(Continuous Sign Language Recognition, CSLR)中现有基于关键点的方法因时空编码设计导致参数量过大、模型复杂度高的问题。其解决方案的关键在于提出一种统一的时空注意力网络,该网络在空间维度上(跨关键点)和时间维度上(局部窗口内)同时计算注意力分数,并聚合特征以生成局部上下文感知的时空表示,从而在显著减少约70-80%参数量的同时,保持与当前最先进关键点方法相当的性能,在Phoenix-14T数据集上验证了有效性。

链接: https://arxiv.org/abs/2603.16163
作者: Suvajit Patra,Soumitra Samanta
机构: RKMVERI, Belur
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Continuous Sign Language Recognition (CSLR) is a crucial task for understanding the languages of deaf communities. Contemporary keypoint-based approaches typically rely on spatio-temporal encoding, where spatial interactions among keypoints are modeled using Graph Convolutional Networks or attention mechanisms, while temporal dynamics are captured using 1D convolutional networks. However, such designs often introduce a large number of parameters in both the encoder and the decoder. This paper introduces a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), and aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder contains approximately 70-80% fewer parameters than existing state-of-the-art models while achieving comparable performance to keypoint-based methods on the Phoenix-14T dataset.

[NLP-53] HIPO: Instruction Hierarchy via Constrained Reinforcement Learning

【速读】: 该论文旨在解决层次化指令遵循(Hierarchical Instruction Following, HIF)问题,即在大语言模型(Large Language Models, LLMs)中如何有效处理优先级排序的多条指令,确保系统提示(system prompt)的严格遵守。传统方法如基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)和直接偏好优化(Direct Preference Optimization, DPO)因仅优化单一目标而无法显式保障系统提示合规性,而监督微调则受限于数据过滤机制,难以在算法层面建立指令优先级的不对称性。解决方案的关键在于提出 \textscHIPO 框架,将 HIF 建模为一个约束马尔可夫决策过程(Constrained Markov Decision Process),将系统提示从输入上下文提升为严格的算法边界,并采用原始-对偶安全强化学习方法,动态地将系统提示合规性作为显式约束进行优化,在可行区域内最大化用户效用。实验证明该方法显著提升了系统合规性和用户满意度,且机制分析表明其能自主引导模型关注长距离系统令牌,为复杂工作流中的可靠部署提供了理论基础。

链接: https://arxiv.org/abs/2603.16152
作者: Keru Chen,Jun Luo,Sen Lin,Yingbin Liang,Alvaro Velasquez,Nathaniel Bastian,Shaofeng Zou
机构: Arizona State University (亚利桑那州立大学); The Ohio State University (俄亥俄州立大学); University of Houston (休斯顿大学); CU Boulder (科罗拉多大学博尔德分校); United States Military Academy (美国军事学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages + appendix. Under review

点击查看摘要

Abstract:Hierarchical Instruction Following (HIF) refers to the problem of prompting large language models with a priority-ordered stack of instructions. Standard methods like RLHF and DPO typically fail in this problem since they mainly optimize for a single objective, failing to explicitly enforce system prompt compliance. Meanwhile, supervised fine-tuning relies on mimicking filtered, compliant data, which fails to establish the priority asymmetry at the algorithmic level. In this paper, we introduce \textscHIPO, a novel alignment framework that formulates HIF as a Constrained Markov Decision Process. \textscHIPO elevates system prompts from mere input context to strict algorithmic boundaries. Using a primal-dual safe reinforcement learning approach, the algorithm dynamically enforces system prompt compliance as an explicit constraint, maximizing user utility strictly within this feasible region. Extensive evaluations across diverse model architectures (e.g., Qwen, Phi, Llama) demonstrate that \textscHIPO significantly improves both system compliance and user utility. Furthermore, mechanistic analysis reveals that this constrained optimization autonomously drives the model to shift its attention toward long-range system tokens, providing a principled foundation for reliable LLM deployment in complex workflows.

[NLP-54] Parametric Social Identity Injection and Diversification in Public Opinion Simulation

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的公众意见模拟方法中存在的“多样性坍缩”(Diversity Collapse)问题,即LLM在生成社会群体响应时未能保留真实世界中不同人口统计特征和价值取向之间的差异,导致群体间同质化严重、分布失真。解决方案的关键在于提出参数化社会身份注入(Parametric Social Identity Injection, PSII),该框架通过在LLM的中间隐藏状态中直接注入显式的、可调参的人口统计属性与价值导向表示,实现对代理身份的细粒度、可控的表征层调节,而非依赖提示工程进行人格条件化,从而显著提升模拟结果在分布保真度和群体多样性上的表现。

链接: https://arxiv.org/abs/2603.16142
作者: Hexi Wang,Yujia Zhou,Bangde Du,Qingyao Ai,Yiqun Liu
机构: Tsinghua University (清华大学); Quan Cheng Laboratory (全成实验室)
类目: Computation and Language (cs.CL)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:Large language models (LLMs) have recently been adopted as synthetic agents for public opinion simulation, offering a promising alternative to costly and slow human surveys. Despite their scalability, current LLM-based simulation methods fail to capture social diversity, producing flattened inter-group differences and overly homogeneous responses within demographic groups. We identify this limitation as a Diversity Collapse phenomenon in LLM hidden representations, where distinct social identities become increasingly indistinguishable across layers. Motivated by this observation, we propose Parametric Social Identity Injection (PSII), a general framework that injects explicit, parametric representations of demographic attributes and value orientations directly into intermediate hidden states of LLMs. Unlike prompt-based persona conditioning, PSII enables fine-grained and controllable identity modulation at the representation level. Extensive experiments on the World Values Survey using multiple open-source LLMs show that PSII significantly improves distributional fidelity and diversity, reducing KL divergence to real-world survey data while enhancing overall diversity. This work provides new insights into representation-level control of LLM agents and advances scalable, diversity-aware public opinion simulation. Code and data are available at this https URL.

[NLP-55] SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLM s with Industrial Deployment

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在电商搜索场景中部署时面临的两大核心问题:一是由于缺乏对动态、细粒度商品知识的有效编码而导致的知识幻觉(knowledge hallucination),二是模型在越狱攻击(jailbreak attacks)下存在安全漏洞,可能引发合规风险。解决方案的关键在于提出一种名为SI(Synthesize-Inject-Align)的框架:首先通过融合结构化知识图谱与非结构化行为日志,并引入推理链和安全感知增强,合成高质量自然语言语料;其次采用基于深度扩展(Depth Up-Scaling)的参数高效预训练策略,在注入领域知识的同时保持通用能力;最后通过多任务指令微调与对抗训练相结合的双路径对齐方法,同步提升任务性能与安全性鲁棒性。该框架已在阿里巴巴集团旗下的中国最大自营电商平台落地应用,A/B测试验证了其在五个核心搜索场景中的显著业务指标提升,证明了其工业有效性与可扩展性。

链接: https://arxiv.org/abs/2603.16137
作者: Zhouwei Zhai,Mengxiang Chen,Anmeng Zhang
机构: JD.com(京东); Beijing(北京); China(中国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models offer transformative potential for e-commerce search by enabling intent-aware recommendations. However, their industrial deployment is hindered by two critical challenges: (1) knowledge hallucination due to insufficient encoding of dynamic, fine-grained product knowledge, and (2) security vulnerabilities under jailbreak attacks that threaten compliance. To address these issues, we propose SI–a Synthesize-Inject-Align framework for building knowledgeable and secure e-commerce search LLMs. Our approach first synthesizes high-quality natural language corpus by combining structured knowledge graphs with unstructured behavioral logs, augmented with reasoning chains and safety-aware this http URL then introduce a parameter-efficient pre-training strategy based on Depth Up-Scaling to inject domain knowledge while preserving general capabilities. Finally, a dual-path alignment method via multi-task instruction tuning and adversarial training strengthens both task performance and safety robustness. The framework has been deployed at this http URL, China’s largest self-operated e-commerce platform, where A/B tests across five core search scenarios demonstrate significant improvements in key business metrics, validating its industrial effectiveness and scalability.

[NLP-56] SciZoom: A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era KDD2026

【速读】: 该论文旨在解决当前科学文献摘要任务中存在的一系列关键问题:现有基准数据集规模有限、仅支持单一粒度的摘要目标,且未涵盖生成式 AI(Generative AI)兴起后的研究写作范式变化。为应对这些问题,作者提出了 SciZoom 基准,其核心创新在于构建了一个包含 44,946 篇顶级机器学习会议论文(NeurIPS、ICLR、ICML、EMNLP,2020–2025 年)的大规模数据集,并明确划分为“前大语言模型(Pre-LLM)”与“后大语言模型(Post-LLM)”两个时期。该数据集提供三个层次的摘要目标(Abstract、Contributions 和 TL;DR),压缩比最高达 600:1,从而支持多粒度摘要研究和对科学写作演变的时序分析。通过语言学分析发现,LLM 辅助写作显著改变了句法模式(公式化表达最多提升 10 倍)和修辞风格(情态动词使用减少 23%),表明此类写作更具自信但趋于同质化。SciZoom 不仅是一个具有挑战性的新基准,更是研究生成式 AI 时代科学话语演化的独特资源。

链接: https://arxiv.org/abs/2603.16131
作者: Han Jang,Junhyeok Lee,Kyu Sung Choi
机构: Seoul National University (首尔国立大学); Seoul National University College of Medicine (首尔国立大学医学院); Seoul National University Hospital (首尔国立大学医院)
类目: Computation and Language (cs.CL)
备注: 12 pages, 7 figures, Submitted to KDD 2026

点击查看摘要

Abstract:The explosive growth of AI research has created unprecedented information overload, increasing the demand for scientific summarization at multiple levels of granularity beyond traditional abstracts. While LLMs are increasingly adopted for summarization, existing benchmarks remain limited in scale, target only a single granularity, and predate the LLM era. Moreover, since the release of ChatGPT in November 2022, researchers have rapidly adopted LLMs for drafting manuscripts themselves, fundamentally transforming scientific writing, yet no resource exists to analyze how this writing has evolved. To bridge these gaps, we introduce SciZoom, a benchmark comprising 44,946 papers from four top-tier ML venues (NeurIPS, ICLR, ICML, EMNLP) spanning 2020 to 2025, explicitly stratified into Pre-LLM and Post-LLM eras. SciZoom provides three hierarchical summarization targets (Abstract, Contributions, and TL;DR) achieving compression ratios up to 600:1, enabling both multi-granularity summarization research and temporal mining of scientific writing patterns. Our linguistic analysis reveals striking shifts in phrase patterns (up to 10x for formulaic expressions) and rhetorical style (23% decline in hedging), suggesting that LLM-assisted writing produces more confident yet homogenized prose. SciZoom serves as both a challenging benchmark and a unique resource for mining the evolution of scientific discourse in the generative AI era. Our code and dataset are publicly available on GitHub (this https URL) and Hugging Face (this https URL), respectively.

[NLP-57] Social Simulacra in the Wild: AI Agent Communities on Moltbook

【速读】: 该论文旨在解决当前关于AI代理(AI-agent)在线社区动态机制的理解不足问题,特别是在与人类社区对比下的结构差异、语言特征及互动模式。其核心解决方案在于通过大规模实证比较,分析73,899条Moltbook和189,838条Reddit帖子,揭示AI代理社区在参与不平等(Gini系数0.84 vs. 0.47)、作者跨社区重叠率(33.8% vs. 0.5%)以及语言属性(情感扁平化、认知倾向从探索转向陈述、社会关联弱化)上的显著差异。关键发现是:表面的社区同质化现象主要源于共享作者结构而非内容趋同;同时,个体AI代理因极端发帖量放大其风格异常值而更具可识别性,从而为理解多代理交互如何塑造不同于人类群体的集体传播动态提供了实证基础。

链接: https://arxiv.org/abs/2603.16128
作者: Agam Goyal,Olivia Pal,Hari Sundaram,Eshwar Chandrasekharan,Koustuv Saha
机构: Siebel School of Computing and Data Science (Siebel计算机与数据科学学院); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注: Preprint: 12 pages, 4 figures, 5 tables

点击查看摘要

Abstract:As autonomous LLM-based agents increasingly populate social platforms, understanding the dynamics of AI-agent communities becomes essential for both communication research and platform governance. We present the first large-scale empirical comparison of AI-agent and human online communities, analyzing 73,899 Moltbook and 189,838 Reddit posts across five matched communities. Structurally, we find that Moltbook exhibits extreme participation inequality (Gini = 0.84 vs. 0.47) and high cross-community author overlap (33.8% vs. 0.5%). In terms of linguistic attributes, content generated by AI-agents is emotionally flattened, cognitively shifted toward assertion over exploration, and socially detached. These differences give rise to apparent community-level homogenization, but we show this is primarily a structural artifact of shared authorship. At the author level, individual agents are more identifiable than human users, driven by outlier stylistic profiles amplified by their extreme posting volume. As AI-mediated communication reshapes online discourse, our work offers an empirical foundation for understanding how multi-agent interaction gives rise to collective communication dynamics distinct from those of human communities.

[NLP-58] Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning ICLR2026

【速读】: 该论文旨在解决大规模预训练中学习率调度策略对下游任务微调(Supervised Fine-Tuning, SFT)性能影响不明确的问题,尤其是传统基于衰减(decay-based)的学习率调度器是否真正提升模型的下游适应能力。其解决方案的关键在于提出并验证Warmup-Stable-Only(WSO)调度策略——即在预热阶段后保持恒定学习率、不进行任何衰减——实验表明,尽管衰减策略在预训练损失上表现更优,WSO能显著提升模型在SFT后的性能,并通过损失曲面分析揭示其维持更平坦的极小值区域,从而增强模型的下游适应性。

链接: https://arxiv.org/abs/2603.16127
作者: Kazuki Yano,Shun Kiyono,Sosuke Kobayashi,Sho Takase,Jun Suzuki
机构: Tohoku University(东北大学); SB Intuitions( SB 智能)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 25 pages, accepted by ICLR 2026 as a conference paper

点击查看摘要

Abstract:We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT). Decay-based learning rate schedulers are widely used to minimize pre-training loss. However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored. In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup without any decay. Through experiments with 1B and 8B parameter models, we show that WSO consistently outperforms decay-based schedulers in terms of performance after SFT, even though decay-based schedulers may exhibit better performance after pre-training. The result also holds across different regimes with mid-training and over-training. Loss landscape analysis further reveals that decay-based schedulers lead models into sharper minima, whereas WSO preserves flatter minima that support adaptability. These findings indicate that applying LR decay to improve pre-training metrics may compromise downstream adaptability. Our work also provides practical guidance for training and model release strategies, highlighting that pre-training models with WSO enhances their adaptability for downstream tasks.

[NLP-59] SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

【速读】: 该论文旨在解决当前软件工程领域中缺乏可靠基准来评估代理型代码库理解能力的问题,尤其是现有评测方法常忽略长尾主题且依赖于热门仓库导致大语言模型(Large Language Models, LLMs)通过记忆知识作弊。为应对这一挑战,作者提出SWE-QA-Pro基准,其构建于多样化的长尾仓库之上并配备可执行环境,通过基于问题的聚类实现主题平衡,并采用严格的难度校准流程过滤掉可通过直接回答基线解决的问题,从而确保任务复杂性。关键解决方案包括:1)设计了一个更具挑战性和真实性的评估框架;2)提出一种可扩展的合成数据生成流水线与两阶段训练策略(监督微调后接基于AI反馈的强化学习,Reinforcement Learning from AI Feedback, RLAIF),使小型开源模型能够高效学习工具使用和推理能力。实验证明,基于此方案训练的Qwen3-8B模型在SWE-QA-Pro上超越GPT-4o 2.3分,显著缩小与顶级闭源模型的差距,验证了评估体系的有效性与训练流程的优越性。

链接: https://arxiv.org/abs/2603.16124
作者: Songcheng Cai,Zhiheng Lyu,Yuansheng Ni,Xiangchao Chen,Baichuan Zhou,Shenzhe Zhu,Yi Lu,Haozhe Wang,Chi Ruan,Benjamin Schneider,Weixu Zhang,Xiang Li,Andy Zheng,Yuyu Zhang,Ping Nie,Wenhu Chen
机构: University of Waterloo (滑铁卢大学); University of Toronto (多伦多大学); The Hong Kong University of Science and Technology (香港科技大学); McGill University (麦吉尔大学) MILA; Verdent AI, Inc. (Verdent AI公司)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail repositories with executable environments. We enforce topical balance via issue-driven clustering to cover under-represented task types and apply a rigorous difficulty calibration process: questions solvable by direct-answer baselines are filtered out. This results in a dataset where agentic workflows significantly outperform direct answering (e.g., a ~13-point gap for Claude Sonnet 4.5), confirming the necessity of agentic codebase exploration. Furthermore, to tackle the scarcity of training data for such complex behaviors, we propose a scalable synthetic data pipeline that powers a two-stage training recipe: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from AI Feedback (RLAIF). This approach allows small open models to learn efficient tool usage and reasoning. Empirically, a Qwen3-8B model trained with our recipe surpasses GPT-4o by 2.3 points on SWE-QA-Pro and substantially narrows the gap to state-of-the-art proprietary models, demonstrating both the validity of our evaluation and the effectiveness of our agentic training workflow.

[NLP-60] Language Models Dont Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

【速读】: 该论文试图解决当前深度研究(Deep Research, DR)工具在个性化能力上的不足问题,即现有工具虽能合成科学文献以回应用户查询,但缺乏对用户个体研究兴趣的理解与适配。解决方案的关键在于提出 MyScholarQA(MySQA),一个具备三重功能的个性化 DR 工具:首先基于用户行为推断其研究兴趣画像;其次针对输入查询提出个性化的行动建议;最后根据用户批准的行动生成多章节报告。研究进一步表明,仅依赖大语言模型(Large Language Model, LLM)驱动的基准测试无法捕捉真实用户对个性化 DR 的深层需求,因此通过在线访谈揭示了九类 LLM 判官难以检测的个性化错误,并提炼出面向未来 DR 设计的实践启示,强调真实用户反馈是实现真正个性化进步的核心支柱。

链接: https://arxiv.org/abs/2603.16120
作者: Nishant Balepur,Malachi Hamada,Varsha Kishore,Sergey Feldman,Amanpreet Singh,Pao Siangliulue,Joseph Chee Chang,Eunsol Choi,Jordan Lee Boyd-Graber,Aakanksha Naik
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Deep Research (DR) tools (e.g. OpenAI DR) help researchers cope with ballooning publishing counts. Such tools can synthesize scientific papers to answer researchers’ queries, but lack understanding of their users. We change that in MyScholarQA (MySQA), a personalized DR tool that: 1) infers a profile of a user’s research interests; 2) proposes personalized actions for a user’s input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP’s standard protocol: we design a benchmark of synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. In all, we argue for a pillar of personalization that easy-to-use LLM judges can lead NLP to overlook: real progress in personalization is only possible with real users.

[NLP-61] ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在金融推理任务中因需昂贵微调而导致的模型锁定问题,以及现有无训练替代方法在复杂多步领域推理中表现有限的问题。其解决方案的关键在于提出自动化技能蒸馏与适配(Automated Skill Distillation and Adaptation, ASDA)框架,该框架通过迭代式纠错学习自动生成结构化的技能资源(如推理步骤、代码模板和示例),并在推理阶段动态注入,无需修改模型权重;该方法显著提升了FAMMA基准上的算术与非算术推理性能,且生成的技能文件具备可读性、版本控制能力及与Agent Skills开放标准的兼容性,为无权重访问或重训练场景下的领域适应提供了可审计、实用的路径。

链接: https://arxiv.org/abs/2603.16112
作者: Tik Yu Yim,Wenting Tan,Sum Yee Chan,Tak-Wah Lam,Siu Ming Yiu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Adapting large language models (LLMs) to specialized financial reasoning typically requires expensive fine-tuning that produces model-locked expertise. Training-free alternatives have emerged, yet our experiments show that leading methods (GEPA and ACE) achieve only marginal gains on the FAMMA financial reasoning benchmark, exposing the limits of unstructured text optimization for complex, multi-step domain reasoning. We introduce Automated Skill Distillation and Adaptation (ASDA), a framework that automatically generates structured skill artifacts through iterative error-corrective learning without modifying model weights. A teacher model analyzes a student model’s failures on financial reasoning tasks, clusters errors by subfield and error type, and synthesizes skill files containing reasoning procedures, code templates, and worked examples, which are dynamically injected during inference. Evaluated on FAMMA, ASDA achieves up to +17.33% improvement on arithmetic reasoning and +5.95% on non-arithmetic reasoning, substantially outperforming all training-free baselines. The resulting skill artifacts are human-readable, version-controlled, and compatible with the Agent Skills open standard, offering any organization with a labeled domain dataset a practical and auditable path to domain adaptation without weight access or retraining.

[NLP-62] Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)后训练压缩过程中校准数据(calibration data)选择不当导致的性能下降问题,尤其是在剪枝(pruning)和量化(quantization)两种压缩技术中,如何选取能够有效保留模型内在能力(intra- and inter-task capabilities)的高质量校准数据集。其解决方案的关键在于提出一种与模型无关的数据筛选策略——ZipCal,该策略基于Zipfian幂律分析,通过最大化词汇多样性来自动构建高绩效校准数据集,从而避免依赖昂贵的模型感知信号(如困惑度 perplexity),显著提升效率(平均快约240倍),同时在下游任务性能上达到甚至超越现有先进方法。

链接: https://arxiv.org/abs/2603.16105
作者: Francesco Pio Monaco,Elia Cunegatti,Flavio Vella,Giovanni Iacca
机构: University of Trento (特伦托大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emphcalibration data) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt\textbfZipCal, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt\textbfZipCal is on average \sim 240 \times faster due to its tractable linear complexity\footnoteWe make the code and the experiments available at this https URL…

[NLP-63] CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

【速读】: 该论文旨在解决事实性问答中因“承诺失败”(commitment failure)导致的错误问题,即系统虽能检索到相关证据,但仍可能得出错误答案。解决方案的关键在于提出CounterRefine——一种轻量级的推理时修复层,其核心机制是将检索过程从单纯获取上下文转变为对初步答案的测试与验证:首先基于检索证据生成一个短答案,随后通过条件化的后续查询收集支持与冲突证据,最终执行受限的精炼步骤(输出KEEP或REVISE),且仅当修订通过确定性验证时才被采纳。这一方法使模型能够利用证据主动修正自身判断,从而显著提升准确率。

链接: https://arxiv.org/abs/2603.16091
作者: Tianyi Huang,Ying Kai Deng
机构: Ryquo; App-In Club
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In factual question answering, many errors are not failures of access but failures of commitment: the system retrieves relevant evidence, yet still settles on the wrong answer. We present CounterRefine, a lightweight inference-time repair layer for retrieval-grounded question answering. CounterRefine first produces a short answer from retrieved evidence, then gathers additional support and conflicting evidence with follow-up queries conditioned on that draft answer, and finally applies a restricted refinement step that outputs either KEEP or REVISE, with proposed revisions accepted only if they pass deterministic validation. In effect, CounterRefine turns retrieval into a mechanism for testing a provisional answer rather than merely collecting more context. On the full SimpleQA benchmark, CounterRefine improves a matched GPT-5 Baseline-RAG by 5.8 points and reaches a 73.1 percent correct rate, while exceeding the reported one-shot GPT-5.4 score by roughly 40 points. These findings suggest a simple but important direction for knowledgeable foundation models: beyond accessing evidence, they should also be able to use that evidence to reconsider and, when necessary, repair their own answers.

[NLP-64] ClaimFlow: Tracing the Evolution of Scientific Claims in NLP

【速读】: 该论文旨在解决现有文献引用分析方法无法完整捕捉科学论断(claim)在学术对话中演化关系的问题。其核心挑战在于,当前方法仅能识别片段化的引用行为,而未能系统刻画论文之间对具体科学主张的支撑、扩展、质疑或反驳等动态互动。解决方案的关键是构建一个以论断为中心的新型知识图谱——ClaimFlow,该图谱基于304篇ACL会议论文的人工标注数据(共1,084个论断和832个跨论文论断关系),明确标注了引文与被引论断之间的五类语义关系:支持(support)、扩展(extend)、限定(qualify)、反驳(refute)及背景引用(background)。在此基础上,作者提出并定义了“论断关系分类”(Claim Relation Classification)这一新任务,并验证了模型从文本和引用上下文中推断科学立场的可行性,为理解自然语言处理(NLP)领域中观点演进提供了可量化、可追溯的分析框架。

链接: https://arxiv.org/abs/2603.16073
作者: Aniket Pramanick,Yufang Hou,Saif M. Mohammad,Iryna Gurevych
机构: Technische Universität Darmstadt (达姆施塔特工业大学); IT:U Interdisciplinary Transformation University Austria (IT:U跨学科转型大学奥地利); National Research Council Canada (加拿大国家研究委员会)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific papers do more than report results - they advance \textitclaims that later work supports, extends, or sometimes refutes. Yet existing methods for citation and claim analysis capture only fragments of this dialogue. In this work, we make these interactions explicit at the level of individual scientific claims. We introduce \textttClaimFlow , a claim-centric view of the NLP literature, built from 304 ACL Anthology papers (1979 - 2025) that are manually annotated with 1,084 claims and 832 cross-paper claim relations, indicating whether a citing paper \textitsupports , \textitextends , \textitqualifies , \textitrefutes , or references a claim as \textitbackground . Using \textttClaimFlow , we define a new task - \textitClaim Relation Classification - which requires models to infer the scientific stance toward a cited claim from the text and citation context. Evaluating strong neural models and large language models on this task, we report baseline performance of 0.78 macro-F1, highlighting that claim-relation classification is feasible but challenging. We further apply our model to \sim 13k NLP papers to analyze how claims evolve across decades of NLP research. Our analysis reveals that 63.5 % claims are never reused; only 11.1 % are ever challenged; meanwhile, widely propagated claims are more often \textitreshaped through qualification and extension than directly confirmed or refuted. Overall, \textttClaimFlow offers a lens for examining how ideas shift and mature within NLP, and a foundation for assessing whether models can interpret scientific argumentation.

[NLP-65] SEAHateCheck: Functional Tests for Detecting Hate Speech in Low-Resource Languages of Southeast Asia

【速读】: 该论文旨在解决低资源语言(如印尼语、他加禄语、泰语和越南语)在仇恨言论检测中因缺乏高质量语言资源而面临的挑战,这些问题阻碍了东南亚地区在线内容治理工具的开发与部署。解决方案的关键在于构建SEAHateCheck——一个针对上述四种语言的首个功能性测试套件,其核心创新包括:基于HateCheck的功能性测试框架并优化SGHateCheck的方法,引入文化相关测试用例,并通过大语言模型增强数据多样性,同时由本地专家验证准确性;该数据集不仅揭示了当前多语言模型在特定低资源语言(尤其是他加禄语)上的性能短板,还暴露了模型在隐含仇恨表达和反言辞(counter-speech)识别方面的系统性不足,从而为开发更具文化敏感性和实用性的仇恨言论检测工具提供了可复用的基准和诊断依据。

链接: https://arxiv.org/abs/2603.16070
作者: Ri Chi Ng,Aditi Kumaresan,Yujia Hu,Roy Ka-Wei Lee
机构: Singapore University of Technology and Design(新加坡科技设计大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: TALLIP Accepted

点击查看摘要

Abstract:Hate speech detection relies heavily on linguistic resources, which are primarily available in high-resource languages such as English and Chinese, creating barriers for researchers and platforms developing tools for low-resource languages in Southeast Asia, where diverse socio-linguistic contexts complicate online hate moderation. To address this, we introduce SEAHateCheck, a pioneering dataset tailored to Indonesia, Thailand, the Philippines, and Vietnam, covering Indonesian, Tagalog, Thai, and Vietnamese. Building on HateCheck’s functional testing framework and refining SGHateCheck’s methods, SEAHateCheck provides culturally relevant test cases, augmented by large language models and validated by local experts for accuracy. Experiments with state-of-the-art and multilingual models revealed limitations in detecting hate speech in specific low-resource languages. In particular, Tagalog test cases showed the lowest model accuracy, likely due to linguistic complexity and limited training data. In contrast, slang-based functional tests proved the hardest, as models struggled with culturally nuanced expressions. The diagnostic insights of SEAHateCheck further exposed model weaknesses in implicit hate detection and models’ struggles with counter-speech expression. As the first functional test suite for these Southeast Asian languages, this work equips researchers with a robust benchmark, advancing the development of practical, culturally attuned hate speech detection tools for inclusive online content moderation.

[NLP-66] Resource Consumption Threats in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在资源消耗方面面临的威胁问题,特别是由过度生成引发的计算资源浪费,这会降低模型效率、增加延迟与API成本,并损害服务可用性和经济可持续性。其解决方案的关键在于建立一个统一的研究视角,系统梳理从威胁诱导到机制理解再到缓解措施的完整链条,从而为该新兴领域提供清晰的问题界定和有效的应对策略基础。

链接: https://arxiv.org/abs/2603.16068
作者: Yuanhe Zhang,Xinyue Wang,Zhican Chen,Weiliu Wang,Zilu Zhang,Zhengshuo Gong,Zhenhong Zhou,Li Sun,Yang Liu,Sen Su
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Hangzhou Dianzi University (杭州电子科技大学); Nanyang Technological University (南洋理工大学); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Given limited and costly computational infrastructure, resource efficiency is a key requirement for large language models (LLMs). Efficient LLMs increase service capacity for providers and reduce latency and API costs for users. Recent resource consumption threats induce excessive generation, degrading model efficiency and harming both service availability and economic sustainability. This survey presents a systematic review of threats to resource consumption in LLMs. We further establish a unified view of this emerging area by clarifying its scope and examining the problem along the full pipeline from threat induction to mechanism understanding and mitigation. Our goal is to clarify the problem landscape for this emerging area, thereby providing a clearer foundation for characterization and mitigation.

[NLP-67] Residual Stream Duality in Modern Transformer Architectures

【速读】: 该论文旨在解决Transformer架构中残差路径(residual stream)的结构与功能本质问题,即明确其不仅是优化过程中的“工程实现”,更是模型表征机制的核心组成部分。解决方案的关键在于提出一个双轴视角来组织Transformer的设计空间:将信息演化视为在序列位置(sequence position)和层深度(layer depth)两个有序维度上的传递过程。文中指出,自注意力机制已在序列轴上实现自适应混合,而传统残差连接则沿深度轴进行固定加法操作;若固定token位置并将层索引视为有序变量,则因果深度残差注意力(causal depth-wise residual attention)与因果短滑动窗口注意力(ShortSWA)在局部算子层面具有等价性,这揭示了Transformer²中的核心残差流对偶性(residual stream duality)。基于此理解,作者进一步区分了两类改进策略:当目标是修改残差捷径本身时,采用Deep Delta Learning(DDL)直接调整残差运算;当目标是实现局部自适应混合时,则优先使用序列轴上的ShortSWA,因其更利于硬件友好部署。

链接: https://arxiv.org/abs/2603.16039
作者: Yifan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model’s representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth-wise residual attention read is exactly the same local operator as causal short sliding-window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer ^2 . This perspective also clarifies the recent literature. ELC-BERT and DenseFormer already show that learned aggregation over depth can outperform uniform residual accumulation, while Vertical Attention, DeepCrossAttention (DCA), MUDDFormer, and Attention Residuals move further toward explicit attention-based routing over earlier layers. The key point, however, is that operator-level duality does not imply systems-level symmetry. For large-scale autoregressive models, sequence-axis ShortSWA is usually the more hardware-friendly placement because it reuses token-side sliding-window kernels, KV-cache layouts, and chunked execution. If the goal is instead to change the shortcut itself, Deep Delta Learning (DDL) is the cleaner intervention because it modifies the residual operator directly rather than adding a separate cross-layer retrieval path. Our recommendation is therefore simple: use DDL when the shortcut is the object of interest, and use sequence-axis ShortSWA when the goal is local adaptive mixing.

[NLP-68] Understanding Moral Reasoning Trajectories in Large Language Models : Toward Probing-Based Explainability

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在道德敏感决策中如何组织伦理框架的问题,特别是其推理过程中伦理框架的动态演化机制尚不明确。解决方案的关键在于提出“道德推理轨迹”(moral reasoning trajectories)这一新概念,即对中间推理步骤中伦理框架调用序列的建模,并通过多模型(六种)与多基准测试(三种)的实证分析揭示:LLMs在道德推理中普遍采用跨框架的系统性思辨策略,约55.4%–57.7%的连续步骤发生框架切换,且框架一致性轨迹稳定性较低(仅16.4%–17.8%),并易受说服性攻击(p=0.015)。进一步地,研究发现模型特定层中存在可定位的框架特异性表征(如Llama-3.3-70B第63/81层),可通过轻量级激活引导技术减少框架漂移(6.7%–8.9%),提升稳定性和准确性的关联强度;最终提出的道德表示一致性(Moral Representation Consistency, MRC)指标与人类对模型连贯性的评分高度相关(r=0.715, p<0.0001),且其框架归因经人工标注验证(平均余弦相似度=0.859),为评估和优化LLMs的道德推理能力提供了可量化、可解释的新范式。

链接: https://arxiv.org/abs/2603.16017
作者: Fan Huang,Haewoon Kwak,Jisun An
机构: Indiana University Bloomington (印第安纳大学布卢明顿分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly participate in morally sensitive decision-making, yet how they organize ethical frameworks across reasoning steps remains underexplored. We introduce \textitmoral reasoning trajectories, sequences of ethical framework invocations across intermediate reasoning steps, and analyze their dynamics across six models and three benchmarks. We find that moral reasoning involves systematic multi-framework deliberation: 55.4–57.7% of consecutive steps involve framework switches, and only 16.4–17.8% of trajectories remain framework-consistent. Unstable trajectories remain 1.29 \times more susceptible to persuasive attacks ( p=0.015 ). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8–22.6% lower KL divergence than the training-set prior baseline. Lightweight activation steering modulates framework integration patterns (6.7–8.9% drift reduction) and amplifies the stability–accuracy relationship. We further propose a Moral Representation Consistency (MRC) metric that correlates strongly ( r=0.715 , p0.0001 ) with LLM coherence ratings, whose underlying framework attributions are validated by human annotators (mean cosine similarity = 0.859 ).

[NLP-69] Evaluating Agent ic Optimization on Large Codebases

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)代码代理在真实世界代码库中进行多目标优化能力评估的不足问题。现有基准测试大多依赖合成任务、二元正确性信号或单目标评估,难以全面衡量LLM代理在复杂、现实约束下对整个代码库进行系统性优化的能力。解决方案的关键在于提出FormulaCode——一个基于真实科学Python代码库中挖掘出的957个性能瓶颈构建的基准,每个瓶颈均配有专家编写的修复补丁及平均264.6个由社区维护的性能工作负载,从而支持细粒度、多目标的评估体系,能够更真实地检验LLM代理在保持正确性的前提下协同优化多个性能指标的能力。

链接: https://arxiv.org/abs/2603.16011
作者: Atharva Sehgal,James Hou,Akanksha Sarkar,Ishaan Mantripragada,Swarat Chaudhuri,Jennifer J. Sun,Yisong Yue
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint version

点击查看摘要

Abstract:Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior. We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics. FormulaCode comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and, on average, 264.6 community-maintained performance workloads per task, enabling the holistic ability of LLM agents to optimize codebases under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents. Project website at: this https URL

[NLP-70] RadAnnotate: Large Language Models for Efficient and Reliable Radiology Report Annotation

【速读】: 该论文旨在解决放射学报告标注(radiology report annotation)在临床自然语言处理(clinical NLP)中依赖人工标注效率低、成本高的问题。其核心解决方案是提出 RadAnnotate 框架,关键在于结合检索增强生成(retrieval-augmented generation, RAG)合成报告与基于置信度的自动化选择机制:首先训练特定实体类别的分类器并分析其在不同解剖结构和观察类别中的性能表现,发现不确定观察最难建模;其次利用 RAG 生成合成报告,在低资源场景下通过合成数据增强显著提升对不确定观察的 F1 分数(从 0.61 提升至 0.70);最后通过学习每个实体类别的置信度阈值,实现对 55–90% 报告的自动标注,同时保持 0.86–0.92 的实体匹配准确率,将低置信度样本路由至专家审核,从而大幅降低人工标注负担。

链接: https://arxiv.org/abs/2603.16002
作者: Saisha Pradeep Shetty,Roger Eric Goldman,Vladimir Filkov
机构: University of California, Davis, CA, USA (加州大学戴维斯分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 3 figures. Accepted at AMIA Amplify Informatics Summit 2026

点击查看摘要

Abstract:Radiology report annotation is essential for clinical NLP, yet manual labeling is slow and costly. We present RadAnnotate, an LLM-based framework that studies retrieval-augmented synthetic reports and confidence-based selective automation to reduce expert effort for labeling in RadGraph. We study RadGraph-style entity labeling (graph nodes) and leave relation extraction (edges) to future work. First, we train entity-specific classifiers on gold-standard reports and characterize their strengths and failure modes across anatomy and observation categories, with uncertain observations hardest to learn. Second, we generate RAG-guided synthetic reports and show that synthetic-only models remain within 1-2 F1 points of gold-trained models, and that synthetic augmentation is especially helpful for uncertain observations in a low-resource setting, improving F1 from 0.61 to 0.70. Finally, by learning entity-specific confidence thresholds, RadAnnotate can automatically annotate 55-90% of reports at 0.86-0.92 entity match score while routing low-confidence cases for expert review.

[NLP-71] Mostly Text Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models CVPR2026

【速读】: 该论文旨在解决轻量化大型视觉语言模型(Large Vision-Language Models, LVLMs)中因统一处理多模态校准数据而导致的剪枝精度不足问题,即现有方法未充分考虑文本和视觉token在剪枝敏感性上的差异。其解决方案的关键在于提出一种不对称的文本-视觉权重剪枝方法(Asymmetric Text-Visual Weight Pruning, ATV-Pruning),通过解耦文本与视觉路径的权重并分别设计重要性度量机制:一方面利用全部文本token进行高敏感性的文本路径校准,另一方面采用层自适应选择策略从视觉路径中提取关键token,同时发现视觉路径具有较高冗余性(可支持50%稀疏度),从而实现更精准且高效的剪枝。

链接: https://arxiv.org/abs/2603.16001
作者: Sijie Li,Biao Qian,Jungong Han
机构: University of Sheffield, UK (谢菲尔德大学, 英国); Tsinghua University, China (清华大学, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: CVPR 2026. Code available here: this https URL

点击查看摘要

Abstract:Network pruning is an effective technique for enabling lightweight Large Vision-Language Models (LVLMs), which primarily incorporates both weights and activations into the importance metric. However, existing efforts typically process calibration data from different modalities in a unified manner, overlooking modality-specific behaviors. This raises a critical challenge: how to address the divergent behaviors of textual and visual tokens for accurate pruning of LVLMs. To this end, we systematically investigate the sensitivity of visual and textual tokens to the pruning operation by decoupling their corresponding weights, revealing that: (i) the textual pathway should be calibrated via text tokens, since it exhibits higher sensitivity than the visual pathway; (ii) the visual pathway exhibits high redundancy, permitting even 50% sparsity. Motivated by these insights, we propose a simple yet effective Asymmetric Text-Visual Weight Pruning method for LVLMs, dubbed ATV-Pruning, which establishes the importance metric for accurate weight pruning by selecting the informative tokens from both textual and visual pathways. Specifically, ATV-Pruning integrates two primary innovations: first, a calibration pool is adaptively constructed by drawing on all textual tokens and a subset of visual tokens; second, we devise a layer-adaptive selection strategy to yield important visual tokens. Finally, extensive experiments across standard multimodal benchmarks verify the superiority of our ATV-Pruning over state-of-the-art methods.

[NLP-72] NLP Occupational Emergence Analysis: How Occupations Form and Evolve in Real Time – A Zero-Assumption Method Demonstrated on AI in the US Technology Workforce 2022-2026

【速读】: 该论文试图解决职业演化速度远超分类系统追踪能力的问题,即如何在不依赖预定义职业分类或职位名称的情况下,准确识别真实职业的出现与演变。其解决方案的关键在于提出“共吸引子”(bipartite co-attractor)概念——即一个职业是一个由共享专业术语(vocabulary cohesion)和从业者群体凝聚力(population cohesion)相互强化的自洽结构。作者通过独立检验词汇凝聚力与人群凝聚力,并辅以消融实验验证词汇是否为绑定群体的核心机制,从而实现零假设的职业涌现检测方法。该方法在820万份美国简历数据中成功识别出既有职业并揭示了人工智能(AI)领域存在显著不对称:尽管专业术语迅速凝聚成 cohesive vocabulary,但从业者群体始终未能形成 cohesion,表明AI更像是一种扩散性技术而非新兴职业。

链接: https://arxiv.org/abs/2603.15998
作者: David Nordfors
机构: BOLD; Monster Research Institute
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 37 pages, 5 figures

点击查看摘要

Abstract:Occupations form and evolve faster than classification systems can track. We propose that a genuine occupation is a self-reinforcing structure (a bipartite co-attractor) in which a shared professional vocabulary makes practitioners cohesive as a group, and the cohesive group sustains the vocabulary. This co-attractor concept enables a zero-assumption method for detecting occupational emergence from resume data, requiring no predefined taxonomy or job titles: we test vocabulary cohesion and population cohesion independently, with ablation to test whether the vocabulary is the mechanism binding the population. Applied to 8.2 million US resumes (2022-2026), the method correctly identifies established occupations and reveals a striking asymmetry for AI: a cohesive professional vocabulary formed rapidly in early 2024, but the practitioner population never cohered. The pre-existing AI community dissolved as the tools went mainstream, and the new vocabulary was absorbed into existing careers rather than binding a new occupation. AI appears to be a diffusing technology, not an emerging occupation. We discuss whether introducing an “AI Engineer” occupational category could catalyze population cohesion around the already-formed vocabulary, completing the co-attractor.

[NLP-73] Visual Set Program Synthesizer

【速读】: 该论文旨在解决当前视觉语言模型(Visual Language Models, VLMs)在处理需要集合推理(set-based reasoning)的复杂视觉问答任务时表现不佳的问题,例如用户询问“哪款汽水含糖量最低”这类涉及过滤、比较和聚合操作的任务。传统端到端多模态大模型(Multimodal Large Language Models, MLLMs)由于缺乏显式的组合逻辑机制,难以完成此类任务。其解决方案的关键在于将视觉推理建模为视觉程序合成(Visual Program Synthesis),即模型首先生成一个符号化的可执行程序,由一个独立于模型的引擎基于视觉场景进行执行,从而实现结构化、透明且准确的推理过程。这一方法显著优于现有基线,在Set-VQA新基准上的实验验证了其在复杂推理任务中的优越性。

链接: https://arxiv.org/abs/2603.15997
作者: Zehua Cheng,Wei Dai,Wenhu Zhang,Thomas Lukasiewicz,Jiahao Sun
机构: University of Oxford (牛津大学); Hong Kong University of Science and Technology (香港科技大学); TU Wien (维也纳工业大学)
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注: 10 pages, IEEE International Conference on Multimedia and Expo 2026

点击查看摘要

Abstract:A user pointing their phone at a supermarket shelf and asking “Which soda has the least sugar?” poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for compositional logic. We propose treating visual reasoning as Visual Program Synthesis, where the model first generates a symbolic program that is executed by a separate engine grounded in visual scenes. We also introduce Set-VQA, a new benchmark designed specifically for evaluating set-based visual reasoning. Experiments show that our approach significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer accuracy. These results demonstrate that program-driven reasoning provides a principled alternative to black-box visual-language inference.

[NLP-74] Aligning Paralinguistic Understanding and Generation in Speech LLM s via Multi-Task Reinforcement Learning

【速读】: 该论文旨在解决语音大语言模型(Speech LLM)在理解语调、情感和非语言声音等副语言特征(paralinguistic cues)时面临的挑战,包括训练数据稀缺、标注困难以及模型倾向于依赖词汇捷径而非有效利用副语言信号。其解决方案的关键在于提出一种基于思维链(chain-of-thought)提示的多任务强化学习(multi-task reinforcement learning, RL)框架,并设计了一个副语言感知的语音大语言模型(PALLM),通过两阶段流水线联合优化从音频中进行情感分类与副语言感知的响应生成,从而显式地激发情感推理过程。实验表明,该方法在Expresso、IEMOCAP和RAVDESS数据集上相较监督基线及主流商用模型(如Gemini-2.5-Pro、GPT-4o-audio)提升了8–12%的副语言理解性能。

链接: https://arxiv.org/abs/2603.15981
作者: Jingxiang Chen,Minseok Kim,Seong-Gyun Leem,Yin Huang,Rashi Rungta,Zhicheng Ouyang,Haibin Wu,Surya Teja Appini,Ankur Bansal,Yang Bai,Yue Liu,Florian Metze,Ahmed A Aly,Anuj Kumar,Ariya Rastrow,Zhaojiang Lin
机构: Meta Reality Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds–crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.

[NLP-75] Robust Language Identification for Romansh Varieties

【速读】: 该论文旨在解决罗曼什语(Romansh)不同地域变体(idioms)之间的语言识别(Language Identification, LID)问题,尤其关注如何区分具有有限互理解性的方言,并进一步识别一种跨区域统一变体——Rumantsch Grischun。这一问题具有挑战性,因方言间差异显著且缺乏系统性的标注数据。解决方案的关键在于构建一个基于支持向量机(SVM)的分类模型,利用新整理的基准数据集在两个不同领域进行评估,最终实现平均域内准确率达97%,为拼写检查和机器翻译等下游任务提供可靠的idiom-aware识别能力。

链接: https://arxiv.org/abs/2603.15969
作者: Charlotte Model,Sina Ahmadi,Jannis Vamvas
机构: University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Romansh language has several regional varieties, called idioms, which sometimes have limited mutual intelligibility. Despite this linguistic diversity, there has been a lack of documented efforts to build a language identification (LID) system that can distinguish between these idioms. Since Romansh LID should also be able to recognize Rumantsch Grischun, a supra-regional variety that combines elements of several idioms, this makes for a novel and interesting classification problem. In this paper, we present a LID system for Romansh idioms based on an SVM approach. We evaluate our model on a newly curated benchmark across two domains and find that it reaches an average in-domain accuracy of 97%, enabling applications such as idiom-aware spell checking or machine translation. Our classifier is publicly available.

[NLP-76] MoLoRA: Composable Specialization via Per-Token Adapter Routing

【速读】: 该论文旨在解决多适配器(multi-adapter)服务系统在处理跨域请求时的局限性问题,即传统方法强制将整个序列路由到单一适配器,导致无法有效支持同时包含文本和图像令牌的多模态生成任务,以及需要多种专业能力组合的混合请求(如“编写代码求解此方程”)。其解决方案的关键在于引入逐token路由机制(per-token routing),该机制根据词汇结构(适用于多模态模型)或学习到的门控机制(用于语义专业化)动态选择每个token对应的适配器。这一方法在理论上最优,可实现N个token仅需N个工作量,而传统按序列路由需K·N(K为适配器类型数)。核心创新是提出MoLoRA(Mixture of LoRA),通过组合多个领域特定的LoRA适配器并由学习到的路由器进行逐token调度,实现了推理时模块化专家能力的灵活集成:可独立训练专注型LoRA,无需重新训练即可组合使用,并通过加载新适配器轻松扩展功能。

链接: https://arxiv.org/abs/2603.15965
作者: Shrey Shah,Justin Wagle
机构: Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-adapter serving systems route entire sequences to a single adapter, forcing a choice when requests span multiple domains. This assumption fails in two important settings: (1) multimodal generation, where text and image tokens require different adapters within the same sequence, and (2) mixed-capability requests like “write code to solve this equation,” which need expertise from multiple specialized adapters. We introduce per-token routing, which routes individual tokens to adapters based on either vocabulary structure (for multimodal models) or learned gating (for semantic specialization). Per-token routing is provably optimal, achieving work N for N tokens versus K \cdot N for per-sequence routing with K adapter types. Our key contribution is MoLoRA (Mixture of LoRA), which enables composable specialization: load multiple domain-specific adapters and let a learned router select the appropriate adapter per-token. We demonstrate that specialization dramatically beats scale: MoLoRA enables Qwen3-1.7B to exceed Qwen3-8B across four reasoning benchmarks while being 4.7x smaller. This enables modular expertise at inference time: train focused LoRAs independently, combine them without retraining, and add new capabilities by simply loading new adapters.

[NLP-77] A Family of LLM s Liberated from Static Vocabularies

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)中分词器(Tokenization)存在的两大局限:固定且庞大的词汇表规模,以及对新领域或新语言适应能力差的问题。其解决方案的关键在于提出一种基于分层自回归Transformer(Hierarchical Autoregressive Transformer, HAT)的架构,该架构通过一个编码器将字节(bytes)聚合为词嵌入(word embeddings),再由主干Transformer处理,并通过解码器将输出重新映射回字节。此设计实现了端到端的字节级建模,同时保留了预训练主干模型的知识,从而在不增加词汇表复杂度的前提下提升了跨语言鲁棒性和文本压缩效率。

链接: https://arxiv.org/abs/2603.15953
作者: Aleph Alpha:Adnen Abdessaied,Artur Baranowski,Lukas Balles,Michael Barlow,Fabien C. Y. Benureau,Felix Berkenkamp,Lukas Bluebaum,Bastian Boll,Thomas F. Burns,Björn Deiseroth,Constantin Eichenberg,David Friede,Pablo Iyu Guerrero,Ahmed Hammam,Bastian Harren,Johann Higl,Yasser Jadidi,Carina Kauf,Johannes Messner,Jan Hendrik Metzen,Max Meuer,Vedant Nanda,Pit Neitemeier,Koen Oostermeijer,Letitia Parcalabescu,Markus Pernpointner,Felix Reinfurt,Dylan Rodriquez,Grégory Schott,Philipp Siedler,Martin Simonovsky,Till Speicher,Volker Stampa,Stephan Wäldchen,Samuel Weinbach,Gregor Ziegltrum
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchical autoregressive transformer (HAT) architecture. In HAT, an encoder transformer aggregates bytes into word embeddings and then feeds them to the backbone, a classical autoregressive transformer. The outputs of the backbone are then cross-attended by the decoder and converted back into bytes. We show that we can reuse available pre-trained models by converting the Llama 3.1 8B and 70B models into the HAT architecture: Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT are byte-level models whose encoder and decoder are trained from scratch, but where we adapt the pre-trained Llama backbone, i.e., the transformer blocks with the embedding matrix and head removed, to handle word embeddings instead of the original tokens. We also provide a 7B HAT model, Llama-TFree-HAT-Pretrained, trained entirely from scratch on nearly 4 trillion words. The HAT architecture improves text compression by reducing the number of required sequence positions and enhances robustness to intra-word variations, e.g., spelling differences. Through pre-training, as well as subsequent supervised fine-tuning and direct preference optimization in English and German, we show strong proficiency in both languages, improving on the original Llama 3.1 in most benchmarks. We release our models (including 200 pre-training checkpoints) on Hugging Face.

[NLP-78] POLAR:A Per-User Association Test in Embedding Space

【速读】: 该论文旨在解决现有内在关联探测方法(如词、句或语料级别)无法有效捕捉作者个体差异的问题。其解决方案的关键在于提出POLAR(Per-user On-axis Lexical Association Report),一种基于轻度调整的掩码语言模型嵌入空间中的个体化词汇关联测试方法。该方法通过为每位作者生成私有的确定性向量表示,并将其投影到预定义的词汇轴上,结合置换检验p值与Benjamini–Hochberg多重假设校正,实现对个体层面语言模式的精准量化。这一设计使POLAR能够清晰区分LLM驱动的机器人账号与真实用户,并揭示极端论坛中用户群体对贬损词汇库的强关联及其随时间的右倾演变趋势,具有模块化扩展性和计算社会科学研究所需的细粒度诊断能力。

链接: https://arxiv.org/abs/2603.15950
作者: Pedro Bento,Arthur Buzelin,Arthur Chagas,Yan Aquino,Victoria Estanislau,Samira Malaquias,Pedro Robles Dutenhefner,Gisele L. Pappa,Virgilio Almeida,Wagner MeiraJr
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: Accepted paper at ICWSM 2026

点击查看摘要

Abstract:Most intrinsic association probes operate at the word, sentence, or corpus level, obscuring author-level variation. We present POLAR (Per-user On-axis Lexical Association Re-port), a per-user lexical association test that runs in the embedding space of a lightly adapted masked language model. Authors are represented by private deterministic to-kens; POLAR projects these vectors onto curated lexicalaxes and reports standardized effects with permutation p-values and Benjamini–Hochberg control. On a balanced bot–human Twitter benchmark, POLAR cleanly separates LLM-driven bots from organic accounts; on an extremist forum,it quantifies strong alignment with slur lexicons and reveals rightward drift over time. The method is modular to new attribute sets and provides concise, per-author diagnostics for computational social science. All code is publicly avail-able at this https URL.

[NLP-79] BANGLASOCIALBENCH: A Benchmark for Evaluating Socioprag matic and Cultural Alignment of LLM s in Bangladeshi Social Interaction

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高语境语言(如孟加拉语)中缺乏社会语用能力(sociopragmatic competence)的问题,即模型虽具备多语言流利性,但难以准确理解和应用与社会层级、关系角色及互动规范相关的文化语用规则。解决方案的关键在于构建首个针对孟加拉语的社会语用评测基准——BANGLASOCIALBENCH,该基准通过情境依赖的语言使用而非事实回忆来评估模型的社会适应能力,涵盖称谓系统、亲属关系推理和社交习俗三个维度,共包含1,719个由母语者编写并验证的文化语境实例。实验表明,当前LLMs在零样本设置下普遍存在系统性文化错位,如过度使用正式称谓、忽略多种可接受的称呼形式以及混淆宗教背景下的亲属术语,揭示了现有模型在文化适切语言生成中的结构性局限。

链接: https://arxiv.org/abs/2603.15949
作者: Tanvir Ahmed Sijan,S. M Golam Rifat,Pankaj Chowdhury Partha,Md. Tanjeed Islam,Md. Musfique Anwar
机构: Jahangirnagar University, Dhaka, Bangladesh; Rajshahi University of Engineering Technology, Rajshahi, Bangladesh; Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Large Language Models have demonstrated strong multilingual fluency, yet fluency alone does not guarantee socially appropriate language use. In high-context languages, communicative competence requires sensitivity to social hierarchy, relational roles, and interactional norms that are encoded directly in everyday language. Bangla exemplifies this challenge through its three-tiered pronominal system, kinship-based addressing, and culturally embedded social customs. We introduce BANGLASOCIALBENCH, the first benchmark designed to evaluate sociopragmatic competence in Bangla through context-dependent language use rather than factual recall. The benchmark spans three domains: Bangla Address Terms, Kinship Reasoning, and Social Customs, and consists of 1,719 culturally grounded instances written and verified by native Bangla speakers. We evaluate twelve contemporary LLMs in a zero-shot setting and observe systematic patterns of cultural misalignment. Models frequently default to overly formal address forms, fail to recognize multiple socially acceptable address pronouns, and conflate kinship terminology across religious contexts. Our findings show that sociopragmatic failures are often structured and non-random, revealing persistent limitations in how current LLMs infer and apply culturally appropriate language use in realistic Bangladeshi social interactions.

[NLP-80] CTG-DB: An Ontology-Based Transformation of ClinicalTrials.gov to Enable Cross-Trial Drug Safety Analyses

【速读】: 该论文旨在解决临床试验注册库(如ClinicalTrials.gov)中不良事件(Adverse Event, AE)术语异质性和非结构化文本记录导致的系统性药物警戒(Pharmacovigilance, PV)分析困难问题。现有数据多以研究者报告的自由文本形式存在,缺乏标准化标识符,需人工映射才能提取一致的安全性概念,效率低且难以规模化。解决方案的关键在于构建一个名为CTG-DB的开源数据处理管道,其核心能力包括:从完整的ClinicalTrials.gov XML归档中提取数据,通过确定性精确匹配与模糊匹配策略将AE术语标准化至MedDRA(Medical Dictionary for Regulatory Activities)体系,并保留各试验组(含安慰剂组和对照组)的分母信息,从而实现跨试验的概念级检索与聚合,支持可扩展的以安慰剂为参照的安全性分析及临床试验证据向下游信号检测流程的集成。

链接: https://arxiv.org/abs/2603.15936
作者: Jeffery L. Painter,François Haguinet,Andrew Bate
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 2 figures. Submitted to the 2026 AMIA Annual Symposium

点击查看摘要

Abstract:this http URL (this http URL) is the largest publicly accessible registry of clinical studies, yet its registry-oriented architecture and heterogeneous adverse event (AE) terminology limit systematic pharmacovigilance (PV) analytics. AEs are typically recorded as investigator-reported text rather than standardized identifiers, requiring manual reconciliation to identify coherent safety concepts. We present the this http URL Transformation Database (CTG-DB), an open-source pipeline that ingests the complete this http URL XML archive and produces a relational database aligned to standardized AE terminology using the Medical Dictionary for Regulatory Activities (MedDRA). CTG-DB preserves arm-level denominators, represents placebo and comparator arms, and normalizes AE terminology using deterministic exact and fuzzy matching to ensure transparent and reproducible mappings. This framework enables concept-level retrieval and cross-trial aggregation for scalable placebo-referenced safety analyses and integration of clinical trial evidence into downstream PV signal detection.

[NLP-81] Agent -based imitation dynamics can yield efficiently compressed population-level vocabularies

【速读】: 该论文试图解决的问题是:自然语言词汇的高效压缩特性(即在信息瓶颈(Information Bottleneck, IB)框架下实现复杂度与准确性之间的最优权衡)是如何通过社会性演化动力学机制产生的。现有研究虽提出语言演化可能受IB优化驱动,但缺乏对具体社会互动过程如何促成此类效率的解释。解决方案的关键在于构建一个融合进化博弈论与IB框架的统一模型,其中通过代理间不精确策略模仿(imprecise strategy imitation)这一独立动机的动态过程,在信号博弈中自发涌现出接近IB最优的词汇系统。研究表明,模型中调控博弈精度及玩家混淆相似状态倾向的关键参数,能够约束所达成的IB权衡范围,从而为词汇演化提供具有信息论最优性和实证支持的机制基础。

链接: https://arxiv.org/abs/2603.15903
作者: Nathaniel Imel,Richard Futrell,Michael Franke,Noga Zaslavsky
机构: New York University (纽约大学); University of California, Irvine (加州大学欧文分校); University of Tübingen (图宾根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural languages have been argued to evolve under pressure to efficiently compress meanings into words by optimizing the Information Bottleneck (IB) complexity-accuracy tradeoff. However, the underlying social dynamics that could drive the optimization of a language’s vocabulary towards efficiency remain largely unknown. In parallel, evolutionary game theory has been invoked to explain the emergence of language from rudimentary agent-level dynamics, but it has not yet been tested whether such an approach can lead to efficient compression in the IB sense. Here, we provide a unified model integrating evolutionary game theory with the IB framework and show how near-optimal compression can arise in a population through an independently motivated dynamic of imprecise strategy imitation in signaling games. We find that key parameters of the model – namely, those that regulate precision in these games, as well as players’ tendency to confuse similar states – lead to constrained variation of the tradeoffs achieved by emergent vocabularies. Our results suggest that evolutionary game dynamics could potentially provide a mechanistic basis for the evolution of vocabularies with information-theoretically optimal and empirically attested properties.

[NLP-82] COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives SEMEVAL-2026

【速读】: 该论文旨在解决同义词(homonyms)在短篇故事中语义合理性评分的问题,即如何基于人类标注的5点李克特量表(Likert scale)准确评估不同词义在特定语境下的合理性。其关键解决方案包括:首先,采用三种提示策略——零样本基线、链式思维(Chain-of-Thought, CoT)结构化推理以及并行比较式提示(comparative prompting),以增强大语言模型(Large Language Models, LLMs)对语境敏感语义判断的能力;其次,针对标注数据中存在的显著标注者间差异(inter-annotator variation),提出通过集成多个模型预测结果进行平均的集成机制,从而更贴近人类平均判断。实验表明,比较式提示策略在各类模型上均稳定提升性能,而模型集成显著增强了与人类平均判断的一致性,验证了LLM集成在主观语义评价任务中的有效性。

链接: https://arxiv.org/abs/2603.15897
作者: Azwad Anjum Islam,Tisa Islam Erana
机构: Florida International University (佛罗里达国际大学); Knight Foundation School of Computing and Information Sciences (Knight基金会计算机与信息科学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: System description paper in SemEval-2026, Task 5

点击查看摘要

Abstract:We describe our system for SemEval-2026 Task 5, which requires rating the plausibility of given word senses of homonyms in short stories on a 5-point Likert scale. Systems are evaluated by the unweighted average of accuracy (within one standard deviation of mean human judgments) and Spearman Rank Correlation. We explore three prompting strategies using multiple closed-source commercial LLMs: (i) a baseline zero-shot setup, (ii) Chain-of-Thought (CoT) style prompting with structured reasoning, and (iii) a comparative prompting strategy for evaluating candidate word senses simultaneously. Furthermore, to account for the substantial inter-annotator variation present in the gold labels, we propose an ensemble setup by averaging model predictions. Our best official system, comprising an ensemble of LLMs across all three prompting strategies, placed 4th on the competition leaderboard with 0.88 accuracy and 0.83 Spearman’s rho (0.86 average). Post-competition experiments with additional models further improved this performance to 0.92 accuracy and 0.85 Spearman’s rho (0.89 average). We find that comparative prompting consistently improved performance across model families, and model ensembling significantly enhanced alignment with mean human judgments, suggesting that LLM ensembles are especially well suited for subjective semantic evaluation tasks involving multiple annotators.

[NLP-83] FlashSampling: Fast and Memory-Efficient Exact Sampling

【速读】: 该论文旨在解决大词汇量解码场景下,从类别分布中采样时因额外内存访问和核函数调用导致的性能瓶颈问题。传统方法需将logits张量完整存储在高带宽内存(HBM)中,造成带宽占用高且延迟大。解决方案的关键在于提出FlashSampling,一种将采样操作完全融合进语言模型头部矩阵乘法(LM-head matmul)的精确采样原语:通过片上分块计算logits、添加Gumbel噪声、每行及每词汇块保留唯一最大值,并最终进行小规模归约完成采样。该方法无需在HBM中显式存储logits,利用argmax在分区上的可分解性保证采样精度,同时支持在线和张量并行场景下的分组变体,实现无近似误差的高效采样,显著降低端到端推理延迟(最高达19%)。

链接: https://arxiv.org/abs/2603.15854
作者: Tomas Ruiz,Zhen Qin,Yifan Zhang,Xuyang Shen,Yiran Zhong,Mengdi Wang
机构: LMU Munich (慕尼黑路德维希马克西米利安大学); Princeton University (普林斯顿大学); FlashSampling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM-head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile-by-tile on chip, add Gumbel noise, keep only one maximizer per row and per vocabulary tile, and finish with a small reduction over tiles. The fused tiled kernel is exact because \argmax decomposes over a partition; grouped variants for online and tensor-parallel settings are exact by hierarchical factorization of the categorical distribution. Across H100, H200, B200, and B300 GPUs, FlashSampling speeds up kernel-level decode workloads, and in end-to-end vLLM experiments, it reduces time per output token by up to 19% on the models we test. These results show that exact sampling, with no approximation, can be integrated into the matmul itself, turning a bandwidth-bound postprocessing step into a lightweight epilogue. Project Page: this https URL.

[NLP-84] When Stability Fails: Hidden Failure Modes Of LLM S in Data-Constrained Scientific Decision-Making ICLR2026

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在数据受限的科学工作流中作为决策支持工具时,仅依赖运行稳定性评估可能导致与统计基准不一致的问题。其核心问题是:LLM 的高稳定性并不等同于决策正确性或输出有效性,尤其在需要严格验证科学结论的场景下,这种偏差可能引发严重误判。解决方案的关键在于提出一个受控的行为评估框架,将 LLM 决策行为明确分解为四个维度——稳定性(stability)、正确性(correctness)、提示敏感性(prompt sensitivity)和固定统计输入下的输出有效性(output validity),并通过基于差异表达分析的基因优先排序任务,在多种提示设置(如严格与宽松显著性阈值、边界排名情形及微小措辞变化)下对多个 LLM 进行系统测试,从而揭示其在实际科学应用中的潜在风险。

链接: https://arxiv.org/abs/2603.15840
作者: Nazia Riasat
机构: North Dakota State University (北达科他州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 13 pages, 5 figures. Accepted at ICLR 2026 Workshop: I Can’t Believe It’s Not Better (ICBINB 2026). OpenReview: this https URL

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as decision-support tools in data-constrained scientific workflows, where correctness and validity are critical. However, evaluation practices often emphasize stability or reproducibility across repeated runs. While these properties are desirable, stability alone does not guar- antee agreement with statistical ground truth when such references are available. We introduce a controlled behavioral evaluation framework that explicitly sep- arates four dimensions of LLM decision-making: stability, correctness, prompt sensitivity, and output validity under fixed statistical inputs. We evaluate multi- ple LLMs using a statistical gene prioritization task derived from differential ex- pression analysis across prompt regimes involving strict and relaxed significance thresholds, borderline ranking scenarios, and minor wording variations. Our ex- periments show that LLMs can exhibit near-perfect run-to-run stability while sys- tematically diverging from statistical ground truth, over-selecting under relaxed thresholds, responding sharply to minor prompt wording changes, or producing syntactically plausible gene identifiers absent from the input table. Although sta- bility reflects robustness across repeated runs, it does not guarantee agreement with statistical ground truth in structured scientific decision tasks. These findings highlight the importance of explicit ground-truth validation and output validity checks when deploying LLMs in automated or semi-automated scientific work- flows.

[NLP-85] Persona-Conditioned Risk Behavior in Large Language Models : A Simulated Gambling Study with GPT -4.1

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在不确定、序列决策环境中的行为是否反映了深层次的认知规律,还是仅表现为表面的提示模仿。解决方案的关键在于设计了一个受控实验,将GPT-4.1分配至三个不同社会经济人格角色(富人、中等收入者和穷人),并置于三种具有不同概率结构的老虎机环境中(公平型、偏倚低型和连败激励型)。通过分析6,950次决策数据发现,模型在未被明确指令的情况下重现了卡尼曼与特沃斯基前景理论(Prospect Theory)所预测的关键行为特征,尤其是“穷人”人格显著更倾向于持续博弈(平均37.4轮/次),而“富人”则几乎立即退出(平均1.1轮/次),且风险评分效应量巨大(Cohen’s d=4.15)。这表明LLM代理的行为可能隐含了经典认知经济学偏差,而非简单提示响应,为LLM代理设计与可解释性研究提供了实证基础。

链接: https://arxiv.org/abs/2603.15831
作者: Sankalp Dubedy
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 13 figures, 9 tables. Independent research. Submitted to arXiv for open dissemination

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as autonomous agents in uncertain, sequential decision-making contexts. Yet it remains poorly understood whether the behaviors they exhibit in such environments reflect principled cognitive patterns or simply surface-level prompt mimicry. This paper presents a controlled experiment in which GPT-4.1 was assigned one of three socioeconomic personas (Rich, Middle-income, and Poor) and placed in a structured slot-machine environment with three distinct machine configurations: Fair (50%), Biased Low (35%), and Streak (dynamic probability increasing after consecutive losses). Across 50 independent iterations per condition and 6,950 recorded decisions, we find that the model reproduces key behavioral signatures predicted by Kahneman and Tversky’s Prospect Theory without being instructed to do so. The Poor persona played a mean of 37.4 rounds per session (SD=15.5) compared to 1.1 rounds for the Rich persona (SD=0.31), a difference that is highly significant (Kruskal-Wallis H=393.5, p2.2e-16). Risk scores by persona show large effect sizes (Cohen’s d=4.15 for Poor vs Rich). Emotional labels appear to function as post-hoc annotations rather than decision drivers (chi-square=3205.4, Cramer’s V=0.39), and belief-updating across rounds is negligible (Spearman rho=0.032 for Poor persona, p=0.016). These findings carry implications for LLM agent design, interpretability research, and the broader question of whether classical cognitive economic biases are implicitly encoded in large-scale pretrained language models.

[NLP-86] Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory CVPR2026

【速读】: 该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在面对安全风险时存在的“情境安全性”(contextual safety)不足问题,即模型难以根据上下文语境准确区分看似相似但安全意图不同的输入。传统防御方法仅关注显式不安全输入的检测与拒绝,忽略了隐含于场景中的细微差异对安全行为的影响。其解决方案的关键在于提出一个名为 EchoSafe 的训练-free 框架,该框架通过构建自反思记忆库(self-reflective memory bank)持续积累并检索过往交互中的安全洞察,并将相关经验融入当前提示中,从而实现推理过程中基于上下文的智能安全决策与持续演化能力。

链接: https://arxiv.org/abs/2603.15800
作者: Ce Zhang,Jinxi He,Junyi He,Katia Sycara,Yaqi Xie
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted at CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent. In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image-text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding. Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs. All benchmark data and code are available at this https URL.

[NLP-87] Morphemes Without Borders: Evaluating Root-Pattern Morphology in Arabic Tokenizers and LLM s LREC2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在阿拉伯语词根-模式形态学(Arabic root-pattern morphology)表征与生成能力上的局限性问题,具体探究其是否真正捕捉了形态结构的本质,还是仅依赖于表面形式的记忆。解决方案的关键在于构建一个全新的测试集用于评估LLMs在词根-模式生成任务中的表现,并系统比较七种以阿拉伯语为中心及多语言LLM与其对应分词器(tokenizer)在形态保真度(morphological fidelity)方面的差异,从而揭示分词对齐并非形态生成能力的必要或充分条件,挑战了传统认为形态化分词有助于下游任务性能提升的观点。

链接: https://arxiv.org/abs/2603.15773
作者: Yara Alakeel,Chatrine Qwaider,Hanan Aldarmaki,Sawsan Alqahtani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at LREC 2026

点击查看摘要

Abstract:This work investigates how effectively large language models (LLMs) and their tokenization schemes represent and generate Arabic root-pattern morphology, probing whether they capture genuine morphological structure or rely on surface memorization. Arabic morphological system provides a rich testbed for analyzing how LLMs handle complex, non-concatenative forms and how tokenization choices influence this process. Our study begins with an evaluation of morphological fidelity across Arabic and multilingual tokenizers against gold-standard segmentation, followed by an analysis of LLM performance in productive root-pattern generation using a newly developed test set. Our findings across seven Arabic-centric and multilingual LLMs and their respective tokenizers reveal that tokenizer morphological alignment is not necessary nor sufficient for morphological generation, which questions the role of morphological tokenization in downstream performance.

[NLP-88] MedArena: Comparing LLM s for Medicine-in-the-Wild Clinician Preferences

【速读】: 该论文旨在解决当前医学大语言模型(Medical LLMs)评估方法存在的局限性问题,即现有评测主要依赖静态、模板化的基准测试,难以反映真实临床实践中复杂的动态需求,导致模型在基准上的表现与其实际临床应用价值之间存在脱节。其解决方案的关键在于提出MedArena——一个交互式评估平台,允许临床医生基于自身实际医疗问题直接比较多个LLM的响应,并通过偏好选择机制收集临床决策视角下的评价数据。该方法以真实临床场景为核心,不仅提升了评估的真实性与实用性,还揭示了临床医生更重视响应的深度、清晰度和临床细节而非单纯的事实准确性,从而为衡量和优化医学LLM的临床效用提供了可扩展且贴近实践的新范式。

链接: https://arxiv.org/abs/2603.15677
作者: Eric Wu,Kevin Wu,Jason Hom,Paul H. Yi,Angela Zhang,Alejandro Lozano,Jeff Nirschl,Jeff Tangney,Kevin Byram,Braydon Dymm,Narender Annapureddy,Eric Topol,David Ouyang,James Zou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly central to clinician workflows, spanning clinical decision support, medical education, and patient communication. However, current evaluation methods for medical LLMs rely heavily on static, templated benchmarks that fail to capture the complexity and dynamics of real-world clinical practice, creating a dissonance between benchmark performance and clinical utility. To address these limitations, we present MedArena, an interactive evaluation platform that enables clinicians to directly test and compare leading LLMs using their own medical queries. Given a clinician-provided query, MedArena presents responses from two randomly selected models and asks the user to select the preferred response. Out of 1571 preferences collected across 12 LLMs up to November 1, 2025, Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o were the top three models by Bradley-Terry rating. Only one-third of clinician-submitted questions resembled factual recall tasks (e.g., MedQA), whereas the majority addressed topics such as treatment selection, clinical documentation, or patient communication, with ~20% involving multi-turn conversations. Additionally, clinicians cited depth and detail and clarity of presentation more often than raw factual accuracy when explaining their preferences, highlighting the importance of readability and clinical nuance. We also confirm that the model rankings remain stable even after controlling for style-related factors like response length and formatting. By grounding evaluation in real-world clinical questions and preferences, MedArena offers a scalable platform for measuring and improving the utility and efficacy of medical LLMs.

[NLP-89] Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context

【速读】: 该论文旨在解决长上下文处理(long-context handling)中语言模型难以可靠提取、推理和利用跨长文本信息的问题。现有方法如递归语言模型(Recursive Language Models, RLM)通过程序化交互在推理阶段将长上下文分解为递归子调用,但其性能高度依赖于上下文交互程序的选择策略,而这一问题此前尚未得到系统研究。论文提出SRLM框架,其核心创新在于引入基于不确定性感知的自我反思(uncertainty-aware Self-Reflection),利用自一致性(self consistency)、推理长度(reasoning length)和显式置信度(verbalized confidence)三种内在信号作为互补的内部不确定性指标,用于评估和比较候选上下文交互程序。实验表明,SRLM在多种基准数据集、上下文长度和主干模型上均显著优于当前最优基线,最高提升达22%,且无需显式递归机制或自查询即可实现与RLM相当甚至更优的性能,揭示了递归本身并非性能提升的关键因素,而自我反思机制能更有效地引导语义密集型任务中的推理过程。

链接: https://arxiv.org/abs/2603.15653
作者: Keivan Alizadeh,Parshin Shojaee,Minsik Cho,Mehrdad Farajtabar
机构: Apple(苹果)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:Long-context handling remains a core challenge for language models: even with extended context windows, models often fail to reliably extract, reason over, and use the information across long contexts. Recent works like Recursive Language Models (RLM) have approached this challenge by agentic way of decomposing long contexts into recursive sub-calls through programmatic interaction at inference. While promising, the success of RLM critically depends on how these context-interaction programs are selected, which has remained largely unexplored. In this paper, we study this problem and introduce SRLM, a framework that augments programmatic context interaction with uncertainty-aware Self-Reflection. SRLM leverages three intrinsic signals: self consistency, reasoning length, and verbalized confidence. These serve as complementary indicators of a model’s internal uncertainty, and the model uses them to evaluate and compare candidate context-interaction programs. Extensive experiments across diverse benchmark datasets, context lengths, and backbone models, show that SRLM consistently outperforms state-of-the-art baselines, yielding up to 22% improvement over RLM under the same time budget. Our findings show that recursion itself is not the primary driver of performance in RLM, and a simple self-reflective program search can match or surpass RLM without requiring self-query or explicit recursion mechanisms. We find that for context lengths within the model’s window, RLMs with recursion often degrade performance relative to the base model, whereas SRLM yields consistent gains across both short and long contexts. We also find that RLM is less effective in tasks with semantically intensive nature, where heuristic program search is insufficient and broader contextual understanding is required, while self-reflection in SRLM provides a semantic signal that better steers reasoning in these scenarios.

[NLP-90] Alternating Reinforcement Learning with Contextual Rubric Rewards

【速读】: 该论文旨在解决生成式 AI (Generative AI) 中基于评分的强化学习(如 RLHF 和 RLVR)在奖励聚合过程中存在的局限性,即现有方法通常采用固定权重将多维奖励向量线性压缩为标量奖励,导致对人工设计分数敏感且无法捕捉各奖励维度间的相关性。解决方案的关键在于提出交替式基于评分的强化学习(Alternating Reinforcement Learning with Rubric Rewards, ARL-RR),通过逐次优化单个语义评分元类别(meta-class),避免了固定标量化过程;同时引入轻量级基于搜索的自适应机制,动态选择下一优化目标以聚焦关键任务指标,从而提升模型性能与训练效率。理论分析表明,传统奖励聚合具有方差收缩效应,有助于解释性能增益。

链接: https://arxiv.org/abs/2603.15646
作者: Guangchen Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional, contextual rubric-based evaluations. However, existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with a fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions. To overcome the limitations of reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for a fixed scalarization by optimizing one semantic rubric meta-class at a time. Theoretically, we show that reward aggregation induces a variance contraction effect, which helps explain the performance gains. We further introduce a lightweight, search-based adaptation procedure that selects the next meta-class dynamically based on task performance, enabling the policy to emphasize critical objectives and thereby improve the model performance. Empirically, our experiments on the HealthBench dataset with experts annotations demonstrate that ARL-RR uniformly outperforms scalarized methods in both model performance and training efficiency across different model scales (1.7B, 4B, 8B, and 14B).

[NLP-91] okenization Tradeoffs in Structured EHR Foundation Models

【速读】: 该论文旨在解决结构化电子健康记录(Structured Electronic Health Records, EHRs)基础模型中tokenization设计对下游任务性能和计算效率影响不明确的问题。其核心解决方案在于通过因子实验设计系统性地评估事件编码(event encoding)、时间编码(time encoding)和工作流标注(workflow annotation)三种tokenization策略的组合效果,发现联合事件编码与位置时间编码相结合的方案在74项临床预测任务中优于其他变体(分别在73/74和71/74任务中表现更优),同时分别减少39.5%和9.6%的预训练浮点运算量。关键创新点在于揭示了局部绑定效率(local binding efficiency)的作用机制:将代码-属性对整合为单一token,而非拆分至多个token让模型在预训练阶段学习关联,从而显著提升模型表示能力和训练效率。

链接: https://arxiv.org/abs/2603.15644
作者: Lin Lawrence Guo,Santiago Eduardo Arciniegas,Joseph Jihyung Lee,Adam Paul Yan,George Tomlinson,Jason Fries,Lillian Sung
机构: The Hospital for Sick Children, Toronto, Canada
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Foundation models for structured electronic health records (EHRs) are pretrained on longitudinal sequences of timestamped clinical events to learn adaptable patient representations. Tokenization – how these timelines are converted into discrete model inputs – determines what information is preserved, how efficiently it is encoded, and which relationships must be learned versus precomputed. Yet the impact of tokenization design choices on downstream performance and computational efficiency remains largely unexplored. Here, we pretrained a transformer on pediatric EHR data under a factorial design, varying tokenization along event encoding, time encoding, and workflow annotation. We evaluated area-under-the-receiver-operating-characteristic curve across 74 clinical prediction tasks. Joint event encoding and positional time encoding outperformed their alternatives (73/74 and 71/74 tasks) while requiring 39.5% and 9.6% fewer pretraining floating-point operations, respectively. Targeted ablations traced the joint encoding advantage to local binding efficiency, that is, code-attribute pairs are combined into single tokens, rather than split across tokens that the model must learn to associate during pretraining. External evaluation on an adult intensive care unit cohort demonstrated that this advantage generalizes despite substantial vocabulary mismatch, while temporal and workflow effects remain institution-specific. These results establish tokenization as a tractable lever for improving both the performance and efficiency of EHR foundation models.

[NLP-92] BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在标准基准测试中表现优异,但在涉及常识推理的复杂问题上频繁失败的问题。其核心挑战在于识别并量化LLMs在特定推理模式上的系统性缺陷,而非单纯的语言理解能力不足。解决方案的关键是提出BrainBench——一个包含100道脑筋急转弯类问题的细粒度基准测试集,覆盖20个精心设计的类别,每个类别针对一种具体的常识推理失效模式(如隐式物理约束、语义范围陷阱和默认假设劫持)。通过零样本协议对八种前沿模型进行评估,研究揭示了当前最优模型(Claude Opus 4.6 with extended thinking)仅达80.3%准确率,且存在显著的准确率与一致性差距,表明模型依赖表面启发式而非真正常识推理。该工具为诊断LLM推理缺陷提供了可解释、可复现的框架。

链接: https://arxiv.org/abs/2603.14761
作者: Yuzhe Tang
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve impressive scores on standard benchmarks yet routinely fail questions that any human would answer correctly in seconds. We introduce BrainBench, a benchmark of 100 brainteaser questions spanning 20 carefully designed categories, each targeting a specific commonsense reasoning failure mode in LLMs. Categories range from implicit physical constraints (“Should I walk or drive my rental car to the return lot?”) to semantic scope tricks and default assumption hijacks. We evaluate eight frontier models – four from the Claude family and four from the GPT family – using a zero-shot protocol with 10 independent runs per question. The best model, Claude Opus 4.6 with extended thinking, achieves only 80.3% accuracy; the worst, GPT-4o, scores 39.7%. Even top-performing models exhibit a 6-16 percentage-point gap between accuracy and consistency, revealing stochastic reasoning. Cross-lingual evaluation in Chinese shows most models degrade by 2-8 percentage points, confirming that these failures reflect reasoning deficits rather than language-specific artifacts. BrainBench provides a fine-grained diagnostic tool for identifying where and why LLMs substitute surface heuristics for genuine commonsense reasoning.

信息检索

[IR-0] IndexRAG : Bridging Facts for Cross-Document Reasoning at Index Time

【速读】:该论文旨在解决多跳问答(Multi-hop Question Answering, Multi-hop QA)中跨文档推理效率低的问题,现有检索增强生成(Retrieval-Augmented Generation, RAG)方法通常依赖图结构或迭代式多步推理,导致在线处理复杂度高。其解决方案的关键在于将跨文档推理从在线推理阶段移至离线索引阶段:通过识别文档间共享的桥接实体(bridge entities),生成可独立检索的桥接事实(bridging facts),从而在推理时仅需单次检索和一次大语言模型(Large Language Model, LLM)调用即可完成多跳推理,无需额外训练或微调。

链接: https://arxiv.org/abs/2603.16415
作者: Zhenghua Bao,Yi Shi
机构: Continuum AI(Continuum AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Multi-hop question answering (QA) requires reasoning across multiple documents, yet existing retrieval-augmented generation (RAG) approaches address this either through graph-based methods requiring additional online processing or iterative multi-step reasoning. We present IndexRAG, a novel approach that shifts cross-document reasoning from online inference to offline indexing. IndexRAG identifies bridge entities shared across documents and generates bridging facts as independently retrievable units, requiring no additional training or fine-tuning. Experiments on three widely-used multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue) show that IndexRAG improves F1 over Naive RAG by 4.6 points on average, while requiring only single-pass retrieval and a single LLM call at inference time. When combined with IRCoT, IndexRAG outperforms all graph-based baselines on average, including HippoRAG and FastGraphRAG, while relying solely on flat retrieval. Our code will be released upon acceptance.

[IR-1] PashtoCorp: A 1.25-Billion-Word Corpus Evaluation Suite and Reproducible Pipeline for Low-Resource Language Development

【速读】:该论文旨在解决帕什图语(Pashto)在自然语言处理(Natural Language Processing, NLP)领域长期存在的资源匮乏问题,其核心挑战在于缺乏大规模、高质量且可复现的语料库。针对这一问题,作者构建了PashtoCorp——一个包含12.5亿词的帕什图语语料库,由39个来源(包括7个HuggingFace数据集和32个定制网页爬虫)组成,并通过可复现的处理流程完成阿拉伯字母脚本分词(Arabic-script tokenization)、SHA-256去重(SHA-256 deduplication)和质量过滤。该语料库规模远超此前最大帕什图语语料库(OSCAR子集的40倍、专用语料库的83倍),并成功用于XLM-R-base模型的继续预训练,使困惑度下降25.1%,同时显著提升WikiANN帕什图语命名实体识别(NER)任务的F1值(相对提升10%)和贝莱贝莱(Belebele)阅读理解任务的准确率(Gemma-3n达到64.6%)。解决方案的关键在于:系统性地整合多源异构数据、标准化处理流程与严格的去重及质量控制机制,从而为低资源语言提供高质量、可扩展的语料支持和下游任务性能提升的基础

链接: https://arxiv.org/abs/2603.16354
作者: Hanif Rahman
机构: Independent Researcher
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present PashtoCorp, a 1.25-billion-word corpus for Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose-built web scrapers, processed through a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering. At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus. Continued MLM pretraining of XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08-6.06). On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%-21.0%) and reduces training variance nearly 7x; the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity vocabulary. On Belebele Pashto reading comprehension, Gemma-3n achieves 64.6% accuracy, the first published LLM baseline for Pashto on this benchmark. A leave-one-out source ablation shows that Wikipedia (0.7% of documents) is the most critical source for NER: removing it alone reduces entity F1 by 47%. Corpus data, trained model, and code are available at this https URL, this https URL, and this https URL. Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2603.16354 [cs.CL] (or arXiv:2603.16354v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.16354 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hanif Rahman [view email] [v1] Tue, 17 Mar 2026 10:36:18 UTC (284 KB) Full-text links: Access Paper: View a PDF of the paper titled PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development, by Hanif RahmanView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-03 Change to browse by: cs cs.IR cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[IR-2] ReFORM: Review-aggregated Profile Generation via LLM with Multi-Factor Attention for Restaurant Recommendation

【速读】:该论文旨在解决现有基于大语言模型(Large Language Models, LLMs)的推荐系统中,过度依赖物品标题内部知识而忽视用户决策过程中多种因素影响的问题。现有方法未能充分利用评论中蕴含的多样化决策信息,导致个性化推荐能力受限。其解决方案的关键在于提出 ReFORM 框架——通过 LLM 从用户评论中聚合生成因子特定的用户与物品画像(factor-specific user and item profiles),以捕捉用户对物品的偏好和物品被用户的评价;同时引入多因子注意力机制(Multi-Factor Attention),显式识别并加权每个用户决策中最关键的影响因素,从而增强推荐的可解释性与个性化程度。

链接: https://arxiv.org/abs/2603.16236
作者: Moonsoo Park,Seulbeen Je,Donghyeon Park
机构: 未知
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In recommender systems, large language models (LLMs) have gained popularity for generating descriptive summarization to improve recommendation robustness, along with Graph Convolution Networks. However, existing LLM-enhanced recommendation studies mainly rely on the internal knowledge of LLMs about item titles while neglecting the importance of various factors influencing users’ decisions. Although information reflecting various decision factors of each user is abundant in reviews, few studies have actively exploited such insights for recommendation. To address these limitations, we propose a ReFORM: Review-aggregated Profile Generation via LLM with Multi-FactOr Attentive RecoMmendation framework. Specifically, we first generate factor-specific user and item profiles from reviews using LLM to capture a user’s preference by items and an item’s evaluation by users. Then, we propose a Multi-Factor Attention to highlight the most influential factors in each user’s decision-making process. In this paper, we conduct experiments on two restaurant datasets of varying scales, demonstrating its robustness and superior performance over state-of-the-art baselines. Furthermore, in-depth analyses validate the effectiveness of the proposed modules and provide insights into the sources of personalization. Our source code and datasets are available at this https URL.

[IR-3] MemX: A Local-First Long-Term Memory System for AI Assistants

【速读】:该论文旨在解决AI助手在长期记忆存储与检索中的稳定性与可解释性问题,尤其是在本地优先部署场景下如何实现高效、准确且可复现的记忆访问。其核心挑战在于:如何在不依赖云端服务的前提下,构建一个具备持久化存储能力、支持多模态召回(向量+关键词)、并能抑制虚假匹配的检索系统。解决方案的关键在于设计了一个稳定导向的检索流水线(retrieval pipeline),融合了向量召回、关键词召回、Reciprocal Rank Fusion(RRF)排序、四因子重排序以及低置信度拒绝机制;同时采用FTS5全文索引显著降低关键词搜索延迟(100k记录规模下提升1,100倍),确保端到端响应时间低于90毫秒。这一架构使得MemX在事实级粒度上达到Hit@5=51.6%和MRR=0.380,显著优于会话级性能,并为本地-first AI代理提供了一个结构简洁、可解释性强、边界清晰的基准方案。

链接: https://arxiv.org/abs/2603.16171
作者: Lizheng Sun
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 18 pages, 2 figures, 13 tables

点击查看摘要

Abstract:We present MemX, a local-first long-term memory system for AI assistants with stability-oriented retrieval design. MemX is implemented in Rust on top of libSQL and an OpenAI-compatible embedding API, providing persistent, searchable, and explainable memory for conversational agents. Its retrieval pipeline applies vector recall, keyword recall, Reciprocal Rank Fusion (RRF), four-factor re-ranking, and a low-confidence rejection rule that suppresses spurious recalls when no answer exists in the memory store. We evaluate MemX on two axes. First, two custom Chinese-language benchmark suites (43 queries, =1,014 records) validate pipeline design: Hit@1=91.3% on a default scenario and 100% under high confusion, with conservative miss-query suppression. Second, the LongMemEval benchmark (500 queries, up to 220,349 records) quantifies system boundaries across four ability types and three storage granularities. At fact-level granularity the system reaches Hit@5=51.6% and MRR=0.380, doubling session-level performance, while temporal and multi-session reasoning remain challenging (=43.6% Hit@5). FTS5 full-text indexing reduces keyword search latency by 1,100x at 100k-record scale, keeping end-to-end search under 90 ms. Unlike Mem0 and related work that targets end-to-end agent benchmarks, MemX focuses on a narrower, reproducible baseline: local-first deployment, structural simplicity, explainable retrieval, and stability-oriented design. Comments: 18 pages, 2 figures, 13 tables Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.16171 [cs.IR] (or arXiv:2603.16171v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.16171 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-4] Open-Source Reproduction and Explainability Analysis of Corrective Retrieval Augmented Generation

【速读】:该论文旨在解决修正型检索增强生成(Corrective Retrieval Augmented Generation, CRAG)系统因依赖专有组件(如 Google Search API 和 LLaMA-2 模型权重)而导致的可复现性受限问题。其解决方案的关键在于构建一个完全开源的 CRAG 实现:用 Wikipedia API 替代原生网络搜索模块,并以 Phi-3-mini-4k-instruct 代替原始的 LLaMA-2 生成器,从而在保持性能的同时提升系统的透明度与可复现性。此外,作者还首次对 CRAG 中基于 T5 的检索评估器进行了可解释性分析(使用 SHAP 方法),发现其主要依赖命名实体对齐而非语义相似性,同时识别出科学类问题上的领域迁移局限性这一关键失败模式。

链接: https://arxiv.org/abs/2603.16169
作者: Surya Vardhan Yalavarthi
机构: University of Cincinnati (辛辛那提大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Corrective Retrieval Augmented Generation (CRAG) improves the robustness of RAG systems by evaluating retrieved document quality and triggering corrective actions. However, the original implementation relies on proprietary components including the Google Search API and closed model weights, limiting reproducibility. In this work, we present a fully open-source reproduction of CRAG, replacing proprietary web search with the Wikipedia API and the original LLaMA-2 generator with Phi-3-mini-4k-instruct. We evaluate on PopQA and ARC-Challenge, demonstrating that our open-source pipeline achieves comparable performance to the original system. Furthermore, we contribute the first explainability analysis of CRAG’s T5-based retrieval evaluator using SHAP, revealing that the evaluator primarily relies on named entity alignment rather than semantic similarity. Our analysis identifies key failure modes including domain transfer limitations on science questions. All code and results are available at this https URL.

[IR-5] Answer Bubbles: Information Exposure in AI-Mediated Search

【速读】:该论文旨在解决生成式搜索系统(Generative Search Systems)在信息检索过程中存在的源选择偏差、语言风格差异及摘要与引用来源之间一致性不足的问题。其核心发现表明,尽管这些系统通过AI生成摘要替代传统链接式检索,但它们在引用来源上表现出显著的偏倚,例如过度依赖维基百科和长篇内容,而忽视社交媒体和负面语境下的来源;同时,结合搜索功能会显著削弱不确定性表达(如“可能”、“据推测”等),使摘要语气更自信,从而加剧了信息呈现的失真风险。解决方案的关键在于识别并量化这种“答案气泡”(answer bubbles)现象,揭示不同系统间因源选择与语言处理机制差异导致的信息现实结构分化,为提升AI中介信息获取的透明度与公平性提供实证依据。

链接: https://arxiv.org/abs/2603.16138
作者: Michelle Huang,Agam Goyal,Koustuv Saha,Eshwar Chandrasekharan
机构: Siebel School of Computing and Data Science (Siebel计算机与数据科学学院); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Preprint: 12 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Generative search systems are increasingly replacing link-based retrieval with AI-generated summaries, yet little is known about how these systems differ in sources, language, and fidelity to cited material. We examine responses to 11,000 real search queries across four systems – vanilla GPT, Search GPT, Google AI Overviews, and traditional Google Search – at three levels: source diversity, linguistic characterization of the generated summary, and source-summary fidelity. We find that generative search systems exhibit significant \textitsource-selection biases in their citations, favoring certain sources over others. Incorporating search also selectively attenuates epistemic markers, reducing hedging by up to 60% while preserving confidence language in the AI-generated summaries. At the same time, AI summaries further compound the citation biases: Wikipedia and longer sources are disproportionately overrepresented, whereas cited social media content and negatively framed sources are substantially underrepresented. Our findings highlight the potential for \textitanswer bubbles, in which identical queries yield structurally different information realities across systems, with implications for user trust, source visibility, and the transparency of AI-mediated information access.

[IR-6] RecBundle: A Next-Generation Geometric Paradigm for Explainable Recommender Systems

【速读】:该论文旨在解决推荐系统中因长期局部交互累积导致的宏观结构退化问题(如信息茧房),以及现有表示学习范式因假设单一平坦空间而无法区分系统性偏差来源的理论瓶颈。其解决方案的关键在于引入现代微分几何中的纤维丛(Fiber Bundle)理论,将系统空间解耦为两个层次:由用户交互网络构成的底流形(base manifold)和附着于每个用户节点上的携带动态偏好的纤维(fiber)。在此基础上,提出RecBundle框架,通过底流形上的几何联络与平行移动形式化用户协作,同时将内容演化映射为纤维上的全纯变换(holonomy transformation),从而实现对推荐系统中结构性偏差的机制性识别与建模。

链接: https://arxiv.org/abs/2603.16088
作者: Hui Wang,Tianzhu Hu,Mingming Li,Xi Zhou,Chun Gan,Jiao Dai,Jizhong Han,Songlin Hu,Tao Guo
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cyberspace Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院); JD.com(京东)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommender systems are inherently dynamic feedback loops where prolonged local interactions accumulate into macroscopic structural degradation such as information cocoons. Existing representation learning paradigms are universally constrained by the assumption of a single flat space, forcing topologically grounded user associations and semantically driven historical interactions to be fitted within the same vector space. This excessive coupling of heterogeneous information renders it impossible for researchers to mechanistically distinguish and identify the sources of systemic bias. To overcome this theoretical bottleneck, we introduce Fiber Bundle from modern differential geometry and propose a novel geometric analysis paradigm for recommender systems. This theory naturally decouples the system space into two hierarchical layers: the base manifold formed by user interaction networks, and the fibers attached to individual user nodes that carry their dynamic preferences. Building upon this, we construct RecBundle, a framework oriented toward next-generation recommender systems that formalizes user collaboration as geometric connection and parallel transport on the base manifold, while mapping content evolution to holonomy transformations on fibers. From this foundation, we identify future application directions encompassing quantitative mechanisms for information cocoons and evolutionary bias, geometric meta-theory for adaptive recommendation, and novel inference architectures integrating large language models (LLMs). Empirical analysis on real-world MovieLens and Amazon Beauty datasets validates the effectiveness of this geometric framework.

[IR-7] mporal Fact Conflicts in LLM s: Reproducibility Insights from Unifying DYNAMICQA and MULAN

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对时间性事实冲突(temporal fact conflicts)时表现不一致的问题,尤其是不同研究对引入外部上下文是否能有效修正模型中过时或错误的时间相关知识得出相反结论的现象。其关键解决方案在于通过可复现性实验设计,系统性地对比两个代表性基准(DYNAMICQA 和 MULAN),并采用标准化数据集和统一评估框架进行交叉验证;同时,利用生成式 AI(Generative AI)合成更自然的上下文以提升实验的真实性,并进一步考察模型规模对时间知识编码与更新的影响,从而揭示数据集设计、评估指标和模型大小如何共同塑造 LLM 在时间性知识冲突下的行为模式。

链接: https://arxiv.org/abs/2603.15892
作者: Ritajit Dey,Iadh Ounis,Graham McDonald,Yashar Moshfeghi
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often struggle with temporal fact conflicts due to outdated or evolving information in their training data. Two recent studies with accompanying datasets report opposite conclusions on whether external context can effectively resolve such conflicts. DYNAMICQA evaluates how effective external context is in shifting the model’s output distribution, finding that temporal facts are more resistant to change. In contrast, MULAN examines how often external context changes memorised facts, concluding that temporal facts are easier to update. In this reproducibility paper, we first reproduce experiments from both benchmarks. We then reproduce the experiments of each study on the dataset of the other to investigate the source of their disagreement. To enable direct comparison of findings, we standardise both datasets to align with the evaluation settings of each study. Importantly, using an LLM, we synthetically generate realistic natural language contexts to replace MULAN’s programmatically constructed statements when reproducing the findings of DYNAMICQA. Our analysis reveals strong dataset dependence: MULAN’s findings generalise under both methodological frameworks, whereas applying MULAN’s evaluation to DYNAMICQA yields mixed outcomes. Finally, while the original studies only considered 7B LLMs, we reproduce these experiments across LLMs of varying sizes, revealing how model size influences the encoding and updating of temporal facts. Our results highlight how dataset design, evaluation metrics, and model size shape LLM behaviour in the presence of temporal knowledge conflicts.

[IR-8] MiroThinker-1.7 H1: Towards Heavy-Duty Research Agents via Verification

【速读】:该论文旨在解决复杂长时程推理任务中研究代理(research agent)的可靠性与多步问题求解能力不足的问题。解决方案的关键在于提出两个层级的改进:首先,MiroThinker-1.7通过引入一个代理式的中期训练阶段,强化结构化规划、上下文推理和工具交互能力,从而提升每一步交互的可靠性,支持更有效的多步协作与持续推理;其次,MiroThinker-H1进一步将验证机制嵌入推理过程,在局部和全局层面实现中间决策的评估与修正,并对整体推理轨迹进行审计,确保最终答案由连贯的证据链支撑,从而显著提升深度研究任务中的性能表现。

链接: https://arxiv.org/abs/2603.15726
作者: MiroMind Team:S. Bai,L. Bing,L. Lei,R. Li,X. Li,X. Lin,E. Min,L. Su,B. Wang,L. Wang,L. Wang,S. Wang,X. Wang,Y. Zhang,Z. Zhang,G. Chen,L. Chen,Z. Cheng,Y. Deng,Z. Huang,D. Ng,J. Ni,Q. Ren,X. Tang,B.L. Wang,H. Wang,N. Wang,C. Wei,Q. Wu,J. Xia,Y. Xiao,H. Xu,X. Xu,C. Xue,Z. Yang,Z. Yang,F. Ye,H. Ye,J. Yu,C. Zhang,W. Zhang,H. Zhao,P. Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 23 pages

点击查看摘要

Abstract:We present MiroThinker-1.7, a new research agent designed for complex long-horizon reasoning tasks. Building on this foundation, we further introduce MiroThinker-H1, which extends the agent with heavy-duty reasoning capabilities for more reliable multi-step problem solving. In particular, MiroThinker-1.7 improves the reliability of each interaction step through an agentic mid-training stage that emphasizes structured planning, contextual reasoning, and tool interaction. This enables more effective multi-step interaction and sustained reasoning across complex tasks. MiroThinker-H1 further incorporates verification directly into the reasoning process at both local and global levels. Intermediate reasoning decisions can be evaluated and refined during inference, while the overall reasoning trajectory is audited to ensure that final answers are supported by coherent chains of evidence. Across benchmarks covering open-web research, scientific reasoning, and financial analysis, MiroThinker-H1 achieves state-of-the-art performance on deep research tasks while maintaining strong results on specialized domains. We also release MiroThinker-1.7 and MiroThinker-1.7-mini as open-source models, providing competitive research-agent capabilities with significantly improved efficiency.

[IR-9] Embedding-Aware Feature Discovery: Bridging Latent Representations and Interpretable Features in Event Sequences

【速读】:该论文旨在解决工业金融系统中事件序列(如交易、用户行为和系统日志)建模时存在的“嵌入表示与手工特征管道脱节”问题,即当前生产系统仍依赖可解释性强、鲁棒性高且满足低延迟要求的手工统计特征,而基于预训练嵌入的方法难以有效融合这些优势。解决方案的关键在于提出统一框架Embedding-Aware Feature Discovery (EAFD),其核心机制是通过一个自我反思的大语言模型(LLM)驱动的特征生成代理,直接从原始事件序列中迭代发现、评估并优化新特征;该过程同时考虑两个互补标准:对齐性(alignment),用于捕捉嵌入已编码的信息;互补性(complementarity),用于识别嵌入中缺失的预测信号,从而实现嵌入与特征的协同增强,在多个开源和工业交易基准上显著优于纯嵌入或纯特征基线方法,达到新的性能上限。

链接: https://arxiv.org/abs/2603.15713
作者: Artem Sakhno,Ivan Sergeev,Alexey Shestov,Omar Zoloev,Elizaveta Kovtun,Gleb Gusev,Andrey Savchenko,Maksim Makarenko
机构: Sber AI Lab; ISP RAS Research Center for Trusted AI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Industrial financial systems operate on temporal event sequences such as transactions, user actions, and system logs. While recent research emphasizes representation learning and large language models, production systems continue to rely heavily on handcrafted statistical features due to their interpretability, robustness under limited supervision, and strict latency constraints. This creates a persistent disconnect between learned embeddings and feature-based pipelines. We introduce Embedding-Aware Feature Discovery (EAFD), a unified framework that bridges this gap by coupling pretrained event-sequence embeddings with a self-reflective LLM-driven feature generation agent. EAFD iteratively discovers, evaluates, and refines features directly from raw event sequences using two complementary criteria: \emphalignment, which explains information already encoded in embeddings, and \emphcomplementarity, which identifies predictive signals missing from them. Across both open-source and industrial transaction benchmarks, EAFD consistently outperforms embedding-only and feature-based baselines, achieving relative gains of up to +5.8% over state-of-the-art pretrained embeddings, resulting in new state-of-the-art performance across event-sequence datasets.

[IR-10] Knowledge Graph Extraction from Biomedical Literature for Alkaptonuria Rare Disease

【速读】:该论文旨在解决阿尔波特尿症(Alkaptonuria, AKU)这一超罕见代谢疾病在现有生物医学知识图谱(Knowledge Graphs, KGs)中严重缺失或代表性不足的问题,从而限制了对其病理机制、并发症及潜在治疗靶点的系统性挖掘。解决方案的关键在于采用基于PubTator3的文本挖掘方法,从海量文献中大规模提取生物医学关系,构建两个不同规模的知识图谱,并通过已有生化知识进行验证,最终实现对AKU相关基因、疾病和疗法的自动识别与系统关联分析,有效揭示该疾病的多系统交互网络及其潜在治疗策略。

链接: https://arxiv.org/abs/2603.15711
作者: Giang Pham,Rebecca Finetti,Caterina Graziani,Bianca Roncaglia,Asma Bendjeddou,Linda Brodo,Sara Brunetti,Moreno Falaschi,Stefano Forti,Silvia Giulia Galfré,Paolo Milazzo,Corrado Priami,Annalisa Santucci,Ottavia Spiga,Alina Sîrbu
机构: University of Pisa (比萨大学); University of Siena (锡耶纳大学); University of Sassari (萨萨里大学); University of Bologna (博洛尼亚大学)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Alkaptonuria (AKU) is an ultra-rare autosomal recessive metabolic disorder caused by mutations in the HGD (Homogentisate 1,2-Dioxygenase) gene, leading to a pathological accumulation of homogentisic acid (HGA) in body fluids and tissues. This leads to systemic manifestations, including premature spondyloarthropathy, renal and prostatic stones, and cardiovascular complications. Being ultra-rare, the amount of data related to the disease is limited, both in terms of clinical data and literature. Knowledge graphs (KGs) can help connect the limited knowledge about the disease (basic mechanisms, manifestations and existing therapies) with other knowledge; however, AKU is frequently underrepresented or entirely absent in existing biomedical KGs. In this work, we apply a text-mining methodology based on PubTator3 for large-scale extraction of biomedical relations. We construct two KGs of different sizes, validate them using existing biochemical knowledge and use them to extract genes, diseases and therapies possibly related to AKU. This computational framework reveals the systemic interactions of the disease, its comorbidities, and potential therapeutic targets, demonstrating the efficacy of our approach in analyzing rare metabolic disorders.

[IR-11] Did You Check the Right Pocket? Cost-Sensitive Store Routing for Memory-Augmented Agents ICLR2026

【速读】:该论文旨在解决记忆增强型智能体(memory-augmented agents)在多存储器(multi-store)系统中普遍存在的检索效率低下问题:现有系统通常对每个查询都从所有存储器中进行检索,导致计算成本高且引入无关上下文。其解决方案的关键在于将记忆检索建模为一个存储器路由(store-routing)问题,通过引入选择性检索机制,仅从相关存储器中获取信息。实验表明,基于oracle的路由策略在下游问答任务中不仅显著提升了准确率,还大幅减少了使用的上下文token数量,证明了选择性检索在提升性能与效率上的双重优势。该研究进一步将存储器选择形式化为一个代价敏感决策问题(cost-sensitive decision problem),在答案准确率与检索成本之间建立权衡关系,为可扩展的多存储器系统设计提供了理论基础和学习路由机制的新方向。

链接: https://arxiv.org/abs/2603.15658
作者: Madhava Gaikwad
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: accepted in ICLR 2026 Workshop on Memory for LLM-Based Agentic Systems

点击查看摘要

Abstract:Memory-augmented agents maintain multiple specialized stores, yet most systems retrieve from all stores for every query, increasing cost and introducing irrelevant context. We formulate memory retrieval as a store-routing problem and evaluate it using coverage, exact match, and token efficiency metrics. On downstream question answering, an oracle router achieves higher accuracy while using substantially fewer context tokens compared to uniform retrieval, demonstrating that selective retrieval improves both efficiency and performance. Our results show that routing decisions are a first-class component of memory-augmented agent design and motivate learned routing mechanisms for scalable multi-store systems. We additionally formalize store selection as a cost-sensitive decision problem that trades answer accuracy against retrieval cost, providing a principled interpretation of routing policies.

[IR-12] NextMem: Towards Latent Factual Memory for LLM -based Agents

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)代理系统中事实性记忆(factual memory)构建的挑战,现有方法存在文本编码导致上下文与索引负担过重,以及参数化存储易引发灾难性遗忘和高计算成本等问题。其解决方案的关键在于提出NextMem框架,利用自回归自动编码器(autoregressive autoencoder)高效构建潜在空间中的事实性记忆,并通过两阶段训练策略——自回归重建对齐与渐进潜在替换——实现稳定优化;同时引入量化技术降低存储开销,从而在检索准确性、鲁棒性和可扩展性方面显著优于现有方法。

链接: https://arxiv.org/abs/2603.15634
作者: Zeyu Zhang,Rui Li,Xiaoyan Zhao,Yang Zhang,Wenjie Wang,Xu Chen,Tat-Seng Chua
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 17 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Memory is critical for LLM-based agents to preserve past observations for future decision-making, where factual memory serves as its foundational part. However, existing approaches to constructing factual memory face several limitations. Textual methods impose heavy context and indexing burdens, while parametric methods suffer from catastrophic forgetting and high costs. To address these challenges, we introduce NextMem, a latent factual memory framework that utilizes an autoregressive autoencoder to efficiently construct latent memory while ensuring accurate reconstruction. For better optimization, we propose a two-stage training process, including autoregressive reconstruction alignment and progressive latent substitution. We also incorporate quantization to reduce storage overhead. Extensive experiments demonstrate that NextMem achieves superior performance, and excels in retrieval, robustness, and extensibility properties. We release our code and model checkpoints at this https URL.

[IR-13] Finder: A Multimodal AI-Powered Search Framework for Pharmaceutical Data Retrieval

【速读】:该论文旨在解决制药领域中传统搜索系统在处理多模态内容(如文本、图像、音频和视频)以及依赖人工标注时效率低下、精度不足的问题。其解决方案的关键在于构建一个可扩展的生成式 AI (Generative AI) 驱动框架 Finder,通过混合向量搜索(hybrid vector search)融合稀疏词法模型与密集语义模型,实现跨模态内容的统一检索;同时采用模块化处理流程对多种格式数据进行预处理、元数据增强并存储于向量原生后端,支持基于推理感知的自然语言查询,显著提升搜索的准确性和上下文相关性。

链接: https://arxiv.org/abs/2603.15623
作者: Suyash Mishra,Srikanth Patil,Satyanarayan Pati,Sagar Sahu,Baddu Narendra
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI is transforming pharmaceutical search, where traditional systems struggle with multimodal content and manual curation. Finder is a scalable AI-powered framework that unifies retrieval across text, images, audio, and video using hybrid vector search, combining sparse lexical and dense semantic models. Its modular pipeline ingests diverse formats, enriches metadata, and stores content in a vector-native backend. Finder supports reasoning-aware natural language search, improving precision and contextual relevance. The system has processed over 291,400 documents, 31,070 videos, and 1,192 audio files in 98 languages. Techniques like hybrid fusion, chunking, and metadata-aware routing enable intelligent access across regulatory, research, and commercial domains.

人机交互

[HC-0] Real-Time Decoding of Movement Onset and Offset for Brain-Controlled Rehabilitation Exoskeleton ICRA2026

【速读】:该论文旨在解决当前机器人辅助康复治疗中神经可塑性靶向不足的问题,即现有系统主要作用于肢体层面,对受损神经回路的直接调控有限,难以实现基于患者意图的精准、适时干预。其解决方案的关键在于实现了基于非侵入式脑电图(EEG)的双状态运动想象控制上肢外骨骼,使用户能够通过脑电信号自主启动和终止机器人辅助的达目标向动作,从而将康复训练与患者的神经意图更紧密耦合。研究进一步提出了一种类无关的固定点重校准方法,有效抑制了任务驱动重校准带来的系统性偏差,显著提升了分类分离度(AUC提升:起始+56%,终止+34%),并维持跨日稳定性,为临床转化提供了可靠的技术基础。

链接: https://arxiv.org/abs/2603.16825
作者: Kanishka Mitra,Satyam Kumar,Frigyes Samuel Racz,Deland Liu,Ashish D. Deshpande,José del R. Millán
机构: Massachusetts Institute of Technology (麻省理工学院); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Meta Reality Labs Research (Meta Reality Labs Research)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to ICRA 2026. 8 pages, 5 figures. Project page available at this https URL

点击查看摘要

Abstract:Robot-assisted therapy can deliver high-dose, task-specific training after neurologic injury, but most systems act primarily at the limb level-engaging the impaired neural circuits only indirectly-which remains a key barrier to truly contingent, neuroplasticity-targeted rehabilitation. We address this gap by implementing online, dual-state motor imagery control of an upper-limb exoskeleton, enabling goal-directed reaches to be both initiated and terminated directly from non-invasive EEG. Eight participants used EEG to initiate assistance and then volitionally halt the robot mid-trajectory. Across two online sessions, group-mean hit rates were 61.5% for onset and 64.5% for offset, demonstrating reliable start-stop command delivery despite instrumental noise and passive arm motion. Methodologically, we reveal a systematic, class-driven bias induced by common task-based recentering using an asymmetric margin diagnostic, and we introduce a class-agnostic fixation-based recentering method that tracks drift without sampling command classes while preserving class geometry. This substantially improves threshold-free separability (AUC gains: onset +56%, p = 0.0117; offset +34%, p = 0.0251) and reduces bias within and across days. Together, these results help bridge offline decoding and practical, intention-driven start-stop control of a rehabilitation exoskeleton, enabling precisely timed, contingent assistance aligned with neuroplasticity goals while supporting future clinical translation.

[HC-1] Beyond Cybathlon: On-demand Quadrupedal Assistance for People with Limited Mobility

【速读】:该论文旨在解决轮椅使用者在日常生活中因移动能力受限而难以独立完成移动操作任务的问题,尤其是传统附加于轮椅的机械臂方案因体积大、灵活性差导致操纵不便。其解决方案的关键在于提出一种基于共享自主(shared autonomy)的四足辅助机器人系统,该系统结合半自主任务执行与人类远程操控:一方面通过语义感知和避障导航实现环境中的自主移动与拾取-放置任务自动化;另一方面采用口控操纵杆接口支持残障用户对末端执行器进行高精度操作,从而在不牺牲轮椅机动性的前提下提升用户的独立性与任务效率。

链接: https://arxiv.org/abs/2603.16772
作者: Carmen Scheidemann,Andrei Cramariuc,Changan Chen,Jia-Ruei Chiu,Marco Hutter
机构: ETH Zurich (苏黎世联邦理工学院); Hexagon (海克斯康)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Background: Assistance robots have the potential to increase the independence of people who need daily care due to limited mobility or being wheelchair-bound. Current solutions of attaching robotic arms to motorized wheelchairs offer limited additional mobility at the cost of increased size and reduced wheelchair maneuverability. Methods: We present an on-demand quadrupedal assistance robot system controlled via a shared autonomy approach, which combines semi-autonomous task execution with human teleoperation. Due to the mobile nature of the system it can assist the operator whenever needed and perform autonomous tasks independently, without otherwise restricting their mobility. We automate pick-and-place tasks, as well as robot movement through the environment with semantic, collision-aware navigation. For teleoperation, we present a mouth-level joystick interface that enables an operator with reduced mobility to control the robot’s end effector for precision manipulation. Results: We showcase our system in the \textitCybathlon 2024 Assistance Robot Race, and validate it in an at-home experimental setup, where we measure task completion times and user satisfaction. We find our system capable of assisting in a broad variety of tasks, including those that require dexterous manipulation. The user study confirms the intuition that increased robot autonomy alleviates the operator’s mental load. Conclusions: We present a flexible system that has the potential to help people in wheelchairs maintain independence in everyday life by enabling them to solve mobile manipulation problems without external support. We achieve results comparable to previous state-of-the-art on subjective metrics while allowing for more autonomy of the operator and greater agility for manipulation. Subjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC) Cite as: arXiv:2603.16772 [cs.RO] (or arXiv:2603.16772v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2603.16772 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Carmen Scheidemann [view email] [v1] Tue, 17 Mar 2026 16:48:36 UTC (47,172 KB)

[HC-2] hermopneumatic Pixels for Fast Localized Low-Voltage Touch Feedback

【速读】:该论文旨在解决可穿戴及表面类触觉系统中 tactile feedback(触觉反馈)集成困难的问题,特别是传统方案在制造复杂度、驱动电压要求和空间分布能力方面的局限性。解决方案的关键在于提出了一种基于低电压(~10 V)热气动像素(thermopneumatic pixels, TPPs)的新型触觉执行器架构:通过将微小密封腔内的电脉冲转化为瞬态压力变化,实现垂直方向上的力输出与位移(峰值力>1 N,位移达毫米级),同时具备材料成本低、层叠式组装简单、易于规模化制造和并行驱动等优势,从而为实验与消费界面提供高性价比且高性能的触觉反馈嵌入路径。

链接: https://arxiv.org/abs/2603.16750
作者: Max Linnander,Yon Visell
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We present thermopneumatic pixels (TPPs), which are tactile actuators designed for rapid fabrication and straightforward integration into compact wearable and surface-based haptic systems. Each TPP converts low-voltage ( \sim 10 V) electrical pulses into transient pressure increases within a sealed cavity, producing out-of-plane forces and displacements suitable for tactile stimulation. The architecture enables scalable fabrication and spatially distributed actuation while maintaining simple electrical interfacing. The TPPs are constructed from inexpensive, readily available materials using straightforward layer-based assembly, facilitating rapid prototyping and integration into interactive devices. Mechanical characterization demonstrates peak forces exceeding 1 N and millimeter displacements. We further present driving electronics for operating multiple TPP modules concurrently and report perceptual study results demonstrating the effectiveness of the resulting tactile feedback. Together, these results establish low-voltage thermopneumatic actuation as an accessible and high-performance approach for embedding tactile feedback into experimental and consumer-facing interfaces.

[HC-3] SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding

【速读】:该论文旨在解决脑电图(EEG)信号中神经活动解码的挑战,尤其是现有自监督预训练框架因仅对原始信号进行独立的时间和频域掩码,导致模型偏向高频振荡而忽视低频节律模式的问题。其解决方案的关键在于提出一种新颖的高斯平滑掩码策略,应用于短时傅里叶变换(STFT)谱图,并通过联合施加时间、频率及时间-频率维度的高斯掩码,显著提升重建任务难度,迫使模型学习跨高低频域的复杂神经特征;同时设计了SpecHi-Net U型分层架构以有效恢复被掩码信号,并采用基于谱门控机制的专家混合(SpecMoE)框架加速大规模预训练,实现跨物种与跨受试者的强泛化能力。

链接: https://arxiv.org/abs/2603.16739
作者: D. Darankoum,C. Habermacher,J. Volle,S. Grudinin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 34 pages (12 pages Main text and 22 pages Supplementary Information)

点击查看摘要

Abstract:Decoding the orchestration of neural activity in electroencephalography (EEG) signals is a central challenge in bridging neuroscience with artificial intelligence. Foundation models have made strides in generalized EEG decoding, yet many existing frameworks primarily relying on separate temporal and spectral masking of raw signals during self-supervised pretraining. Such strategies often tend to bias learning toward high-frequency oscillations, as low-frequency rhythmic patterns can be easily inferred from the unmasked signal. We introduce a foundation model that utilizes a novel Gaussian-smoothed masking scheme applied to short-time Fourier transform (STFT) maps. By jointly applying time, frequency, and time-frequency Gaussian masks, we make the reconstruction task much more challenging, forcing the model to learn intricate neural patterns across both high- and low-frequency domains. To effectively recover signals under this aggressive masking strategy, we design SpecHi-Net, a U-shaped hierarchical architecture with multiple encoding and decoding stages. To accelerate large-scale pretraining, we partition the data into three subsets, each used to train an independent expert model. We then combine these models through SpecMoE, a mixture of experts framework guided by a learned spectral gating mechanism. SpecMoE achieves state-of-the-art performance across a diverse set of EEG decoding tasks, including sleep staging, emotion recognition, motor imagery classification, abnormal signal detection, and drug effect prediction. Importantly, the model demonstrates strong cross-species and cross-subject generalization, maintaining high accuracy on both human and murine EEG datasets.

[HC-4] Whose Knowledge Counts? Co-Designing Community-Centered AI Auditing Tools with Educators in Hawai`i

【速读】:该论文旨在解决生成式 AI(Generative AI)在教育场景中可能引发的文化误表征与偏见问题,尤其是在资源匮乏语言和原住民教育环境中,AI系统常因缺乏文化适配性而表现不佳。其解决方案的关键在于通过与22名公立学校教师合作开展四场共同设计工作坊,提炼出基于夏威夷特定文化价值观和实践的AI审计工具设计原则,例如追溯知识来源的谱系(genealogy of knowledge),并提出将AI审计视为一种社区导向的过程,而非孤立个体的任务,从而增强AI工具在文化敏感场景下的可靠性与伦理合规性。

链接: https://arxiv.org/abs/2603.16646
作者: Dora Zhao,Hannah Cha,Michael J. Ryan,Angelina Wang,Rachel Baker-Ramos Evyn-Bree Helekahi-Kaiwi,Rebecca Diego,Josiah Hester,Diyi Yang
机构: Stanford University (斯坦福大学); Cornell Tech (康奈尔科技); Georgia Institute of Technology (佐治亚理工学院); Ulu Lāhui Foundation (乌鲁拉胡基金会)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: CHI 2026

点击查看摘要

Abstract:Although generative AI is being deployed into classrooms with promises of aiding teachers, educators caution that these tools can have unintended pedagogical repercussions, including cultural misrepresentation and bias. These concerns are heightened in low-resource language and Indigenous education settings, where AI systems frequently underperform. We investigate these challenges in Hawai`i, where public schools operate under a statewide mandate to integrate Hawaiian language and culture into education. Through four co-design workshops with 22 public school educators, we surfaced concerns about using generative AI in educational settings, particularly around cultural misrepresentation, and corresponding designs for auditing tools that address these issues. We find that educators envision tools grounded in specific Hawaiian cultural values and practices, such as tracing the genealogy of knowledge in source materials. Building on these insights, we conceptualize AI auditing as a community-oriented process rather than the work of isolated individuals, and discuss implications for designing auditing tools.

[HC-5] Why We Need to Destroy the Illusion of Speaking to A Human: Critical Reflections On Ethics at the Front-End for LLM s

【速读】:该论文旨在解决生成式 AI(Generative AI)在对话交互中因模拟人类对话而引发的伦理问题,特别是用户可能因此产生对AI能力的误解。其核心解决方案在于明确区分人机对话与人际交流之间的本质差异,并基于此提出面向前端设计的伦理改进路径,以减少误导性认知并提升AI交互的透明度与责任性。

链接: https://arxiv.org/abs/2603.16633
作者: Sarah Diefenbach,Daniel Ullrich
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: CHI 2026 Conference on Human-Computer Interaction

点击查看摘要

Abstract:Conversation with chatbots based on Large Language Models (LLMs) such as ChatGPT has become one of the major forms of interaction with Artificial Intelligence (AI) in everyday life. What makes this interaction so convenient is that interacting with LLMs feels so natural, and resembles what we know from real, human conversations. At the same time, this seeming similarity is part of one of the ethical challenges of AI design, since it activates many misleading ideas about AI. We discuss similarities and differences between human-AI-conversations and interpersonal conversation and highlight starting points for more ethical design of AI at the front-end.

[HC-6] CoEmpaTeam: Enhancing Cognitive Empathy using LLM -based Avatars and Dynamic Role Play in Virtual Reality

【速读】:该论文试图解决在以绩效为导向的社会中,认知共情(cognitive empathy)能力下降的问题,而这一能力的培养因属于难以察觉的软技能(soft skill)而面临挑战。解决方案的关键在于开发了一个基于虚拟现实(VR)的训练系统 CoEmpaTeam,该系统利用大语言模型(LLM)驱动的、具有显著个性差异的三个虚拟角色(avatar),通过动态角色扮演使用户主动进行视角转换,从而在沉浸式环境中训练认知共情能力。实证研究表明,该系统能显著提升用户的认知共情水平,并且这种提升被参与者报告为可迁移至现实生活场景。

链接: https://arxiv.org/abs/2603.16614
作者: Dehui Kong,Martin Feick,Shi Liu,Alexander Maedche
机构: Karlsruhe Institute of Technology (KIT)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to appear in the Proceedings of the ACM CHI Conference on Human Factors in Computing Systems (CHI 2026)

点击查看摘要

Abstract:Cognitive empathy, the ability to understand others’ perspectives, is essential for effective communication, reducing biases, and constructive negotiation. However, this skill is declining in a performance-driven society, which prioritizes efficiency over perspective-taking. Here, the training of cognitive empathy is challenging because it is a subtle, hard-to-perceive soft skill. To address this, we developed CoEmpaTeam, a VR-based system that enables users to train their cognitive empathy by using LLM-driven avatars with different personalities. Through dynamic role play, users actively engage in perspective-taking, experiencing situations through another person’s eyes. CoEmpaTeam deploys three avatars who significantly differ in their personality, validated by a technical evaluation and an online experiment (n=90). Next, we evaluated the system through a lab experiment with 32 participants who performed three sessions across two weeks, followed by a one-week diary study. Our results showed a significant increase in cognitive empathy, which, according to participants, transferred into their real lives.

[HC-7] Designing for Disagreement: Front-End Guardrails for Assistance Allocation in LLM -Enabled Robots

【速读】:该论文旨在解决在社交场景中,由大语言模型(Large Language Model, LLM)驱动的机器人在资源稀缺时进行辅助优先级分配所面临的双重挑战:一是社会价值多元性导致不同用户对“谁应优先获助”存在合理分歧;二是LLM行为在提示词、上下文和群体差异下表现出不可预测的变异性,而当前面向用户的实时多用户辅助分配规则缺乏明确规范。解决方案的关键在于提出一种“有限校准与可争议性”(bounded calibration with contestability)的程序性前端模式,其核心包括三点:(i)将优先级策略限制在治理机构批准的可接受选项范围内;(ii)在交互关键节点以用户可理解的方式保持当前模式的透明性;(iii)提供针对具体决策结果的申诉路径,而不重新协商全局规则。该设计将价值多元性和LLM不确定性视为持续存在的前提条件,从而避免隐性偏见的沉默默认和高压情境下用户配置“价值观设置”带来的负担转移问题。

链接: https://arxiv.org/abs/2603.16537
作者: Carmen Ng
机构: Technical University of Munich (慕尼黑工业大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: Accepted at the Proceedings of the CHI 2026 Workshop: Ethics at the Front-End

点击查看摘要

Abstract:LLM-enabled robots prioritizing scarce assistance in social settings face pluralistic values and LLM behavioral variability: reasonable people can disagree about who is helped first, while LLM-mediated interaction policies vary across prompts, contexts, and groups in ways that are difficult to anticipate or verify at contact point. Yet user-facing guardrails for real-time, multi-user assistance allocation remain under-specified. We propose bounded calibration with contestability, a procedural front-end pattern that (i) constrains prioritization to a governance-approved menu of admissible modes, (ii) keeps the active mode legible in interaction-relevant terms at the point of deferral, and (iii) provides an outcome-specific contest pathway without renegotiating the global rule. Treating pluralism and LLM uncertainty as standing conditions, the pattern avoids both silent defaults that hide implicit value skews and wide-open user-configurable “value settings” that shift burden under time pressure. We illustrate the pattern with a public-concourse robot vignette and outline an evaluation agenda centered on legibility, procedural legitimacy, and actionability, including risks of automation bias and uneven usability of contest channels.

[HC-8] Follow the Clues Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimodal Emotion Recognition

【速读】:该论文旨在解决开放词汇多模态情感识别(Open-Vocabulary Multimodal Emotion Recognition, OV-MER)中因模态线索模糊性导致的情感推理难题,尤其是当不同模态间存在冲突或情境动态未被观测时,现有方法易受主导数据先验影响,忽略互补的情感线索。其解决方案的关键在于提出HyDRA——一种混合证据演绎推理架构(Hybrid-evidential Deductive Reasoning Architecture),将推理过程形式化为“提议-验证-决策”协议,并通过分层奖励塑造的强化学习机制内化该归纳推理过程,使模型能够从多个潜在视角合成多证据支撑的合理性解释,从而更准确地重构复杂情绪状态并提供可解释的诊断性证据链。

链接: https://arxiv.org/abs/2603.16463
作者: Yu Liu,Lei Zhang,Haoxun Li,Hanlei Shi,Yuxuan Ding,Leyuan Qu,Taihao Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Open-Vocabulary Multimodal Emotion Recognition (OV-MER) is inherently challenging due to the ambiguity of equivocal multimodal cues, which often stem from distinct unobserved situational dynamics. While Multimodal Large Language Models (MLLMs) offer extensive semantic coverage, their performance is often bottlenecked by premature commitment to dominant data priors, resulting in suboptimal heuristics that overlook crucial, complementary affective cues across modalities. We argue that effective affective reasoning requires more than surface-level association; it necessitates reconstructing nuanced emotional states by synthesizing multiple evidence-grounded rationales that reconcile these observations from diverse latent perspectives. We introduce HyDRA, a Hybrid-evidential Deductive Reasoning Architecture that formalizes inference as a Propose-Verify-Decide protocol. To internalize this abductive process, we employ reinforcement learning with hierarchical reward shaping, aligning the reasoning trajectories with final task performance to ensure they best reconcile the observed multimodal cues. Systematic evaluations validate our design choices, with HyDRA consistently outperforming strong baselines–especially in ambiguous or conflicting scenarios–while providing interpretable, diagnostic evidence traces.

[HC-9] One Kiss: Emojis as Agents of Genre Flux in Generative Comics

【速读】:该论文旨在解决当前基于文本提示(prompt-based)的生成式AI(Generative AI)在视觉叙事创作中面临的控制精度与创作流畅性之间的权衡问题。传统方法要求用户通过精确的文字描述来引导图像生成,这限制了创作过程的自然性和灵活性。其解决方案的关键在于提出了一种名为“One Kiss”的协同创作漫画生成系统,引入“情感引导”(Affective Steering)机制:用户不再依赖文本提示,而是通过表情符号(emoji)输入来调节故事的情感基调,同时结合手绘分镜框架和关键词-表情符号配对的方式实现结构 pacing 与氛围设定的双通道输入。这种设计使情绪输入在多面板间累积并引发“类型流动”(Genre Flux),从而让用户的角色从提示工程师转变为叙事导演,利用语义模糊性作为创意惊喜的来源而非控制损失。

链接: https://arxiv.org/abs/2603.16359
作者: Xiruo Wang,Xinyi Jiang,Ziqi Lyu
机构: University College London (伦敦大学学院); Tsinghua University (清华大学); Beijing Forestry University (北京林业大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to CHI 202X Extended Abstracts

点击查看摘要

Abstract:Generative AI has made visual storytelling widely accessible, yet current prompt-based interactions often force users into a trade-off between precise control and creative flow. We present One Kiss, a co-creative comic generation system that introduces “Affective Steering”. Instead of writing text prompts, users guide the tone of their story through emoji inputs, whose semantic ambiguity becomes a resource rather than a limitation. Unlike traditional text-to-image tools that rely on explicit descriptions, One Kiss uses a dual-stream input in which users define structural pacing by sketching panel frames and set atmospheric tone by pairing keywords with emojis. This mechanism enables “Genre Flux,” where emotional inputs accumulate across panels and gradually shift the genre of a story. A preliminary study (N = 6) suggests that this soft steering approach may reframe the user’s role from prompt engineer to narrative director, with ambiguity serving as a source of creative surprise rather than a loss of control.

[HC-10] Human/AI Collective Intelligence for Deliberative Democracy: A Human-Centred Design Approach

【速读】:该论文旨在解决如何通过计算工具(特别是人工智能)增强协商式民主(Deliberative Democracy, DD)的可信性和有效性问题。其核心挑战在于如何设计人机协同系统,使利益相关方能够有意义地参与民主过程的设计与实施,从而提升决策质量与公众信任。解决方案的关键在于采用以人类为中心的设计方法(Human-Centred Design Approach),通过共同设计(co-design)方法识别关键挑战、优化用户场景,并推导出相应的技术实现路径,最终在真实情境中验证AI支持下的民主实践原型。

链接: https://arxiv.org/abs/2603.16260
作者: Anna De Liddo,Lucas Anastasiou,Simon Buckingham Shum
机构: Knowledge Media Institute, The Open University, Milton Keynes, UK; Connected Intelligence Centre, University of Technology Sydney, Sydney, Australia
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This chapter introduces the concept of Collective Intelligence for Deliberative Democracy (CI4DD). We propose that the use of computational tools, specifically artificial intelligence to advance deliberative democracy, is an instantiation of a broader class of human-computer system designed to augment collective intelligence. Further, we argue for a fundamentally human-centred design approach to orchestrate how stakeholders can contribute meaningfully to shaping the artifacts and processes needed to create trustworthy DD processes. We first contextualise the key concepts of CI and the role of AI within it. We then detail our co-design methodology for identifying key challenges, refining user scenarios, and deriving technical implications. Two exemplar cases illustrate how user requirements from civic organisations were implemented with AI support and piloted in authentic contexts.

[HC-11] A Scoping Review of AI-Driven Digital Interventions in Mental Health Care: Mapping Applications Across Screening Support Monitoring Prevention and Clinical Education ALT

【速读】:该论文旨在解决当前数字心理健康(mHealth)干预中AI技术应用碎片化、缺乏系统性框架的问题,尤其在筛查、治疗、随访、临床教育和群体预防等关键阶段如何有效整合生成式AI(GenAI)与以人为本的AI(HCAI)技术。其解决方案的关键在于提出一个创新的四支柱框架,用于系统性地规划和评估AI增强型心理医疗服务,涵盖技术实现、伦理合规、人机协作机制及可及性保障,从而为研究人员、临床医生和政策制定者提供可操作的指导路径,确保数字健康干预的安全性、有效性与公平性。

链接: https://arxiv.org/abs/2603.16204
作者: Yang Ni,Fanli Jia
机构: Columbia University (哥伦比亚大学); Seton Hall University (圣奥诺弗雷大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Please cite the published version. Thank you. Y. Ni and F. Jia. 2025. A Scoping Review of AI-Driven Digital Interventions in Mental Health Care: Mapping Applications Across Screening, Support, Monitoring, Prevention, and Clinical Education. Healthcare 13, 10 (2025), 1205. DOI: this https URL

点击查看摘要

Abstract:Artificial intelligence (AI)-enabled digital interventions, including Generative AI (GenAI) and Human-Centered AI (HCAI), are increasingly used to expand access to digital psychiatry and mental health care. This PRISMA-ScR scoping review maps the landscape of AI-driven mental health (mHealth) technologies across five critical phases: pre-treatment (screening/triage), treatment (therapeutic support), post-treatment (remote patient monitoring), clinical education, and population-level prevention. We synthesized 36 empirical studies implemented through early 2024, focusing on Large Language Models (LLMs), machine learning (ML) models, and autonomous conversational agents. Key use cases involve referral triage, empathic communication enhancement, and AI-assisted psychotherapy delivered via chatbots and voice agents. While benefits include reduced wait times and increased patient engagement, we address recurring challenges like algorithmic bias, data privacy, and human-AI collaboration barriers. By introducing a novel four-pillar framework, this review provides a comprehensive roadmap for AI-augmented mental health care, offering actionable insights for researchers, clinicians, and policymakers to develop safe, effective, and equitable digital health interventions.

[HC-12] Balancing Openness and Safety: Central and Peripheral Governance Practices in the Lesbian Subreddit Ecosystem

【速读】:该论文旨在解决在线LGBTQ+社群中普遍存在的矛盾:如何在保持可见性以吸引新成员的同时,有效保护现有成员免受骚扰。针对Reddit上女同性恋社群这一特定场景,研究发现这些社群并非孤立存在,而是一个相互关联的生态系统。其解决方案的关键在于识别并利用治理劳动的功能性分工——中心子版块(占34%)侧重内容筛选与信息流质量,服务于广泛、公开的受众;边缘子版块(占66%)则聚焦边界维护与参与控制,以保障小众、身份特定的社群安全。这一发现挑战了统一化的内容审核模式,强调需基于角色和上下文设计适应互联空间的差异化治理工具,从而实现可见性与安全性的动态平衡。

链接: https://arxiv.org/abs/2603.16176
作者: Yan Xia,Sushmita Khan,Naiyah Lewis,Jinkyung Katie Park
机构: Clemson University (克莱姆森大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted as a Poster at CHI 2026 (Extended Abstracts of the 2026 ACM CHI Conference on Human Factors in Computing Systems)

点击查看摘要

Abstract:Online LGBTQ+ communities face a persistent tension: remaining visible to welcome newcomers while protecting members from harassment. This challenge is particularly acute for lesbian communities on Reddit, which operate not as isolated groups but as an interconnected ecosystem. We examine how this tension is negotiated across the lesbian subreddit ecosystem (N=29) by combining network analysis of cross-subreddit links with a qualitative thematic analysis of 167 subreddit rules. Our findings show a functional division of governance labor between central (34%) and peripheral subreddits (66%). While all communities share a baseline of safety regulations, central subreddits prioritize content curation and feed quality to support a large, public-facing audience, whereas peripheral subreddits emphasize boundary maintenance and participation control to protect smaller, identity-specific niches. These findings challenge monolithic moderation approaches and highlight the need for ecosystem-aware design. We argue that effective moderation requires role- and context-sensitive tools supporting visibility and safety across interconnected spaces.

[HC-13] Change is Hard: Consistent Player Behavior Across Games with Conflicting Incentives

【速读】:该论文旨在解决游戏中玩家行为模式的跨平台一致性问题,即在不同游戏机制下(如《英雄联盟》强调专业化、《云顶之弈》鼓励灵活性),玩家是否仍保持稳定的行为倾向。其核心问题是:个体行为特征(如灵活性或专业化)是受游戏结构限制,还是由玩家自身特质主导?解决方案的关键在于采用一种新颖的跨游戏分析方法,追踪同一组4,830名至少参与50场竞技对战的玩家在两个不同环境中的行为表现,从而减少自选择偏差,揭示玩家行为具有高度一致性,表明个体代理权(agency)比游戏结构(structure)更能预测跨平台行为模式。

链接: https://arxiv.org/abs/2603.16136
作者: Emily Chen,Alexander J. Bisberg,Dmitri Williams,Magy Seif El-Nasr,Emilio Ferrara
机构: University of Southern California(南加州大学); University of California at Santa Cruz(加州大学圣克鲁兹分校); California Polytechnic State University(加州理工州立大学)
类目: Human-Computer Interaction (cs.HC)
备注: 29 pages, 11 tables, 3 figures, to be published in ACM conference on Human Factors in Computing Systems (CHI 2026)

点击查看摘要

Abstract:This paper examines how player flexibility – a player’s willingness to engage in a breadth of options or specialize – manifests across two gaming environments: League of Legends (League) and Teamfight Tactics (TFT). We analyze the gameplay decisions of 4,830 players who have played at least 50 competitive games in both titles and explore cross-game dynamics of behavior retention and consistency. Our work introduces a novel cross-game analysis that tracks the same players’ behavior across two different environments, reducing self-selection bias. Our findings reveal that while games incentivize different behaviors (specialization in League versus flexibility in TFT) for performance-based success, players exhibit consistent behavior across platforms. This study contributes to long-standing debate about agency versus structure, showing individual agency may be more predictive of cross-platform behavior than game-imposed structure in competitive settings. These insights offer implications for game developers, designers and researchers interested in building systems to promote behavior change.

[HC-14] oward Reliable Scientific Visualization Pipeline Construction with Structure-Aware Retrieval-Augmented LLM s

【速读】:该论文旨在解决从自然语言描述中生成可执行的科学可视化流程(scientific visualization pipeline)这一挑战,尤其是在基于网页的可视化库环境中,由于流程构建对阶段缺失、算子误用或顺序错误高度敏感,大型语言模型(Large Language Models, LLMs)难以保证生成结果的正确性和可用性。解决方案的关键在于提出一种结构感知的检索增强生成(retrieval-augmented generation, RAG)工作流,通过引入与目标流程对齐的代码示例作为上下文引导,辅助模型准确选择模块、配置参数并确定执行顺序,从而显著提升流程的可执行性并降低人工修正成本(correction cost)。

链接: https://arxiv.org/abs/2603.16057
作者: Guanghui Zhao,Zhe Wang,Yu Dong,Guan Li,GuiHua Shan
机构: Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences; University of Chinese Academy of Sciences; Computer Network Information Center, Chinese Academy of Sciences
类目: Graphics (cs.GR); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Scientific visualization pipelines encode domain-specific procedural knowledge with strict execution dependencies, making their construction sensitive to missing stages, incorrect operator usage, or improper ordering. Thus, generating executable scientific visualization pipelines from natural-language descriptions remains challenging for large language models, particularly in web-based environments where visualization authoring relies on explicit code-level pipeline assembly. In this work, we investigate the reliability of LLM-based scientific visualization pipeline generation, focusing on this http URL as a representative web-based visualization library. We propose a structure-aware retrieval-augmented generation workflow that provides pipeline-aligned this http URL code examples as contextual guidance, supporting correct module selection, parameter configuration, and execution order. We evaluate the proposed workflow across multiple multi-stage scientific visualization tasks and LLMs, measuring reliability in terms of pipeline executability and human correction effort. To this end, we introduce correction cost as metric for the amount of manual intervention required to obtain a valid pipeline. Our results show that structured, domain-specific context substantially improves pipeline executability and reduces correction cost. We additionally provide an interactive analysis interface to support human-in-the-loop inspection and systematic evaluation of generated visualization pipelines.

[HC-15] Interpretable Context Methodology: Folder Structure as Agent ic Architecture

【速读】:该论文旨在解决当前AI代理编排(AI agent orchestration)方法中存在的工程开销问题,尤其是在顺序工作流中,传统多代理框架因引入不必要的上下文传递、记忆管理与步骤协调逻辑而造成冗余。其解决方案的关键在于提出模型工作区协议(Model Workspace Protocol, MWP),通过将原本由代码实现的编排逻辑转化为基于文件系统结构的静态组织方式:以编号文件夹表示流程阶段,使用纯Markdown文件承载提示词和上下文信息,明确指定单个AI代理在每一步的角色;同时由本地脚本处理非智能的机械任务。这种方法借鉴了Unix管道设计、模块化分解、多遍编译及文献编程的思想,实现了无需复杂框架即可高效结构化AI代理上下文的目标。

链接: https://arxiv.org/abs/2603.16021
作者: Jake Van Clief,David McDermott
机构: Eduba; University of Edinburgh (爱丁堡大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 28 pages, 5 figures, 2 tables, 54 references

点击查看摘要

Abstract:Current approaches to AI agent orchestration typically involve building multi-agent frameworks that manage context passing, memory, error handling, and step coordination through code. These frameworks work well for complex, concurrent systems. But for sequential workflows where a human reviews output at each step, they introduce engineering overhead that the problem does not require. This paper presents Model Workspace Protocol (MWP), a method that replaces framework-level orchestration with filesystem structure. Numbered folders represent stages. Plain markdown files carry the prompts and context that tell a single AI agent what role to play at each step. Local scripts handle the mechanical work that does not need AI at all. The result is a system where one agent, reading the right files at the right moment, does the work that would otherwise require a multi-agent framework. This approach applies ideas from Unix pipeline design, modular decomposition, multi-pass compilation, and literate programming to the specific problem of structuring context for AI agents. The protocol is open source under the MIT license.

[HC-16] Galaxy Tracer: A Topology-First 3D Interface for Interactive PCAP Exploration

【速读】:该论文旨在解决传统包分析工具仅以表格形式呈现捕获数据所导致的局限性,即分析师难以直观理解网络通信中的关系结构。其解决方案的关键在于提出Galaxy Tracer系统,该系统默认界面为交互式三维网络拓扑图,将主机表示为空间定位的节点、会话表示为边、协议分组表示为视觉上区分的聚类,同时保持同步的包列表作为辅助视图,共享过滤状态,使结构化与表格化分析形成连续工作流。该设计利用第三维空间维度揭示密度、聚类、主机中心性和通信规模等关键特征,显著提升网络流量的可解释性与分析效率。

链接: https://arxiv.org/abs/2603.16018
作者: Ryan Younger
机构: Olivet Nazarene University (奥利维特纳撒勒大学)
类目: Human-Computer Interaction (cs.HC); Networking and Internet Architecture (cs.NI)
备注: Preprint of Galaxy Tracer (March 2026). Live interactive demo: this https URL

点击查看摘要

Abstract:Packet analysis tools conventionally present capture data through tabular packet lists, constraining the analyst to a sequential view that obscures the relational structure of network communication. This paper presents Galaxy Tracer, a browser-native packet capture exploration system in which the default interface is an interactive three-dimensional network topology rather than a packet list. Hosts appear as spatially positioned nodes, conversations as edges, and protocol groupings as visually distinct clusters. A synchronized packet list remains available as a secondary view, sharing filter state with the topology so that structural and tabular inspection function as one continuous workflow. The system parses PCAP and PCAPNG formats, dissects over 90 protocols, and renders the topology through this http URL. The paper argues that the third spatial dimension is not merely aesthetic but analytically meaningful: it reveals density, clustering, host centrality, and communication scale that are difficult to perceive in list-only tools.

[HC-17] CoDesignAI: An AI-Enabled Multi-Agent Multi-User System for Collaborative Urban Design at the Conceptual Stage

【速读】:该论文旨在解决当前协同城市设计中公众参与效率低、难以规模化的问题(public participation in collaborative urban design often faces challenges in achieving efficient and scalable citizen engagement)。其解决方案的关键在于引入CoDesignAI系统,该系统通过整合多用户(代表居民或利益相关者)与多个AI代理(代表领域专家),在城市设计概念阶段提供引导和专业知识支持;系统利用生成式AI(Generative AI)与空间映射服务结合,实现街道层面的设计方案可视化,并借助AI代理自动总结讨论内容、提取共享设计意图并生成提示语以呈现设计方案干预措施,从而推动从专家主导向开放、参与式设计流程的转变。

链接: https://arxiv.org/abs/2603.16008
作者: Zhaoxi Zhang,Ruolin Wu,Feiyang Ren,Sridevi Turaga,Tamir Mendel
机构: University of Florida (佛罗里达大学); Stanford University (斯坦福大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Public participation has become increasingly important in collaborative urban design; yet, existing processes often face challenges in achieving efficient and scalable citizen engagement. To address this gap, this study explores how large language models (LLMs) can support cooperation among community members in participatory design. We introduce CoDesignAI, a collaborative urban design tool that combines multiple users, representing residents or stakeholders, with multiple AI agents, representing domain experts who provide facilitation and professional knowledge during the conceptual stage of urban design. This paper presents the system architecture and main components of the tool, illustrating how users interact with AI agents within a collaborative and iterative design workflow. Specifically, the system integrates generative AI with spatial mapping services to support street-level visualization of design proposals. AI agents assist users by summarizing discussion content, extracting shared design intentions, and generating prompts for presenting design interventions. The system also enables users to revise and refine their ideas over multiple rounds while documenting the design process. By combining conversational AI, multi-user interaction, and image-based design grounded in real-world urban contexts, this study argues that AI-enabled design systems can help shift urban design from an expert-centered practice to a more open and participatory process. The paper contributes a new web-based platform for AI-assisted collaborative design and offers an early exploration of how AI agents may expand the capacity for public participation in urban design.

[HC-18] he Midas Touch in Gaze vs. Hand Pointing: Modality-Specific Failure Modes and Implications for XR Interfaces

【速读】:该论文旨在解决扩展现实(Extended Reality, XR)界面中手部输入与注视输入之间的权衡问题:手部输入易导致疲劳,而注视输入则存在“金手指效应”(Midas Touch problem)和精度受限的问题。解决方案的关键在于提出并实现一个名为xr-adaptive-modality-2025的开源Web平台,该平台支持模态特异性自适应干预策略,包括注视去杂(gaze declutter)和手部目标宽度膨胀(hand target-width inflation),以改善XR环境下的指向性能并降低用户工作负荷。实验结果显示,手部输入在吞吐量、错误率和主观负荷方面均优于注视输入,且两种模态的错误类型具有显著差异,表明自适应策略需针对不同模态的失效模式进行设计。

链接: https://arxiv.org/abs/2603.15991
作者: Mohammad Dastgheib,Fatemeh Pourmahdian
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 25 pages, 10 figures

点击查看摘要

Abstract:Extended Reality (XR) interfaces impose both ergonomic and cognitive demands, yet current systems often force a binary choice between hand-based input, which can produce fatigue, and gaze-based input, which is vulnerable to the Midas Touch problem and precision limitations. We introduce the xr-adaptive-modality-2025 platform, a web-based open-source framework for studying whether modality-specific adaptive interventions can improve XR-relevant pointing performance and reduce workload relative to static unimodal interaction. The platform combines physiologically informed gaze simulation, an ISO 9241-9 multidirectional tapping task, and two modality-specific adaptive interventions: gaze declutter and hand target-width inflation. We evaluated the system in a 2 x 2 x 2 within-subjects design manipulating Modality (Hand vs. Gaze), UI Mode (Static vs. Adaptive), and Pressure (Yes vs. No). Results from N=69 participants show that hand yielded higher throughput than gaze (5.17 vs. 4.73 bits/s), lower error (1.8% vs. 19.1%), and lower NASA-TLX workload. Crucially, error profiles differed sharply by modality: gaze errors were predominantly slips (99.2%), whereas hand errors were predominantly misses (95.7%), consistent with the Midas Touch account. Of the two adaptive interventions, only gaze declutter executed in this dataset; it modestly reduced timeouts but not slips. Hand width inflation was not evaluable due to a UI integration bug. These findings reveal modality-specific failure modes with direct implications for adaptive policy design, and establish the platform as a reproducible infrastructure for future studies.

[HC-19] Adaptive Captioning with Emotional Cues: Supporting DHH and Neurodivergent Learners in STEM

【速读】:该论文旨在解决实时字幕在服务听障及听力困难(Deaf and Hard of Hearing, DHH)和神经多样性学习者(如注意力缺陷多动障碍,ADHD)时,因缺乏情感与非语言线索而导致理解困难的问题,尤其在STEM教育中,此类缺失可能加剧认知负荷并影响学习效果。解决方案的关键在于设计并评估四种嵌入情感与多模态提示(包括面部表情、肢体动作、关键词高亮和表情符号)的字幕原型,实证表明其中部分原型可显著降低用户自评的认知负荷并提升理解分数;同时强调个性化配置功能的重要性,以适配不同神经多样性用户的偏好和需求,从而推动更具包容性的可访问技术发展。

链接: https://arxiv.org/abs/2603.15977
作者: Sunday David Ubur,Eugenia Ha Rim Rho,Denis Gracanin
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted and presented at Affective Computing and Intelligent Interaction (ACII) Conference 2025; final version to appear in IEEE Xplore

点击查看摘要

Abstract:Real-time captioning is vital for Deaf and Hard of Hearing (DHH) and neurodivergent learners (e.g., those with ADHD), yet it often omits emotional and non-verbal cues essential for comprehension. This omission is particularly consequential in STEM education, where cognitively demanding material can exacerbate the challenges faced by caption users across diverse ability profiles. In this paper, we present a design-oriented exploration of four captioning prototypes that embed emotional and multimodal cues, including facial expressions, body gestures, keyword highlighting, and emoji. Across a pilot and a main study with 24 participants, we found that certain prototypes reduced self-reported cognitive load and improved comprehension scores compared to traditional captions. Qualitative feedback reveals the importance of customizable caption features to accommodate neurodivergent users’ preferences (e.g., ADHD or different levels of comfort with emojis). Our findings contribute to ongoing conversations in accessible technology research about how best to integrate emotional cues into captions in a way that is both usable and beneficial for a wide range of learners.

[HC-20] Why Avoid Generative Legal AI Systems? Hallucination Overreliance and their Impact on Explainability

【速读】:该论文试图解决生成式法律人工智能(Generative Legal AI, GLAI)在法律职业中部署时因幻觉(hallucination)和专业人员过度依赖所引发的系统性风险问题。其核心论点是,GLAI模型基于统计词元预测架构而非法律推理设计,易产生看似合理但事实错误的输出,且因其类人表达特性加剧了自动化偏见(automation bias),从而削弱欧洲人工智能治理框架中的可解释性原则。解决方案的关键在于建立有效的“有意义的人类审查机制”(meaningful human scrutiny),以确保司法独立性和基本权利保护不受技术滥用的影响。

链接: https://arxiv.org/abs/2603.15937
作者: Gizem Gültekin Varkonyi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This article argues that the deployment of generative AI systems in legal profession requires strong restraint due to the critical risks of hallucination and overreliance. Central to this analysis is the definition of Generative Legal AI (GLAI), an umbrella term for systems specifically adapted for the legal domain which is ranging from document drafting to decision support in criminal justice. Unlike traditional AI, GLAI models are built on architectures designed for statistical token prediction rather than legal reasoning, often leading to confabulations where the system prioritizes linguistic fluency over factual accuracy. These hallucinations obscure the reasoning process, while the persuasive, human-like nature of the output encourages professional overreliance. The paper situates these dynamics within the framework of European AI governance, arguing that the interaction between fabricated data and automation bias fundamentally weakens the principle of explainability. The article concludes that without effective mechanisms for meaningful human scrutiny, the routine adoption of GLAI poses significant challenges to judicial independence and the protection of fundamental rights.

[HC-21] Machine Translation in the Wild: User Reaction to Xiaohongshus Built-In Translation Feature

【速读】:该论文旨在解决社交媒体平台中机器翻译(Machine Translation, MT)功能在实际用户交互场景下的接受度与使用效果问题,特别是其在跨语言沟通中的实用性、准确性及用户体验。研究通过分析小红书(Xiaohongshu)2025年1月上线内置翻译功能后用户的6,723条评论数据,结合情感分析与主题分析方法,揭示了用户对翻译功能的正面评价及其对功能缺陷(如准确性、可访问性)的反馈。解决方案的关键在于:一方面识别用户在真实语境下对翻译技术的多样化测试行为(包括网络用语、表情符号、拼音缩写等非标准输入),另一方面强调计算机科学家、翻译学者与平台设计者之间需加强协作,以推动翻译技术在实际社交语境中的持续优化与落地应用。

链接: https://arxiv.org/abs/2603.15922
作者: Sui He
机构: Swansea University (斯旺西大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing integration of machine translation into social media platforms is transforming how users interact with each other across cultural and linguistic boundaries. This paper examines user reactions to the launch of Xiaohongshu’s built-in translation feature in January 2025. Drawing on a dataset of 6,723 comments collected from 11 official posts promoting the translation function, this paper combines sentiment analysis with thematic analysis to investigate how users perceived and experimented with the function. Results show that reactions were generally positive, particularly for translating posts and comments, although concerns regarding functionality, accessibility, and translation accuracy were also expressed. In addition to evaluative feedback, users actively tested the function with diverse inputs, including words and phrases in English and Chinese, abbreviations in pinyin, internet slang, and other language forms such as emoji, kaomoji, coded texts, etc. The findings highlight the importance of closer collaboration among computer scientists, translation scholars, and platform designers to better understand and improve translation technologies in real world communicative context.

[HC-22] Prompt Engineering for Scale Development in Generative Psychometrics

【速读】:该论文旨在解决生成式 AI (Generative AI) 在心理测量学应用中,如何通过提示工程(prompt engineering)策略提升大语言模型(LLM)生成人格测评题项质量的问题。其核心挑战在于确保生成题项不仅具有高语义多样性,还能维持良好的结构效度(structural validity)。解决方案的关键在于引入AI-GENIE框架,结合自适应提示(adaptive prompting)与网络心理测量方法(network psychometric methods),实现对初始题项池的筛选与优化。研究表明,自适应提示显著优于零样本、少样本和角色设定等非自适应策略,能有效降低语义冗余、提高预处理阶段的结构效度,并在多数情况下保持更大规模的题项池,尤其在高能力模型上表现更优,且其优势不随温度参数变化而明显减弱,从而缓解了创造力与心理测量一致性之间的权衡问题。

链接: https://arxiv.org/abs/2603.15909
作者: Lara Lee Russell-Lasalandra,Hudson Golino
机构: University of Virginia (弗吉尼亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 22 pages, 7 figures

点击查看摘要

Abstract:This Monte Carlo simulation examines how prompt engineering strategies shape the quality of large language model (LLM)–generated personality assessment items within the AI-GENIE framework for generative psychometrics. Item pools targeting the Big Five traits were generated using multiple prompting designs (zero-shot, few-shot, persona-based, and adaptive), model temperatures, and LLMs, then evaluated and reduced using network psychometric methods. Across all conditions, AI-GENIE reliably improved structural validity following reduction, with the magnitude of its incremental contribution inversely related to the quality of the incoming item pool. Prompt design exerted a substantial influence on both pre- and post-reduction item quality. Adaptive prompting consistently outperformed non-adaptive strategies by sharply reducing semantic redundancy, elevating pre-reduction structural validity, and preserving substantially larger item pool, particularly when paired with newer, higher-capacity models. These gains were robust across temperature settings for most models, indicating that adaptive prompting mitigates common trade-offs between creativity and psychometric coherence. An exception was observed for the GPT-4o model at high temperatures, suggesting model-specific sensitivity to adaptive constraints at elevated stochasticity. Overall, the findings demonstrate that adaptive prompting is the strongest approach in this context, and that its benefits scale with model capability, motivating continued investigation of model–prompt interactions in generative psychometric pipelines.

[HC-23] Interpretative Interfaces: Designing for AI-Mediated Reading Practices and the Knowledge Commons

【速读】:该论文试图解决的问题是:当前可解释人工智能(Explainable AI, XAI)接口虽能呈现大语言模型(Large Language Models, LLMs)的行为逻辑,但仅提供静态解释并不足以促成用户对模型运作机制的深度理解与主动干预。尤其在科学领域,研究人员依赖LLMs进行文献阅读、引用和综述生成时,缺乏直接操作和观察模型内部表示过程的能力,导致其无法真正“参与”模型的推理流程。解决方案的关键在于从“可解释性”转向“解释性交互”(interpretative engagement),即设计支持用户直接操控模型中间表征的交互界面——允许用户选取特定词元(token)并追踪其在各层隐藏状态中的语义演化路径,同时通过标注记录有意义的变换过程。这一方法借鉴文本批评、计算诗学及阅读书写技术史中的实践(如批注、索引和旁注),旨在构建一个用户可介入的语言模型表征空间,从而将AI可解释性重构为交互设计问题,并推动面向科学知识批判性治理的AI辅助阅读范式。

链接: https://arxiv.org/abs/2603.15863
作者: Gabrielle Benabdallah
机构: University of Washington (华盛顿大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted at the Proceedings of the CHI 2026 Workshop: Ethics at the Front-End

点击查看摘要

Abstract:Explainable AI (XAI) interfaces seek to make large language models more transparent, yet explanation alone does not produce understanding. Explaining a system’s behavior is not the same as being able to engage with it, to probe and interpret its operations through direct manipulation. This distinction matters for scientific disciplines in particular: scientists who increasingly rely on LLMs for reading, citing, and producing literature reviews have little means of directly engaging with how these models process and transform the texts they generate. In this ongoing design research project, I argue for a shift from explainability to interpretative engagement. This shift moves away from accounts of system behavior to instead enable users to manipulate a model’s intermediate representations. Drawing on textual scholarship, computational poetics, and the history of reading and writing technologies, including practices such as marginalia, glosses, indices, and annotation systems, I propose interpretative interfaces as interactive environments in which non-expert users can intervene in the representational space of a language model. More specifically, such interfaces will allow users to select a token and follow its trajectory through the model’s intermediate layers. This way, they can observe how its semantic position shifts as context is processed, and possibly annotate the transformations they find useful or meaningful. The same way readers can create their own maps within a book through annotations and bookmarks, interpretative interfaces will allow users to inscribe their reading of a model’s internal representations. The goal of this project is to reframe AI interpretability as an interaction design project rather than a purely technical one, and to open a path toward AI-mediated reading that supports interpretative engagement and critical stewardship of scientific knowledge.

[HC-24] Same Performance Hidden Bias: Evaluating Hypothesis- and Recommendation-Driven AI

【速读】:该论文旨在解决当前人机交互(Human-Computer Interaction, HCI)领域中对决策支持系统评估过于侧重任务表现和用户依赖程度,而忽视了用户决策策略形成过程的问题。其解决方案的关键在于引入信号检测理论(Signal Detection Theory)作为理论框架,通过对比推荐驱动与假设驱动的交互设计(N = 290),发现即使在任务性能相同的情况下,推荐驱动设计会降低用户对充分证据的判断阈值,并引入一种“隐藏偏差”(hidden bias),导致错误分布发生系统性偏移。研究进一步表明,专家与新手在这一机制下同样易受影响,从而呼吁将评估焦点从单纯性能与依赖转向对决策过程及稳定证据标准的保护。

链接: https://arxiv.org/abs/2603.15824
作者: Michaela Benk,Tim Miller
机构: University of Zurich (苏黎世大学); The University of Queensland (昆士兰大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at the CHI 2026 Workshop on Understanding, Mitigating, and Leveraging Cognitive Biases to Calibrate Trust in Evolving AI Systems

点击查看摘要

Abstract:The HCI community commonly evaluates decision support systems based on whether they improve task performance or promote appropriate user reliance. In this work, we look beyond decision outcomes to examine the process through which users develop decision-making strategies. Through a web-based experiment (N = 290) comparing recommendation-driven and hypothesis-driven interaction designs, and using Signal Detection Theory as a theoretical framework, we show that even when performance remains identical, recommendation-driven designs lower participants’ thresholds for sufficient evidence and introduce a “hidden bias” in their judgments, resulting in a shifted distribution of errors. Furthermore, we find that experts are just as susceptible to these systemic shifts as novices. We conclude by advocating for a shift in focus: prioritizing decision processes and the preservation of stable evidence standards over performance and reliance alone.

[HC-25] Lost in Transcription: Subtitle Errors in Automatic Speech Recognition Reduce Speaker and Content Evaluations

【速读】:该论文试图解决的问题是:自动语音识别(Automatic Speech Recognition, ASR)系统在不同人口群体中的性能差异如何通过字幕错误影响观众对发言者及其内容的评价。解决方案的关键在于设计并实施一项预注册的在线实验(N=207,参与者为美国众包工作者),通过控制字幕准确性(准确 vs. 有误)和发言者口音类型,量化字幕错误对听众评价的影响。结果表明,错误字幕会一致降低所有发言者的评价,且未发现口音组间存在显著差异,但暗示了ASR性能较差的口音群体可能因字幕质量差而遭受额外的负面评价,从而加剧不平等。

链接: https://arxiv.org/abs/2603.15807
作者: Kowe Kadoma,Priyal Shrivastava,Mor Naaman
机构: Cornell University (康奈尔大学); Carnegie Mellon University (卡内基梅隆大学); Cornell Tech (康奈尔科技)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Researchers have demonstrated that Automatic Speech Recognition (ASR) systems perform differently across demographic groups. In this work, we examined how subtitle errors affect evaluations of speakers and their content using a preregistered online experiment (N=207, U.S.-based crowdworkers). Participants watched speakers with various accents deliver a talk in which the subtitles were accurate or error-prone. Our results indicate that error-prone subtitles consistently reduce both speaker and content evaluations for all speakers. We did not see disparate impact between the accent groups, controlling for subtitle quality. Taken together, though, the findings of this short paper imply that speakers with accents for which ASR systems perform poorly are likely to be further penalized by viewers with lower evaluations.

[HC-26] Lessons from Real-World Deployment of a Cognition-Preserving Writing Tool: Students Actively Engage with Critical Thinking and Planning Affordances

【速读】:该论文旨在解决当前AI支持写作工具在真实课堂环境中如何有效促进学生论证写作学习的问题,特别是厘清其认知支架(cognitive scaffolds)功能在实际教学场景中的运作机制与效果。解决方案的关键在于部署并评估一个名为VISAR的AI写作工具,在真实的本科写作课程中通过混合方法(包括交互日志、写作成果分析、问卷调查与访谈)系统考察学生对工具功能的实际使用情况及其学习成效。研究发现,当AI工具提供结构化的规划支持和目标导向的内容生成功能时,学生倾向于主动采用这些保留认知过程的支架,从而显著提升论证写作质量、深化概念理解,并发展出初步的批判性AI素养,这验证了“认知保留型”设计特征在AI写作辅助工具中的核心价值。

链接: https://arxiv.org/abs/2603.15777
作者: Yinuo Yang,Zheng Zhang,Ningzhi Tang,Xu Wang,Alex Ambrose,Nathaniel Myers,Patrick Clauss,Toby Jia-Jun Li
机构: University of Notre Dame (圣母大学); University of Michigan (密歇根大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:AI-supported writing tools show strong potential for scaffolding students’ learning of argumentative writing. Prior work has demonstrated the benefits of AI-supported cognitive scaffolds, such as idea exploration and argument refinement, but how these features function in authentic classroom settings remains underexplored. In this paper, we investigate the classroom integration of an AI-supported writing tool, VISAR. We deployed VISAR in an undergraduate writing course across three sections for one week each over two semesters (49 students total). Using a mixed-methods approach that combines interaction logs, writing artifact analysis, surveys, and interviews, we examine how students used VISAR features in authentic writing tasks. Our findings confirm that students appropriated AI-supported cognitive scaffolds for writing learning and achieved measurable learning gains. While prior studies suggest that students may bypass important cognitive processes when using AI writing assistants, our classroom deployment shows that when systems provide structured supports for planning and targeted generation, students naturally choose to engage with these cognition-preserving scaffolds. These learning-oriented interaction patterns were positively associated with argumentative writing quality, improved conceptual understanding, and emerging critical AI literacy, highlighting the design value of cognition-preserving features in AI writing tools. Together, these findings provide empirical evidence of how AI-supported writing scaffolds operate in authentic classroom contexts and offer design insights for future learning-oriented AI writing tools.

[HC-27] Grant Verify Revoke: A User-Centric Pattern for Blockchain Compliance

【速读】:该论文旨在解决去中心化网络应用中用户面临的隐私与合规性之间的固有冲突问题:当前用户在参与受监管的链上服务时,必须将敏感身份文件提交给中心化中介,导致真实身份永久关联至公共交易历史,从而造成隐私完全丧失或被排除在外的二元困境。解决方案的关键在于提出一种选择性披露框架(Selective Disclosure Framework),并实现名为ZK-Compliance的原型系统,该系统利用浏览器端零知识证明(zero-knowledge proofs)技术,使用户能够在本地验证特定属性(如“年满18岁”)而无需暴露底层数据,同时通过用户自主管理的“授权、验证、撤销”生命周期机制,将合规行为从一次性数据移交转变为可动态撤销的授权会话,从而在保障监管合规的同时恢复用户主权和隐私控制权。

链接: https://arxiv.org/abs/2603.15721
作者: Supriya Khadka,Sanchari Das
机构: George Mason University (乔治梅森大学)
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In decentralized web applications, users face an inherent conflict between public verifiability and personal privacy. To participate in regulated on-chain services, users must currently disclose sensitive identity documents to centralized intermediaries, permanently linking real-world identities to public transaction histories. This binary choice between total privacy loss or total exclusion strips users of agency and exposes them to persistent surveillance. In this work, we introduce a Selective Disclosure Framework designed to restore user sovereignty by decoupling eligibility verification from identity revelation. We present ZK-Compliance, a prototype that leverages browser-based zero-knowledge proofs to shift the interaction model, enabling users to prove specific attributes (e.g., “I am over 18”) locally without revealing the underlying data. We implement a user-governed Grant, Verify, Revoke lifecycle that transforms the user’s mental model of compliance from a permanent data handover into a dynamic, revocable authorization session. Our evaluation shows that client-side proof generation takes under 200ms, enabling a seamless interactive experience on commodity hardware. This work provides early evidence that regulatory compliance need not come at the cost of user privacy or autonomy.

[HC-28] he Evolving Duet of Two Modalities: A Survey on Integrating Text and Visualization for Data Communication

【速读】:该论文试图解决的问题是:文本在数据可视化中作为叙事工具的功能尚未得到充分研究,尽管已有大量工作聚焦于文本作为数据输入和交互方式,但其在支持故事讲述与解读方面的潜力仍处于碎片化状态。解决方案的关键在于通过系统性综述98篇相关文献,深入分析文本在可视化中的使用方式、功能及其对数据传播的影响,并提出设计原则以增强叙事清晰度与用户参与感,从而填补该领域的研究空白并推动文本与可视化更深层次的融合。

链接: https://arxiv.org/abs/2603.15640
作者: Xingyu Lan,Xi Li,Yixing Zhang,Mengqin Cheng,Jiazhe Wang,Siming Chen
机构: Fudan University (复旦大学); Ant Group (蚂蚁集团)
类目: Human-Computer Interaction (cs.HC)
备注: The survey has been peer-reviewed and accepted by ACM CHI 2026

点击查看摘要

Abstract:Text plays a fundamental yet understudied role as a narrative device in data visualization. While existing research has extensively explored text as data input and interaction modality, its function in supporting storytelling and interpretation remains fragmented. To address this gap, this work presents a systematic review of 98 publications that provide insights into using text as narrative. We investigate how text can be utilized in visualization, analyze its functions and effects, and explore how it can be designed to facilitate data communication. Our synthesis identifies significant research gaps in this domain and proposes future directions to advance the integration of text and visualization, ultimately aiming to provide guidance for designing text that enhances narrative clarity and fosters engagement.

[HC-29] Parameter-Efficient Deep Learning for Ultrasound-Based Human-Machine Interfaces ICPR2026

【速读】:该论文旨在解决超声波(Ultrasound, US)在人机交互(Human-Machine Interfaces, HMIs)中用于手部姿态估计(Hand Pose Estimation, HPE)时缺乏系统性模型比较与优化策略的问题。当前虽已展现出支持多达23个自由度的潜力,但尚无统一基准评估不同深度学习模型、输入模态及训练策略的效果,且仅有Ultra-Pro一个公开数据集可供复现。解决方案的关键在于通过系统性对比六种深度学习模型,并发现采用射频信号(RF signals)的包络作为输入模态、配合步长学习率调度器(step learning rate scheduler)的训练策略,可显著提升性能——其中4层深度UDACNN模型在参数量减少87.52%的前提下,准确率较XceptionTime提升2.28个百分点,绝对性能相较已有基线提高0.88%。这表明模型架构、预处理方式与训练算法的合理组合是优化HMI性能的核心因素。

链接: https://arxiv.org/abs/2603.15625
作者: Antonios Lykourinas,Chinmay Pendse,Francky Catthoor,Veronique Rochus,Xavier Rottenberg,Athanassios Skodras
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 15 pages, 2 figures, Submitted to ICPR 2026

点击查看摘要

Abstract:Ultrasound (US) has emerged as a promising modality for Human-Machine Interfaces (HMIs), with recent research efforts exploring its potential for Hand Pose Estimation (HPE). A reliable solution to this problem could introduce interfaces with simultaneous support for up to 23 degrees of freedom encompassing all hand and wrist kinematics, thereby allowing far richer and more intuitive interaction strategies. Despite these promising results, a systematic comparison of models, input modalities and training strategies is missing from the literature. Moreover, there is only one publicly available dataset, namely the Ultrasound Adaptive Prosthetic Control (Ultra-Pro) dataset, enabling reproducible benchmarking and iterative model development. In this paper, we compare the performance of six different deep learning models, selected based on diverse criteria, on this benchmark. We demonstrate that, by using a step learning rate scheduler and the envelope of the RF signals as input modality, our 4-layer deep UDACNN surpasses XceptionTime’s performance by 2.28 percentage points while featuring 87.52% fewer parameters. This result ( 77.72% ) constitutes an absolute improvement of 0.88% from previously reported baselines. According to our findings, the appropriate combination of model, preprocessing and training algorithm is crucial for optimizing HMI performance.

计算机视觉

[CV-0] WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

【速读】:该论文旨在解决现有视频扩散变换模型在交互式游戏世界建模中面临的两个核心挑战:一是用户动作控制精度不足,二是长时间序列下的三维空间一致性难以保持。现有方法通常将用户动作视为抽象的条件信号,忽视了动作与三维世界之间的几何耦合关系——即动作会引发相对相机位姿变化,并在全局范围内累积为相机的绝对位姿(absolute camera pose)。解决方案的关键在于引入相机位姿(camera pose)作为统一的几何表示,以同时实现即时动作控制与长期3D一致性。具体而言,作者定义了一个基于物理的连续动作空间,并通过李代数(Lie algebra)表示用户输入来推导精确的6自由度(6-DoF)相机位姿,再通过相机嵌入器(camera embedder)将其注入生成模型以确保动作对齐;此外,利用全局相机位姿作为空间索引检索历史观测,从而支持长时程导航中的位置几何一致重访。

链接: https://arxiv.org/abs/2603.16871
作者: Jisu Nam,Yicong Hong,Chun-Hao Paul Huang,Feng Liu,JoungBin Lee,Jiyoung Kim,Siyoon Jin,Yunsung Lee,Jaeyoon Jung,Suhwan Choi,Seungryong Kim,Yang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page is available at this https URL

点击查看摘要

Abstract:Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.

[CV-1] Demystifing Video Reasoning WWW

【速读】:该论文旨在解决当前扩散模型在视频生成中表现出的非平凡推理能力的本质机制问题,特别是针对此前研究将这种能力归因于“帧链式推理(Chain-of-Frames, CoF)”的假设提出质疑。研究表明,推理并非沿时间帧顺序展开,而是主要发生在扩散去噪步骤内部,即在每个视频帧的逐步去噪过程中,模型通过多候选解探索与收敛实现推理,这一机制被命名为“步骤链式推理(Chain-of-Steps, CoS)”。解决方案的关键在于揭示了扩散过程中的三个核心 emergent 行为:工作记忆(working memory)、自我修正与增强(self-correction and enhancement),以及感知先行动作(perception before action),并发现扩散Transformer中存在层间功能专业化现象——早期层负责密集感知结构编码,中间层执行推理,后期层整合潜在表示。基于此理解,作者提出一种无需训练的策略,通过不同随机种子下相同模型的潜空间轨迹集成来提升推理性能,为挖掘视频生成模型内在推理动力学提供了系统性框架和新思路。

链接: https://arxiv.org/abs/2603.16870
作者: Ruisi Wang,Zhongang Cai,Fanyi Pu,Junxiang Xu,Wanqi Yin,Maijunxian Wang,Ran Ji,Chenyang Gu,Bo Li,Ziqi Huang,Hokin Deng,Dahua Lin,Ziwei Liu,Lei Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Homepage: this https URL

点击查看摘要

Abstract:Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

[CV-2] SegviGen: Repurposing 3D Generative Model for Part Segmentation

【速读】:该论文旨在解决3D部分分割(3D part segmentation)任务中现有方法存在的两大问题:一是基于2D先验迁移至3D的方法易产生跨视角不一致和边界模糊;二是纯3D判别式分割方法通常依赖大规模标注数据与大量训练资源。其解决方案的关键在于提出SegviGen框架,利用预训练3D生成模型(pretrained 3D generative model)中编码的结构先验,通过在几何对齐重建的活跃体素上预测具有区分性的颜色来诱导分割,从而实现高效、少样本的3D部分分割。该方法在交互式分割和全量分割任务上分别较当前最优方法提升40%和15%,且仅需0.32%的标注数据,验证了3D生成先验向3D分割任务的有效迁移能力。

链接: https://arxiv.org/abs/2603.16869
作者: Lin Li,Haoran Feng,Zehuan Huang,Haohua Chen,Wenbo Nie,Shaohua Hou,Keqing Fan,Pan Hu,Sheng Wang,Buyu Li,Lu Sheng
机构: Beihang University (北京航空航天大学); Renmin University of China (中国人民大学); Tsinghua University (清华大学); OriginArk
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce SegviGen, a framework that repurposes native 3D generative models for 3D part segmentation. Existing pipelines either lift strong 2D priors into 3D via distillation or multi-view mask aggregation, often suffering from cross-view inconsistency and blurred boundaries, or explore native 3D discriminative segmentation, which typically requires large-scale annotated 3D data and substantial training resources. In contrast, SegviGen leverages the structured priors encoded in pretrained 3D generative model to induce segmentation through distinctive part colorization, establishing a novel and efficient framework for part segmentation. Specifically, SegviGen encodes a 3D asset and predicts part-indicative colors on active voxels of a geometry-aligned reconstruction. It supports interactive part segmentation, full segmentation, and full segmentation with 2D guidance in a unified framework. Extensive experiments show that SegviGen improves over the prior state of the art by 40% on interactive part segmentation and by 15% on full segmentation, while using only 0.32% of the labeled training data. It demonstrates that pretrained 3D generative priors transfer effectively to 3D part segmentation, enabling strong performance with limited supervision. See our project page at this https URL.

[CV-3] MessyKitchens: Contact-rich object-level 3D scene reconstruction

【速读】:该论文旨在解决单目3D场景重建中对象级重建的挑战,特别是如何在杂乱环境中实现高保真度的多物体3D形状、位姿及物理合理接触关系的联合建模。现有方法在个体物体的形状与姿态估计上表现良好,但在复杂场景下仍难以处理频繁遮挡、多样化的物体形态以及物体间的物理合理性(如非穿透和真实接触)。其解决方案的关键在于两个方面:一是构建了MessyKitchens数据集,提供真实场景下的高精度物体级标注(包括3D形状、位姿和精确接触信息),显著提升注册精度并减少物体间穿透;二是提出Multi-Object Decoder (MOD),基于SAM 3D的单物体重建框架扩展而来,能够联合推理多个物体的几何结构与空间关系,从而实现更符合物理规律的对象级场景重建。

链接: https://arxiv.org/abs/2603.16868
作者: Junaid Ahmed Ansari,Ran Ding,Fabio Pizzati,Ivan Laptev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large-scale data, recent methods achieve high performance in depth estimation from a single image. Meanwhile, reconstructing and decomposing common scenes into individual 3D objects remains a hard challenge due to the large variety of objects, frequent occlusions and complex object relations. Notably, beyond shape and pose estimation of individual objects, applications in robotics and animation require physically-plausible scene reconstruction where objects obey physical principles of non-penetration and realistic contacts. In this work we advance object-level scene reconstruction along two directions. First, we introduceMessyKitchens, a new dataset with real-world scenes featuring cluttered environments and providing high-fidelity object-level ground truth in terms of 3D object shapes, poses and accurate object contacts. Second, we build on the recent SAM 3D approach for single-object reconstruction and extend it with Multi-Object Decoder (MOD) for joint object-level scene reconstruction. To validate our contributions, we demonstrate MessyKitchens to significantly improve previous datasets in registration accuracy and inter-object penetration. We also compare our multi-object reconstruction approach on three datasets and demonstrate consistent and significant improvements of MOD over the state of the art. Our new benchmark, code and pre-trained models will become publicly available on our project website: this https URL.

[CV-4] SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

【速读】:该论文旨在解决现有视频超分辨率(Video Super-Resolution, VSR)方法在推理阶段缺乏可控性的问题,即模型行为如同“黑箱”,用户无法有效纠正生成结果中的异常伪影,只能被动接受输出。其解决方案的关键在于提出一种名为SparkVSR的交互式VSR框架,通过将稀疏关键帧作为简洁且表达能力强的控制信号,实现对整个视频序列的引导性超分。具体而言,SparkVSR采用关键帧条件化的潜在像素两阶段训练流程,融合低分辨率(LR)视频潜在表示与稀疏编码的高分辨率(HR)关键帧潜在表示,学习跨空间传播机制并精细重构感知细节;同时在推理时支持灵活的关键帧选择策略和无参考引导机制,动态平衡关键帧约束与盲恢复效果,从而提升时间一致性与重建质量,在多个基准上显著优于基线方法。

链接: https://arxiv.org/abs/2603.16864
作者: Jiongze Yu,Xiangbo Gao,Pooja Verlani,Akshay Gadde,Yilin Wang,Balu Adsumilli,Zhengzhong Tu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a small set of keyframes using any off-the-shelf image super-resolution (ISR) model, then SparkVSR propagates the keyframe priors to the entire video sequence while remaining grounded by the original LR video motion. Concretely, we introduce a keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents to learn robust cross-space propagation and refine perceptual details. At inference time, SparkVSR supports flexible keyframe selection (manual specification, codec I-frame extraction, or random sampling) and a reference-free guidance mechanism that continuously balances keyframe adherence and blind restoration, ensuring robust performance even when reference keyframes are absent or imperfect. Experiments on multiple VSR benchmarks demonstrate improved temporal consistency and strong restoration quality, surpassing baselines by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, enabling controllable, keyframe-driven video super-resolution. Moreover, we demonstrate that SparkVSR is a generic interactive, keyframe-conditioned video processing framework as it can be applied out of the box to unseen tasks such as old-film restoration and video style transfer. Our project page is available at: this https URL

[CV-5] SOMA: Unifying Parametric Human Body Models

【速读】:该论文旨在解决当前多种参数化人体模型(如SMPL、SMPL-X、MHR、Anny等)在网格拓扑、骨骼结构、形状参数化方式及单位规范上存在不兼容的问题,导致难以在同一处理流程中协同利用各模型的优势。解决方案的关键在于提出SOMA统一人体层,通过三个抽象层级实现异构表示的无缝衔接:一是网格拓扑抽象,以常数时间复杂度将任意源模型映射到共享的标准网格;二是骨骼抽象,通过闭式解法从任意身体形态(静息或任意姿态)中恢复适配身份的关节变换,无需迭代优化或针对每种模型单独训练;三是姿态抽象,直接从任意支持模型的已绑定顶点中逆向求解统一骨骼旋转,从而避免对异构运动数据集进行定制化重定向。这三个层次共同将原本O(M²)的成对适配问题简化为O(M)的单后端连接,使用户可在推理阶段自由混合不同身份来源与姿态数据,并且整个流程具备端到端可微分性与GPU加速特性(基于NVIDIA-Warp)。

链接: https://arxiv.org/abs/2603.16858
作者: Jun Saito,Jiefeng Li,Michael de Ruyter,Miguel Guerrero,Edy Lim,Ehsan Hassani,Roger Blanco Ribera,Hyejin Moon,Magdalena Dadela,Marco Di Lucca,Qiao Wang,Xueting Li,Jan Kautz,Simon Yuen,Umar Iqbal
机构: NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Parametric human body models are foundational to human reconstruction, animation, and simulation, yet they remain mutually incompatible: SMPL, SMPL-X, MHR, Anny, and related models each diverge in mesh topology, skeletal structure, shape parameterization, and unit convention, making it impractical to exploit their complementary strengths within a single pipeline. We present SOMA, a unified body layer that bridges these heterogeneous representations through three abstraction layers. Mesh topology abstraction maps any source model’s identity to a shared canonical mesh in constant time per vertex. Skeletal abstraction recovers a full set of identity-adapted joint transforms from any body shape, whether in rest pose or an arbitrary posed configuration, in a single closed-form pass, with no iterative optimization or per-model training. Pose abstraction inverts the skinning pipeline to recover unified skeleton rotations directly from posed vertices of any supported model, enabling heterogeneous motion datasets to be consumed without custom retargeting. Together, these layers reduce the O(M^2) per-pair adapter problem to O(M) single-backend connectors, letting practitioners freely mix identity sources and pose data at inference time. The entire pipeline is fully differentiable end-to-end and GPU-accelerated via NVIDIA-Warp.

[CV-6] M3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM

【速读】:该论文旨在解决从非标定单目视频中进行高精度、实时流式重建的难题,其核心挑战在于动态环境中如何实现精确的姿态估计与计算高效的在线优化。解决方案的关键在于提出M³框架,该框架通过在多视角基础模型(Multi-view foundation model)中引入专用的Matching头以生成细粒度的密集对应关系,并将其集成到鲁棒的单目高斯溅射SLAM系统中;同时,通过动态区域抑制和跨推理内在对齐机制进一步提升跟踪稳定性,从而显著改善姿态估计精度与场景重建质量。

链接: https://arxiv.org/abs/2603.16844
作者: Kerui Ren,Guanghao Li,Changjian Jiang,Yingxiang Xu,Tao Lu,Linning Xu,Junting Dong,Jiangmiao Pang,Mulin Yu,Bo Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate state-of-the-art accuracy in both pose estimation and scene reconstruction. Notably, M^3 reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforms ARTDECO by 2.11 dB in PSNR on the ScanNet++ dataset.

[CV-7] What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在处理无方向偏好图像(如材料科学中的均质微观结构截面)时存在的位置偏差问题,这种偏差会阻碍零样本适配(zero-shot adaptation)的有效性。其关键解决方案是通过微调模型以使用ALiBi(Attention with Linear Biases)相对位置编码机制,从而减少模型对绝对位置信息的依赖,同时保留其语义表征能力;实验证明,此类去偏后的模型在复杂显微图像的可训练分割任务中表现优异。

链接: https://arxiv.org/abs/2603.16840
作者: Moritz Pawlowsky,Antonis Vamvakeros,Alexander Weiss,Anja Bielefeld,Samuel J. Cooper,Ronan Docherty
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci)
备注:

点击查看摘要

Abstract:Vision transformers (ViTs) - especially feature foundation models like DINOv2 - learn rich representations useful for many downstream tasks. However, architectural choices (such as positional encoding) can lead to these models displaying positional biases and artefacts independent of semantic content. This makes zero-shot adaption difficult in fields like material science, where images are often cross-sections of homogeneous microstructure (i.e. having no preferred direction). In this work, we investigate the positional bias in ViTs via linear probing, finding it present across a range of objectives and positional encodings, and subsequently reduce it by finetuning models to use ALiBi relative positional encoding. We demonstrate that these models retain desirable general semantics and their unbiased features can be used successfully in trainable segmentation of complex microscopy images.

[CV-8] An assessment of data-centric methods for label noise identification in remote sensing data sets

【速读】:该论文旨在解决遥感(remote sensing)领域中标签噪声(label noise)对深度学习模型泛化能力的严重影响问题,尤其是当前缺乏系统性评估数据驱动方法在识别与隔离噪声标签方面的有效性。解决方案的关键在于引入并评估三种数据中心(data-centric)方法,这些方法不仅能主动识别和过滤标签噪声,还能通过净化训练数据提升下游任务性能;研究通过在两个基准数据集上注入不同类型的标签噪声(噪声比例10%–70%),量化分析了这些方法在噪声过滤能力和任务性能改善上的表现,从而为实际应用提供了可操作的指导,并指出了未来在遥感场景下迁移此类方法的研究方向。

链接: https://arxiv.org/abs/2603.16835
作者: Felix Kröber,Genc Hoxha,Ribana Roscher
机构: Institute of Bio- and Geosciences, Forschungszentrum Jülich (弗劳恩霍夫研究所生物与地球科学研究所); Institute of Geodesy and Geoinformation, University of Bonn (波恩大学大地测量与地球信息学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in International Society for Photogrammetry and Remote Sensing (ISPRS) Annals 2026

点击查看摘要

Abstract:Label noise in the sense of incorrect labels is present in many real-world data sets and is known to severely limit the generalizability of deep learning models. In the field of remote sensing, however, automated treatment of label noise in data sets has received little attention to date. In particular, there is a lack of systematic analysis of the performance of data-centric methods that not only cope with label noise but also explicitly identify and isolate noisy labels. In this paper, we examine three such methods and evaluate their behavior under different label noise assumptions. To do this, we inject different types of label noise with noise levels ranging from 10 to 70% into two benchmark data sets, followed by an analysis of how well the selected methods filter the label noise and how this affects task performances. With our analyses, we clearly prove the value of data-centric methods for both parts - label noise identification and task performance improvements. Our analyses provide insights into which method is the best choice depending on the setting and objective. Finally, we show in which areas there is still a need for research in the transfer of data-centric label noise methods to remote sensing data. As such, our work is a step forward in bridging the methodological establishment of data-centric label noise methods and their usage in practical settings in the remote sensing domain.

[CV-9] Deep Reinforcement Learning-driven Edge Offloading for Latency-constrained XR pipelines

【速读】:该论文旨在解决沉浸式扩展现实(Extended Reality, XR)应用在资源受限设备上运行时,如何在严格实时延迟要求与设备电池寿命之间实现有效平衡的问题。其核心挑战在于现有计算卸载策略通常仅优化平均性能指标,未能充分考虑闭环XR工作负载中实时延迟需求与电池动态之间的持续交互关系。解决方案的关键在于提出一种面向电池感知的执行管理框架,通过轻量级深度强化学习策略设计在线决策机制,在动态网络条件下持续调整执行位置(本地或边缘服务器),从而同时保障高运动到光子(motion-to-photon)延迟合规性与延长设备电池寿命;实验表明,该方法相较纯本地延迟最优执行可使设备电池寿命延长达163%,且在稳定网络下保持超90%的延迟合规率,即使在网络带宽受限时仍不低于80%。

链接: https://arxiv.org/abs/2603.16823
作者: Sourya Saha(City University of New York),Saptarshi Debroy(City University of New York)
机构: City University of New York (纽约市立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the The 26th IEEE International Symposium on Cluster, Cloud, and Internet Computing (CCGrid 2026)

点击查看摘要

Abstract:Immersive extended reality (XR) applications introduce latency-critical workloads that must satisfy stringent real-time responsiveness while operating on energy- and battery-constrained devices, making execution placement between end devices and nearby edge servers a fundamental systems challenge. Existing approaches to adaptive execution and computation offloading typically optimize average performance metrics and do not fully capture the sustained interaction between real-time latency requirements and device battery lifetime in closed-loop XR workloads. In this paper, we present a battery-aware execution management framework for edge-assisted XR systems that jointly considers execution placement, workload quality, latency requirements, and battery dynamics. We design an online decision mechanism based on a lightweight deep reinforcement learning policy that continuously adapts execution decisions under dynamic network conditions while maintaining high motion-to-photon latency compliance. Experimental results show that the proposed approach extends the projected device battery lifetime by up to 163% compared to latency-optimal local execution while maintaining over 90% motion-to-photon latency compliance under stable network conditions. Such compliance does not fall below 80% even under significantly limited network bandwidth availability, thereby demonstrating the effectiveness of explicitly managing latency-energy trade-offs in immersive XR systems.

[CV-10] WildDepth: A Multimodal Dataset for 3D Wildlife Perception and Depth Estimation

【速读】:该论文旨在解决动物场景下深度估计与三维重建中因缺乏带度量尺度的训练数据而导致模型可靠性不足的问题。现有大多数模型依赖于无尺度信息的数据集,难以有效验证仅基于图像的深度估计性能。其解决方案的关键在于构建了WildDepth——一个包含同步RGB与LiDAR数据的多模态数据集及基准测试套件,覆盖从家养到野生动物的多样化物种及其复杂环境。实验表明,利用多模态数据可将深度估计的均方根误差(RMSE)降低最多10%,而RGB-LiDAR融合则使三维重建在Chamfer距离指标上提升12%,从而显著增强系统对跨域动物场景的鲁棒感知能力。

链接: https://arxiv.org/abs/2603.16816
作者: Muhammad Aamir,Naoya Muramatsu,Sangyun Shin,Matthew Wijers,Jiaxing Jhong,Xinyu Hou,Amir Patel,Andrew Markham
机构: University of Oxford (牛津大学); University of Cape Town (开普敦大学); University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:Depth estimation and 3D reconstruction have been extensively studied as core topics in computer vision. Starting from rigid objects with relatively simple geometric shapes, such as vehicles, the research has expanded to address general objects, including challenging deformable objects, such as humans and animals. However, for the animal, in particular, the majority of existing models are trained based on datasets without metric scale, which can help validate image-only models. To address this limitation, we present WildDepth, a multimodal dataset and benchmark suite for depth estimation, behavior detection, and 3D reconstruction from diverse categories of animals ranging from domestic to wild environments with synchronized RGB and LiDAR. Experimental results show that the use of multi-modal data improves depth reliability by up to 10% RMSE, while RGB-LiDAR fusion enhances 3D reconstruction fidelity by 12% in Chamfer distance. By releasing WildDepth and its benchmarks, we aim to foster robust multimodal perception systems that generalize across domains.

[CV-11] Adaptive Moments are Surprisingly Effective for Plug-and-Play Diffusion Sampling

【速读】:该论文旨在解决引导扩散采样(guided diffusion sampling)中因难以精确计算似然梯度(likelihood scores)而导致的采样动态噪声过大问题。其解决方案的关键在于引入自适应矩估计(adaptive moment estimation),通过稳定采样过程中不稳定的似然梯度来降低噪声影响,从而提升图像修复和类别条件生成任务中的对齐效果。该方法虽简单,却在多个任务上达到当前最优性能,且优于许多计算复杂度更高的替代方案。

链接: https://arxiv.org/abs/2603.16797
作者: Christian Belardi,Justin Lovelace,Kilian Q. Weinberger,Carla P. Gomes
机构: Cornell University (康奈尔大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Guided diffusion sampling relies on approximating often intractable likelihood scores, which introduces significant noise into the sampling dynamics. We propose using adaptive moment estimation to stabilize these noisy likelihood scores during sampling. Despite its simplicity, our approach achieves state-of-the-art results on image restoration and class-conditional generation tasks, outperforming more complicated methods, which are often computationally more expensive. We provide empirical analysis of our method on both synthetic and real data, demonstrating that mitigating gradient noise through adaptive moments offers an effective way to improve alignment.

[CV-12] V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

【速读】:该论文旨在解决像素空间扩散模型(pixel-space diffusion)因缺乏强语义监督和高阶视觉结构建模能力而导致生成质量受限的问题。现有方法虽尝试通过预训练视觉特征提升扩散训练效果,但其视觉协同去噪(visual co-denoising)策略常因多种设计选择混杂而难以明确关键因素。为此,作者提出V-Co,在统一的JiT框架下系统性地研究视觉协同去噪,揭示出四个核心要素:1)采用全双流架构以保留特征特异性计算并实现灵活跨流交互;2)通过结构化无条件预测实现有效的无分类器引导(classifier-free guidance, CFG);3)使用感知漂移混合损失(perceptual-drifting hybrid loss)提供更强语义监督;4)借助基于RMS的特征归一化实现跨流校准以保障稳定协同去噪。这些发现共同构成一个简洁有效的视觉协同去噪方案,在ImageNet-256上验证了其在模型规模相当条件下优于基线与现有先进像素扩散方法,且训练周期更短。

链接: https://arxiv.org/abs/2603.16792
作者: Han Lin,Xichen Pan,Zun Wang,Yue Zhang,Chu Wang,Jaemin Cho,Mohit Bansal
机构: UNC Chapel Hill(北卡罗来纳大学教堂山分校); NYU(纽约大学); Meta(Meta); AI2(艾伦人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: code: this https URL

点击查看摘要

Abstract:Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.

[CV-13] IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans

【速读】:该论文旨在解决当前基于口腔内扫描(3D intraoral scans, IOS)的多疾病联合诊断与生成式视觉问答(VQA)任务中,现有方法未能充分挖掘原生3D几何信息的问题。其核心挑战包括:异构扫描形式与复杂拓扑结构、多种疾病共存下的类别不平衡与形态学细粒度模糊性,以及有限的配对3D IOS-文本数据。解决方案的关键在于提出IOSVLM——一个端到端的3D视觉语言模型(VLM),将IOS表示为点云,并采用3D编码器-投影器-大语言模型(LLM)架构实现统一诊断与生成式VQA;同时构建了大规模多源IOS诊断VQA数据集IOSVQA(含19,002例和249,055个VQA对),并创新性地设计几何到色度代理(geometry-to-chromatic proxy)以缓解无颜色IOS数据与依赖颜色的3D预训练之间的分布差异,辅以两阶段课程训练策略提升鲁棒性,最终在宏平均准确率和F1分数上显著优于基线模型。

链接: https://arxiv.org/abs/2603.16781
作者: Huimin Xiong,Zijie Meng,Tianxiang Hu,Chenyi Zhou,Yang Feng,Zuozhu Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D intraoral scans (IOS) are increasingly adopted in routine dentistry due to abundant geometric evidence, and unified multi-disease diagnosis is desirable for clinical documentation and communication. While recent works introduce dental vision-language models (VLMs) to enable unified diagnosis and report generation on 2D images or multi-view images rendered from IOS, they do not fully leverage native 3D geometry. Such work is necessary and also challenging, due to: (i) heterogeneous scan forms and the complex IOS topology, (ii) multi-disease co-occurrence with class imbalance and fine-grained morphological ambiguity, (iii) limited paired 3D IOS-text data. Thus, we present IOSVLM, an end-to-end 3D VLM that represents scans as point clouds and follows a 3D encoder-projector-LLM design for unified diagnosis and generative visual question-answering (VQA), together with IOSVQA, a large-scale multi-source IOS diagnosis VQA dataset comprising 19,002 cases and 249,055 VQA pairs over 23 oral diseases and heterogeneous scan types. To address the distribution gap between color-free IOS data and color-dependent 3D pre-training, we propose a geometry-to-chromatic proxy that stabilizes fine-grained geometric perception and cross-modal alignment. A two-stage curriculum training strategy further enhances robustness. IOSVLM consistently outperforms strong baselines, achieving gains of at least +9.58% macro accuracy and +1.46% macro F1, indicating the effectiveness of direct 3D geometry modeling for IOS-based diagnosis.

[CV-14] GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution

【速读】:该论文旨在解决一阶段生成式图像超分辨率(One-step Generative Image Super-Resolution, ISR)中因随机性不足导致性能受限的问题,以及现有强化学习(Reinforcement Learning, RL)方法如直接偏好优化(Direct Preference Optimization, DPO)依赖离线正负样本对、样本数量有限,而群体相对策略优化(Group Relative Policy Optimization, GRPO)仅评估整图似然、忽略局部细节的局限。其解决方案的关键在于提出一种新的群体直接偏好优化(Group Direct Preference Optimization, GDPO)方法:首先设计了一个噪声感知的一阶段扩散模型,通过不等时间步策略将噪声注入与扩散过程解耦以维持性能;其次在GDPO中融合GRPO的群体相对优势计算机制至DPO框架,实现对在线生成样本的组级优势评估,并引入属性感知奖励函数,依据图像平滑区域与纹理区域的统计特性动态评分,从而有效提升一阶段生成式ISR模型的质量。

链接: https://arxiv.org/abs/2603.16769
作者: Qiaosi Yi,Shuai Li,Rongyuan Wu,Lingchen Sun,Zhengqiang Zhang,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); OPPO Research Institute (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, reinforcement learning (RL) has been employed for improving generative image super-resolution (ISR) performance. However, the current efforts are focused on multi-step generative ISR, while one-step generative ISR remains underexplored due to its limited stochasticity. In addition, RL methods such as Direct Preference Optimization (DPO) require the generation of positive and negative sample pairs offline, leading to a limited number of samples, while Group Relative Policy Optimization (GRPO) only calculates the likelihood of the entire image, ignoring local details that are crucial for ISR. In this paper, we propose Group Direct Preference Optimization (GDPO), a novel approach to integrate RL into one-step generative ISR model training. First, we introduce a noise-aware one-step diffusion model that can generate diverse ISR outputs. To prevent performance degradation caused by noise injection, we introduce an unequal-timestep strategy to decouple the timestep of noise addition from that of diffusion. We then present the GDPO strategy, which integrates the principle of GRPO into DPO, to calculate the group-relative advantage of each online generated sample for model optimization. Meanwhile, an attribute-aware reward function is designed to dynamically evaluate the score of each sample based on its statistics of smooth and texture areas. Experiments demonstrate the effectiveness of GDPO in enhancing the performance of one-step generative ISR models. Code: this https URL.

[CV-15] Dual Stream Independence Decoupling for True Emotion Recognition under Masked Expressions

【速读】:该论文旨在解决从伪装表情中准确识别真实情绪的问题,尤其针对现有方法依赖“起始帧”(onset frame)导致的真实情绪信息泄露、无法反映稳定伪装状态的局限性。其核心解决方案是提出一种基于“峰值帧”(apex frame)的新范式,该帧代表情绪伪装达到稳定状态的时刻,从而更真实地模拟实际情境;同时设计了一种双流独立解耦框架(dual stream independence decoupling framework),通过分离真实情绪特征与伪装情绪特征,避免二者干扰,其中关键创新在于引入由两类分类损失和希尔伯特-施密特独立性损失(Hilbert-Schmidt Independence loss)组成的解耦损失组,以提升两组特征的独立性并增强识别性能。

链接: https://arxiv.org/abs/2603.16760
作者: Jinsheng Wei,Xiguang Zhang,Zheng Shi,Guanming Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recongnizing true emotions from masked expressions is extremely challenging due to deliberate concealment. Existing paradigms recognize true emotions from masked-expression clips that contain onsetframes just starting to disguise. However, this paradigm may not reflect the actual disguised state, as the onsetframe leaks the true emotional information without reaching a stable disguise state. Thus, this paper introduces a novel apexframe-based paradigm that classifies true emotions from the apexframe with a stable disguised state. Furthermore, this paper proposes a novel dual stream independence decoupling framework that decouples true and disguised emotion features, avoiding the interference of disguised emotions on true emotions. For efficient decoupling, we design a decoupling loss group, comprising two classification losses that learn true emotion and disguised expression features, respectively, and a Hilbert-Schmidt Independence loss that enhances the independence of two features. Experiments demonstrate that the apexframe-based paradigm is challenging, and the proposed decouple framework improves recogntion performances.

[CV-16] SuCor: Susceptibility Distortion Correction via Parameter-Free and Self-Regularized Optimal Transport

【速读】:该论文旨在解决回波平面成像(Echo Planar Imaging, EPI)中由磁 susceptibility差异引起的几何失真问题,这类失真严重影响了功能磁共振成像(fMRI)和弥散加权成像(DWI)的空间准确性。其解决方案的关键在于提出SuCor方法,利用最优传输(Optimal Transport, OT)理论在相位编码方向上建模失真场:将每列失真视为正负相位编码EPI图像强度剖面之间的Wasserstein-2均值位移,并通过谱域的弯曲能量惩罚项进行正则化,其强度由Morozov偏差原理自动选择,无需人工调参。此方法在人类连接组计划(HCP)数据集上实现了比FSL TOPUP更高的体积互信息(0.341 vs. 0.317),且单核CPU运行仅需约12秒。

链接: https://arxiv.org/abs/2603.16758
作者: Sreekar Chigurupati,Eleftherios Garyfallidis
机构: Indiana University (印第安纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present SuCor, a method for correcting susceptibility induced geometric distortions in echo planar imaging (EPI) using optimal transport (OT) along the phase encoding direction. Given a pair of reversed phase encoding EPI volumes, we model each column of the distortion field as a Wasserstein-2 barycentric displacement between the opposing-polarity intensity profiles. Regularization is performed in the spectral domain using a bending-energy penalty whose strength is selected automatically via the Morozov discrepancy principle, requiring no manual tuning. On a human connectome project (HCP) dataset with left-right/right-left b0 EPI pairs and a co-registered T1 structural reference, SuCor achieves a mean volumetric mutual information of 0.341 with the T1 image, compared to 0.317 for FSL TOPUP, while running in approximately 12 seconds on a single CPU core.

[CV-17] Semi-supervised Latent Disentangled Diffusion Model for Textile Pattern Generation AAAI2026

【速读】:该论文旨在解决纺织品图案生成(Textile Pattern Generation, TPG)任务中因服装图像中复杂织物纹理与非刚性形变特征混淆而导致的细节失真问题。现有图像到图像转换模型在直接应用于TPG时往往无法保留精细的织物图案信息,导致生成结果不忠实于原始输入。解决方案的关键在于提出一种两阶段方法SLDDM-TPG:首先设计一个潜在解耦网络(Latent Disentangled Network, LDN),通过构建多维独立的服装特征空间来消除特征混淆;其次引入半监督潜在扩散模型(Semi-supervised Latent Diffusion Model, S-LDM),利用LDN提供的引导信号并结合细粒度对齐策略进行训练,从而实现高保真、忠实的纺织品图案生成。

链接: https://arxiv.org/abs/2603.16747
作者: Chenggong Hu,Yi Wang,Mengqi Xue,Haofei Zhang,Jie Song,Li Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures, acceptted by AAAI2026, the code is available at this https URL

点击查看摘要

Abstract:Textile pattern generation (TPG) aims to synthesize fine-grained textile pattern images based on given clothing images. Although previous studies have not explicitly investigated TPG, existing image-to-image models appear to be natural candidates for this task. However, when applied directly, these methods often produce unfaithful results, failing to preserve fine-grained details due to feature confusion between complex textile patterns and the inherent non-rigid texture distortions in clothing images. In this paper, we propose a novel method, SLDDM-TPG, for faithful and high-fidelity TPG. Our method consists of two stages: (1) a latent disentangled network (LDN) that resolves feature confusion in clothing representations and constructs a multi-dimensional, independent clothing feature space; and (2) a semi-supervised latent diffusion model (S-LDM), which receives guidance signals from LDN and generates faithful results through semi-supervised diffusion training, combined with our designed fine-grained alignment strategy. Extensive evaluations show that SLDDM-TPG reduces FID by 4.1 and improves SSIM by up to 0.116 on our CTP-HD dataset, and also demonstrate good generalization on the VITON-HD dataset.

[CV-18] When the City Teaches the Car: Label-Free 3D Perception from Infrastructure

【速读】:该论文旨在解决自动驾驶中3D感知模型训练对大规模标注数据的高度依赖问题,尤其是在城市环境多样化背景下,传统依赖人工标注的数据采集方式变得不可持续。其核心解决方案是提出“基础设施教学的无标签3D感知”(infrastructure-taught, label-free 3D perception)范式,关键在于利用道路旁部署的静态传感器(即路边单元,RSUs)作为固定视角的无监督教师模型,通过自身对场景的重复观测学习局部3D检测器,并将预测结果广播给经过的车辆;这些预测被车辆聚合为伪标签监督信号,用于训练独立运行的车载检测器(ego detector)。此方法在测试阶段无需任何基础设施或通信支持,且实验证明其在CARLA多智能体环境中可实现82.3%的车辆检测AP,接近全监督上限(94.4%),表明城市基础设施本身可作为可扩展的监督信号源,为降低3D感知标注成本提供了一种新的、与现有自监督方法互补的技术路径。

链接: https://arxiv.org/abs/2603.16742
作者: Zhen Xu,Jinsu Yoo,Cristian Bautista,Zanming Huang,Tai-Yu Pan,Zhenzhen Liu,Katie Z Luo,Mark Campbell,Bharath Hariharan,Wei-Lun Chao
机构: The Ohio State University (俄亥俄州立大学); Google(谷歌); Cornell University (康奈尔大学); Stanford University (斯坦福大学); Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Building robust 3D perception for self-driving still relies heavily on large-scale data collection and manual annotation, yet this paradigm becomes impractical as deployment expands across diverse cities and regions. Meanwhile, modern cities are increasingly instrumented with roadside units (RSUs), static sensors deployed along roads and at intersections to monitor traffic. This raises a natural question: can the city itself help train the vehicle? We propose infrastructure-taught, label-free 3D perception, a paradigm in which RSUs act as stationary, unsupervised teachers for ego vehicles. Leveraging their fixed viewpoints and repeated observations, RSUs learn local 3D detectors from unlabeled data and broadcast predictions to passing vehicles, which are aggregated as pseudo-label supervision for training a standalone ego detector. The resulting model requires no infrastructure or communication at test time. We instantiate this idea as a fully label-free three-stage pipeline and conduct a concept-and-feasibility study in a CARLA-based multi-agent environment. With CenterPoint, our pipeline achieves 82.3% AP for detecting vehicles, compared to a fully supervised ego upper bound of 94.4%. We further systematically analyze each stage, evaluate its scalability, and demonstrate complementarity with existing ego-centric label-free methods. Together, these results suggest that city infrastructure itself can potentially provide a scalable supervisory signal for autonomous vehicles, positioning infrastructure-taught learning as a promising orthogonal paradigm for reducing annotation cost in 3D perception.

[CV-19] World Reconstruction From Inconsistent Views ATC WWW

【速读】:该论文旨在解决视频扩散模型生成的视频帧之间缺乏3D一致性的问题,这导致从视频中重建高质量3D场景变得困难。解决方案的关键在于提出一种非刚性对齐方法,首先利用几何基础模型将每帧图像转化为像素级的3D点云(pointcloud),随后通过定制的非刚性迭代ICP(Iterative Closest Point)算法实现多帧间的初始对齐,并进一步通过全局优化提升点云的锐利度与细节;最终,以该点云作为初始化,结合新颖的逆向变形渲染损失(inverse deformation rendering loss),生成高保真且可交互的3D环境,从而有效将不一致的视频视角转化为具有3D一致性的世界表示。

链接: https://arxiv.org/abs/2603.16736
作者: Lukas Höllein,Matthias Nießner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project website: this https URL , video: this https URL , code: this https URL

点击查看摘要

Abstract:Video diffusion models generate high-quality and diverse worlds; however, individual frames often lack 3D consistency across the output sequence, which makes the reconstruction of 3D worlds difficult. To this end, we propose a new method that handles these inconsistencies by non-rigidly aligning the video frames into a globally-consistent coordinate frame that produces sharp and detailed pointcloud reconstructions. First, a geometric foundation model lifts each frame into a pixel-wise 3D pointcloud, which contains unaligned surfaces due to these inconsistencies. We then propose a tailored non-rigid iterative frame-to-model ICP to obtain an initial alignment across all frames, followed by a global optimization that further sharpens the pointcloud. Finally, we leverage this pointcloud as initialization for 3D reconstruction and propose a novel inverse deformation rendering loss to create high quality and explorable 3D environments from inconsistent views. We demonstrate that our 3D scenes achieve higher quality than baselines, effectively turning video models into 3D-consistent world generators.

[CV-20] Emotion-Aware Classroom Quality Assessment Leverag ing IoT-Based Real-Time Student Monitoring

【速读】:该论文旨在解决大规模课堂中教师与学生互动受限、难以实时捕捉学生情绪和参与度的问题,从而提升教学效果。其解决方案的关键在于构建一个面向物联网(IoT)设备的高通量、实时多智能体情感计算框架,通过优化负载均衡与延迟控制实现高效处理,并基于自建的“Classroom Emotion Dataset”进行训练与验证,最终在三所不同教育阶段的学校中实现了高达88%的整体准确率,且具备良好的实用性与可扩展性。

链接: https://arxiv.org/abs/2603.16719
作者: Hai Nguyen,Hieu Dao,Hung Nguyen,Nam Vu,Cong Tran
机构: PTIT University (PTIT 大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study presents high-throughput, real-time multi-agent affective computing framework designed to enhance classroom learning through emotional state monitoring. As large classroom sizes and limited teacher student interaction increasingly challenge educators, there is a growing need for scalable, data-driven tools capable of capturing students’ emotional and engagement patterns in real time. The system was evaluated using the Classroom Emotion Dataset, consisting of 1,500 labeled images and 300 classroom detection videos. Tailored for IoT devices, the system addresses load balancing and latency challenges through efficient real-time processing. Field testing was conducted across three educational institutions in a large metropolitan area: a primary school (hereafter school A), a secondary school (school B), and a high school (school C). The system demonstrated robust performance, detecting up to 50 faces at 25 FPS and achieving 88% overall accuracy in classifying classroom engagement states. Implementation results showed positive outcomes, with favorable feedback from students, teachers, and parents regarding improved classroom interaction and teaching adaptation. Key contributions of this research include establishing a practical, IoT-based framework for emotion-aware learning environments and introducing the ‘Classroom Emotion Dataset’ to facilitate further validation and research.

[CV-21] Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search

【速读】:该论文旨在解决图像到视频生成中对象级运动编辑的难题,即如何在不依赖轨迹、边界框、掩码或运动场等标注信息的情况下,实现对特定物体的精准位移控制,同时保持场景整体稳定性。其核心解决方案是提出了一种无需训练的框架Search2Motion,通过基于目标帧的控制机制,利用首尾帧间的运动先验来实现物体重定位;关键创新在于借助早期步骤的自注意力图预测物体与相机动态,从而设计出轻量级的ACE-Seed(Attention Consensus for Early-step Seed selection)搜索策略,在无需前视采样或外部评估器的前提下显著提升运动保真度,并引入新的基准S2M-DAVIS和S2M-OMB以及FLF2V-obj指标以更准确地评估对象运动质量。

链接: https://arxiv.org/abs/2603.16711
作者: Sainan Liu,Tz-Ying Wu,Hector A Valdez,Subarna Tripathi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.

[CV-22] vAccSOL: Efficient and Transparent AI Vision Offloading for Mobile Robots

【速读】:该论文旨在解决移动机器人在资源受限环境下执行AI视觉任务时面临的计算能力和能耗瓶颈问题。当前机器人平台通常依赖嵌入式加速器,但其专有软件栈限制了用户自定义模型的部署,导致复杂视觉工作负载只能运行在性能有限的配套计算单元上,难以满足实时性和能效需求。解决方案的关键在于提出vAccSOL框架,该框架由两部分组成:SOL(神经网络编译器),用于生成低依赖、高优化的推理库;以及vAccel(轻量级执行框架),可透明地将推理任务调度至本地机器人或邻近边缘基础设施。这种设计实现了硬件感知的高效推理与灵活的任务分发,无需修改现有机器人应用即可显著提升性能并降低功耗——实测表明,在电池供电机器人上可减少高达80%的本地能耗,同时将视觉流水线帧率提升至24倍,有效延长续航时间。

链接: https://arxiv.org/abs/2603.16685
作者: Adam Zahir,Michele Gucciardom Falk Selker,Anastasios Nanos,Kostis Papazafeiropoulos,Carlos J. Bernardos,Nicolas Weber,Roberto Gonzalez
机构: University Carlos III of Madrid (UC3M)(马德里卡洛斯三世大学); NEC Laboratories Europe (NEC欧洲实验室); Nubificus Ltd(纽比菲库斯有限公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mobile robots are increasingly deployed for inspection, patrol, and search-and-rescue operations, relying on computer vision for perception, navigation, and autonomous decision-making. However, executing modern vision workloads onboard is challenging due to limited compute resources and strict energy constraints. While some platforms include embedded accelerators, these are typically tied to proprietary software stacks, leaving user-defined workloads to run on resource-constrained companion computers. We present vAccSOL, a framework for efficient and transparent execution of AI-based vision workloads across heterogeneous robotic and edge platforms. vAccSOL integrates two components: SOL, a neural network compiler that generates optimized inference libraries with minimal runtime dependencies, and vAccel, a lightweight execution framework that transparently dispatches inference locally on the robot or to nearby edge infrastructure. This combination enables hardware-optimized inference and flexible execution placement without requiring modifications to robot applications. We evaluate vAccSOL on a real-world testbed with a commercial quadruped robot and twelve deep learning models covering image classification, video classification, and semantic segmentation. Compared to a PyTorch compiler baseline, SOL achieves comparable or better inference performance. With edge offloading, vAccSOL reduces robot-side power consumption by up to 80% and edge-side power by up to 60% compared to PyTorch, while increasing vision pipeline frame rate by up to 24x, extending the operating lifetime of battery-powered robots. Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.16685 [cs.RO] (or arXiv:2603.16685v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2603.16685 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-23] HMAR: Hierarchical Modality-Aware Expert and Dynamic Routing Medical Image Retrieval Architecture

【速读】:该论文旨在解决医学图像检索(Medical Image Retrieval, MIR)系统中存在的三个关键问题:一是统一特征编码无法反映不同解剖结构的临床重要性差异;二是基于粗粒度分类标签的相似性度量存在歧义;三是现有方法仅关注全局图像相似性,难以满足临床对细粒度区域特定检索的需求。解决方案的核心在于提出HMAR(Hierarchical Modality-Aware Expert and Dynamic Routing)框架,其基于Mixture-of-Experts(MoE)架构,采用双专家机制:Expert0提取全局特征用于整体相似性匹配,Expert1学习位置不变的局部表示以实现病变区域的精准检索;同时结合两阶段对比学习策略(无需昂贵边界框标注)和滑动窗口匹配算法,在推理时实现密集局部比对,并利用Kolmogorov-Arnold Network(KAN)层生成哈希码,支持高效的汉明距离搜索,从而显著提升检索精度与临床实用性。

链接: https://arxiv.org/abs/2603.16679
作者: Aojie Yuan
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures, 1 table

点击查看摘要

Abstract:Medical image retrieval (MIR) is a critical component of computer-aided diagnosis, yet existing systems suffer from three persistent limitations: uniform feature encoding that fails to account for the varying clinical importance of anatomical structures, ambiguous similarity metrics based on coarse classification labels, and an exclusive focus on global image similarity that cannot meet the clinical demand for fine-grained region-specific retrieval. We propose HMAR (Hierarchical Modality-Aware Expert and Dynamic Routing), an adaptive retrieval framework built on a Mixture-of-Experts (MoE) architecture. HMAR employs a dual-expert mechanism: Expert0 extracts global features for holistic similarity matching, while Expert1 learns position-invariant local representations for precise lesion-region retrieval. A two-stage contrastive learning strategy eliminates the need for expensive bounding-box annotations, and a sliding-window matching algorithm enables dense local comparison at inference time. Hash codes are generated via Kolmogorov-Arnold Network (KAN) layers for efficient Hamming-distance search. Experiments on the RadioImageNet-CT dataset (16 clinical patterns, 29,903 images) show that HMAR achieves mean Average Precision (mAP) of 0.711 and 0.724 for 64-bit and 128-bit hash codes, improving over the state-of-the-art ACIR method by 0.7% and 1.1%, respectively.

[CV-24] x2-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space CVPR2026

【速读】:该论文旨在解决多模态传感器(图像、LiDAR 和事件相机)在估计密集 2D 光学流(optical flow)与 3D 场景流(scene flow)时,因缺乏共享潜在空间而导致的跨模态不匹配问题。现有方法通常在异构特征空间中独立处理各模态信息,难以实现有效融合。其解决方案的关键在于提出一种名为 x²-Fusion 的新框架,通过引入“事件边缘空间”(Event Edge Space)——即利用事件相机天然提供的时空边缘信号构建统一的边缘中心型同质表示空间,使图像和 LiDAR 特征能够在此空间中显式对齐;同时结合可靠性感知的自适应融合机制以增强退化条件下的稳定特征,并采用跨维度对比学习强化 2D 光学流与 3D 场景流之间的耦合关系,从而实现更鲁棒且准确的多模态运动估计。

链接: https://arxiv.org/abs/2603.16671
作者: Ruishan Guo,Ciyu Ruan,Haoyang Wang,Zihang Gong,Jingao Xu,Xinlei Chen
机构: Shenzhen International Graduate School, Tsinghua University; Harbin Institute of Technology; The University of Hong Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This version is the camera-ready version accepted at CVPR 2026

点击查看摘要

Abstract:Estimating dense 2D optical flow and 3D scene flow is essential for dynamic scene understanding. Recent work combines images, LiDAR, and event data to jointly predict 2D and 3D motion, yet most approaches operate in separate heterogeneous feature spaces. Without a shared latent space that all modalities can align to, these systems rely on multiple modality-specific blocks, leaving cross-sensor mismatches unresolved and making fusion unnecessarily this http URL cameras naturally provide a spatiotemporal edge signal, which we can treat as an intrinsic edge field to anchor a unified latent representation, termed the Event Edge Space. Building on this idea, we introduce x^2 -Fusion, which reframes multimodal fusion as representation unification: event-derived spatiotemporal edges define an edge-centric homogeneous space, and image and LiDAR features are explicitly aligned in this shared this http URL this space, we perform reliability-aware adaptive fusion to estimate modality reliability and emphasize stable cues under degradation. We further employ cross-dimension contrast learning to tightly couple 2D optical flow with 3D scene flow. Extensive experiments on both synthetic and real benchmarks show that x^2 -Fusion achieves state-of-the-art accuracy under standard conditions and delivers substantial improvements in challenging scenarios.

[CV-25] Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation

【速读】:该论文旨在解决传统机器人世界交互模拟器在物理约束和视觉表现上的局限性问题,特别是现有方法多局限于二维空间或依赖静态环境线索,无法准确建模机器人与环境之间本质的四维(4D)时空交互特性。其解决方案的关键在于提出Kinema4D——一个动作条件驱动的4D生成式机器人仿真器,通过解耦机器人-世界交互为两个核心模块:一是基于URDF的3D机器人运动学驱动,实现精确的4D机器人控制轨迹;二是将该轨迹投影为时空点云信号,用以引导生成模型合成复杂环境中同步的RGB/点云序列,从而重建真实世界的动态反应。这一框架首次实现了高保真、几何一致且与具体机器人形态无关的4D交互模拟,并展现出零样本迁移潜力。

链接: https://arxiv.org/abs/2603.16669
作者: Mutian Xu,Tianbao Zhang,Tianqi Liu,Zhaoxi Chen,Xiaoguang Han,Ziwei Liu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Simulating robot-world interactions is a cornerstone of Embodied AI. Recently, a few works have shown promise in leveraging video generations to transcend the rigid visual/physical constraints of traditional simulators. However, they primarily operate in 2D space or are guided by static environmental cues, ignoring the fundamental reality that robot-world interactions are inherently 4D spatiotemporal events that require precise interactive modeling. To restore this 4D essence while ensuring the precise robot control, we introduce Kinema4D, a new action-conditioned 4D generative robotic simulator that disentangles the robot-world interaction into: i) Precise 4D representation of robot controls: we drive a URDF-based 3D robot via kinematics, producing a precise 4D robot control trajectory. ii) Generative 4D modeling of environmental reactions: we project the 4D robot trajectory into a pointmap as a spatiotemporal visual signal, controlling the generative model to synthesize complex environments’ reactive dynamics into synchronized RGB/pointmap sequences. To facilitate training, we curated a large-scale dataset called Robo4D-200k, comprising 201,426 robot interaction episodes with high-quality 4D annotations. Extensive experiments demonstrate that our method effectively simulates physically-plausible, geometry-consistent, and embodiment-agnostic interactions that faithfully mirror diverse real-world dynamics. For the first time, it shows potential zero-shot transfer capability, providing a high-fidelity foundation for advancing next-generation embodied simulation.

[CV-26] Fast-WAM: Do World Action Models Need Test-time Future Imagination?

【速读】:该论文旨在解决世界动作模型(World Action Models, WAMs)在推理阶段是否需要显式未来想象的问题,即其性能提升是否依赖于测试时的视频预测能力,还是主要源于训练阶段的视频建模。解决方案的关键在于提出Fast-WAM架构,该架构在训练阶段保留视频联合训练(video co-training)以学习良好的世界表征,但在测试阶段跳过未来视频生成步骤,从而显著降低延迟。实验表明,移除测试时的显式未来预测对性能影响较小,而移除视频联合训练则导致性能大幅下降,说明WAM的核心价值在于训练阶段通过视频建模增强世界表征,而非测试时的未来想象。

链接: https://arxiv.org/abs/2603.16666
作者: Tianyuan Yuan,Zibin Dong,Yicheng Liu,Hang Zhao
机构: IIIS, Tsinghua University; Galaxea AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance. In this paper, we ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. We disentangle the role of video modeling during training from explicit future generation during inference by proposing \textbfFast-WAM, a WAM architecture that retains video co-training during training but skips future prediction at test time. We further instantiate several Fast-WAM variants to enable a controlled comparison of these two factors. Across these variants, we find that Fast-WAM remains competitive with imagine-then-execute variants, while removing video co-training causes a much larger performance drop. Empirically, Fast-WAM achieves competitive results with state-of-the-art methods both on simulation benchmarks (LIBERO and RoboTwin) and real-world tasks, without embodied pretraining. It runs in real time with 190ms latency, over 4 \times faster than existing imagine-then-execute WAMs. These results suggest that the main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time. Project page: this https URL

[CV-27] Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在多模态任务中易产生幻觉(hallucination)的问题,这一缺陷严重限制了其实际部署。现有训练-free 方法如基于解码或工具调用的策略往往效果有限且缺乏可解释性。论文提出 Kestrel 框架,其关键在于结合显式视觉定位代理(visual-grounding agent)与证据验证驱动的自我精炼机制:首先收集并结构化视觉证据和工具输出,其次通过 LVLM 判别器对证据进行验证,并基于可信证据迭代优化答案,从而降低过矫正风险。实验表明,Kestrel 在多个幻觉基准测试中显著优于强基线(如 POPE 平均提升 +3.31%,MME-Hallucination 提升 +28.34%),同时提供透明的验证轨迹用于幻觉诊断与分析。

链接: https://arxiv.org/abs/2603.16664
作者: Jiawei Mao,Hardy Chen,Haoqin Tu,Yuhan Wang,Letian Zhang,Zeyu Zheng,Huaxiu Yao,Zirui Wang,Cihang Xie,Yuyin Zhou
机构: UC Santa Cruz; UC Berkeley; UNC-Chapel Hill; Apple
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 11 figures, 5 tables

点击查看摘要

Abstract:Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis – e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.

[CV-28] Spectral Property-Driven Data Augmentation for Hyperspectral Single-Source Domain Generalization

【速读】:该论文旨在解决高光谱图像(Hyperspectral Images, HSI)在单源域泛化(Single-Source Domain Generalization, SDG)中因域间分布差异导致的分类性能下降问题。现有方法依赖数据增强来模拟域偏移,但存在现实性与多样性之间的权衡:盲目增强可能生成不符合真实场景的样本,而过度强调现实性则会抑制多样性,限制模型对目标域的泛化能力。解决方案的关键在于提出一种基于光谱属性驱动的数据增强方法(Spectral Property-Driven Data Augmentation, SPDDA),其核心包括三个机制:(1) 光谱多样性模块,通过沿光谱维度重采样生成不同通道数的样本以提升多样性;(2) 通道自适应光谱混合器,基于通道间相似性建模避免固定增强模式;(3) 空间-光谱协同优化机制,联合优化空间保真度约束和光谱连续性自约束,并根据空间约束动态调整光谱自约束权重,从而防止光谱过平滑并保留空间结构。实验表明,SPDDA在三个遥感基准数据集上优于当前最优方法。

链接: https://arxiv.org/abs/2603.16662
作者: Taiqin Chen,Yifeng Wang,Xiaochen Feng,Zhilin Zhu,Hao Sha,Yingjian Li,Yongbing Zhang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While hyperspectral images (HSI) benefit from numerous spectral channels that provide rich information for classification, the increased dimensionality and sensor variability make them more sensitive to distributional discrepancies across domains, which in turn can affect classification performance. To tackle this issue, hyperspectral single-source domain generalization (SDG) typically employs data augmentation to simulate potential domain shifts and enhance model robustness under the condition of single-source domain training data availability. However, blind augmentation may produce samples misaligned with real-world scenarios, while excessive emphasis on realism can suppress diversity, highlighting a tradeoff between realism and diversity that limits generalization to target domains. To address this challenge, we propose a spectral property-driven data augmentation (SPDDA) that explicitly accounts for the inherent properties of HSI, namely the device-dependent variation in the number of spectral channels and the mixing of adjacent channels. Specifically, SPDDA employs a spectral diversity module that resamples data from the source domain along the spectral dimension to generate samples with varying spectral channels, and constructs a channel-wise adaptive spectral mixer by modeling inter-channel similarity, thereby avoiding fixed augmentation patterns. To further enhance the realism of the augmented samples, we propose a spatial-spectral co-optimization mechanism, which jointly optimizes a spatial fidelity constraint and a spectral continuity self-constraint. Moreover, the weight of the spectral self-constraint is adaptively adjusted based on the spatial counterpart, thus preventing over-smoothing in the spectral dimension and preserving spatial structure. Extensive experiments conducted on three remote sensing benchmarks demonstrate that SPDDA outperforms state-of-the-art methods.

[CV-29] HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models

【速读】:该论文旨在解决大规模视觉-语言模型(VLMs)如CLIP在下游任务适配中普遍存在的“一刀切”架构问题,即对视觉和文本token采用统一的宽泛适配器处理方式,忽略了图像的空间局部性与文本的语义密集性这两种模态的本质差异。其解决方案的核心在于提出HeBA(Heterogeneous Bottleneck Adapter)框架,通过三项关键创新实现:(1) 异质性设计——使用二维深度可分离卷积处理视觉token以保留空间关联,而用密集线性投影处理文本token以捕捉语义关系;(2) 瓶颈正则化——引入压缩瓶颈(D - D/4)强制学习紧凑且鲁棒的特征,起到结构正则作用;(3) 主动梯度初始化——摒弃传统零初始化策略,采用Kaiming初始化保障充分的初始梯度流动,加速收敛同时不破坏预训练骨干网络的知识。

链接: https://arxiv.org/abs/2603.16653
作者: Md Jahidul Islam
机构: Bangladesh University of Engineering and Technology (孟加拉国工程技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a “one-size-fits-all” architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities – spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D - D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone’s pre-trained knowledge. Extensive experiments demonstrate that HeBA’s architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at this https URL.

[CV-30] Efficient Brood Cell Detection in Layer Trap Nests for Bees and Wasps: Balancing Labeling Effort and Species Coverag e

【速读】:该论文旨在解决层状巢穴陷阱(Layer Trap Nests, LTNs)中蜂类和黄蜂类巢室的自动检测与分类问题,其核心挑战在于:1)巢室密集排列导致标注成本高;2)物种类别分布严重不均衡,常见种标注量大但加剧不平衡,而部分标注又引入数据不完整性,影响模型性能。解决方案的关键创新是提出一种约束性误报损失(Constrained False Positive Loss, CFPL)策略,通过在训练过程中动态屏蔽未标注样本的预测结果,避免其对分类损失的干扰,从而在减少标注工作量的同时提升模型对稀有类别的识别能力,并有效缓解类别不平衡问题。

链接: https://arxiv.org/abs/2603.16652
作者: Chenchang Liu,Felix Fornoff,Annika Grasreiner,Patrick Maeder,Henri Greil,Marco Seeland
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monitoring cavity-nesting wild bees and wasps is vital for biodiversity research and conservation. Layer trap nests (LTNs) are emerging as a valuable tool to study the abundance and species richness of these insects, offering insights into their nesting activities and ecological needs. However, manually evaluating LTNs to detect and classify brood cells is labor-intensive and time-consuming. To address this, we propose a deep learning based approach for efficient brood cell detection and classification in LTNs. LTNs present additional challenges due to densely packed brood cells, leading to a high labeling effort per image. Moreover, we observe a significant imbalance in class distribution, with common species having notably more occurrences than rare species. Comprehensive labeling of common species is time-consuming and exacerbates data imbalance, while partial labeling introduces data incompleteness which degrades model performance. To reduce labeling effort and mitigate the impact of unlabeled data, we introduce a novel Constrained False Positive Loss (CFPL) strategy. CFPL dynamically masks predictions from unlabeled data, preventing them from interfering with the classification loss during training. We evaluate our approach on a dataset of 712 LTN images collected over one season, covering 28 fine-grained classes describing the taxonomy and status of brood cells. To minimize labeling effort, we limit the training set to a maximum of 300 labels per class. Experimental results demonstrate that deep learning can be effectively used to detect brood cells in LTNs. Our CFPL method further improves performance and balances model accuracy and labeling effort while also mitigating class imbalance.

[CV-31] Mixture of Style Experts for Diverse Image Stylization

【速读】:该论文旨在解决当前基于扩散模型的风格化方法主要依赖颜色驱动变换、忽视复杂语义与材质细节的问题。其解决方案的关键在于提出StyleExpert框架,该框架基于混合专家(Mixture of Experts, MoE)架构,引入一个统一的风格编码器(style encoder),通过大规模内容-风格-风格化图像三元组数据集训练,将多样风格嵌入到一致的潜在空间中;随后利用相似性感知门控机制动态地将风格路由至MoE中的专业专家模块,从而在多个语义层级上(从浅层纹理到深层语义)实现对多样化风格的有效建模与控制,显著提升风格迁移中语义和材质细节的保留能力,并具备对未见风格的泛化性能。

链接: https://arxiv.org/abs/2603.16649
作者: Shihao Zhu,Ziheng Ouyang,Yijia Kang,Qilong Wang,Mi Zhou,Bo Li,Ming-Ming Cheng,Qibin Hou
机构: Nankai University (南开大学); NKIARI, Shenzhen Futian (深圳福田NKIARI); Tianjin University (天津大学); vivo BlueImage Lab (vivo蓝眸实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based stylization has advanced significantly, yet existing methods are limited to color-driven transformations, neglecting complex semantics and material this http URL introduce StyleExpert, a semantic-aware framework based on the Mixture of Experts (MoE). Our framework employs a unified style encoder, trained on our large-scale dataset of content-style-stylized triplets, to embed diverse styles into a consistent latent space. This embedding is then used to condition a similarity-aware gating mechanism, which dynamically routes styles to specialized experts within the MoE architecture. Leveraging this MoE architecture, our method adeptly handles diverse styles spanning multiple semantic levels, from shallow textures to deep semantics. Extensive experiments show that StyleExpert outperforms existing approaches in preserving semantics and material details, while generalizing to unseen styles. Our code and collected images are available at the project page: this https URL.

[CV-32] BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection CVPR2026

【速读】:该论文旨在解决场景图中异常关系检测的问题,即在图像生成的场景图中识别出不符合常识或逻辑的对象间关系(如“人骑在椅子上”)。其解决方案的关键在于提出一种基于归一化流(normalizing flow)的双向可逆模型——Bijective Universal Scene-Specific Anomalous Relationship Detection (BUSSARD),通过将场景图中的对象-关系-对象三元组映射到一个简单的基础分布(通常为高斯分布),利用对数似然估计实现异常检测。该方法结合语言模型嵌入对象与关系标记以引入语义知识,从而提升检测精度与泛化能力,在SARD数据集上相较当前最优模型提升约10%的AUROC指标且推理速度提高五倍,同时展现出对同义词等语义变化的强鲁棒性。

链接: https://arxiv.org/abs/2603.16645
作者: Melissa Schween,Mathis Kruse,Bodo Rosenhahn
机构: L3S - Leibniz University Hannover (L3S -汉诺威大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Main Track

点击查看摘要

Abstract:We propose Bijective Universal Scene-Specific Anomalous Relationship Detection (BUSSARD), a normalizing flow-based model for detecting anomalous relations in scene graphs, generated from images. Our work follows a multimodal approach, embedding object and relationship tokens from scene graphs with a language model to leverage semantic knowledge from the real world. A normalizing flow model is used to learn bijective transformations that map object-relation-object triplets from scene graphs to a simple base distribution (typically Gaussian), allowing anomaly detection through likelihood estimation. We evaluate our approach on the SARD dataset containing office and dining room scenes. Our method achieves around 10% better AUROC results compared to the current state-of-the-art model, while simultaneously being five times faster. Through ablation studies, we demonstrate superior robustness and universality, particularly regarding the use of synonyms, with our model maintaining stable performance while the baseline shows 17.5% deviation. This work demonstrates the strong potential of learning-based methods for relationship anomaly detection in scene graphs. Our code is available at this https URL .

[CV-33] FlowComposer: Composable Flows for Compositional Zero-Shot Learning CVPR2026

【速读】:该论文旨在解决当前组合零样本学习(Compositional Zero-Shot Learning, CZSL)方法中两个根本性问题:一是隐式组合构建(Implicit Composition Construction),即通过token拼接或分支级提示调优实现组合,而非在嵌入空间中显式操作;二是残余特征纠缠(Remained Feature Entanglement),即不完全解耦导致属性、对象与组合特征相互污染,从而限制模型泛化能力。解决方案的关键在于提出FlowComposer框架,其核心创新为:学习两条原始流(primitive flows),分别将视觉特征映射至属性和对象文本嵌入空间,并引入一个可学习的Composer模块,显式融合其速度场生成组合流;同时设计泄漏引导增强策略,利用残留纠缠特征作为辅助信号提升鲁棒性。该方法具有模型无关性,可作为插件模块集成到多种基线模型中并显著提升性能。

链接: https://arxiv.org/abs/2603.16641
作者: Zhenqi He,Lin Li,Long Chen
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2026

点击查看摘要

Abstract:Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by recombining primitives learned from seen pairs. Recent CZSL methods built on vision-language models (VLMs) typically adopt parameter-efficient fine-tuning (PEFT). They apply visual disentanglers for decomposition and manipulate token-level prompts or prefixes to encode compositions. However, such PEFT-based designs suffer from two fundamental limitations: (1) Implicit Composition Construction, where composition is realized only via token concatenation or branch-wise prompt tuning rather than an explicit operation in the embedding space; (2) Remained Feature Entanglement, where imperfect disentanglement leaves attribute, object, and composition features mutually contaminated. Together, these issues limit the generalization ability of current CZSL models. In this paper, we are the first to systematically study flow matching for CZSL and introduce FlowComposer, a model-agnostic framework that learns two primitive flows to transport visual features toward attribute and object text embeddings, and a learnable Composer that explicitly fuses their velocity fields into a composition flow. To exploit the inevitable residual entanglement, we further devise a leakage-guided augmentation scheme that reuses leaked features as auxiliary signals. We thoroughly evaluate FlowComposer on three public CZSL benchmarks by integrating it as a plug-and-play component into various baselines, consistently achieving significant improvements.

[CV-34] MLLM -based Textual Explanations for Face Comparison

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在无约束条件下进行人脸验证时生成解释的可靠性问题,尤其关注极端姿态变化和监控图像场景下的解释可信度。其核心挑战在于:即使MLLMs做出正确的验证决策,其生成的自然语言解释往往依赖于无法验证或虚构的人脸属性,缺乏视觉证据支持。解决方案的关键在于引入一种基于似然比(likelihood-ratio-based)的评估框架,用于量化文本解释的证据强度,从而超越单纯依赖决策准确性的评价方式,系统性地衡量解释的真实性与可信赖性。这一方法揭示了当前MLLMs在可解释人脸识别中的根本局限,并强调了在生物识别应用中建立严谨、可靠解释评估机制的必要性。

链接: https://arxiv.org/abs/2603.16629
作者: Redwan Sony,Anil K Jain,Ross Arun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at 14th International Workshop on Biometrics and Forensics (IWBF)

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have recently been proposed as a means to generate natural-language explanations for face recognition decisions. While such explanations facilitate human interpretability, their reliability on unconstrained face images remains underexplored. In this work, we systematically analyze MLLM-generated explanations for the unconstrained face verification task on the challenging IJB-S dataset, with a particular focus on extreme pose variation and surveillance imagery. Our results show that even when MLLMs produce correct verification decisions, the accompanying explanations frequently rely on non-verifiable or hallucinated facial attributes that are not supported by visual evidence. We further study the effect of incorporating information from traditional face recognition systems, viz., scores and decisions, alongside the input images. Although such information improves categorical verification performance, it does not consistently lead to faithful explanations. To evaluate the explanations beyond decision accuracy, we introduce a likelihood-ratio-based framework that measures the evidential strength of textual explanations. Our findings highlight fundamental limitations of current MLLMs for explainable face recognition and underscore the need for a principled evaluation of reliable and trustworthy explanations in biometric applications. Code is available at this https URL.

[CV-35] CATSeg: A Tooth Center-Wise Attention Network for 3D Dental Model Semantic Segmentation ICASSP2026

【速读】:该论文旨在解决三维牙科模型语义分割中的准确性问题,尤其是在牙齿排列复杂且相邻牙齿形状相似的情况下,现有方法因过度关注局部几何特征而忽视全局语义上下文,导致分割效果不佳。解决方案的关键在于提出TCATSeg框架,通过引入一组稀疏但具有物理意义的超点(superpoints)来捕捉全局语义关系,并将其与局部几何特征融合,从而显著提升分割精度。

链接: https://arxiv.org/abs/2603.16620
作者: Qiang He,Wentian Qu,Jiajia Dai,Changsong Lei,Shaofeng Wang,Feifei Zuo,Yajie Wang,Yaqian Liang,Xiaoming Deng,Cuixia Ma,Yong-Jin Liu,Hongan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, ICASSP 2026

点击查看摘要

Abstract:Accurate semantic segmentation of 3D dental models is essential for digital dentistry applications such as orthodontics and dental implants. However, due to complex tooth arrangements and similarities in shape among adjacent teeth, existing methods struggle with accurate segmentation, because they often focus on local geometry while neglecting global contextual information. To address this, we propose TCATSeg, a novel framework that combines local geometric features with global semantic context. We introduce a set of sparse yet physically meaningful superpoints to capture global semantic relationships and enhance segmentation accuracy. Additionally, we present a new dataset of 400 dental models, including pre-orthodontic samples, to evaluate the generalization of our method. Extensive experiments demonstrate that TCATSeg outperforms state-of-the-art approaches.

[CV-36] ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery CVPR2026

【速读】:该论文旨在解决从航空影像中一次性生成完整矢量地图表示的问题,即在单次运行中为所有地表覆盖(land-cover)类别生成具有共享边界的多边形,且无间隙或重叠。现有方法通常为每个类别单独处理,扩展至多类时易导致拓扑不一致(如重复边、缝隙和重叠)。为此,作者提出了All-Class Polygonal Vectorization (ACPV)任务并发布了首个公开基准Deventer-512,包含标准化指标以联合评估语义保真度、几何精度、顶点效率、类别级拓扑保真度及全局拓扑一致性。解决方案的关键在于提出ACPV-Net框架,其核心创新包括:一种新颖的语义监督条件机制(Semantically Supervised Conditioning, SSC),将语义感知与几何原始结构生成耦合;以及一种拓扑重建模块,通过设计强制共享边一致性,实现严格拓扑约束下的高质量多边形生成。该方法在Deventer-512上优于所有类别特异性基线,在WHU-Building数据集上也实现了单类别矢量化最佳结果。

链接: https://arxiv.org/abs/2603.16616
作者: Weiqin Jiao,Hao Cheng,George Vosselman,Claudio Persello
机构: Faculty of Geo-Information Science and Earth Observation (ITC); University of Twente, The Netherlands
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. The supplementary material available in the conference proceedings

点击查看摘要

Abstract:We tackle the problem of generating a complete vector map representation from aerial imagery in a single run: producing polygons for all land-cover classes with shared boundaries and without gaps or overlaps. Existing polygonization methods are typically class-specific; extending them to multiple classes via per-class runs commonly leads to topological inconsistencies, such as duplicated edges, gaps, and overlaps. We formalize this new task as All-Class Polygonal Vectorization (ACPV) and release the first public benchmark, Deventer-512, with standardized metrics jointly evaluating semantic fidelity, geometric accuracy, vertex efficiency, per-class topological fidelity and global topological consistency. To realize ACPV, we propose ACPV-Net, a unified framework introducing a novel Semantically Supervised Conditioning (SSC) mechanism coupling semantic perception with geometric primitive generation, along with a topological reconstruction that enforces shared-edge consistency by design. While enforcing such strict topological constraints, ACPV-Net surpasses all class-specific baselines in polygon quality across classes on Deventer-512. It also applies to single-class polygonal vectorization without any architectural modification, achieving the best-reported results on WHU-Building. Data, code, and models will be released at: this https URL.

[CV-37] Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models

【速读】:该论文旨在解决生成式奖励模型(Generative Reward Models, GRMs)在视觉语言模型(Vision-Language Models, VLMs)中因中间评分标准(rubric)未被显式优化而导致的奖励信号质量不足问题。现有方法通常将rubric视为附属产物或依赖昂贵且不可微的大型语言模型(LLM-as-judge)进行验证,缺乏对rubric内部一致性与泛化能力的有效引导。解决方案的关键在于提出Proxy-GRM框架,通过引入轻量级代理验证器(Proxy-SFT和Proxy-RL),以候选rubric为输入,在不依赖原始偏好对的情况下预测偏好排序,并利用代理预测准确率作为rubric质量奖励信号,从而在强化学习(Reinforcement Learning, RL)过程中显式提升rubric的内在一致性与迁移能力。实验证明,该方法仅用约5万样本即可超越使用四倍数据训练的方法,且learned rubrics可迁移至未见过的评估者,显著提升测试时的奖励准确性。

链接: https://arxiv.org/abs/2603.16600
作者: Weijie Qiu,Dai Guan,Junxin Wang,Zhihang Li,Yongbo Gai,Mengyu Zhou,Erchao Zhao,Xiaoxi Jiang,Guanjun Jiang
机构: Qwen Large Model Application Team, Alibaba(阿里巴巴); Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 10 figures,

点击查看摘要

Abstract:Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy’s prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at this https URL.

[CV-38] FSMC-Pose: Frequency and Spatial Fusion with Multiscale Self-calibration for Cattle Mounting Pose Estimation CVPR2026

【速读】:该论文旨在解决在复杂现实环境中可靠估计奶牛发情期骑跨姿态(mounting posture)的问题,尤其针对背景杂乱和动物间频繁遮挡带来的挑战。解决方案的关键在于提出一种自顶向下的框架 FSMC-Pose,其核心创新包括:1)轻量级频域-空间融合骨干网络 CattleMountNet,其中包含空间频率增强模块(SFEBlock)以分离奶牛与背景、感受野聚合模块(RABlock)以捕获多尺度上下文信息;2)空间-通道自校准头(SC2Head),通过关注空间与通道依赖关系并引入自校准分支,缓解因动物重叠导致的结构错位问题。该方法在自建数据集 MOUNT-Cattle(含 1176 个骑跨实例)及 NWAFU-Cattle 数据集上验证,相较强基线模型显著提升精度,同时具备更低的计算和参数开销,并可在消费级 GPU 上实现实时推理。

链接: https://arxiv.org/abs/2603.16596
作者: Fangjing Li,Zhihai Wang,Xinxin Ding,Haiyang Liu,Ronghua Gao,Rong Wang,Yao Zhu,Ming Jin
机构: Beijing Jiaotong University (北京交通大学); NERCITA; Tsinghua University (清华大学); Griffith University (格里菲斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures. Accept by CVPR2026 Findings

点击查看摘要

Abstract:Mounting posture is an important visual indicator of estrus in dairy cattle. However, achieving reliable mounting pose estimation in real-world environments remains challenging due to cluttered backgrounds and frequent inter-animal occlusion. We present FSMC-Pose, a top-down framework that integrates a lightweight frequency-spatial fusion backbone, CattleMountNet, and a multiscale self-calibration head, SC2Head. Specifically, we design two algorithmic components for CattleMountNet: the Spatial Frequency Enhancement Block (SFEBlock) and the Receptive Aggregation Block (RABlock). SFEBlock separates cattle from cluttered backgrounds, while RABlock captures multiscale contextual information. The Spatial-Channel Self-Calibration Head (SC2Head) attends to spatial and channel dependencies and introduces a self-calibration branch to mitigate structural misalignment under inter-animal overlap. We construct a mounting dataset, MOUNT-Cattle, covering 1176 mounting instances, which follows the COCO format and supports drop-in training across pose estimation models. Using a comprehensive dataset that combines MOUNT-Cattle with the public NWAFU-Cattle dataset, FSMC-Pose achieves higher accuracy than strong baselines, with markedly lower computational and parameter costs, while maintaining real-time inference on commodity GPUs. Extensive experiments and qualitative analyses show that FSMC-Pose effectively captures and estimates cattle mounting pose in complex and cluttered environments. Dataset and code are available at this https URL.

[CV-39] On the Transfer of Collinearity to Computer Vision

【速读】:该论文旨在解决如何将人类视觉系统中的共线性(collinearity)感知原理迁移至计算机视觉领域,并探索其在实际应用中的潜力。共线性是指大脑对沿直线排列的边缘具有增强感知的能力,但其在现实世界中的功能意义及在计算机视觉中的应用仍不明确。论文的关键解决方案是构建一个原型模型,系统性地验证共线性在四类典型场景中的有效性:晶圆缺陷检测、纳米材料缺陷识别、遮挡情况下的目标识别以及ImageNet图像分类。实验表明,共线性在人工结构(如晶圆和纳米材料)中显著提升性能(如缺陷检测错误率下降24%,深度学习任务性能提升3.2倍),而在自然图像(如ImageNet)中效果有限,说明该方法特别适用于具有明显线性结构的工业场景。

链接: https://arxiv.org/abs/2603.16592
作者: Frederik Beuth,Danny Kowerko
机构: Chemnitz University of Technology (茨维考工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Collinearity is a visual perception phenomenon in the human brain that amplifies spatially aligned edges arranged along a straight line. However, it is vague for which purpose humans might have this principle in the real-world, and its utilization in computer vision and engineering applications even is a largely unexplored field. In this work, our goal is to transfer the collinearity principle to computer vision, and we explore the potential usages of this novel principle for computer vision applications. We developed a prototype model to exemplify the principle, then tested it systematically, and benchmarked it in the context of four use cases. Our cases are selected to spawn a broad range of potential applications and scenarios: sketching the combination of collinearity with deep learning (case I and II), using collinearity with saliency models (case II), and as a feature detector (case I). In the first use case, we found that collinearity is able to improve the fault detection of wafers and obtain a performance increase by a factor 1.24 via collinearity (decrease of the error rate from 6.5% to 5.26%). In the second use case, we test the defect recognition in nanotechnology materials and achieve a performance increase by 3.2x via collinearity (deep learning, error from 21.65% to 6.64%), and also explore saliency models. As third experiment, we cover occlusions; while as fourth experiment, we test ImageNet and observe that it might not be very beneficial for ImageNet. Therefore, we can assemble a list of scenarios for which collinearity is beneficial (wafers, nanotechnology, occlusions), and for what is not beneficial (ImageNet). Hence, we infer collinearity might be suitable for industry applications as it helps if the image structures of interest are man-made because they often consist of lines. Our work provides another tool for CV, hope to capture the power of human processing.

[CV-40] REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models ICME2026

【速读】:该论文旨在解决生成式 AI (Generative AI) 中图像生成模型(Image Generation Models, IGMs)在执行概念移除(Unlearning)后,对对抗性输入(尤其是黑盒场景下的图像提示攻击)仍存在显著脆弱性的问题。当前 IGMU 方法虽能有效移除特定有害概念,但缺乏对多模态对抗攻击的鲁棒性保障,导致潜在安全风险未被充分消除。解决方案的关键在于提出 REFORGE,一个基于黑盒红队测试的评估框架,其核心创新为采用基于交叉注意力引导的掩码策略优化扰动分配,将噪声精准注入与目标概念相关的区域,在保证视觉保真度的同时大幅提升攻击成功率,从而系统性暴露现有 IGMU 方法的漏洞,并推动面向对抗鲁棒性的概念移除机制发展。

链接: https://arxiv.org/abs/2603.16576
作者: Yong Zou,Haoran Li,Fanxiao Li,Shenyang Wei,Yunyun Dong,Li Tang,Wei Zhou,Renyang Liu
机构: Yunnan University (云南大学); Northeastern University (东北大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted by ICME 2026

点击查看摘要

Abstract:Recent progress in image generation models (IGMs) enables high-fidelity content creation but also amplifies risks, including the reproduction of copyrighted content and the generation of offensive content. Image Generation Model Unlearning (IGMU) mitigates these risks by removing harmful concepts without full retraining. Despite growing attention, the robustness under adversarial inputs, particularly image-side threats in black-box settings, remains underexplored. To bridge this gap, we present REFORGE, a black-box red-teaming framework that evaluates IGMU robustness via adversarial image prompts. REFORGE initializes stroke-based images and optimizes perturbations with a cross-attention-guided masking strategy that allocates noise to concept-relevant regions, balancing attack efficacy and visual fidelity. Extensive experiments across representative unlearning tasks and defenses demonstrate that REFORGE significantly improves attack success rate while achieving stronger semantic alignment and higher efficiency than involved baselines. These results expose persistent vulnerabilities in current IGMU methods and highlight the need for robustness-aware unlearning against multi-modal adversarial attacks. Our code is at: this https URL.

[CV-41] Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration CVPR2026

【速读】:该论文旨在解决现有图像修复方法在处理全场景(包括身体和背景)时的局限性问题:一方面,基于参考的面部修复(Ref-FR)模型仅关注面部区域,忽视了场景中其他部分的退化;另一方面,全场景修复模型缺乏退化线索,导致预测欠定并产生视觉伪影。解决方案的关键在于提出一个两阶段的修复框架 Face2Scene,其核心创新是利用面部作为感知“先知”(perceptual oracle),从修复前后的人脸差异中提取出退化编码(degradation code),并将其转化为多尺度的退化感知令牌(degradation-aware tokens),进而指导扩散模型完成整个图像的单步恢复,从而实现对身体和背景等全场景内容的高质量重建。

链接: https://arxiv.org/abs/2603.16570
作者: Amirhossein Kazerouni,Maitreya Suin,Tristan Aumentado-Armstrong,Sina Honari,Amanpreet Walia,Iqbal Mohomed,Konstantinos G. Derpanis,Babak Taati,Alex Levinshtein
机构: University of Toronto (多伦多大学); Vector Institute (向量研究所); AI Center–Toronto, Samsung Electronics (三星电子人工智能中心-多伦多); University Health Network (大学健康网络); York University (约克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Recent advances in image restoration have enabled high-fidelity recovery of faces from degraded inputs using reference-based face restoration models (Ref-FR). However, such methods focus solely on facial regions, neglecting degradation across the full scene, including body and background, which limits practical usability. Meanwhile, full-scene restorers often ignore degradation cues entirely, leading to underdetermined predictions and visual artifacts. In this work, we propose Face2Scene, a two-stage restoration framework that leverages the face as a perceptual oracle to estimate degradation and guide the restoration of the entire image. Given a degraded image and one or more identity references, we first apply a Ref-FR model to reconstruct high-quality facial details. From the restored-degraded face pair, we extract a face-derived degradation code that captures degradation attributes (e.g., noise, blur, compression), which is then transformed into multi-scale degradation-aware tokens. These tokens condition a diffusion model to restore the full scene in a single step, including the body and background. Extensive experiments demonstrate the superior effectiveness of the proposed method compared to state-of-the-art methods.

[CV-42] VideoMatGen: PBR Materials through Joint Generative Modeling

【速读】:该论文旨在解决3D形状中物理合理材质生成的难题,特别是如何基于输入几何结构和文本描述,联合建模多种材质属性(如基础颜色、粗糙度、金属度和高度图)以生成高质量且符合物理规律的材质。其解决方案的关键在于提出了一种基于视频扩散Transformer架构的方法,并引入了一个定制的变分自编码器(Variational Auto-Encoder, VAE),将多种材质模态编码至紧凑的潜在空间,从而在不增加token数量的前提下实现多模态材质的联合生成,最终支持与主流内容创作工具兼容的高质量材质输出。

链接: https://arxiv.org/abs/2603.16566
作者: Jon Hasselgren,Zheng Zeng,Milos Hasan,Jacob Munkberg
机构: NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We present a method for generating physically-based materials for 3D shapes based on a video diffusion transformer architecture. Our method is conditioned on input geometry and a text description, and jointly models multiple material properties (base color, roughness, metallicity, height map) to form physically plausible materials. We further introduce a custom variational auto-encoder which encodes multiple material modalities into a compact latent space, which enables joint generation of multiple modalities without increasing the number of tokens. Our pipeline generates high-quality materials for 3D shapes given a text prompt, compatible with common content creation tools.

[CV-43] Understanding Cell Fate Decisions with Temporal Attention

【速读】:该论文旨在解决癌症治疗中细胞命运决定的非遗传因素识别问题,即在相同治疗条件下,基因型相同的癌细胞为何会表现出不同的生物学结局(如分裂或凋亡)。其核心挑战在于如何从原始长时程活细胞成像数据中自动提取与细胞命运相关的关键时空特征,而无需依赖人工定义的形态学或分子特征。解决方案的关键在于提出一种基于Transformer架构的深度学习模型,该模型可直接从原始图像序列中预测细胞命运,并结合注意力机制和掩码实验构建了一个全面的可解释性框架,揭示了细胞命运决策过程中形态变化与p53信号通路在不同时间点上的贡献差异。结果表明,仅凭视频信息即可实现高精度预测(平衡准确率为0.94,F1分数为0.93),且预测信号可在事件发生前长达10小时被识别,从而为理解非遗传调控机制提供了新范式。

链接: https://arxiv.org/abs/2603.16562
作者: Florian Bürger,Martim Dias Gomes,Adrián E. Granada,Noémie Moreau,Katarzyna Bozek
机构: University of Cologne (科隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cell Behavior (q-bio.CB); Quantitative Methods (q-bio.QM)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Understanding non-genetic determinants of cell fate is critical for developing and improving cancer therapies, as genetically identical cells can exhibit divergent outcomes under the same treatment conditions. In this work, we present a deep learning approach for cell fate prediction from raw long-term live-cell recordings of cancer cell populations under chemotherapeutic treatment. Our Transformer model is trained to predict cell fate directly from raw image sequences, without relying on predefined morphological or molecular features. Beyond classification, we introduce a comprehensive explainability framework for interpreting the temporal and morphological cues guiding the model’s predictions. We demonstrate that prediction of cell outcomes is possible based on the video only, our model achieves balanced accuracy of 0.94 and an F1-score of 0.93. Attention and masking experiments further indicate that the signal predictive of the cell fate is not uniquely located in the final frames of a cell trajectory, as reliable predictions are possible up to 10 h before the event. Our analysis reveals distinct temporal distribution of predictive information in the mitotic and apoptotic sequences, as well as the role of cell morphology and p53 signaling in determining cell outcomes. Together, these findings demonstrate that attention-based temporal models enable accurate cell fate prediction while providing biologically interpretable insights into non-genetic determinants of cellular decision-making. The code is available at this https URL.

[CV-44] Segmentation-Based Attention Entropy: Detecting and Mitigating Object Hallucinations in Large Vision-Language Models

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的物体幻觉(object hallucinations)问题,即模型在推理过程中生成不存在的物体信息,从而严重影响其可靠性和实用性。现有研究多聚焦于文本模态,认为幻觉源于过强的语言先验和不足的视觉 grounding,而本文发现视觉模态内部异常的注意力模式同样会导致幻觉。解决方案的关键在于提出基于分割的注意力熵(Segmentation-based Attention Entropy, SAE),通过语义分割将视觉注意力不确定性量化到对象级语义空间中,并据此设计了一个用于幻觉检测的可靠性评分机制以及一种无需额外训练即可在推理时调整视觉注意力的 SAE 指导方法,有效降低了幻觉发生率,提升了 LVLM 在真实场景(如四足机器人)中的感知与决策可信度。

链接: https://arxiv.org/abs/2603.16558
作者: Jiale Song,Jiaxin Luo,Xue-song Tang,Kuangrong Hao,Mingbo Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) achieve strong performance on many multimodal tasks, but object hallucinations severely undermine their reliability. Most existing studies focus on the text modality, attributing hallucinations to overly strong language priors and insufficient visual grounding. In contrast, we observe that abnormal attention patterns within the visual modality can also give rise to hallucinated objects. Building on this observation, we propose Segmentation-based Attention Entropy (SAE), which leverages semantic segmentation to quantify visual attention uncertainty in an object-level semantic space. Based on SAE, we further design a reliability score for hallucination detection and an SAE-guided attention adjustment method that modifies visual attention at inference time to mitigate hallucinations. We evaluate our approach on public benchmarks and in real embodied multimodal scenarios with quadruped robots. Experimental results show that SAE substantially reduces object hallucinations without any additional training cost, thereby enabling more trustworthy LVLM-driven perception and decision-making.

[CV-45] CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation

【速读】:该论文旨在解决生成式医学图像模型中存在的“不平衡生成器问题”(imbalanced generator problem),即生成模型在不同人口统计学群体间产生图像质量不一致的问题,尤其当某些子群体或交叉群体(如罕见性别-种族组合)在训练数据中缺失时,模型性能显著下降。解决方案的关键在于提出一种分层结构化的扩散框架CompDiff,其核心创新是引入一个分层条件网络(Hierarchical Conditioner Network, HCN),通过将人口统计学条件进行分解并生成结构化的“人口统计学标记”(demographic token),与CLIP嵌入共同作为交叉注意力上下文输入,从而实现跨子群体的参数共享和对未见交叉群体的组合泛化能力,有效提升图像质量和公平性表现。

链接: https://arxiv.org/abs/2603.16551
作者: Mahmoud Ibrahim,Bart Elen,Chang Sun,Gokhan Ertaylan,Michel Dumontier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative models are increasingly used to augment medical imaging datasets for fairer AI. Yet a key assumption often goes unexamined: that generators themselves produce equally high-quality images across demographic groups. Models trained on imbalanced data can inherit these imbalances, yielding degraded synthesis quality for rare subgroups and struggling with demographic intersections absent from training. We refer to this as the imbalanced generator problem. Existing remedies such as loss reweighting operate at the optimization level and provide limited benefit when training signal is scarce or absent for certain combinations. We propose CompDiff, a hierarchical compositional diffusion framework that addresses this problem at the representation level. A dedicated Hierarchical Conditioner Network (HCN) decomposes demographic conditioning, producing a demographic token concatenated with CLIP embeddings as cross-attention context. This structured factorization encourages parameter sharing across subgroups and supports compositional generalization to rare or unseen demographic intersections. Experiments on chest X-rays (MIMIC-CXR) and fundus images (FairGenMed) show that CompDiff compares favorably against both standard fine-tuning and FairDiffusion across image quality (FID: 64.3 vs. 75.1), subgroup equity (ES-FID), and zero-shot intersectional generalization (up to 21% FID improvement on held-out intersections). Downstream classifiers trained on CompDiff-generated data also show improved AUROC and reduced demographic bias, suggesting that architectural design of demographic conditioning is an important and underexplored factor in fair medical image generation. Code is available at this https URL.

[CV-46] Bridging the Simulation-to-Reality Gap in Electron Microscope Calibration via VAE-EM Estimation

【速读】:该论文旨在解决扫描透射电子显微镜(Scanning Transmission Electron Microscope, STEM)校准中的关键难题,即如何在高维、噪声严重的诊断图像中准确估计最优的显微镜参数,以克服光学像差对成像质量的影响。传统方法通常依赖单一图像提取标量特征,难以实现鲁棒校准。其解决方案的关键在于:首先利用变分自编码器(Variational Autoencoder, VAE)从模拟数据中学习图像的低维表示,从而实现信息压缩;其次通过期望最大化(Expectation Maximization, EM)算法联合估计从校准参数到编码表示的映射模型与最优校准参数,有效缓解了基于数字孪生模拟数据训练时存在的“仿真到现实差距”(simulation-to-reality gap)问题;此外,利用光学系统的已知对称性确保了联合估计问题的全局可识别性,保障唯一最优解的存在。实验表明,该方法相比现有技术显著提升了校准速度和一致性,在真实STEM上实现了2倍的误差降低且所需观测次数更少。

链接: https://arxiv.org/abs/2603.16549
作者: Jilles S. van Hulst,W.P.M.H.(Maurice)Heemels,Duarte J. Antunes
机构: Eindhoven University of Technology (埃因霍温理工大学); TNO (荷兰应用科学研究组织); Dutch Ministry of Economic Affairs and Climate (荷兰经济事务与气候部); German Federal Ministry of Education and Research (BMBF) (德国联邦教育与研究部); DLR (德国航空航天中心); Holland High Tech (荷兰高科技); top sector High-Tech Systems and Materials (荷兰高技术系统与材料顶尖领域); PPP innovation subsidy for public-private partnerships for research and development (公共-私营合作伙伴关系研发创新补贴)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Electron microscopy has enabled many scientific breakthroughs across multiple fields. A key challenge is the tuning of microscope parameters based on images to overcome optical aberrations that deteriorate image quality. This calibration problem is challenging due to the high-dimensional and noisy nature of the diagnostic images, and the fact that optimal parameters cannot be identified from a single image. We tackle the calibration problem for Scanning Transmission Electron Microscopes (STEM) by employing variational autoencoders (VAEs), trained on simulated data, to learn low-dimensional representations of images, whereas most existing methods extract only scalar values. We then simultaneously estimate the model that maps calibration parameters to encoded representations and the optimal calibration parameters using an expectation maximization (EM) approach. This joint estimation explicitly addresses the simulation-to-reality gap inherent in data-driven methods that train on simulated data from a digital twin. We leverage the known symmetry property of the optical system to establish global identifiability of the joint estimation problem, ensuring that a unique optimum exists. We demonstrate that our approach is substantially faster and more consistent than existing methods on a real STEM, achieving a 2x reduction in estimation error while requiring fewer observations. This represents a notable advance in automated STEM calibration and demonstrates the potential of VAEs for information compression in images. Beyond microscopy, the VAE-EM framework applies to inverse problems where simulated training data introduces a reality gap and where non-injective mappings would otherwise prevent unique solutions.

[CV-47] SAMSEM – A Generic and Scalable Approach for IC Metal Line Segmentation

【速读】:该论文旨在解决集成电路(Integrated Circuits, ICs)后制造验证中金属线段分割的难题,尤其是在全球化供应链背景下,如何在不可信制造环境中准确识别扫描电子显微镜(Scanning Electron Microscope, SEM)图像中的金属线路以确保无恶意电路植入。传统方法依赖于针对特定IC定制参数与算法,导致模型泛化能力差,难以跨工艺节点、材料和成像技术迁移。解决方案的关键在于提出SAMSEM——通过适配Meta的Segment Anything Model 2(SAM2),构建一个具备多尺度分割能力的通用模型,并引入拓扑约束损失函数,使分割结果更关注电气连通性而非像素级精度,从而显著提升模型在不同制造条件下的鲁棒性和泛化性能。

链接: https://arxiv.org/abs/2603.16548
作者: Christian Gehrmann,Jonas Ricker,Simon Damm,Deruo Cheng,Julian Speith,Yiqiong Shi,Asja Fischer,Christof Paar
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In light of globalized hardware supply chains, the assurance of hardware components has gained significant interest, particularly in cryptographic applications and high-stakes scenarios. Identifying metal lines on scanning electron microscope (SEM) images of integrated circuits (ICs) is one essential step in verifying the absence of malicious circuitry in chips manufactured in untrusted environments. Due to varying manufacturing processes and technologies, such verification usually requires tuning parameters and algorithms for each target IC. Often, a machine learning model trained on images of one IC fails to accurately detect metal lines on other ICs. To address this challenge, we create SAMSEM by adapting Meta’s Segment Anything Model 2 (SAM2) to the domain of IC metal line segmentation. Specifically, we develop a multi-scale segmentation approach that can handle SEM images of varying sizes, resolutions, and magnifications. Furthermore, we deploy a topology-based loss alongside pixel-based losses to focus our segmentation on electrical connectivity rather than pixel-level accuracy. Based on a hyperparameter optimization, we then fine-tune the SAM2 model to obtain a model that generalizes across different technology nodes, manufacturing materials, sample preparation methods, and SEM imaging technologies. To this end, we leverage an unprecedented dataset of SEM images obtained from 48 metal layers across 14 different ICs. When fine-tuned on seven ICs, SAMSEM achieves an error rate as low as 0.72% when evaluated on other images from the same ICs. For the remaining seven unseen ICs, it still achieves error rates as low as 5.53%. Finally, when fine-tuned on all 14 ICs, we observe an error rate of 0.62%. Hence, SAMSEM proves to be a reliable tool that significantly advances the frontier in metal line segmentation, a key challenge in post-manufacturing IC verification.

[CV-48] Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty CVPR2026

【速读】:该论文旨在解决基于3D Gaussian Splatting (3DGS) 的位姿精化(pose refinement)在实际应用中对初始相机位姿和重建几何的敏感性问题,尤其是在存在姿态先验不确定性和几何不确定性时,容易导致重投影几何失真和优化不稳定。其解决方案的关键在于提出一种结合蒙特卡洛姿态采样(Monte Carlo pose sampling)与基于Fisher信息的PnP优化(Fisher Information-based PnP optimization)的重定位框架,该方法显式建模并缓解两种不确定性来源,无需重新训练或额外监督,在多种室内外基准测试中显著提升了定位精度和鲁棒性。

链接: https://arxiv.org/abs/2603.16538
作者: Mangyu Kong,Jaewon Lee,Seongwon Lee,Euntai Kim
机构: Yonsei University (延世大学); Kookmin University (酷尔敏大学); Korea Institution of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 11 figures, CVPR 2026

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently emerged as a powerful scene representation and is increasingly used for visual localization and pose refinement. However, despite its high-quality differentiable rendering, the robustness of 3DGS-based pose refinement remains highly sensitive to both the initial camera pose and the reconstructed geometry. In this work, we take a closer look at these limitations and identify two major sources of uncertainty: (i) pose prior uncertainty, which often arises from regression or retrieval models that output a single deterministic estimate, and (ii) geometric uncertainty, caused by imperfections in the 3DGS reconstruction that propagate errors into PnP solvers. Such uncertainties can distort reprojection geometry and destabilize optimization, even when the rendered appearance still looks plausible. To address these uncertainties, we introduce a relocalization framework that combines Monte Carlo pose sampling with Fisher Information-based PnP optimization. Our method explicitly accounts for both pose and geometric uncertainty and requires no retraining or additional supervision. Across diverse indoor and outdoor benchmarks, our approach consistently improves localization accuracy and significantly increases stability under pose and depth noise.

[CV-49] An approximate graph elicits detonation lattice

【速读】:该论文旨在解决爆炸波(detonation)研究中长期存在的细胞结构(detonation cells)精确分割与测量难题,传统方法依赖人工或简单的二维边缘检测,存在精度低、效率差的问题。其解决方案的关键在于提出一种基于图论(graph theory)的训练-free算法,通过构建图结构对三维压力数据中的细胞模式进行自动提取与量化分析,实现了对复杂爆炸波形中细胞形态的高精度识别,且具备良好的泛化能力,可适用于多种几何构型的细胞结构,为后续三重点碰撞(triple-point collision)等深入研究提供可靠基础。

链接: https://arxiv.org/abs/2603.16524
作者: Vansh Sharma,Venkat Raman
机构: University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
备注:

点击查看摘要

Abstract:This study presents a novel algorithm based on graph theory for the precise segmentation and measurement of detonation cells from 3D pressure traces, termed detonation lattices, addressing the limitations of manual and primitive 2D edge detection methods prevalent in the field. Using a segmentation model, the proposed training-free algorithm is designed to accurately extract cellular patterns, a longstanding challenge in detonations research. First, the efficacy of segmentation on generated data is shown with a prediction error 2%. Next, 3D simulation data is used to establish performance of the graph-based workflow. The results of statistics and joint probability densities show oblong cells aligned with the wave propagation axis with 17% deviation, whereas larger dispersion in volume reflects cubic amplification of linear variability. Although the framework is robust, it remains challenging to reliably segment and quantify highly complex cellular patterns. However, the graph-based formulation generalizes across diverse cellular geometries, positioning it as a practical tool for detonation analysis and a strong foundation for future extensions in triple-point collision studies.

[CV-50] VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations

【速读】:该论文旨在解决多视角视觉推理(multi-view visual reasoning)在真实场景中面临的挑战,即如何从稀疏且离散的视角中整合局部观测信息以理解复杂环境,而现有方法主要集中在单图或时序密集视频设置下,难以迁移至现实世界。解决方案的关键在于构建一个基于物理模拟的可扩展数据生成引擎,用于创建高保真3D场景及精确的每视图元数据,从而支持大规模、带标注的问题-答案对训练;在此基础上提出VIEW2SPACE基准和分离式训练划分,结合Grounded Chain-of-Thought with Visual Evidence方法,在保证几何感知能力的同时显著提升跨数据集的稀疏多视角推理性能,揭示了模型规模、数据量、推理深度与可见性约束之间的难度感知关系,指出深层组合推理仍是当前核心瓶颈。

链接: https://arxiv.org/abs/2603.16506
作者: Fucai Ke,Zhixi Cai,Boying Li,Long Chen,Beibei Lin,Weiqing Wang,Pari Delir Haghighi,Gholamreza Haffari,Hamid Rezatofighi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-view visual reasoning is essential for intelligent systems that must understand complex environments from sparse and discrete viewpoints, yet existing research has largely focused on single-image or temporally dense video settings. In real-world scenarios, reasoning across views requires integrating partial observations without explicit guidance, while collecting large-scale multi-view data with accurate geometric and semantic annotations remains challenging. To address this gap, we leverage physically grounded simulation to construct diverse, high-fidelity 3D scenes with precise per-view metadata, enabling scalable data generation that remains transferable to real-world settings. Based on this engine, we introduce VIEW2SPACE, a multi-dimensional benchmark for sparse multi-view reasoning, together with a scalable, disjoint training split supporting millions of grounded question-answer pairs. Using this benchmark, a comprehensive evaluation of state-of-the-art vision-language and spatial models reveals that multi-view reasoning remains largely unsolved, with most models performing only marginally above random guessing. We further investigate whether training can bridge this gap. Our proposed Grounded Chain-of-Thought with Visual Evidence substantially improves performance under moderate difficulty, and generalizes to real-world data, outperforming existing approaches in cross-dataset evaluation. We further conduct difficulty-aware scaling analyses across model size, data scale, reasoning depth, and visibility constraints, indicating that while geometric perception can benefit from scaling under sufficient visibility, deep compositional reasoning across sparse views remains a fundamental challenge.

[CV-51] Unlearning for One-Step Generative Models via Unbalanced Optimal Transport

【速读】:该论文旨在解决生成式 AI(Generative AI)中“机器遗忘”(machine unlearning)问题,即如何安全地从训练好的一阶生成模型(如流映射模型)中移除特定类别的数据分布,同时保持其余类别的生成质量。现有扩散模型的遗忘方法因依赖多步去噪过程,无法直接应用于高效的一步生成框架。其解决方案的关键在于提出一种基于非平衡最优传输(Unbalanced Optimal Transport, UOT)的即插即用类遗忘框架 UOT-Unlearn:通过在遗忘代价(forget cost)与 f-散度惩罚(f-divergence penalty)之间建立权衡关系,利用松弛的边际约束实现目标类概率质量向剩余类的平滑重分配,从而避免生成低质量或噪声样本,显著提升遗忘成功率(PUL)和保留类别的生成保真度(u-FID)。

链接: https://arxiv.org/abs/2603.16489
作者: Hyundo Choi,Junhyeong An,Jinseong Park,Jaewoong Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 27 pages, 10 figures

点击查看摘要

Abstract:Recent advances in one-step generative frameworks, such as flow map models, have significantly improved the efficiency of image generation by learning direct noise-to-data mappings in a single forward pass. However, machine unlearning for ensuring the safety of these powerful generators remains entirely unexplored. Existing diffusion unlearning methods are inherently incompatible with these one-step models, as they rely on a multi-step iterative denoising process. In this work, we propose UOT-Unlearn, a novel plug-and-play class unlearning framework for one-step generative models based on the Unbalanced Optimal Transport (UOT). Our method formulates unlearning as a principled trade-off between a forget cost, which suppresses the target class, and an f -divergence penalty, which preserves overall generation fidelity via relaxed marginal constraints. By leveraging UOT, our method enables the probability mass of the forgotten class to be smoothly redistributed to the remaining classes, rather than collapsing into low-quality or noise-like samples. Experimental results on CIFAR-10 and ImageNet-256 demonstrate that our framework achieves superior unlearning success (PUL) and retention quality (u-FID), significantly outperforming baselines.

[CV-52] DST-Net: A Dual-Stream Transformer with Illumination-Independent Feature Guidance and Multi-Scale Spatial Convolution for Low-Light Image Enhancement

【速读】:该论文旨在解决低光照图像增强中因信号退化(如亮度衰减和结构破坏)导致的固有信号先验丢失问题,从而提升图像的可见性和细节保真度。其解决方案的关键在于提出了一种双流Transformer网络(DST-Net),通过两个核心机制实现:一是设计特征提取模块,融合Difference of Gaussians (DoG)、LAB颜色空间变换与VGG-16,提取解耦的光照无关特征作为信号先验,持续引导增强过程;二是构建双流交互架构,利用跨模态注意力机制动态校正增强图像的退化信号表示,并结合可微分曲线估计实现迭代优化。此外,引入多尺度空间融合模块(MSFB),通过伪3D与3D梯度算子卷积恢复高频边缘并捕捉多尺度空间相关性,有效保留精细结构与纹理。

链接: https://arxiv.org/abs/2603.16482
作者: Yicui Shi,Yuhan Chen,Xiangfei Huang,Zhenguo Wang,Wenxuan Yu,Ying Fang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-light image enhancement aims to restore the visibility of images captured by visual sensors in dim environments by addressing their inherent signal degradations, such as luminance attenuation and structural corruption. Although numerous algorithms attempt to improve image quality, existing methods often cause a severe loss of intrinsic signal priors. To overcome these challenges, we propose a Dual-Stream Transformer Network (DST-Net) based on illumination-agnostic signal prior guidance and multi-scale spatial convolutions. First, to address the loss of critical signal features under low-light conditions, we design a feature extraction module. This module integrates Difference of Gaussians (DoG), LAB color space transformations, and VGG-16 for texture extraction, utilizing decoupled illumination-agnostic features as signal priors to continuously guide the enhancement process. Second, we construct a dual-stream interaction architecture. By employing a cross-modal attention mechanism, the network leverages the extracted priors to dynamically rectify the deteriorated signal representation of the enhanced image, ultimately achieving iterative enhancement through differentiable curve estimation. Furthermore, to overcome the inability of existing methods to preserve fine structures and textures, we propose a Multi-Scale Spatial Fusion Block (MSFB) featuring pseudo-3D and 3D gradient operator convolutions. This module integrates explicit gradient operators to recover high-frequency edges while capturing inter-channel spatial correlations via multi-scale spatial convolutions. Extensive evaluations and ablation studies demonstrate that DST-Net achieves superior performance in subjective visual quality and objective metrics. Specifically, our method achieves a PSNR of 25.64 dB on the LOL dataset. Subsequent validation on the LSRW dataset further confirms its robust cross-scene generalization.

[CV-53] GAP-MLLM : Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在仅依赖纯RGB图像输入时,3D空间感知能力不足的问题。尽管现有方法利用了来自3D重建模型的隐式几何先验,仍存在显著性能差距,这并非源于几何先验不足,而是训练范式不匹配:以文本为主导的微调策略未能激活MLLMs中的几何表征。解决方案的关键在于提出一种几何对齐预训练范式(Geometry-Aligned Pre-training, GAP-MLLM),其核心是通过引入视觉提示联合任务,强制模型同时预测稀疏点云图(sparse pointmaps)与语义标签,从而显式激活几何感知;并设计多级渐进融合模块与token级门控机制,实现几何先验的自适应融合,避免抑制语义推理能力,最终显著提升在3D视觉定位、3D密集描述和3D视频目标检测等任务上的性能。

链接: https://arxiv.org/abs/2603.16461
作者: Jiaxin Zhang,Junjun Jiang,Haijie Li,Youyu Chen,Kui Jiang,Dave Zhenyu Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) demonstrate exceptional semantic reasoning but struggle with 3D spatial perception when restricted to pure RGB inputs. Despite leveraging implicit geometric priors from 3D reconstruction models, image-based methods still exhibit a notable performance gap compared to methods using explicit 3D data. We argue that this gap does not arise from insufficient geometric priors, but from a misalignment in the training paradigm: text-dominated fine-tuning fails to activate geometric representations within MLLMs. Existing approaches typically resort to naive feature concatenation and optimize directly for downstream tasks without geometry-specific supervision, leading to suboptimal structural utilization. To address this limitation, we propose GAP-MLLM, a Geometry-Aligned Pre-training paradigm that explicitly activates structural perception before downstream adaptation. Specifically, we introduce a visual-prompted joint task that compels the MLLMs to predict sparse pointmaps alongside semantic labels, thereby enforcing geometric awareness. Furthermore, we design a multi-level progressive fusion module with a token-level gating mechanism, enabling adaptive integration of geometric priors without suppressing semantic reasoning. Extensive experiments demonstrate that GAP-MLLM significantly enhances geometric feature fusion and consistently enhances performance across 3D visual grounding, 3D dense captioning, and 3D video object detection tasks.

[CV-54] Evo-Retriever: LLM -Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval CVPR2026

【速读】:该论文旨在解决视觉-语言模型(Visual-Language Models, VLMs)在处理真实世界文档时因数据异构性和非结构化特性导致的跨模态嵌入一致性破坏问题,以及传统训练策略在有限样本和静态优化下难以适应模型动态演化所引发的跨模态检索混淆问题。其解决方案的关键在于提出Evo-Retriever框架,该框架通过创新的“视角-路径协同机制”实现模型自适应进化:首先利用多视角图像对齐增强细粒度匹配;其次采用双向对比学习生成难例查询并构建互补学习路径以平衡监督信号;最终将上述协作所得的模型状态摘要输入大语言模型(Large Language Model, LLM)元控制器,基于专家知识动态调整训练课程,从而促进模型持续演进。

链接: https://arxiv.org/abs/2603.16455
作者: Weiqing Li,Jinyue Guo,Yaqi Wang,Haiyang Xiao,Yuewei Zhang,Guohua Liu,Hao Henry Wang
机构: Alibaba Cloud Computing (阿里云计算)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026

点击查看摘要

Abstract:Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model’s dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates “hard queries” and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary from the above collaboration is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model’s evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2% and 77.1%.

[CV-55] nyGLASS: Real-Time Self-Supervised In-Sensor Anomaly Detection

【速读】:该论文旨在解决工业质量控制中缺陷检测的挑战,即在标注异常样本稀缺的情况下实现高效、实时的异常检测。针对现有自监督方法(如GLASS)计算资源消耗高、难以部署于边缘设备的问题,其关键解决方案是提出TinyGLASS——一种轻量化架构,通过将原始WideResNet-50主干网络替换为紧凑的ResNet-18,并引入面向部署的优化策略(包括静态图追踪和INT8量化),使得模型能在索尼IMX500智能视觉传感器上实现端侧实时运行。该方案在保持竞争性检测性能的同时,实现了8.7倍参数压缩,且满足内存(8 MB)与功耗限制,在MVTec-AD数据集上达到94.2%图像级AUROC,并以20 FPS运行,展现出高能效比(470 GMAC/J)。

链接: https://arxiv.org/abs/2603.16451
作者: Pietro Bonazzi,Rafael Sutter,Luigi Capogrosso,Mischa Buob,Michele Magno
机构: ETH Zürich (苏黎世联邦理工学院); Interdisciplinary Transformation University of Austria (奥地利跨学科转型大学); Swiss Engineering Partners AG (瑞士工程合作伙伴公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anomaly detection plays a key role in industrial quality control, where defects must be identified despite the scarcity of labeled faulty samples. Recent self-supervised approaches, such as GLASS, learn normal visual patterns using only defect-free data and have shown strong performance on industrial benchmarks. However, their computational requirements limit deployment on resource-constrained edge platforms. This work introduces TinyGLASS, a lightweight adaptation of the GLASS framework designed for real-time in-sensor anomaly detection on the Sony IMX500 intelligent vision sensor. The proposed architecture replaces the original WideResNet-50 backbone with a compact ResNet-18 and introduces deployment-oriented modifications that enable static graph tracing and INT8 quantization using Sony’s Model Compression Toolkit. In addition to evaluating performance on the MVTec-AD benchmark, we investigate robustness to contaminated training data and introduce a custom industrial dataset, named MMS Dataset, for cross-device evaluation. Experimental results show that TinyGLASS achieves 8.7x parameter compression while maintaining competitive detection performance, reaching 94.2% image-level AUROC on MVTec-AD and operating at 20 FPS within the 8 MB memory constraints of the IMX500 platform. System profiling demonstrates low power consumption (4.0 mJ per inference), real-time end-to-end latency (20 FPS), and high energy efficiency (470 GMAC/J). Furthermore, the model maintains stable performance under moderate levels of training data contamination. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.16451 [cs.CV] (or arXiv:2603.16451v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.16451 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Pietro Bonazzi [view email] [v1] Tue, 17 Mar 2026 12:31:34 UTC (5,900 KB)

[CV-56] ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars CVPR2026

【速读】:该论文旨在解决实时扩展现实(XR)与远程存在(telepresence)应用中因网络带宽和计算资源动态波动而导致的3D虚拟人像渲染质量不稳定问题。其解决方案的关键在于提出一种基于自适应隐式细分构建的层次化3D高斯表示方法——ProgressiveAvatars,该方法通过在模板网格上逐级生长3D高斯,并将其定义在面局部坐标系中以保持不同表情和头部运动下的可动画性;同时,利用屏幕空间信号判断细节缺失区域并动态分配资源,结合重要性排序实现增量加载与渲染,从而在资源受限条件下平滑提升视觉质量。

链接: https://arxiv.org/abs/2603.16447
作者: Kaiwen Song,Jinkai Cui,Juyong Zhang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to CVPR 2026, Project page: this https URL

点击查看摘要

Abstract:In practical real-time XR and telepresence applications, network and computing resources fluctuate frequently. Therefore, a progressive 3D representation is needed. To this end, we propose ProgressiveAvatars, a progressive avatar representation built on a hierarchy of 3D Gaussians grown by adaptive implicit subdivision on a template mesh. 3D Gaussians are defined in face-local coordinates to remain animatable under varying expressions and head motion across multiple detail levels. The hierarchy expands when screen-space signals indicate a lack of detail, allocating resources to important areas. Leveraging importance ranking, ProgressiveAvatars supports incremental loading and rendering, adding new Gaussians as they arrive while preserving previous content, thus achieving smooth quality improvements across varying bandwidths. ProgressiveAvatars enables progressive delivery and progressive rendering under fluctuating network bandwidth and varying compute and memory resources.

[CV-57] Unified Removal of Raindrops and Reflections: A New Benchmark and A Novel Pipeline

【速读】:该论文旨在解决雨天通过玻璃表面或挡风玻璃拍摄图像时,雨滴与反射现象共存导致图像可见度显著下降的问题(即复合退化问题)。此前的去雨滴、去反射及一体化模型均未能有效处理此类联合退化。为此,作者首次正式定义了“统一去除雨滴与反射”(Unified Removal of Raindrops and Reflections, UR³)任务,并构建了一个真实场景下的高质量图像数据集RDRF,为该问题提供新的基准。解决方案的关键在于提出了一种基于扩散机制的新框架DiffUR³,通过利用强大的生成先验能力,同时有效去除雨滴和反射两种退化类型,在自建基准和野外复杂图像上均达到当前最优性能。

链接: https://arxiv.org/abs/2603.16446
作者: Xingyu Liu,Zewei He,Yu Chen,Chunyu Zhu,Zixuan Chen,Xing Luo,Zhe-Ming Lu
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:When capturing images through glass surfaces or windshields on rainy days, raindrops and reflections frequently co-occur to significantly reduce the visibility of captured images. This practical problem lacks attention and needs to be resolved urgently. Prior de-raindrop, de-reflection, and all-in-one models have failed to address this composite degradation. To this end, we first formally define the unified removal of raindrops and reflections (UR ^3 ) task for the first time and construct a real-shot dataset, namely RainDrop and ReFlection (RDRF), which provides a new benchmark with substantial, high-quality, diverse image pairs. Then, we propose a novel diffusion-based framework (i.e., DiffUR ^3 ) with several target designs to address this challenging task. By leveraging the powerful generative prior, DiffUR ^3 successfully removes both types of degradations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on our benchmark and on challenging in-the-wild images. The RDRF dataset and the codes will be made public upon acceptance.

[CV-58] Fast-HaMeR: Boosting Hand Mesh Reconstruction using Knowledge Distillation

【速读】:该论文旨在解决当前3D手部重建模型在资源受限设备(如VR/AR头显、智能手机和嵌入式系统)上部署困难的问题,这些问题主要源于现有方法依赖于计算密集型的大型神经网络,导致推理速度慢且难以实时运行。解决方案的关键在于采用轻量化神经网络结构与知识蒸馏(Knowledge Distillation)相结合的方法:首先用更小的骨干网络(如MobileNet、MobileViT、ConvNeXt和ResNet)替代原模型HaMeR中的ViT-H主干,其次通过输出层级、特征层级及两者结合的知识蒸馏策略来保留原始模型的高精度重建能力。实验表明,使用仅占原模型35%参数量的轻量骨干网络,在保持0.4mm微小精度损失的前提下实现了1.5倍的推理加速,验证了该方案在效率与性能之间取得良好平衡的可行性。

链接: https://arxiv.org/abs/2603.16444
作者: Hunain Ahmed Jillani,Ahmed Tawfik Aboukhadra,Ahmed Elhayek,Jameel Malik,Nadia Robertini,Didier Stricker
机构: RPTU, Kaiserslautern, Germany; DFKI-AV, Kaiserslautern, Germany; UPM, Saudi Arabia; NUST-SEECS, Islamabad, Pakistan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fast and accurate 3D hand reconstruction is essential for real-time applications in VR/AR, human-computer interaction, robotics, and healthcare. Most state-of-the-art methods rely on heavy models, limiting their use on resource-constrained devices like headsets, smartphones, and embedded systems. In this paper, we investigate how the use of lightweight neural networks, combined with Knowledge Distillation, can accelerate complex 3D hand reconstruction models by making them faster and lighter, while maintaining comparable reconstruction accuracy. While our approach is suited for various hand reconstruction frameworks, we focus primarily on boosting the HaMeR model, currently the leading method in terms of reconstruction accuracy. We replace its original ViT-H backbone with lighter alternatives, including MobileNet, MobileViT, ConvNeXt, and ResNet, and evaluate three knowledge distillation strategies: output-level, feature-level, and a hybrid of both. Our experiments show that using lightweight backbones that are only 35% the size of the original achieves 1.5x faster inference speed while preserving similar performance quality with only a minimal accuracy difference of 0.4mm. More specifically, we show how output-level distillation notably improves student performance, while feature-level distillation proves more effective for higher-capacity students. Overall, the findings pave the way for efficient real-world applications on low-power devices. The code and models are publicly available under this https URL.

[CV-59] CD-FKD: Cross-Domain Feature Knowledge Distillation for Robust Single-Domain Generalization in Object Detection ICRA2026

【速读】:该论文旨在解决单源域训练下目标检测模型在未见目标域上的泛化能力不足问题,尤其针对因天气、光照或场景条件变化导致的域偏移(domain shift)挑战。解决方案的关键在于提出跨域特征知识蒸馏(Cross-Domain Feature Knowledge Distillation, CD-FKD),通过引入全局和实例级特征蒸馏机制,使学生网络从教师网络中学习更具鲁棒性的对象中心特征;其中教师网络使用原始源域数据,而学生网络则通过降尺度与扰动等多样化数据增强策略进行训练,从而显著提升对复杂场景中难检目标的识别能力,实现源域性能与目标域泛化能力的双重优化。

链接: https://arxiv.org/abs/2603.16439
作者: Junseok Lee,Sungho Shin,Seongju Lee,Kyoobin Lee
机构: LG Electronics( LG 电子); Naver( Naver); Gwangju Institute of Science and Technology (GIST)(光州科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICRA 2026

点击查看摘要

Abstract:Single-domain generalization is essential for object detection, particularly when training models on a single source domain and evaluating them on unseen target domains. Domain shifts, such as changes in weather, lighting, or scene conditions, pose significant challenges to the generalization ability of existing models. To address this, we propose Cross-Domain Feature Knowledge Distillation (CD-FKD), which enhances the generalization capability of the student network by leveraging both global and instance-wise feature distillation. The proposed method uses diversified data through downscaling and corruption to train the student network, whereas the teacher network receives the original source domain data. The student network mimics the features of the teacher through both global and instance-wise distillation, enabling it to extract object-centric features effectively, even for objects that are difficult to detect owing to corruption. Extensive experiments on challenging scenes demonstrate that CD-FKD outperforms state-of-the-art methods in both target domain generalization and source domain performance, validating its effectiveness in improving object detection robustness to domain shifts. This approach is valuable in real-world applications, like autonomous driving and surveillance, where robust object detection in diverse environments is crucial.

[CV-60] IRIS: A Real-World Benchmark for Inverse Recovery and Identification of Physical Dynamic Systems from Monocular Video

【速读】:该论文旨在解决无监督物理参数估计领域缺乏统一基准的问题,现有方法通常在非重叠的合成数据上评估,唯一的真实世界数据集仅限于单体系统,且尚未建立针对控制方程识别的标准化评估协议。其解决方案的关键在于提出IRIS基准,这是一个高保真度的数据集,包含220个在受控实验室条件下采集的4K分辨率、60 fps真实世界视频,涵盖单体与多体动力学系统,并提供独立测量的地面真实参数及不确定性估计;同时定义了一套标准化评估协议,涵盖参数精度、可辨识性、外推能力、鲁棒性和控制方程选择等维度,并通过多种基线模型(包括多步物理损失和四种互补的方程识别策略)验证了该基准的有效性,为未来研究提供了参考性能和系统性失败模式分析。

链接: https://arxiv.org/abs/2603.16432
作者: Rasul Khanbayov,Mohamed Rayan Barhdadi,Erchin Serpedin,Hasan Kurban
机构: Hamad Bin Khalifa University (哈马德本哈利法大学); Texas A&M University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Unsupervised physical parameter estimation from video lacks a common benchmark: existing methods evaluate on non-overlapping synthetic data, the sole real-world dataset is restricted to single-body systems, and no established protocol addresses governing-equation identification. This work introduces IRIS, a high-fidelity benchmark comprising 220 real-world videos captured at 4K resolution and 60,fps, spanning both single- and multi-body dynamics with independently measured ground-truth parameters and uncertainty estimates. Each dynamical system is recorded under controlled laboratory conditions and paired with its governing equations, enabling principled evaluation. A standardized evaluation protocol is defined encompassing parameter accuracy, identifiability, extrapolation, robustness, and governing-equation selection. Multiple baselines are evaluated, including a multi-step physics loss formulation and four complementary equation-identification strategies (VLM temporal reasoning, describe-then-classify prompting, CNN-based classification, and path-based labelling), establishing reference performance across all IRIS scenarios and exposing systematic failure modes that motivate future research. The dataset, annotations, evaluation toolkit, and all baseline implementations are publicly released.

[CV-61] Cross-modal learning for plankton recognition

【速读】:该论文旨在解决自动浮游生物图像识别中对大量人工标注数据依赖的问题,同时探索如何有效利用多模态未标注数据(如图像与光学测量数据)提升识别性能。其关键解决方案是采用自监督跨模态协同学习策略,通过仅需二元监督信号(即图像与光学剖面是否来自同一颗粒)来训练图像和测量数据的编码器,从而实现无需人工标注即可引导模型学习跨模态一致性特征;在此基础上,结合少量已知物种的标签样本与k近邻(k-NN)分类器进行最终识别,构建出一个天然具备多模态融合能力的浮游生物识别模型,显著降低了标注成本并优于仅使用图像的自监督基线方法。

链接: https://arxiv.org/abs/2603.16427
作者: Joona Kareinen,Veikka Immonen,Tuomas Eerola,Lumi Haraguchi,Lasse Lensu,Kaisa Kraft,Sanna Suikkanen,Heikki Kälviäinen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper considers self-supervised cross-modal coordination as a strategy enabling utilization of multiple modalities and large volumes of unlabeled plankton data to build models for plankton recognition. Automated imaging instruments facilitate the continuous collection of plankton image data on a large scale. Current methods for automatic plankton image recognition rely primarily on supervised approaches, which require labeled training sets that are labor-intensive to collect. On the other hand, some modern plankton imaging instruments complement image information with optical measurement data, such as scatter and fluorescence profiles, which currently are not widely utilized in plankton recognition. In this work, we explore the possibility of using such measurement data to guide the learning process without requiring manual labeling. Inspired by the concepts behind Contrastive Language-Image Pre-training, we train encoders for both modalities using only binary supervisory information indicating whether a given image and profile originate from the same particle or from different particles. For plankton recognition, we employ a small labeled gallery of known plankton species combined with a k -NN classifier. This approach yields a recognition model that is inherently multimodal, i.e., capable of utilizing information extracted from both image and profile data. We demonstrate that the proposed method achieves high recognition accuracy while requiring only a minimal number of labeled images. Furthermore, we show that the approach outperforms an image-only self-supervised baseline. Code available at this https URL.

[CV-62] 3D Fourier-based Global Feature Extraction for Hyperspectral Image Classification

【速读】:该论文旨在解决高光谱图像分类(HSIC)中两大核心问题:一是基于Transformer的模型因自注意力机制具有二次复杂度而导致扩展性差;二是现有基于傅里叶变换的方法通常仅使用二维空间快速傅里叶变换(2D spatial FFT),忽略了高光谱数据中关键的跨波段光谱依赖关系。解决方案的关键在于提出一种新型混合架构HGFNet,其核心创新包括:1)引入三种互补的频域变换——光谱傅里叶变换(Spectral Fourier Transform)、空间傅里叶变换(Spatial Fourier Transform)和时空傅里叶变换(Spatial-Spatial Fourier Transform),实现对高维频率特征的全面建模;2)结合局部3D卷积层提取细粒度的空间-光谱结构与傅里叶域全局滤波模块高效建模长程依赖并抑制噪声,从而实现高效且鲁棒的空间-光谱表征学习;3)嵌入自适应焦点损失(Adaptive Focal Loss, AFL)以缓解HSIC中常见的类别不平衡问题,提升对少数类别的判别能力。

链接: https://arxiv.org/abs/2603.16426
作者: Muhammad Ahmad
机构: SDAIA-KFUPM, Joint Research Center for Artificial Intelligence (JRCAI), King Fahd University of Petroleum and Minerals (国王法赫德石油大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral image classification (HSIC) has been significantly advanced by deep learning methods that exploit rich spatial-spectral correlations. However, existing approaches still face fundamental limitations: transformer-based models suffer from poor scalability due to the quadratic complexity of self-attention, while recent Fourier transform-based methods typically rely on 2D spatial FFTs and largely ignore critical inter-band spectral dependencies inherent to hyperspectral data. To address these challenges, we propose Hybrid GFNet (HGFNet), a novel architecture that integrates localized 3D convolutional feature extraction with frequency-domain global filtering via GFNet-style blocks for efficient and robust spatial-spectral representation learning. HGFNet introduces three complementary frequency transforms tailored to hyperspectral imagery: Spectral Fourier Transform (a 1D FFT along the spectral axis), Spatial Fourier Transform (a 2D FFT over spatial dimensions), and Spatial-Spatial Fourier Transform (a 3D FFT jointly over spectral and spatial dimensions), enabling comprehensive and high-dimensional frequency modeling. The 3D convolutional layers capture fine-grained local spatial-spectral structures, while the Fourier-based global filtering modules efficiently model long-range dependencies and suppress noise. To further mitigate the severe class imbalance commonly observed in HSIC, HGFNet incorporates an Adaptive Focal Loss (AFL) that dynamically adjusts class-wise focusing and weighting, improving discrimination for underrepresented classes.

[CV-63] SF-Mamba: Rethinking State Space Model for Vision

【速读】:该论文旨在解决视觉领域中Mamba架构因单向扫描机制导致的非因果交互受限问题,以及在短序列长度下计算效率较低的问题。现有方法虽尝试通过多扫描策略缓解上述限制,但存在扫描设计不合理和频繁数据重排带来的效率瓶颈。解决方案的关键在于提出SF-Mamba,其核心创新包括:1)辅助补丁交换(auxiliary patch swapping),在单向扫描下实现双向信息流编码;2)批量折叠与周期性状态重置(batch folding with periodic state reset),优化GPU并行计算效率。实验表明,SF-Mamba在图像分类、目标检测及实例/语义分割任务中显著优于现有最优基线,且在不同模型规模下均提升吞吐量。

链接: https://arxiv.org/abs/2603.16423
作者: Masakazu Yoshimura,Teruaki Hayashi,Yuki Hoshino,Wei-Yao Wang,Takeshi Ohashi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages

点击查看摘要

Abstract:The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits non-causal interactions between image patches. Prior works have attempted to address this limitation through various multi-scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF-Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding bidirectional information flow under an unidirectional scan and batch folding with periodic state reset for advanced GPU parallelism. Extensive experiments on image classification, object detection, and instance and semantic segmentation consistently demonstrate that our proposed SF-Mamba significantly outperforms state-of-the-art baselines while improving throughput across different model sizes. We will release the source code after publication.

[CV-64] HGP-Mamba: Integrating Histology and Generated Protein Features for Mamba-based Multimodal Survival Risk Prediction ICME2026

【速读】:该论文旨在解决蛋白质标志物(protein markers)与组织病理图像(histopathology images)联合预后潜力尚未被充分挖掘的问题,尤其针对蛋白表达谱分析成本高、数据获取受限的挑战。解决方案的关键在于提出一种基于Mamba架构的多模态框架HGP-Mamba,其核心创新包括:1)设计了一个蛋白特征提取器(Protein Feature Extractor, PFE),利用预训练基础模型从全切片图像(Whole Slide Images, WSIs)中直接生成高通量蛋白嵌入(protein embeddings),实现分子信息的数据高效融合;2)引入局部交互感知Mamba(Local Interaction-aware Mamba, LiAM)以捕捉细粒度特征交互,并结合全局交互增强Mamba(Global Interaction-enhanced Mamba, GiEM)促进整张切片层面的模态融合,从而有效建模复杂的跨模态依赖关系。

链接: https://arxiv.org/abs/2603.16421
作者: Jing Dai,Chen Wu,Ming Wu,Qibin Zhang,Zexi Wu,Jingdong Zhang,Hongming Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE ICME 2026. This arXiv version includes additional supplementary experiments and extended discussions beyond the conference version

点击查看摘要

Abstract:Recent advances in multimodal learning have significantly improved cancer survival risk prediction. However, the joint prognostic potential of protein markers and histopathology images remains underexplored, largely due to the high cost and limited availability of protein expression profiling. To address this challenge, we propose HGP-Mamba, a Mamba-based multimodal framework that efficiently integrates histological with generated protein features for survival risk prediction. Specifically, we introduce a protein feature extractor (PFE) that leverages pretrained foundation models to derive high-throughput protein embeddings directly from Whole Slide Images (WSIs), enabling data-efficient incorporation of molecular information. Together with histology embeddings that capture morphological patterns, we further introduce the Local Interaction-aware Mamba (LiAM) for fine-grained feature interaction and the Global Interaction-enhanced Mamba (GiEM) to promote holistic modality fusion at the slide level, thus capture complex cross-modal dependencies. Experiments on four public cancer datasets demonstrate that HGP-Mamba achieves state-of-the-art performance while maintaining superior computational efficiency compared with existing methods. Our source code is publicly available at a href="this https URL https URL/a.

[CV-65] Near-light Photometric Stereo with Symmetric Lights

【速读】:该论文旨在解决近光(near-light)条件下光度立体法(photometric stereo)中因非凸优化导致的深度初始化敏感和光照标定复杂的问题。其解决方案的关键在于利用对称布置的多组邻近光源对,通过几何约束推导出表面法向量和深度的闭式解(closed-form solution),从而无需初始值即可直接求解,且对光源空间偏移的校准不敏感,仅需保证光源关于任意点对称分布即可实现高精度形状恢复。

链接: https://arxiv.org/abs/2603.16404
作者: Lilika Makabe,Heng Guo,Hiroaki Santo,Fumio Okura,Yasuyuki Matsushita
机构: Graduate School of Information Science and Technology, Osaka University, Japan(大阪大学信息科学与技术研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper describes a linear solution method for near-light photometric stereo by exploiting symmetric light source arrangements. Unlike conventional non-convex optimization approaches, by arranging multiple sets of symmetric nearby light source pairs, our method derives a closed-form solution for surface normal and depth without requiring initialization. In addition, our method works as long as the light sources are symmetrically distributed about an arbitrary point even when the entire spatial offset is uncalibrated. Experiments showcase the accuracy of shape recovery accuracy of our method, achieving comparable results to the state-of-the-art calibrated near-light photometric stereo method while significantly reducing requirements of careful depth initialization and light calibration.

[CV-66] DermaFlux: Synthetic Skin Lesion Generation with Rectified Flows for Enhanced Image Classification

【速读】:该论文旨在解决皮肤病变分类系统在实际应用中因临床数据集规模小、多样性不足及标注质量差导致的类别不平衡问题,进而影响模型泛化性能。其核心解决方案是提出一种基于修正流(rectified flow)的文本到图像生成框架 DermaFlux,该框架可从自然语言描述的皮肤病学特征中合成具有临床意义的皮肤病变图像;关键创新在于利用 LoRA(Low-Rank Adaptation)对 Flux.1 模型进行参数高效微调,并结合 Llama 3.2 自动生成符合 dermatological criteria(如不对称性、边界不规则性和颜色变异)的图像-文本对,从而显著提升小样本场景下的分类准确率和 AUC 值。

链接: https://arxiv.org/abs/2603.16392
作者: Stathis Galanakis,Alexandros Koliousis,Stefanos Zafeiriou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent advances in deep generative modeling, skin lesion classification systems remain constrained by the limited availability of large, diverse, and well-annotated clinical datasets, resulting in class imbalance between benign and malignant lesions and consequently reduced generalization performance. We introduce DermaFlux, a rectified flow-based text-to-image generative framework that synthesizes clinically grounded skin lesion images from natural language descriptions of dermatological attributes. Built upon Flux.1, DermaFlux is fine-tuned using parameter-efficient Low-Rank Adaptation (LoRA) on a large curated collection of publicly available clinical image datasets. We construct image-text pairs using synthetic textual captions generated by Llama 3.2, following established dermatological criteria including lesion asymmetry, border irregularity, and color variation. Extensive experiments demonstrate that DermaFlux generates diverse and clinically meaningful dermatology images that improve binary classification performance by up to 6% when augmenting small real-world datasets, and by up to 9% when classifiers are trained on DermaFlux-generated synthetic images rather than diffusion-based synthetic images. Our ImageNet-pretrained ViT fine-tuned with only 2,500 real images and 4,375 DermaFlux-generated samples achieves 78.04% binary classification accuracy and an AUC of 0.859, surpassing the next best dermatology model by 8%.

[CV-67] Unpaired Cross-Domain Calibration of DMSP to VIIRS Nighttime Light Data Based on CUT Network

【速读】:该论文旨在解决防御气象卫星计划(DMSP-OLS)与Suomi国家极轨卫星伙伴关系(SNPP-VIIRS)夜间灯光(NTL)数据因传感器不兼容导致的长期时序分析难题。其核心解决方案是提出一种基于对比无配对翻译(Contrastive Unpaired Translation, CUT)网络的跨传感器校准方法,通过多层局部块级对比学习最大化对应图像块间的互信息,从而在保持内容一致性的同时学习跨域相似性,实现将DMSP数据转换为VIIRS风格的数据。该方法利用2012–2013年重叠时段数据训练模型,并成功生成1992–2013年期间高质量的VIIRS类影像,验证结果显示生成数据与实际VIIRS观测及社会经济指标高度一致(R² > 0.87),有效解决了跨传感器数据融合与DMSP缺陷校正问题。

链接: https://arxiv.org/abs/2603.16385
作者: Zhan Tong,ChenXu Zhou,Fei Tang,Yiming Tu,Tianyu Qin,Kaihao Fang
机构: Nanjing Institute of Technology (南京工程学院); Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures, 8 tables. Submitted to Remote Sensing of Environment. Code and data available at: this https URL [your-repo-link]

点击查看摘要

Abstract:Defense Meteorological Satellite Program (DMSP-OLS) and Suomi National Polar-orbiting Partnership (SNPP-VIIRS) nighttime light (NTL) data are vital for monitoring urbanization, yet sensor incompatibilities hinder long-term analysis. This study proposes a cross-sensor calibration method using Contrastive Unpaired Translation (CUT) network to transform DMSP data into VIIRS-like format, correcting DMSP defects. The method employs multilayer patch-wise contrastive learning to maximize mutual information between corresponding patches, preserving content consistency while learning cross-domain similarity. Utilizing 2012-2013 overlapping data for training, the network processes 1992-2013 DMSP imagery to generate enhanced VIIRS-style raster data. Validation results demonstrate that generated VIIRS-like data exhibits high consistency with actual VIIRS observations (R-squared greater than 0.87) and socioeconomic indicators. This approach effectively resolves cross-sensor data fusion issues and calibrates DMSP defects, providing reliable attempt for extended NTL time-series.

[CV-68] Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation

【速读】:该论文旨在解决现有视觉分词器(visual tokenizer)在图像表示中难以捕捉紧凑全局语义的问题。当前方法通常将图像映射到固定二维空间网格,并侧重于像素级重建,导致语义信息冗余且表达效率低。解决方案的关键在于提出一种语义一维分词器(SemTok),其核心创新包括:1)二维到一维的分词机制,将图像压缩为具有高层语义的离散一维token;2)语义对齐约束,增强token与图像语义的一致性;3)两阶段生成训练策略,提升重建质量和生成能力。该框架在图像重建任务中实现了最先进的性能,同时显著降低token表示的维度,为下游图像生成任务提供了更高效的语义基础。

链接: https://arxiv.org/abs/2603.16373
作者: Yunpeng Qu,Kaidong Zhang,Yukang Ding,Ying Chen,Jian Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages,12 figures

点击查看摘要

Abstract:Visual generative models based on latent space have achieved great success, underscoring the significance of visual tokenization. Mapping images to latents boosts efficiency and enables multimodal alignment for scaling up in downstream tasks. Existing visual tokenizers primarily map images into fixed 2D spatial grids and focus on pixel-level restoration, which hinders the capture of representations with compact global semantics. To address these issues, we propose \textbfSemTok, a semantic one-dimensional tokenizer that compresses 2D images into 1D discrete tokens with high-level semantics. SemTok sets a new state-of-the-art in image reconstruction, achieving superior fidelity with a remarkably compact token representation. This is achieved via a synergistic framework with three key innovations: a 2D-to-1D tokenization scheme, a semantic alignment constraint, and a two-stage generative training strategy. Building on SemTok, we construct a masked autoregressive generation framework, which yields notable improvements in downstream image generation tasks. Experiments confirm the effectiveness of our semantic 1D tokenization. Our code will be open-sourced.

[CV-69] InViC: Intent-aware Visual Cues for Medical Visual Question Answering

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在医学视觉问答(Medical Visual Question Answering, Med-VQA)任务中存在“捷径回答”(shortcut answering)的问题,即模型过度依赖语言先验或数据集偏差生成看似合理但缺乏图像支撑的答案,从而削弱了临床可靠性,尤其在依赖细微影像特征的场景下表现不佳。解决方案的关键在于提出一个轻量级插件框架——意图感知视觉线索(Intent-aware Visual Cues, InViC),其核心机制包括:1)引入 Cue Tokens Extraction (CTE) 模块,将密集的视觉 token 压缩为 K 个与问题条件相关的线索 token,作为结构化的视觉中间表示注入 LLM 解码器以增强意图对齐的视觉证据利用;2)设计两阶段微调策略,第一阶段通过线索瓶颈注意力掩码强制模型仅通过线索路径获取视觉信息,第二阶段恢复标准因果注意力以联合优化视觉与线索 token 的使用,从而有效抑制对原始视觉特征的直接访问并提升答案生成的可信度。

链接: https://arxiv.org/abs/2603.16372
作者: Zhisong Wang,Ziyang Chen,Zanting Ye,Hongze Zhu,Yefeng Zheng,Yong Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Medical visual question answering (Med-VQA) aims to answer clinically relevant questions grounded in medical images. However, existing multimodal large language models (MLLMs) often exhibit shortcut answering, producing plausible responses by exploiting language priors or dataset biases while insufficiently attending to visual evidence. This behavior undermines clinical reliability, especially when subtle imaging findings are decisive. We propose a lightweight plug-in framework, termed Intent-aware Visual Cues (InViC), to explicitly enhance image-based answer generation in medical VQA. InViC introduces a Cue Tokens Extraction (CTE) module that distills dense visual tokens into a compact set of K question-conditioned cue tokens, which serve as structured visual intermediaries injected into the LLM decoder to promote intent-aligned visual evidence. To discourage bypassing of visual information, we further design a two-stage fine-tuning strategy with a cue-bottleneck attention mask. In Stage I, we employ an attention mask to block the LLM’s direct view of raw visual features, thereby funneling all visual evidence through the cue pathway. In Stage II, standard causal attention is restored to train the LLM to jointly exploit the visual and cue tokens. We evaluate InViC on three public Med-VQA benchmarks (VQA-RAD, SLAKE, and ImageCLEF VQA-Med 2019) across multiple representative MLLMs. InViC consistently improves over zero-shot inference and standard LoRA fine-tuning, demonstrating that intent-aware visual cues with bottlenecked training is a practical and effective strategy for improving trustworthy Med-VQA.

[CV-70] Advancing Visual Reliability: Color-Accurate Underwater Image Enhancement for Real-Time Underwater Missions

【速读】:该论文旨在解决水下图像增强中普遍存在的质量退化问题,即由于水体对光的强吸收和散射导致的颜色失真与对比度下降,同时兼顾模型轻量化以适应水下设备的实时部署需求。其解决方案的关键在于提出了一种兼具高精度颜色恢复与低计算复杂度的实时增强框架:首先引入自适应加权通道补偿模块(Adaptive Weighted Channel Compensation),利用绿色通道作为参考锚点动态恢复红蓝通道的颜色;其次设计多分支重新参数化空洞卷积(Multi-branch Re-parameterized Dilated Convolution),在训练时融合多分支特征,在推理时通过结构重参数化实现大感受野下的高效计算;最后采用基于统计先验的全局颜色调整模块(Statistical Global Color Adjustment)优化整体色彩一致性。该方法在保持仅3,880个推理参数和409 FPS推理速度的前提下,显著提升了多种评估指标,尤其在UCIQE上提升达29.7%,验证了其在真实水下机器人平台上的实用性与优越性。

链接: https://arxiv.org/abs/2603.16363
作者: Yiqiang Zhou,Yifan Chen,Zhe Sun,Jijun Lu,Ye Zheng,Xuelong Li
机构: Institute of Artificial Intelligence (TeleAI), China Telecom, P. R. China; College of Computer and Control Engineering, Northeast Forestry University, Harbin, 150040, China; College of Future Information Technology, Fudan University, Shanghai 200433, China; School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, China; Fujian Ocean Innovation Center, Xiamen 361102, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater image enhancement plays a crucial role in providing reliable visual information for underwater platforms, since strong absorption and scattering in water-related environments generally lead to image quality degradation. Existing high-performance methods often rely on complex architectures, which hinder deployment on underwater devices. Lightweight methods often sacrifice quality for speed and struggle to handle severely degraded underwater images. To address this limitation, we present a real-time underwater image enhancement framework with accurate color restoration. First, an Adaptive Weighted Channel Compensation module is introduced to achieve dynamic color recovery of the red and blue channels using the green channel as a reference anchor. Second, we design a Multi-branch Re-parameterized Dilated Convolution that employs multi-branch fusion during training and structural re-parameterization during inference, enabling large receptive field representation with low computational overhead. Finally, a Statistical Global Color Adjustment module is employed to optimize overall color performance based on statistical priors. Extensive experiments on eight datasets demonstrate that the proposed method achieves state-of-the-art performance across seven evaluation metrics. The model contains only 3,880 inference parameters and achieves an inference speed of 409 FPS. Our method improves the UCIQE score by 29.7% under diverse environmental conditions, and the deployment on ROV platforms and performance gains in downstream tasks further validate its superiority for real-time underwater missions.

[CV-71] D3-RSMDE: 40times Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

【速读】:该论文旨在解决遥感单目深度估计中准确率与效率之间的权衡问题:现有方法要么使用视觉Transformer(Vision Transformer, ViT)骨干网络虽速度快但感知质量差,要么依赖扩散模型虽精度高但计算成本极高。解决方案的关键在于提出一种名为D³-RSMDE的高效框架,其核心创新包括两个部分:首先,利用ViT模块快速生成高质量的初步深度图作为结构先验,替代扩散模型中耗时的初始结构生成阶段;其次,设计了一种渐进式线性混合精化(Progressive Linear Blending Refinement, PLBR)策略,通过轻量级U-Net在紧凑的潜在空间(由变分自编码器VAE支持)中仅用少量迭代即可精细修复细节。该方案实现了在显著提升感知质量(LPIPS降低11.85%)的同时,推理速度超过40倍加速且显存占用与轻量ViT相当。

链接: https://arxiv.org/abs/2603.16362
作者: Ruizhi Wang,Weihan Li,Zunlei Feng,Haofei Zhang,Mingli Song,Jiayu Wang,Jie Song,Li Sun
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 3. University of Chinese Academy of Sciences (中国科学院大学); 4. Tsinghua University (清华大学); 5. Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation ( D^3 -RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that D^3 -RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40x speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.

[CV-72] Automated identification of Ichneumonoidea wasps via YOLO-based deep learning: Integrating HiresCam for Explainable AI

【速读】:该论文旨在解决寄生蜂总科(Ichneumonoidea)物种在分类鉴定中因形态相似性高、体型微小及种间细微差异导致的依赖人工识别、效率低下且对专家经验要求高的问题。其解决方案的关键在于提出了一种基于YOLO架构的深度学习框架,并集成高分辨率类激活映射(HiResCAM)技术以增强模型的可解释性,从而实现从高分辨率图像中自动识别寄生蜂科级分类单元。实验表明,该方法在3556张膜翅目标本图像上达到96%以上的准确率,且HiResCAM可视化验证了模型关注到与分类学相关的关键解剖结构(如翅脉、触角分节和腹部结构),提升了模型决策的生物合理性与可信度。

链接: https://arxiv.org/abs/2603.16351
作者: Joao Manoel Herrera Pinheiro,Gabriela Do Nascimento Herrera,Alvaro Doria Dos Santos,Luciana Bueno Dos Reis Fernandes,Ricardo V. Godoy,Eduardo A. B. Almeida,Helena Carolina Onody,Marcelo Andrade Da Costa Vieira,Angelica Maria Penteado-Dias,Marcelo Becker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 20 figures

点击查看摘要

Abstract:Accurate taxonomic identification of parasitoid wasps within the superfamily Ichneumonoidea is essential for biodiversity assessment, ecological monitoring, and biological control programs. However, morphological similarity, small body size, and fine-grained interspecific variation make manual identification labor-intensive and expertise-dependent. This study proposes a deep learning-based framework for the automated identification of Ichneumonoidea wasps using a YOLO-based architecture integrated with High-Resolution Class Activation Mapping (HiResCAM) to enhance interpretability. The proposed system simultaneously identifies wasp families from high-resolution images. The dataset comprises 3556 high-resolution images of Hymenoptera specimens. The taxonomic distribution is primarily concentrated among the families Ichneumonidae (n = 786), Braconidae (n = 648), Apidae (n = 466), and Vespidae (n = 460). Extensive experiments were conducted using a curated dataset, with model performance evaluated through precision, recall, F1 score, and accuracy. The results demonstrate high accuracy of over 96 % and robust generalization across morphological variations. HiResCAM visualizations confirm that the model focuses on taxonomically relevant anatomical regions, such as wing venation, antennae segmentation, and metasomal structures, thereby validating the biological plausibility of the learned features. The integration of explainable AI techniques improves transparency and trustworthiness, making the system suitable for entomological research to accelerate biodiversity characterization in an under-described parasitoid superfamily.

[CV-73] Learning Human-Object Interaction for 3D Human Pose Estimation from LiDAR Point Clouds

【速读】:该论文旨在解决从LiDAR点云中进行鲁棒的3D人体姿态估计问题,尤其针对人类-物体交互(Human-Object Interaction, HOI)区域中存在的空间模糊性和类别不平衡两大挑战。解决方案的关键在于提出一个名为HOIL(Human-Object Interaction Learning)的框架:首先,通过人类-物体交互感知对比学习(HOICL)增强交互区域内人体点与物体点的特征区分度,缓解空间模糊性;其次,引入接触感知部件引导池化(CPPool),自适应地重新分配表示能力,压缩过代表征点并保留交互部位的高信息量点,以缓解类别不平衡问题;此外,还设计了一种基于接触的时序精修模块,利用时间维度上的接触线索优化单帧关键点预测。

链接: https://arxiv.org/abs/2603.16343
作者: Daniel Sungho Jung,Dohee Cho,Kyoung Mu Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Understanding humans from LiDAR point clouds is one of the most critical tasks in autonomous driving due to its close relationships with pedestrian safety, yet it remains challenging in the presence of diverse human-object interactions and cluttered backgrounds. Nevertheless, existing methods largely overlook the potential of leveraging human-object interactions to build robust 3D human pose estimation frameworks. There are two major challenges that motivate the incorporation of human-object interaction. First, human-object interactions introduce spatial ambiguity between human and object points, which often leads to erroneous 3D human keypoint predictions in interaction regions. Second, there exists severe class imbalance in the number of points between interacting and non-interacting body parts, with the interaction-frequent regions such as hand and foot being sparsely observed in LiDAR data. To address these challenges, we propose a Human-Object Interaction Learning (HOIL) framework for robust 3D human pose estimation from LiDAR point clouds. To mitigate the spatial ambiguity issue, we present human-object interaction-aware contrastive learning (HOICL) that effectively enhances feature discrimination between human and object points, particularly in interaction regions. To alleviate the class imbalance issue, we introduce contact-aware part-guided pooling (CPPool) that adaptively reallocates representational capacity by compressing overrepresented points while preserving informative points from interacting body parts. In addition, we present an optional contact-based temporal refinement that refines erroneous per-frame keypoint estimates using contact cues over time. As a result, our HOIL effectively leverages human-object interaction to resolve spatial ambiguity and class imbalance in interaction regions. Codes will be released.

[CV-74] PKINet-v2: Towards Powerful and Efficient Poly-Kernel Remote Sensing Object Detection

【速读】:该论文旨在解决遥感图像(Remote Sensing Images, RSIs)目标检测中几何复杂性和空间复杂性共存的挑战:目标可能呈现多样化的长宽比,同时在不同场景下占据广泛的尺寸范围。现有方法通常将这两个问题分别处理,例如采用各向异性条带卷积(anisotropic strip kernels)建模细长目标,或使用各向同性大感受野卷积(isotropic large kernels)捕获全局上下文,但这种分离策略导致互补缺陷——纯条带设计会破坏规则形状物体的空间一致性并削弱微小细节,而各向同性大核则易引入严重背景噪声并对细长结构造成几何失配。解决方案的关键在于提出Poly Kernel Inception Network v2 (PKINet-v2),其通过统一框架协同融合各向异性轴向条带卷积与各向同性方形卷积,构建多尺度感受野,在保留精细局部纹理的同时逐步聚合跨尺度长程上下文;进一步引入异质核重参数化(Heterogeneous Kernel Re-parameterization, HKR)策略,将所有异构分支融合为单一深度可分离卷积以实现高效推理,消除碎片化核调用且不损失精度,从而在准确率和效率上均优于现有遥感骨干网络。

链接: https://arxiv.org/abs/2603.16341
作者: Xinhao Cai,Liulei Li,Gensheng Pei,Zeren Sun,Yazhou Yao,Wenguan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detection in remote sensing images (RSIs) is challenged by the coexistence of geometric and spatial complexity: targets may appear with diverse aspect ratios, while spanning a wide range of object sizes under varied contexts. Existing RSI backbones address the two challenges separately, either by adopting anisotropic strip kernels to model slender targets or by using isotropic large kernels to capture broader context. However, such isolated treatments lead to complementary drawbacks: the strip-only design can disrupt spatial coherence for regular-shaped objects and weaken tiny details, whereas isotropic large kernels often introduce severe background noise and geometric mismatch for slender structures. In this paper, we extend PKINet, and present a powerful and efficient backbone that jointly handles both challenges within a unified paradigm named Poly Kernel Inception Network v2 (PKINet-v2). PKINet-v2 synergizes anisotropic axial-strip convolutions with isotropic square kernels and builds a multi-scope receptive field, preserving fine-grained local textures while progressively aggregating long-range context across scales. To enable efficient deployment, we further introduce a Heterogeneous Kernel Re-parameterization (HKR) Strategy that fuses all heterogeneous branches into a single depth-wise convolution for inference, eliminating fragmented kernel launches without accuracy loss. Extensive experiments on four widely-used benchmarks, including DOTA-v1.0, DOTA-v1.5, HRSC2016, and DIOR-R, demonstrate that PKINet-v2 achieves state-of-the-art accuracy while delivering a \textbf3.9\times FPS acceleration compared to PKINet-v1, surpassing previous remote sensing backbones in both effectiveness and efficiency.

[CV-75] Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation CVPR2026

【速读】:该论文旨在解决单目深度估计(Monocular Depth Estimation, MDE)中传统前馈方法依赖大量训练数据但细节缺失,以及基于扩散模型的方法在从合成数据到真实场景的域迁移中表现不佳的问题。解决方案的关键在于提出了一种名为Iris的确定性框架,其核心创新是引入两阶段的“先验到几何”确定性(Priors-to-Geometry Deterministic, PGD)调度机制:第一阶段通过频谱门控蒸馏(Spectral-Gated Distillation, SGD)迁移低频真实先验以保留结构信息,同时不约束高频细节;第二阶段通过频谱门控一致性(Spectral-Gated Consistency, SGC)强制高频频谱保真度并利用合成真值进行精修,两阶段共享权重且按从高到低时间步顺序执行,从而在有限训练数据下实现强泛化能力和细节保留。

链接: https://arxiv.org/abs/2603.16340
作者: Xinhao Cai,Gensheng Pei,Zeren Sun,Yazhou Yao,Fumin Shen,Wenguan Wang
机构: Nanjing University of Science and Technology (南京理工大学); State Key Laboratory of Intelligent Manufacturing of Advanced Construction Machinery (先进施工机械智能制造国家重点实验室); Sungkyunkwan University (成均馆大学); University of Electronic Science and Technology of China (电子科技大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026

点击查看摘要

Abstract:In this paper, we propose \textbfIris, a deterministic framework for Monocular Depth Estimation (MDE) that integrates real-world priors into the diffusion model. Conventional feed-forward methods rely on massive training data, yet still miss details. Previous diffusion-based methods leverage rich generative priors yet struggle with synthetic-to-real domain transfer. Iris, in contrast, preserves fine details, generalizes strongly from synthetic to real scenes, and remains efficient with limited training data. To this end, we introduce a two-stage Priors-to-Geometry Deterministic (PGD) schedule: the prior stage uses Spectral-Gated Distillation (SGD) to transfer low-frequency real priors while leaving high-frequency details unconstrained, and the geometry stage applies Spectral-Gated Consistency (SGC) to enforce high-frequency fidelity while refining with synthetic ground truth. The two stages share weights and are executed with a high-to-low timestep schedule. Extensive experimental results confirm that Iris achieves significant improvements in MDE performance with strong in-the-wild generalization.

[CV-76] SpikeCLR: Contrastive Self-Supervised Learning for Few-Shot Event-Based Vision using Spiking Neural Networks

【速读】:该论文旨在解决事件视觉传感器(Event-based Vision Sensors)与脉冲神经网络(Spiking Neural Networks, SNNs)在训练过程中面临的标注数据稀缺问题,从而限制了其在嵌入式系统中实现高效感知应用的潜力。解决方案的关键在于提出一种名为SpikeCLR的对比自监督学习框架,该框架利用代理梯度训练方法将传统基于帧的方法适配至脉冲域,并设计了一套针对事件数据特性的增强策略,包括空间、时间和极性变换,以挖掘事件数据中的时空不变性特征。实验表明,通过自监督预训练结合微调的方式,在低数据场景下显著优于监督学习方法,且所学表征具有跨数据集迁移能力,为标签稀缺环境下的事件驱动模型提供了有效路径。

链接: https://arxiv.org/abs/2603.16338
作者: Maxime Vaillant,Axel Carlier,Lai Xing Ng,Christophe Hurter,Benoit R. Cottereau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:Event-based vision sensors provide significant advantages for high-speed perception, including microsecond temporal resolution, high dynamic range, and low power consumption. When combined with Spiking Neural Networks (SNNs), they can be deployed on neuromorphic hardware, enabling energy-efficient applications on embedded systems. However, this potential is severely limited by the scarcity of large-scale labeled datasets required to effectively train such models. In this work, we introduce SpikeCLR, a contrastive self-supervised learning framework that enables SNNs to learn robust visual representations from unlabeled event data. We adapt prior frame-based methods to the spiking domain using surrogate gradient training and introduce a suite of event-specific augmentations that leverage spatial, temporal, and polarity transformations. Through extensive experiments on CIFAR10-DVS, N-Caltech101, N-MNIST, and DVS-Gesture benchmarks, we demonstrate that self-supervised pretraining with subsequent fine-tuning outperforms supervised learning in low-data regimes, achieving consistent gains in few-shot and semi-supervised settings. Our ablation studies reveal that combining spatial and temporal augmentations is critical for learning effective spatio-temporal invariances in event data. We further show that learned representations transfer across datasets, contributing to efforts for powerful event-based models in label-scarce settings.

[CV-77] An Interpretable Machine Learning Framework for Non-Small Cell Lung Cancer Drug Response Analysis

【速读】:该论文旨在解决肺癌个性化治疗中因肿瘤异质性导致的传统疗法(如手术、化疗和放疗)效果有限的问题。其核心解决方案是构建基于多组学数据(Multi-Omics)的机器学习预测模型,以精准评估患者对特定药物的敏感性(通过LN-IC50指标衡量)。关键在于利用Genomics of Drug Sensitivity in Cancer数据库中的基因组学特征,结合XGBoost回归器提取分子与细胞层面的特征,并通过交叉验证和随机搜索进行超参数优化,从而提升模型预测准确性;同时引入SHAP值解释单个预测中各特征的重要性,并借助DeepSeek大语言模型验证生物合理性,实现可解释的个性化用药决策支持。

链接: https://arxiv.org/abs/2603.16330
作者: Ann Rachel,Pranav M Pawar,Mithun Mukharjee,Raja M,Tojo Mathew
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages, 8 figures

点击查看摘要

Abstract:Lung cancer is a condition where there is abnormal growth of malignant cells that spread in an uncontrollable fashion in the lungs. Some common treatment strategies are surgery, chemotherapy, and radiation which aren’t the best options due to the heterogeneous nature of cancer. In personalized medicine, treatments are tailored according to the individual’s genetic information along with lifestyle aspects. In addition, AI-based deep learning methods can analyze large sets of data to find early signs of cancer, types of tumor, and prospects of treatment. The paper focuses on the development of personalized treatment plans using specific patient data focusing primarily on the genetic profile. Multi-Omics data from Genomics of Drug Sensitivity in Cancer have been used to build a predictive model along with machine learning techniques. The value of the target variable, LN-IC50, determines how sensitive or resistive a drug is. An XGBoost regressor is utilized to predict the drug response focusing on molecular and cellular features extracted from cancer datasets. Cross-validation and Randomized Search are performed for hyperparameter tuning to further optimize the model’s predictive performance. For explanation purposes, SHAP (SHapley Additive exPlanations) was used. SHAP values measure each feature’s impact on an individual prediction. Furthermore, interpreting feature relationships was performed using DeepSeek, a large language model trained to verify the biological validity of the features. Contextual explanations regarding the most important genes or pathways were provided by DeepSeek alongside the top SHAP value constituents, supporting the predictability of the model.

[CV-78] DriveFix: Spatio-Temporally Coherent Driving Scene Restoration

【速读】:该论文旨在解决当前4D场景重建方法在自动驾驶应用中缺乏时空一致性的问题,具体表现为多视角独立处理导致的相机间空间错位和时序序列中的时间漂移。解决方案的关键在于提出DriveFix框架,其核心创新是采用交错式扩散Transformer架构,并引入专门设计的模块以显式建模时间依赖性和跨摄像头的空间一致性;同时通过历史上下文条件化生成和几何感知训练损失,强制恢复视图遵循统一的3D几何结构,从而实现高保真纹理的一致传播并显著减少伪影。

链接: https://arxiv.org/abs/2603.16306
作者: Heyu Si,Brandon James Denis,Muyang Sun,Dragos Datcu,Yaoru Li,Xin Jin,Ruiju Fu,Yuliia Tatarinova,Federico Landi,Jie Song,Mingli Song,Qi Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in 4D scene reconstruction, particularly those leveraging diffusion priors, have shown promise for novel view synthesis in autonomous driving. However, these methods often process frames independently or in a view-by-view manner, leading to a critical lack of spatio-temporal synergy. This results in spatial misalignment across cameras and temporal drift in sequences. We propose DriveFix, a novel multi-view restoration framework that ensures spatio-temporal coherence for driving scenes. Our approach employs an interleaved diffusion transformer architecture with specialized blocks to explicitly model both temporal dependencies and cross-camera spatial consistency. By conditioning the generation on historical context and integrating geometry-aware training losses, DriveFix enforces that the restored views adhere to a unified 3D geometry. This enables the consistent propagation of high-fidelity textures and significantly reduces artifacts. Extensive evaluations on the Waymo, nuScenes, and PandaSet datasets demonstrate that DriveFix achieves state-of-the-art performance in both reconstruction and novel view synthesis, marking a substantial step toward robust 4D world modeling for real-world deployment.

[CV-79] Micro-AU CLIP: Fine-Grained Contrastive Learning from Local Independence to Global Dependency for Micro-Expression Action Unit Detection

【速读】:该论文旨在解决现有微表情动作单元(Micro-AU)检测方法因依赖全局面部图像特征而忽略局部肌肉运动的固有局部性问题,从而导致对微表情中细微情感线索感知不足。其核心解决方案在于提出一种新颖的“独立到依赖”建模框架——Micro-AU CLIP,关键创新点包括:首先,在局部语义独立建模(LSI)阶段引入Patch Token Attention(PTA),将AU区域内的局部特征映射至同一特征空间以强化局部独立性;其次,在全局语义依赖建模(GSD)阶段设计Global Dependency Attention(GDA)与Global Dependency Loss(GDLoss),显式建模不同Micro-AU间的跨区域依赖关系以增强特征表达;此外,为克服CLIP在微语义对齐上的局限,进一步提出微动作单元对比损失(MiAUCL),实现视觉与文本特征的细粒度对齐,最终显著提升Micro-AU检测精度并支持无情感标签的微表情识别。

链接: https://arxiv.org/abs/2603.16302
作者: Jinsheng Wei,Fengzhou Guo,Yante Li,Haoyu Chen,Guanming Lu,Guoying Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Micro-expression (ME) action units (Micro-AUs) provide objective clues for fine-grained genuine emotion analysis. Most existing Micro-AU detection methods learn AU features from the whole facial image/video, which conflicts with the inherent locality of AU, resulting in insufficient perception of AU regions. In fact, each AU independently corresponds to specific localized facial muscle movements (local independence), while there is an inherent dependency between some AUs under specific emotional states (global dependency). Thus, this paper explores the effectiveness of the independence-to-dependency pattern and proposes a novel micro-AU detection framework, micro-AU CLIP, that uniquely decomposes the AU detection process into local semantic independence modeling (LSI) and global semantic dependency (GSD) modeling. In LSI, Patch Token Attention (PTA) is designed, mapping several local features within the AU region to the same feature space; In GSD, Global Dependency Attention (GDA) and Global Dependency Loss (GDLoss) are presented to model the global dependency relationships between different AUs, thereby enhancing each AU feature. Furthermore, considering CLIP’s native limitations in micro-semantic alignment, a microAU contrastive loss (MiAUCL) is designed to learn AU features by a fine-grained alignment of visual and text features. Also, Micro-AU CLIP is effectively applied to ME recognition in an emotion-label-free way. The experimental results demonstrate that Micro-AU CLIP can fully learn fine-grained micro-AU features, achieving state-of-the-art performance.

[CV-80] VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在网页浏览代理任务中对视觉推理能力评估不足,以及忽视网页原生视觉信息在推理链中作用的问题。其解决方案的关键在于构建一个名为VisBrowse-Bench的新基准测试集,该数据集包含169个跨领域的视觉问答(VQA)实例,并通过文本-图像检索与联合推理实现多模态证据交叉验证,从而系统性地评估模型在搜索过程中对视觉信息的利用和推理能力;同时提出一种新型代理工作流,驱动浏览器代理主动收集并基于视觉信息进行推理,实验表明即使最先进的闭源模型如Claude-4.6-Opus仅达到47.6%的准确率,凸显了该问题的挑战性和新基准的有效性。

链接: https://arxiv.org/abs/2603.16289
作者: Zhengbo Zhang,Jinbo Su,Zhaowen Zhou,Changtao Miao,Yuhan Hong,Qimeng Wu,Yumeng Liu,Feier Wu,Yihe Tian,Yuhao Liang,Zitong Shan,Wanke Xia,Yi-Fan Zhang,Bo Zhang,Zhe Li,Shiming Xiang,Ying Yan
机构: CASIA(中国科学院自动化研究所); Ant Digital Technologies(蚂蚁数字科技); Ant Group(蚂蚁集团); RUC(中国人民大学); FZU(福州大学); THU(清华大学); USTB(北京科技大学); PKU(北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new benchmark for visual-native search, VisBrowse-Bench. It contains 169 VQA instances covering multiple domains and evaluates the models’ visual reasoning capabilities during the search process through multimodal evidence cross-validation via text-image retrieval and joint reasoning. These data were constructed by human experts using a multi-stage pipeline and underwent rigorous manual verification. We additionally propose an agent workflow that can effectively drive the browsing agent to actively collect and reason over visual information during the search process. We comprehensively evaluated both open-source and closed-source models in this workflow. Experimental results show that even the best-performing model, Claude-4.6-Opus only achieves an accuracy of 47.6%, while the proprietary Deep Research model, o3-deep-research only achieves an accuracy of 41.1%. The code and data can be accessed at: this https URL

[CV-81] Persistent Story World Simulation with Continuous Character Customization

【速读】:该论文旨在解决当前故事可视化方法在角色定制准确性、语义对齐以及新角色持续集成方面难以协同的问题。其解决方案的关键在于提出一个统一的“全合一世界角色整合器”(All-in-One-World Character Integrator),通过在统一的LoRA模块内实现连续角色适应,避免了以往方法中为每个角色单独优化的复杂性;同时引入基于多模态大语言模型(MLLM-as-Judge)的“角色质量门控机制”,利用链式思维推理判断是否可进入下一角色或需对当前角色进行额外训练,从而保障角色适配的保真度;此外,设计了一种“角色感知区域聚焦采样策略”(Character-Aware Region-Focus Sampling),有效缓解多角色场景下的身份退化与布局冲突问题,提升局部角色细节与全局场景上下文的一致性与生成效率。

链接: https://arxiv.org/abs/2603.16285
作者: Jinlu Zhang,Qiyun Wang,Baoxiang Du,Jiayi Ji,Jing He,Rongsheng Zhang,Tangjie Lv,Xiaoshuai Sun,Rongrong Ji
机构: Xiamen University; The Hong Kong University of Science and Technology (Guangzhou); Fuxi AI Lab, Netease Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Story visualization has gained increasing attention in computer vision. However, current methods often fail to achieve a synergy between accurate character customization, semantic alignment, and continuous integration of new identities. To tackle this challenge, in this paper we present EverTale, a story world simulator for continuous story character customization. We first propose an All-in-One-World Character Integrator to achieve continuous character adaptation within unified LoRA module, eliminating the need for per-character optimization modules of previous methods. Then, we incorporate a Character Quality Gate via MLLM-as-Judge to ensure the fidelity of each character adaptation process through chain-of-thought reasoning, determining whether the model can proceed to the next character or require additional training on the current one. We also introduce a Character-Aware Region-Focus Sampling strategy to address the identity degradation and layout conflicts in existing multi-character visual storytelling, ensuring natural multi-character generation by harmonizing local character-specific details with global scene context with higher efficiency. Experimental results show that our EverTale achieves superior performance against a wider range of compared methods on both single- and multi-character story visualization. Codes will be available.

[CV-82] Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation CVPR2026

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的幻觉问题,即模型在生成文本时出现与输入内容不符的错误信息,从而影响其可靠性并限制实际应用。现有特征引导(feature steering)方法通常对所有网络层采用统一强度的调整策略,忽视了不同层对幻觉贡献的差异性,可能导致无关层被干扰,进而损害模型在通用任务上的性能。论文提出了一种即插即用的框架 Locate-Then-Sparsify for Feature Steering (LTS-FS),其核心创新在于:首先构建包含词级和句级幻觉样本的合成数据集,继而基于因果干预的归因方法量化每层对幻觉的敏感度;随后依据各层归因得分动态分配特征引导强度,实现仅对与幻觉相关层进行精细化调节,从而在有效抑制幻觉的同时保持模型整体性能。

链接: https://arxiv.org/abs/2603.16284
作者: TianTian Dang,Chao Bi,Shufan Shen,Jinzhe Liu,Qingming Huang,Shuhui Wang
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Despite the significant advancements in Large Vision-Language Models (LVLMs), their tendency to generate hallucinations undermines reliability and restricts broader practical deployment. Among the hallucination mitigation methods, feature steering emerges as a promising approach that reduces erroneous outputs in LVLMs without increasing inference costs. However, current methods apply uniform feature steering across all layers. This heuristic strategy ignores inter-layer differences, potentially disrupting layers unrelated to hallucinations and ultimately leading to performance degradation on general tasks. In this paper, we propose a plug-and-play framework called Locate-Then-Sparsify for Feature Steering (LTS-FS), which controls the steering intensity according to the hallucination relevance of each layer. We first construct a synthetic dataset comprising token-level and sentence-level hallucination cases. Based on this dataset, we introduce an attribution method based on causal interventions to quantify the hallucination relevance of each layer. With the attribution scores across layers, we propose a layerwise strategy that converts these scores into feature steering intensities for individual layers, enabling more precise adjustments specifically on hallucination-relevant layers. Extensive experiments across multiple LVLMs and benchmarks demonstrate that our LTS-FS framework effectively mitigates hallucination while preserving strong performance.

[CV-83] VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

【速读】:该论文旨在解决视频扩散模型(video diffusion models)在训练过程中缺乏显式几何监督导致的不一致性伪影问题,如物体形变、空间漂移和深度违规等。其解决方案的关键在于提出一种基于几何的奖励模型(geometry-based reward model),该模型利用预训练的几何基础模型(geometric foundation models)通过跨帧重投影误差(cross-frame reprojection error)评估多视角一致性;与以往在像素空间中测量不一致性的方法不同,该方法以逐点方式计算误差,从而获得更物理合理且鲁棒的度量指标。此外,研究引入了一种几何感知采样策略,过滤低纹理和非语义区域,聚焦于具有可靠对应关系的几何意义区域,进一步提升评估的稳定性与准确性。

链接: https://arxiv.org/abs/2603.16271
作者: Tengjiao Yin,Jinglei Shi,Heng Guo,Xi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos. To address this limitation, we propose a geometry-based reward model that leverages pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Unlike previous geometric metrics that measure inconsistency in pixel space, where pixel intensity may introduce additional noise, our approach conducts error computation in a pointwise fashion, yielding a more physically grounded and robust error metric. Furthermore, we introduce a geometry-aware sampling strategy that filters out low-texture and non-semantic regions, focusing evaluation on geometrically meaningful areas with reliable correspondences to improve robustness. We apply this reward model to align video diffusion models through two complementary pathways: post-training of a bidirectional model via SFT or Reinforcement Learning and inference-time optimization of a Causal Video Model (e.g., Streaming video generator) via test-time scaling with our reward as a path verifier. Experimental results validate the effectiveness of our design, demonstrating that our geometry-based reward provides superior robustness compared to other variants. By enabling efficient inference-time scaling, our method offers a practical solution for enhancing open-source video models without requiring extensive computational resources for retraining.

[CV-84] FG-SGL: Fine-Grained Semantic Guidance Learning via Motion Process Decomposition for Micro-Gesture Recognition

【速读】:该论文旨在解决微手势识别(Micro-gesture Recognition, MGR)中因类间差异细微而导致的识别困难问题,现有方法依赖类别级监督信号,难以捕捉局部运动的细微差异。其解决方案的关键在于提出一种细粒度语义引导学习(Fine-Grained Semantic Guidance Learning, FG-SGL)框架,该框架通过联合引入细粒度与类别级语义信息来指导视觉-语言模型对局部微手势运动的感知:其中FG-SA模块利用细粒度语义线索引导局部运动特征的学习,CP-A模块则借助类别级语义增强微手势特征的可分性;同时构建了包含四个细化语义维度的人工标注细粒度文本数据集,并设计多层级对比优化策略以粗到精的方式协同优化两个模块,从而显著提升MGR性能。

链接: https://arxiv.org/abs/2603.16269
作者: Jinsheng Wei,Zhaodi Xu,Guanming Lu,Haoyu Chen,Jingjie Yan
机构: Nanjing University of Posts and Telecommunications (南京邮电大学); University of Oulu (奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Micro-gesture recognition (MGR) is challenging due to subtle inter-class variations. Existing methods rely on category-level supervision, which is insufficient for capturing subtle and localized motion differences. Thus, this paper proposes a Fine-Grained Semantic Guidance Learning (FG-SGL) framework that jointly integrates fine-grained and category-level semantics to guide vision–language models in perceiving local MG motions. FG-SA adopts fine-grained semantic cues to guide the learning of local motion features, while CP-A enhances the separability of MG features through category-level semantic guidance. To support fine-grained semantic guidance, this work constructs a fine-grained textual dataset with human annotations that describes the dynamic process of MGs in four refined semantic dimensions. Furthermore, a Multi-Level Contrastive Optimization strategy is designed to jointly optimize both modules in a coarse-to-fine pattern. Experiments show that FG-SGL achieves competitive performance, validating the effectiveness of fine-grained semantic guidance for MGR.

[CV-85] AW-MoE: All-Weather Mixture of Experts for Robust Multi-Modal 3D Object Detection

【速读】:该论文旨在解决恶劣天气条件下三维目标检测性能下降的问题,特别是现有方法在训练时简单混合不同天气样本而忽视了气象场景间数据分布差异,导致模型在不同天气下出现性能冲突。其解决方案的关键在于提出AW-MoE框架,该框架创新性地将专家混合(Mixture of Experts, MoE)引入到多模态3D目标检测中,并设计了图像引导的天气感知路由机制(Image-guided Weather-aware Routing, IWR),利用图像特征在不同天气下的强判别能力及其对场景变化的不变性实现精确天气分类,进而动态选择最相关的天气特定专家(Weather-Specific Experts, WSE)来处理数据分布差异,从而提升各类恶劣天气下的检测鲁棒性。

链接: https://arxiv.org/abs/2603.16261
作者: Hongwei Lin,Xun Huang,Chenglu Wen,Cheng Wang
机构: Xiamen University (厦门大学); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robust 3D object detection under adverse weather conditions is crucial for autonomous driving. However, most existing methods simply combine all weather samples for training while overlooking data distribution discrepancies across different weather scenarios, leading to performance conflicts. To address this issue, we introduce AW-MoE, the framework that innovatively integrates Mixture of Experts (MoE) into weather-robust multi-modal 3D object detection approaches. AW-MoE incorporates Image-guided Weather-aware Routing (IWR), which leverages the superior discriminability of image features across weather conditions and their invariance to scene variations for precise weather classification. Based on this accurate classification, IWR selects the top-K most relevant Weather-Specific Experts (WSE) that handle data discrepancies, ensuring optimal detection under all weather conditions. Additionally, we propose a Unified Dual-Modal Augmentation (UDMA) for synchronous LiDAR and 4D Radar dual-modal data augmentation while preserving the realism of scenes. Extensive experiments on the real-world dataset demonstrate that AW-MoE achieves ~ 15% improvement in adverse-weather performance over state-of-the-art methods, while incurring negligible inference overhead. Moreover, integrating AW-MoE into established baseline detectors yields performance improvements surpassing current state-of-the-art methods. These results show the effectiveness and strong scalability of our AW-MoE. We will release the code publicly at this https URL.

[CV-86] Point-to-Mask: From Arbitrary Point Annotations to Mask-Level Infrared Small Target Detection

【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, IRSTD)中依赖密集像素级标注导致标注成本高昂,且现有方法在处理纹理弱、边界模糊的小目标时性能受限的问题。解决方案的关键在于提出Point-to-Mask框架,其核心创新是通过两个模块形成闭环:一是物理驱动的自适应掩码生成(Physics-driven Adaptive Mask Generation, PAMG)模块,将低成本点标注转化为紧凑的目标掩码和几何线索;二是轻量级半径感知点回归网络(Radius-aware Point Regression Network, RPR-Net),利用时空运动信息将IRSTD重构为目标中心定位与有效半径回归任务。PAMG在训练中生成伪掩码和几何监督信号,而RPR-Net的几何预测在推理阶段反馈给PAMG以恢复像素级掩码,从而在点监督下逼近全监督性能,显著降低标注成本。

链接: https://arxiv.org/abs/2603.16257
作者: Weihua Gao,Wenlong Niu,Jie Tang,Man Yang,Jiafeng Zhang,Xiaodong Peng
机构: Key Laboratory of Electronics and Information Technology for Space Systems, National Space Science Center, Chinese Academy of Sciences (中国科学院国家空间科学中心电子与信息科学技术重点实验室); School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared small target detection (IRSTD) methods predominantly formulate the task as pixel-level segmentation, which requires costly dense annotations and is not well suited to tiny targets with weak texture and ambiguous boundaries. To address this issue, we propose Point-to-Mask, a framework that bridges low-cost point supervision and mask-level detection through two components: a Physics-driven Adaptive Mask Generation (PAMG) module that converts point annotations into compact target masks and geometric cues, and a lightweight Radius-aware Point Regression Network (RPR-Net) that reformulates IRSTD as target center localization and effective radius regression using spatiotemporal motion cues. The two modules form a closed loop: PAMG generates pseudo masks and geometric supervision during training, while the geometric predictions of RPR-Net are fed back to PAMG for pixel-level mask recovery during inference. To facilitate systematic evaluation, we further construct SIRSTD-Pixel, a sequential dataset with refined pixel-level annotations. Experiments show that the proposed framework achieves strong pseudo-label quality, high detection accuracy, and efficient inference, approaching full-supervision performance under point-supervised settings with substantially lower annotation cost. Code and datasets will be available at: this https URL.

[CV-87] When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition

【速读】:该论文旨在解决视频问答(Video Question Answering)任务中,多模态大语言模型(Multimodal Large Language Models, MLLMs)在采用链式思维(Chain-of-Thought, CoT)推理时出现的“视觉锚点漂移”(visual anchor drifting)问题,即模型过度依赖自生成文本而忽视视觉输入,导致幻觉和性能下降。解决方案的关键在于提出一种轻量级自动化增强框架 FrameRepeat,其核心创新是引入一个帧重复评分模块(frame scoring module)和一种新颖的训练策略 Add-One-In (AOI),该策略利用 MLLM 输出概率生成监督信号以表征重复收益(repeat gain),从而训练出能自主识别需强化帧的帧评分网络,有效提升关键视觉线索在推理过程中的保留与利用,且具有良好的跨模型泛化能力。

链接: https://arxiv.org/abs/2603.16256
作者: Xiaokun Sun,Yubo Wang,Haoyu Cao,Linli Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant potential in complex visual tasks through the integration of Chain-of-Thought (CoT) reasoning. However, in Video Question Answering, extended thinking processes do not consistently yield performance gains and may even lead to degradation due to ``visual anchor drifting’', where models increasingly rely on self-generated text, sidelining visual inputs and causing hallucinations. While existing mitigations typically introduce specific mechanisms for the model to re-attend to visual inputs during inference, these approaches often incur prohibitive training costs and suffer from poor generalizability across different architectures. To address this, we propose FrameRepeat, an automated enhancement framework which features a lightweight repeat scoring module that enables Video-LLMs to autonomously identify which frames should be reinforced. We introduce a novel training strategy, Add-One-In (AOI), that uses MLLM output probabilities to generate supervision signals representing repeat gain. This can be used to train a frame scoring network, which guides the frame repetition behavior. Experimental results across multiple models and datasets demonstrate that FrameRepeat is both effective and generalizable in strengthening important visual cues during the reasoning process.

[CV-88] Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

【速读】:该论文旨在解决视觉-语言过程奖励模型(Vision-Language Process Reward Models, VL-PRMs)在评估推理步骤时存在的感知与推理混淆问题,即低分可能源于真实推理错误,也可能只是验证器对图像的误判,导致系统性假阳性(奖励幻觉视觉前提)和假阴性(惩罚正确接地陈述),从而损害重排序和错误定位能力。解决方案的关键在于提出显式视觉前提验证(Explicit Visual Premise Verification, EVPV),其核心机制是通过一个轻量级验证接口,将步骤评分条件化于所依赖视觉前提的可靠性:策略被提示生成逐步视觉检查清单以显式表达所需视觉事实,同时约束提取器从输入图像中独立推导结构化视觉约束;随后将清单声明与约束匹配以计算标量视觉可靠性信号,并通过可靠性门控校准PRM步骤奖励——当可靠性低时衰减依赖视觉的步骤奖励,高时保留原奖励。此方法在不进行每步工具调用的前提下,解耦了感知不确定性与逻辑判断,实验证明其在VisualProcessBench及六个多模态推理基准上显著提升步骤级验证准确率与Best-of-N重排序性能,并通过可控约束污染实验提供了因果证据,表明性能提升源自约束保真度与显式前提验证,而非偶然提示效应。

链接: https://arxiv.org/abs/2603.16253
作者: Junxin Wang,Dai Guan,Weijie Qiu,Zhihang Li,Yongbo Gai,Zhengyi Yang,Mengyu Zhou,Erchao Zhao,Xiaoxi Jiang,Guanjun Jiang
机构: Alibaba(阿里巴巴); Chinese Academy of Sciences(中国科学院); Beijing University of Posts and Telecommunications(北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 27 pages, 4 figures, 10 tables. Evaluated on VisualProcessBench and six multimodal reasoning benchmarks (LogicVista, MMMU, MathVerse-VO, MathVision, MathVista, WeMath). Includes ablations and causal analysis via controlled constraint corruption. Code: this https URL

点击查看摘要

Abstract:Vision-language process reward models (VL-PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test-time scaling. However, they often function as black-box judges: a low step score may reflect a genuine reasoning mistake or simply the verifier’s misperception of the image. This entanglement between perception and reasoning leads to systematic false positives (rewarding hallucinated visual premises) and false negatives (penalizing correct grounded statements), undermining both reranking and error localization. We introduce Explicit Visual Premise Verification (EVPV), a lightweight verification interface that conditions step scoring on the reliability of the visual premises a step depends on. The policy is prompted to produce a step-wise visual checklist that makes required visual facts explicit, while a constraint extractor independently derives structured visual constraints from the input image. EVPV matches checklist claims against these constraints to compute a scalar visual reliability signal, and calibrates PRM step rewards via reliability gating: rewards for visually dependent steps are attenuated when reliability is low and preserved when reliability is high. This decouples perceptual uncertainty from logical evaluation without per-step tool calls. Experiments on VisualProcessBench and six multimodal reasoning benchmarks show that EVPV improves step-level verification and consistently boosts Best-of-N reranking accuracy over strong baselines. Furthermore, injecting controlled corruption into the extracted constraints produces monotonic performance degradation, providing causal evidence that the gains arise from constraint fidelity and explicit premise verification rather than incidental prompt effects. Code is available at: this https URL

[CV-89] Visual Prompt Discovery via Semantic Exploration

【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在图像理解与视觉推理中因感知缺陷导致的严重误判问题,尤其关注现有视觉提示(visual prompt)生成方法多聚焦于工具选择而忽视对感知失败根本原因诊断与缓解的局限性。其解决方案的关键在于提出一种自动化语义探索框架SEVEX,通过引入抽象概念空间作为搜索空间、新颖性驱动的选择算法以及基于语义反馈的创意生成机制,有效应对由低级代码冗长带来的干扰和视觉提示空间庞大无序的问题,从而实现高效且多样化的任务导向型视觉提示发现,显著提升LVLM在盲测基准上的任务准确率与探索稳定性。

链接: https://arxiv.org/abs/2603.16250
作者: Jaechang Kim,Yotaro Shimose,Zhao Wang,Kuang-Da Wang,Jungseul Ok,Shingo Takamatsu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LVLMs encounter significant challenges in image understanding and visual reasoning, leading to critical perception failures. Visual prompts, which incorporate image manipulation code, have shown promising potential in mitigating these issues. While emerged as a promising direction, previous methods for visual prompt generation have focused on tool selection rather than diagnosing and mitigating the root causes of LVLM perception failures. Because of the opacity and unpredictability of LVLMs, optimal visual prompts must be discovered through empirical experiments, which have relied on manual human trial-and-error. We propose an automated semantic exploration framework for discovering task-wise visual prompts. Our approach enables diverse yet efficient exploration through agent-driven experiments, minimizing human intervention and avoiding the inefficiency of per-sample generation. We introduce a semantic exploration algorithm named SEVEX, which addresses two major challenges of visual prompt exploration: (1) the distraction caused by lengthy, low-level code and (2) the vast, unstructured search space of visual prompts. Specifically, our method leverages an abstract idea space as a search space, a novelty-guided selection algorithm, and a semantic feedback-driven ideation process to efficiently explore diverse visual prompts based on empirical results. We evaluate SEVEX on the BlindTest and BLINK benchmarks, which are designed to assess LVLM perception. Experimental results demonstrate that SEVEX significantly outperforms baseline methods in task accuracy, inference efficiency, exploration efficiency, and exploration stability. Notably, our framework discovers sophisticated and counter-intuitive visual strategies that go beyond conventional tool usage, offering a new paradigm for enhancing LVLM perception through automated, task-wise visual prompts. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.16250 [cs.CV] (or arXiv:2603.16250v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.16250 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-90] Synergizing Deep Learning and Biological Heuristics for Extreme Long-Tail White Blood Cell Classification

【速读】:该论文旨在解决白细胞(White Blood Cell, WBC)分类中因类别极度不平衡、长尾分布及域偏移(domain shift)导致深度模型过拟合常见类别而难以识别罕见亚型的问题。其解决方案的关键在于提出一种混合框架:首先利用基于Pix2Pix的生成式修复模块去除图像伪影,提升数据质量;其次采用Swin Transformer集成结合MedSigLIP对比嵌入进行鲁棒表征学习;最后引入生物启发式精修步骤,通过几何尖锐度(geometric spikiness)与马氏距离(Mahalanobis distance)约束形态学特征,以恢复分布外预测结果。该方法在WBCBench 2026挑战赛私有榜单上实现了0.77139的Macro-F1分数,验证了融合生物学先验知识对血液图像分析中稀有类别泛化能力的重要价值。

链接: https://arxiv.org/abs/2603.16249
作者: Trong-Duc Nguyen,Hoang-Long Nguyen,Huy-Hieu Pham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE ISBI 2026

点击查看摘要

Abstract:Automated white blood cell (WBC) classification is essential for leukemia screening but remains challenged by extreme class imbalance, long-tail distributions, and domain shift, leading deep models to overfit dominant classes and fail on rare subtypes. We propose a hybrid framework for rare-class generalization that integrates a generative Pix2Pix-based restoration module for artifact removal, a Swin Transformer ensemble with MedSigLIP contrastive embeddings for robust representation learning, and a biologically-inspired refinement step using geometric spikiness and Mahalanobis-based morphological constraints to recover out-of-distribution predictions. Evaluated on the WBCBench 2026 challenge, our method achieves a Macro-F1 of 0.77139 on the private leaderboard, demonstrating strong performance under severe imbalance and highlighting the value of incorporating biological priors into deep learning for hematological image analysis.

[CV-91] RASLF: Representation-Aware State Space Model for Light Field Super-Resolution

【速读】:该论文旨在解决当前基于状态空间模型(State Space Model, SSM)的光场超分辨率(Light Field Super-Resolution, LFSR)方法在利用多种光场表示之间的互补性方面不足的问题,从而导致细粒度纹理丢失和多视角几何错位。解决方案的关键在于提出一种表示感知的状态空间框架(Representation-aware State-space Framework, RASLF),其核心创新包括:1)设计渐进式几何精修(Progressive Geometric Refinement, PGR)模块,通过全景极线表示显式编码多视角视差差异,实现不同光场表示间的有效融合;2)引入表示感知的非对称扫描机制(Representation Aware Asymmetric Scanning, RAAS),根据不同表示空间的物理特性动态调整扫描路径,通过路径剪枝优化性能与效率的平衡;3)构建双锚聚合(Dual-Anchor Aggregation, DAA)模块,增强层次化特征流动,减少冗余深层特征并优先保留关键重建信息。

链接: https://arxiv.org/abs/2603.16243
作者: Zeqiang Wei,Kai Jin,Kuan Song,Xiuzhuang Zhou,Wenlong Chen,Min Xu
机构: Capital Normal University Information Engineering College (首都师范大学信息工程学院); Bigo Technology Pte. Ltd. (Bigo 技术私人有限公司); Explorer Global (Suzhou) Artificial Intelligence Technology Co., Ltd. (探索者全球(苏州)人工智能技术有限公司); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Current SSM-based light field super-resolution (LFSR) methods often fail to fully leverage the complementarity among various LF representations, leading to the loss of fine textures and geometric misalignments across views. To address these issues, we propose RASLF, a representation-aware state-space framework that explicitly models structural correlations across multiple LF representations. Specifically, a Progressive Geometric Refinement (PGR) block is created that uses a panoramic epipolar representation to explicitly encode multi-view parallax differences, thereby enabling integration across different LF representations. Furthermore, we introduce a Representation Aware Asymmetric Scanning (RAAS) mechanism that dynamically adjusts scanning paths based on the physical properties of different representation spaces, optimizing the balance between performance and efficiency through path pruning. Additionally, a Dual-Anchor Aggregation (DAA) module improves hierarchical feature flow, reducing redundant deeplayer features and prioritizing important reconstruction information. Experiments on various public benchmarks show that RASLF achieves the highest reconstruction accuracy while remaining highly computationally efficient.

[CV-92] Exclusivity-Guided Mask Learning for Semi-Supervised Crowd Instance Segmentation and Counting

【速读】:该论文旨在解决半监督人群分析中因传统点标注(point-based annotations)导致的细粒度结构语义学习困难问题,尤其是在稀疏标注下难以准确分割密集场景中个体的问题。其核心解决方案是提出一种基于最近邻排除圆(Nearest Neighbor Exclusion Circle, NNEC)约束的排除约束双提示分割模型(Exclusion-Constrained Dual-Prompt SAM, EDP-SAM),用于生成掩码监督信号;并进一步设计了独占引导掩码学习(Exclusivity-Guided Mask Learning, XMask),通过判别性掩码目标强制空间分离,结合高斯平滑与可微中心采样策略提升特征连续性和训练稳定性。最终构建了一个以实例掩码先验作为伪标签的半监督人群计数框架,显著优于仅依赖点标注的传统方法,在ShanghaiTech A、UCF-QNRF和JHU++数据集上验证了其在稀疏标注下的优越性能。

链接: https://arxiv.org/abs/2603.16241
作者: Jiyang Huang,Hongru Cheng,Wei Lin,Jia Wan,Antoni B. Chan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised crowd analysis is a prominent area of research, as unlabeled data are typically abundant and inexpensive to obtain. However, traditional point-based annotations constrain performance because individual regions are inherently ambiguous, and consequently, learning fine-grained structural semantics from sparse anno tations remains an unresolved challenge. In this paper, we first propose an Exclusion-Constrained Dual-Prompt SAM (EDP-SAM), based on our Nearest Neighbor Exclusion Circle (NNEC) constraint, to generate mask supervision for current datasets. With the aim of segmenting individuals in dense scenes, we then propose Exclusivity-Guided Mask Learning (XMask), which enforces spatial separation through a discriminative mask objective. Gaussian smoothing and a differentiable center sampling strategy are utilized to improve feature continuity and training stability. Building on XMask, we present a semi-supervised crowd counting framework that uses instance mask priors as pseudo-labels, which contain richer shape information than traditional point cues. Extensive experiments on the ShanghaiTech A, UCF-QNRF, and JHU++ datasets (using 5%, 10%, and 40% labeled data) verify that our end-to-end model achieves state-of-the-art semi-supervised segmentation and counting performance, effectively bridging the gap between counting and instance segmentation within a unified framework.

[CV-93] PureCLIP-Depth: Prompt-Free and Decoder-Free Monocular Depth Estimation within CLIP Embedding Space

【速读】:该论文旨在解决单目深度估计(Monocular Depth Estimation, MDE)中依赖几何特征导致模型性能受限的问题,提出了一种完全无需提示(prompt-free)且无需解码器(decoder-free)的新型方法 PureCLIP-Depth。其解决方案的关键在于:摒弃传统基于几何信息的建模方式,转而利用对比语言图像预训练(CLIP)嵌入空间中的概念信息,直接在 CLIP 的语义空间内学习从 RGB 图像到深度图的端到端映射关系,从而实现更高效、准确的深度估计。

链接: https://arxiv.org/abs/2603.16238
作者: Ryutaro Miya,Kazuyoshi Fushinobu,Tatsuya Kawaguchi
机构: Institute of Science Tokyo (东京科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:We propose PureCLIP-Depth, a completely prompt-free, decoder-free Monocular Depth Estimation (MDE) model that operates entirely within the Contrastive Language-Image Pre-training (CLIP) embedding space. Unlike recent models that rely heavily on geometric features, we explore a novel approach to MDE driven by conceptual information, performing computations directly within the conceptual CLIP space. The core of our method lies in learning a direct mapping from the RGB domain to the depth domain strictly inside this embedding space. Our approach achieves state-of-the-art performance among CLIP embedding-based models on both indoor and outdoor datasets. The code used in this research is available at: this https URL

[CV-94] Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors

【速读】:该论文旨在解决现有基于惯性测量单元(Inertial Measurement Unit, IMU)的运动捕捉方法在重建人体运动时物理合理性不足的问题。传统IMU-only方法难以准确建模地面交互作用,导致生成的运动轨迹缺乏物理一致性。为此,作者提出Ground Reaction Inertial Poser (GRIP),其核心创新在于融合IMU信号与足底压力数据,并引入一个物理仿真中的数字孪生(digital twin) humanoid 模型来驱动运动重建。解决方案的关键在于双模块设计:KinematicsNet从传感器数据中估计身体姿态和速度,DynamicsNet则通过调整仿真中的人形模型状态,最小化预测与模拟状态之间的残差,从而实现高保真且物理合理的运动重建。

链接: https://arxiv.org/abs/2603.16233
作者: Ryosuke Hori,Jyun-Ting Song,Zhengyi Luo,Jinkun Cao,Soyong Shin,Hideo Saito,Kris Kitani
机构: Carnegie Mellon University (卡内基梅隆大学); Keio University (庆应义塾大学); Keio AI Research Center (庆应义塾大学人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We propose Ground Reaction Inertial Poser (GRIP), a method that reconstructs physically plausible human motion using four wearable devices. Unlike conventional IMU-only approaches, GRIP combines IMU signals with foot pressure data to capture both body dynamics and ground interactions. Furthermore, rather than relying solely on kinematic estimation, GRIP uses a digital twin of a person, in the form of a synthetic humanoid in a physics simulator, to reconstruct realistic and physically plausible motion. At its core, GRIP consists of two modules: KinematicsNet, which estimates body poses and velocities from sensor data, and DynamicsNet, which controls the humanoid in the simulator using the residual between the KinematicsNet prediction and the simulated humanoid state. To enable robust training and fair evaluation, we introduce a large-scale dataset, Pressure and Inertial Sensing for Human Motion and Interaction (PRISM), that captures diverse human motions with synchronized IMUs and insole pressure sensors. Experimental results show that GRIP outperforms existing IMU-only and IMU-pressure fusion methods across all evaluated datasets, achieving higher global pose accuracy and improved physical consistency.

[CV-95] Leveling3D: Leveling Up 3D Reconstruction with Feed-Forward 3D Gaussian Splatting and Geometry-Aware Generation

【速读】:该论文旨在解决当前基于前向传播的3D重建方法在生成外推视图时存在的几何不一致性问题,特别是扩散模型在修复渲染伪影时缺乏几何约束,导致无法有效填补缺失区域。解决方案的关键在于提出Leveling3D框架,其核心创新是引入几何感知的“leveling adapter”(几何对齐适配器),该轻量级模块将扩散模型内部知识与前向3D重建模型提供的几何先验进行对齐,从而实现几何一致性的生成修复。该适配器能够针对3D表示中欠约束区域引起的外推视图伪影进行精准修复,并结合调色板过滤策略和测试阶段掩码精修机制提升生成多样性与边界清晰度,最终实现重建与生成的协同优化,显著提升新视角合成与深度估计等任务的性能。

链接: https://arxiv.org/abs/2603.16211
作者: Yiming Huang,Baixiang Huang,Beilei Cui,Chi Kit Ng,Long Bai,Hongliang Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 10 figures

点击查看摘要

Abstract:Feed-forward 3D reconstruction has revolutionized 3D vision, providing a powerful baseline for downstream tasks such as novel-view synthesis with 3D Gaussian Splatting. Previous works explore fixing the corrupted rendering results with a diffusion model. However, they lack geometric concern and fail at filling the missing area on the extrapolated view. In this work, we introduce Leveling3D, a novel pipeline that integrates feed-forward 3D reconstruction with geometrical-consistent generation to enable holistic simultaneous reconstruction and generation. We propose a geometry-aware leveling adapter, a lightweight technique that aligns internal knowledge in the diffusion model with the geometry prior from the feed-forward model. The leveling adapter enables generation on the artifact area of the extrapolated novel views caused by underconstrained regions of the 3D representation. Specifically, to learn a more diverse distributed generation, we introduce the palette filtering strategy for training, and a test-time masking refinement to prevent messy boundaries along the fixing regions. More importantly, the enhanced extrapolated novel views from Leveling3D could be used as the inputs for feed-forward 3DGS, leveling up the 3D reconstruction. We achieve SOTA performance on public datasets, including tasks such as novel-view synthesis and depth estimation.

[CV-96] S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

【速读】:该论文旨在解决当前视频动作模型(Video Action Models, VAMs)在机器人学习中面临的实时推理与高保真视觉前瞻难以兼顾的问题。现有方法通常依赖缓慢的多步视频生成或噪声较大的单步特征提取,导致性能受限。解决方案的关键在于提出一种快捷视频动作模型(Shortcut Video-Action Model, S-VAM),其通过一次前向传播即可预测一致的几何与语义表征,从而简化动作预测过程;核心创新是引入一种新颖的自蒸馏策略,将扩散模型多步去噪过程中蕴含的结构化生成先验压缩至单步推理中——具体而言,利用视觉基础模型(Vision Foundation Model, VFM)从扩散模型自身生成的多步视频中提取的表示作为教师目标,轻量级解耦器(decouplers)作为学生网络直接学习将噪声单步特征映射到这些目标,实现高效且精确的动作预测。

链接: https://arxiv.org/abs/2603.16195
作者: Haodong Yan,Zhide Zhong,Jiaguan Zhu,Junjie He,Weilin Yuan,Wenxuan Song,Xin Gong,Yingjie Cai,Guanyi Zhao,Xu Yan,Bingbing Liu,Ying-Cong Chen,Haoang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify the action prediction. To enable this efficient shortcut, we introduce a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model (VFM) representations extracted from the diffusion model’s own multi-step generated videos provide teacher targets. Lightweight decouplers, as students, learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that our S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments. Our project page is this https URL

[CV-97] Reliable Reasoning in SVG-LLM s via Multi-Task Multi-Reward Reinforcement Learning

【速读】:该论文旨在解决当前视觉-语言模型在可缩放矢量图形(SVG)生成任务中面临的三个核心问题:泛化能力有限、生成代码中存在冗余路径以及缺乏显式的推理过程。其解决方案的关键在于提出一种统一框架CTRL-S(Chain-of-Thought Reinforcement Learning for SVG),通过引入思维链(Chain-of-Thought)机制显式暴露模型在SVG生成中的推理步骤,并构建高质量多任务数据集SVG-Sophia(包含145K样本,涵盖SVG代码优化、文本到SVG和图像到SVG任务)以支持结构化推理训练。此外,采用GRPO算法并设计多奖励优化框架,整合DINO特征一致性、图文相似性、格式合规性和代码效率等多维度奖励信号,实现联合多任务训练与多奖励优化,从而显著提升SVG生成的结构一致性、视觉保真度和整体性能。

链接: https://arxiv.org/abs/2603.16189
作者: Haomin Wang,Qi Wei,Qianli Ma,Shengyuan Ding,Jinhui Yin,Kai Chen,Hongjie Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Fudan University (复旦大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of vision-language models, an increasing number of studies have explored their potential for SVG generation tasks. Although existing approaches improve performance by constructing large-scale SVG datasets and introducing SVG-specific tokens, they still suffer from limited generalization, redundant paths in code outputs, and a lack of explicit reasoning. In this work, we present CTRL-S (Chain-of-Thought Reinforcement Learning for SVG), a unified framework that introduces a chain-of-thought mechanism to explicitly expose the model’s reasoning process during SVG generation. To support this structured reasoning, we construct SVG-Sophia, a high-quality dataset containing 145K samples across SVG code refinement, Text-to-SVG, and Image-to-SVG tasks. By training the model to generate group-level structured SVG code, CTRL-S significantly improves structural coherence and visual fidelity. Furthermore, we adopt the GRPO algorithm and design a multi-reward optimization framework, incorporating DINO, image-text similarity, format, and code efficiency rewards. Through joint multi-reward optimization and multi-task training, our approach systematically enhances overall generation capabilities. Extensive experiments show that CTRL-S outperforms existing methods, achieving higher task success rates, superior SVG code quality, and exceptional visual fidelity.

[CV-98] ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control

【速读】:该论文旨在解决人形机器人在自然语言指令下实现全身运动控制的挑战,即如何将抽象的语言描述高效、安全地转化为机器人可执行的低层运动轨迹。解决方案的关键在于提出一个边缘-云协同框架ECHO:云端部署基于扩散模型的文本到动作生成器(text-to-motion generator),利用CLIP编码文本特征并结合1D卷积UNet与交叉注意力机制,以约1秒/次的速度生成高质量运动参考序列;边缘端则采用教师-学生架构的强化学习跟踪器(reinforcement-learning tracker),通过证据推理适应模块实现从仿真到现实的迁移,并辅以形态对称约束和领域随机化提升鲁棒性;二者通过一种38维机器人本体运动表示(包含关节角、根部平面速度、高度及连续6D姿态)无缝衔接,无需人体模型重定向即可直接适配底层PD控制器,从而保障运动安全性与轨迹一致性。

链接: https://arxiv.org/abs/2603.16188
作者: Haozhe Jia,Jianfei Song,Yuan Zhang,Honglei Jin,Youcheng Fan,Wenshuo Chen,Wei Zhang,Yutao Yue
机构: The Hong Kong University of Science and Technology (Guangzhou); LimX Dynamics Technology Co., Ltd.; Shandong University; Institute of Deep Perception Technology, Jiangsu Industrial Technology Research Institute (JITRI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present ECHO, an edge–cloud framework for language-driven whole-body control of humanoid robots. A cloud-hosted diffusion-based text-to-motion generator synthesizes motion references from natural language instructions, while an edge-deployed reinforcement-learning tracker executes them in closed loop on the robot. The two modules are bridged by a compact, robot-native 38-dimensional motion representation that encodes joint angles, root planar velocity, root height, and a continuous 6D root orientation per frame, eliminating inference-time retargeting from human body models and remaining directly compatible with low-level PD control. The generator adopts a 1D convolutional UNet with cross-attention conditioned on CLIP-encoded text features; at inference, DDIM sampling with 10 denoising steps and classifier-free guidance produces motion sequences in approximately one second on a cloud GPU. The tracker follows a Teacher–Student paradigm: a privileged teacher policy is distilled into a lightweight student equipped with an evidential adaptation module for sim-to-real transfer, further strengthened by morphological symmetry constraints and domain randomization. An autonomous fall recovery mechanism detects falls via onboard IMU readings and retrieves recovery trajectories from a pre-built motion library. We evaluate ECHO on a retargeted HumanML3D benchmark, where it achieves strong generation quality (FID 0.029, R-Precision Top-1 0.686) under a unified robot-domain evaluator, while maintaining high motion safety and trajectory consistency. Real-world experiments on a Unitree G1 humanoid demonstrate stable execution of diverse text commands with zero hardware fine-tuning.

[CV-99] KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification Object Detection OCR and Contextual Reasoning for Child Safety

【速读】:该论文旨在解决儿童安全领域中多模态内容审核的效率与准确性问题,特别是针对包含视觉和文本信息的潜在有害内容进行高效识别。其解决方案的关键在于提出了一种两阶段的多模态内容审核架构KidsNanny:第一阶段利用视觉Transformer(Vision Transformer, ViT)与目标检测器结合实现快速视觉筛查(11.7 ms),并将输出以文本形式而非原始像素传递至第二阶段;第二阶段通过光学字符识别(Optical Character Recognition, OCR)和基于7B参数的语言模型进行上下文推理(总延迟120 ms)。该设计在保持高准确率(81.40%)和F1分数(86.16%)的同时显著降低延迟,优于ShieldGemma-2和LlavaGuard等基线方法,并在纯文本嵌入威胁场景下展现出更高的召回率(100%)和精度(75.76%),验证了OCR驱动的文本感知机制在提升敏感内容识别效果方面的优势。

链接: https://arxiv.org/abs/2603.16181
作者: Viraj Panchal,Tanmay Talsaniya,Parag Patel,Meet Patel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 12 pages, 2 figures, 6 tables

点击查看摘要

Abstract:We present KidsNanny, a two-stage multimodal content moderation architecture for child safety. Stage 1 combines a vision transformer (ViT) with an object detector for visual screening (11.7 ms); outputs are routed as text not raw pixels to Stage 2, which applies OCR and a text based 7B language model for contextual reasoning (120 ms total pipeline). We evaluate on the UnsafeBench Sexual category (1,054 images) under two regimes: vision-only, isolating Stage 1, and multimodal, evaluating the full Stage 1+2 pipeline. Stage 1 achieves 80.27% accuracy and 85.39% F1 at 11.7 ms; vision-only baselines range from 59.01% to 77.04% accuracy. The full pipeline achieves 81.40% accuracy and 86.16% F1 at 120 ms, compared to ShieldGemma-2 (64.80% accuracy, 1,136 ms) and LlavaGuard (80.36% accuracy, 4,138 ms). To evaluate text-awareness, we filter two subsets: a text+visual subset (257 images) and a text-only subset (44 images where safety depends primarily on embedded text). On text-only images, KidsNanny achieves 100% recall (25/25 positives; small sample) and 75.76% precision; ShieldGemma-2 achieves 84% recall and 60% precision at 1,136 ms. Results suggest that dedicated OCR-based reasoning may offer recall-precision advantages on text-embedded threats at lower latency, though the small text-only subset limits generalizability. By documenting this architecture and evaluation methodology, we aim to contribute to the broader research effort on efficient multimodal content moderation for child safety.

[CV-100] 360° Image Perception with MLLM s: A Comprehensive Benchmark and a Training-Free Method

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在感知和理解360°图像时存在的能力不足问题,尤其针对其在几何失真、复杂空间关系建模等方面的局限性。现有研究普遍聚焦于常规二维图像,而360°图像因其全景视角和独特的球面几何特性,对MLLMs的空间推理能力提出了更高要求。为系统评估这一领域的能力,作者构建了360Bench——一个包含7K分辨率360°图像的视觉问答(Visual Question Answering, VQA)基准数据集,并通过实证分析揭示了当前主流MLLMs在此类任务中的显著短板。解决方案的关键在于提出Free360,这是一种无需训练的基于场景图(scene graph)的框架:它将复杂推理过程分解为模块化步骤,针对每一步动态应用自适应球面图像变换以缓解几何失真,最终将各阶段信息融合进统一的图结构中用于答案生成,从而实现对高分辨率360°图像的有效感知与精准推理。

链接: https://arxiv.org/abs/2603.16179
作者: Huyen T. T. Tran,Van-Quang Nguyen,Farros Alferro,Kang-Jun Liu,Takayuki Okatani
机构: Tohoku University (东北大学); RIKEN AIP (理化学研究所先进智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown impressive abilities in understanding and reasoning over conventional images. However, their perception of 360° images remains largely underexplored. Unlike conventional images, 360° images capture the entire surrounding environment, enabling holistic spatial reasoning but introducing challenges such as geometric distortion and complex spatial relations. To comprehensively assess MLLMs’ capabilities to perceive 360° images, we introduce 360Bench, a Visual Question Answering (VQA) benchmark featuring 7K-resolution 360° images, seven representative (sub)tasks with annotations carefully curated by human annotators. Using 360Bench, we systematically evaluate seven MLLMs and six enhancement methods, revealing their shortcomings in 360° image perception. To address these challenges, we propose Free360, a training-free scene-graph-based framework for high-resolution 360° VQA. Free360 decomposes the reasoning process into modular steps, applies adaptive spherical image transformations to 360° images tailored to each step, and seamlessly integrates the resulting information into a unified graph representation for answer generation. Experiments show that Free360 consistently improves its base MLLM and provides a strong training-free solution for 360° VQA tasks. The source code and dataset will be publicly released upon acceptance.

[CV-101] SignNav: Leverag ing Signage for Semantic Visual Navigation in Large-Scale Indoor Environments

【速读】:该论文旨在解决大尺度室内(Large-Scale Indoor, LSI)环境中基于标识语义提示的具身导航问题,即如何让智能体在缺乏预构建地图的情况下,通过理解动态变化的标识信息并结合当前观测进行推理决策,从而高效、准确地导航至目标地点。其解决方案的关键在于提出了一种时空感知Transformer模型(Spatial-Temporal Aware Transformer, START),其中空间感知模块将标识的语义信息映射到物理空间中,时间感知模块则建模历史状态与当前观测之间的长程依赖关系;同时采用两阶段训练策略结合数据聚合(Dataset Aggregation, DAgger),显著提升了模型在未见场景中的泛化能力,最终在验证集未见场景上实现了80%的成功率(Success Rate, SR)和0.74的归一化轨迹相似度(Normalized Discounted Trajectory Weight, NDTW)。

链接: https://arxiv.org/abs/2603.16166
作者: Jian Sun,Yuming Huang,He Li,Shuqi Xiao,Shenyan Guo,Maani Ghaffari,Qingbiao Li,Chengzhong Xu,Hui Kong
机构: Faculty of Science and Technology, University of Macau(澳门大学科学技术学院); Department of Robotics, University of Michigan(密歇根大学机器人系)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans routinely leverage semantic hints provided by signage to navigate to destinations within novel Large-Scale Indoor (LSI) environments, such as hospitals and airport terminals. However, this capability remains underexplored within the field of embodied navigation. This paper introduces a novel embodied navigation task, SignNav, which requires the agent to interpret semantic hint from signage and reason about the subsequent action based on current observation. To facilitate research in this domain, we construct the LSI-Dataset for the training and evaluation of various SignNav agents. Dynamically changing semantic hints and sparse placement of signage in LSI environments present significant challenges to the SignNav task. To address these challenges, we propose the Spatial-Temporal Aware Transformer (START) model for end-to-end decision-making. The spatial-aware module grounds the semantic hint of signage into physical world, while the temporal-aware module captures long-range dependencies between historical states and current observation. Leveraging a two-stage training strategy with Dataset Aggregation (DAgger), our approach achieves state-of-the-art performance, recording an 80% Success Rate (SR) and 0.74 NDTW on val-unseen split. Real-world deployment further demonstrates the practicality of our method in physical environment without pre-built map.

[CV-102] Homogeneous and Heterogeneous Consistency progressive Re-ranking for Visible-Infrared Person Re-identification

【速读】:该论文旨在解决可见光-红外跨模态行人重识别(Visible-Infrared Person Re-Identification)中因模态间差异显著而导致的匹配困难问题,特别是现有重排序算法难以同时处理模态内变化(intra-modal variation)与模态间差异(inter-modal discrepancy)的局限性。其解决方案的关键在于提出一种新颖的渐进式模态关系重排序方法(Progressive Modal Relationship Re-ranking, PMRR),包含两个核心模块:异质一致性重排序(Heterogeneous Consistency Re-ranking, HCR)用于建模查询与检索库在不同模态间的关联关系,以及同质一致性重排序(Homogeneous Consistency Re-ranking, HCR)用于挖掘同一模态内部的内在结构一致性;在此基础上构建了一种基于一致性约束的重排序推理网络(Consistency Re-ranking Inference Network, CRI)作为基线模型,实验证明该方法具有良好的泛化能力并达到当前最优性能。

链接: https://arxiv.org/abs/2603.16165
作者: Yiming Wang
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visible-infrared person re-identification faces greater challenges than traditional person re-identification due to the significant differences between modalities. In particular, the differences between these modalities make effective matching even more challenging, mainly because existing re-ranking algorithms cannot simultaneously address the intra-modal variations and inter-modal discrepancy in cross-modal person re-identification. To address this problem, we propose a novel Progressive Modal Relationship Re-ranking method consisting of two modules, called heterogeneous and homogeneous consistency re-ranking(HHCR). The first module, heterogeneous consistency re-ranking, explores the relationship between the query and the gallery modalities in the test set. The second module, homogeneous consistency reranking, investigates the intrinsic relationship within each modality between the query and the gallery in the test set. Based on this, we propose a baseline for cross-modal person re-identification, called a consistency re-ranking inference network (CRI). We conducted comprehensive experiments demonstrating that our proposed re-ranking method is generalized, and both the re-ranking and the baseline achieve state-of-the-art performance.

[CV-103] Segmentation-before-Staining Improves Structural Fidelity in Virtual IHC-to-Multiplex IF Translation MICCAI2026

【速读】:该论文旨在解决多路免疫荧光成像(Multiplex Immunofluorescence, mIF)在临床常规应用中受限于高试剂成本、多轮染色流程及专用成像平台的问题,提出通过虚拟染色技术从广泛可用的明场免疫组化(Brightfield Immunohistochemistry, IHC)图像合成mIF通道。其关键创新在于引入一种无需监督、与网络架构无关的条件策略:将预训练的细胞核分割基础模型生成的连续细胞概率图作为显式输入先验,并结合方差保持正则化项以匹配局部强度统计特性,从而保留细胞层面的异质性并提升核计数精度。此方法在不依赖任务特定调优的情况下,显著改善了合成荧光通道的核计数保真度和感知质量。

链接: https://arxiv.org/abs/2603.16160
作者: Junhyeok Lee,Han Jang,Heeseong Eum,Joon Jang,Kyu Sung Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures, 2 tables. Submitted to MICCAI 2026

点击查看摘要

Abstract:Multiplex immunofluorescence (mIF) enables simultaneous single-cell quantification of multiple biomarkers within intact tissue architecture, yet its high reagent cost, multi-round staining protocols, and need for specialized imaging platforms limit routine clinical adoption. Virtual staining can synthesize mIF channels from widely available brightfield immunohistochemistry (IHC), but current translators optimize pixel-level fidelity without explicitly constraining nuclear morphology. In pathology, this gap is clinically consequential: subtle distortions in nuclei count, shape, or spatial arrangement propagate directly to quantification endpoints such as the Ki67 proliferation index, where errors of a few percent can shift treatment-relevant risk categories. This work introduces a supervision-free, architecture-agnostic conditioning strategy that injects a continuous cell probability map from a pretrained nuclei segmentation foundation model as an explicit input prior, together with a variance-preserving regularization term that matches local intensity statistics to maintain cell-level heterogeneity in synthesized fluorescence channels. The soft prior retains gradient-level boundary information lost by binary thresholding, providing a richer conditioning signal without task-specific tuning. Controlled experiments across Pix2Pix with U-Net and ResNet generators, deterministic regression U-Net, and conditional diffusion on two independent datasets demonstrate consistent improvements in nuclei count fidelity and perceptual quality, as the sole modifications. Code will be made publicly available upon acceptance.

[CV-104] AI-Generated Figures in Academic Publishing: Policies Tools and Practical Guidelines

【速读】:该论文试图解决的问题是:随着生成式AI(Generative AI)在科学图像生成中的广泛应用,学术出版界对AI生成图像的使用政策存在不一致和模糊性,这可能引发可重复性、作者署名归属及视觉误导等风险。解决方案的关键在于提出一套最佳实践指南,强调通过适当的披露机制和质量控制措施,使研究人员能够合规、透明地使用AI图形单元工具(如SciDraw),从而在不损害科研诚信的前提下,提升科学传播效率。

链接: https://arxiv.org/abs/2603.16159
作者: Davie Chen
机构: University of Arts in Poznań (波兹南艺术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The rapid advancement of generative AI has introduced a new class of tools capable of producing publication-quality scientific figures, graphical abstracts, and data visualizations. However, academic publishers have responded with inconsistent and often ambiguous policies regarding AI-generated imagery. This paper surveys the current stance of major journals and publishers – including Nature, Science, Cell Press, Elsevier, and PLOS – on the use of AI-generated figures. We identify key concerns raised by publishers, including reproducibility, authorship attribution, and potential for visual misinformation. Drawing on practical examples from tools such as SciDraw, an AI-powered platform designed specifically for scientific illustration, we propose a set of best-practice guidelines for researchers seeking to use AI figure-generation tools in a compliant and transparent manner. Our findings suggest that, with appropriate disclosure and quality control, AI-generated figures can meaningfully accelerate scientific communication without compromising integrity.

[CV-105] GATS: Gaussian Aware Temporal Scaling Transformer for Invariant 4D Spatio-Temporal Point Cloud Representation

【速读】:该论文旨在解决4D点云视频理解中的两大核心挑战:一是不同帧率下存在的时序尺度偏差(temporal scale bias),二是不规则点云分布带来的分布不确定性(distributional uncertainty)。这些问题导致现有基于CNN或Transformer的方法在处理动态环境感知时性能受限,前者受感受野限制,后者则因二次计算复杂度难以扩展。解决方案的关键在于提出一种双不变性框架——Gaussian Aware Temporal Scaling (GATS),其包含两个互补模块:Uncertainty Guided Gaussian Convolution (UGGC) 通过引入局部高斯统计和不确定性感知门控机制,在点云密度变化、噪声和遮挡条件下实现鲁棒的邻域聚合;Temporal Scaling Attention (TSA) 则通过可学习的缩放因子对时间距离进行归一化,确保在不同帧率下保持帧划分不变性和一致的速度估计。这两个模块协同工作,先由TSA对时序信息进行标准化,再由UGGC增强对不规则点云分布的鲁棒性,从而构建了一个高效且具有原则性的4D点云视频理解范式。

链接: https://arxiv.org/abs/2603.16154
作者: Jiayi Tian,Jiaze Wang
机构: Xi’an Jiaotong University (西安交通大学); Harbin Institute of Technology, Shenzhen, Pengcheng Laboratory (哈尔滨工业大学深圳校区,鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding 4D point cloud videos is essential for enabling intelligent agents to perceive dynamic environments. However, temporal scale bias across varying frame rates and distributional uncertainty in irregular point clouds make it highly challenging to design a unified and robust 4D backbone. Existing CNN or Transformer based methods are constrained either by limited receptive fields or by quadratic computational complexity, while neglecting these implicit distortions. To address this problem, we propose a novel dual invariant framework, termed \textbfGaussian Aware Temporal Scaling (GATS), which explicitly resolves both distributional inconsistencies and temporal. The proposed \emphUncertainty Guided Gaussian Convolution (UGGC) incorporates local Gaussian statistics and uncertainty aware gating into point convolution, thereby achieving robust neighborhood aggregation under density variation, noise, and occlusion. In parallel, the \emphTemporal Scaling Attention (TSA) introduces a learnable scaling factor to normalize temporal distances, ensuring frame partition invariance and consistent velocity estimation across different frame rates. These two modules are complementary: temporal scaling normalizes time intervals prior to Gaussian estimation, while Gaussian modeling enhances robustness to irregular distributions. Our experiments on mainstream benchmarks MSR-Action3D (\textbf+6.62% accuracy), NTU RGBD (\textbf+1.4% accuracy), and Synthia4D (\textbf+1.8% mIoU) demonstrate significant performance gains, offering a more efficient and principled paradigm for invariant 4D point cloud video understanding with superior accuracy, robustness, and scalability compared to Transformer based counterparts.

[CV-106] EFF-Grasp: Energy-Field Flow Matching for Physics-Aware Dexterous Grasp Generation

【速读】:该论文旨在解决当前基于扩散模型(diffusion-based methods)的灵巧抓取生成方法中存在的两大问题:一是生成过程依赖随机微分方程(SDE),导致采样步骤多、效率低;二是轨迹不稳定,易产生物理上不可行的抓取姿态。解决方案的关键在于提出一种基于流匹配(Flow-Matching)的新框架EFF-Grasp,将抓取合成建模为确定性常微分方程(ODE)过程,从而实现高效且稳定的概率流生成;同时引入无需训练的物理感知能量引导策略,通过显式物理能量函数定义目标分布,并在推理阶段利用局部蒙特卡洛近似估计引导项,动态调整生成轨迹以逼近物理可行区域,无需额外物理模拟或训练反馈。

链接: https://arxiv.org/abs/2603.16151
作者: Yukun Zhao,Zichen Zhong,Yongshun Gong,Yilong Yin,Haoliang Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Denoising generative models have recently become the dominant paradigm for dexterous grasp generation, owing to their ability to model complex grasp distributions from large-scale data. However, existing diffusion-based methods typically formulate generation as a stochastic differential equation (SDE), which often requires many sequential denoising steps and introduces trajectory instability that can lead to physically infeasible grasps. In this paper, we propose EFF-Grasp, a novel Flow-Matching-based framework for physics-aware dexterous grasp generation. Specifically, we reformulate grasp synthesis as a deterministic ordinary differential equation (ODE) process, which enables efficient and stable generation through smooth probability flows. To further enforce physical feasibility, we introduce a training-free physics-aware energy guidance strategy. Our method defines an energy-guided target distribution using adapted explicit physical energy functions that capture key grasp constraints, and estimates the corresponding guidance term via a local Monte Carlo approximation during inference. In this way, EFF-Grasp dynamically steers the generation trajectory toward physically feasible regions without requiring additional physics-based training or simulation feedback. Extensive experiments on five benchmark datasets show that EFF-Grasp achieves superior performance in grasp quality and physical feasibility, while requiring substantially fewer sampling steps than diffusion-based baselines.

[CV-107] Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)在视觉生成组件预训练阶段存在的两大瓶颈:一是依赖低效的训练范式,二是受限于稀缺且高质量的图文配对数据。为应对这些问题,作者提出了一种名为“仅图像训练”(Image-Only Training for UMMs, IOMM)的数据高效两阶段训练框架。其关键在于:第一阶段仅使用大量未标注图像数据对视觉生成组件进行预训练,从而摆脱对图文配对数据的依赖;第二阶段则结合未标注图像与少量精心筛选的图文对进行微调,以提升指令对齐能力和生成质量。实验表明,IOMM 在训练效率和性能上均达到当前最优水平,例如 IOMM-B(3.6B参数)仅用约1050个H800 GPU小时即可完成从零训练,并在GenEval和WISE基准上分别取得0.89和0.55的得分,显著优于多个强基线模型。

链接: https://arxiv.org/abs/2603.16139
作者: Peng Sun,Jun Xie,Tao Lin
机构: Zhejiang University (浙江大学); Shanghai Innovation Institute (上海创新研究院); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Unified Multimodal Models (UMMs) are often constrained by the pre-training of their \textbfvisual generation components , which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for \textbfUMM visual generation and identify these two issues as the major bottlenecks. To address them, we propose \textbfImage-Only Training for UMMs (IOMM) , a data-efficient two-stage training framework. The first stage pre-trains the visual generative component \textbfexclusively using abundant unlabeled image-only data, thereby removing the dependency on paired data \textbffor this costly phase . The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only \sim \textbf1050 H800 GPU hours (with the vast majority, \textbf1000 hours, dedicated to the efficient \textbfimage-only pre-training stage ). It achieves \textbf0.89 on GenEval and \textbf0.55 on WISE–surpassing strong baselines such as BAGEL-7B (0.82 0.55) and BLIP3-o-4B (0.84 0.50). Code is available \hrefthis https URLthis https URL . Comments: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.16139 [cs.CV] (or arXiv:2603.16139v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.16139 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jun Xie [view email] [v1] Tue, 17 Mar 2026 05:41:48 UTC (2,574 KB) Full-text links: Access Paper: View a PDF of the paper titled Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training, by Peng Sun and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-108] When Generative Augmentation Hurts: A Benchmark Study of GAN and Diffusion Models for Bias Correction in AI Classification Systems

【速读】:该论文旨在解决生成式 AI (Generative AI) 在低数据条件下用于缓解类别不平衡时的失效机制问题,特别是评估不同数据增强策略对细粒度动物分类任务中模型偏差的影响。其关键解决方案在于通过受控基准实验对比传统变换、FastGAN 和基于低秩适配(LoRA)微调的 Stable Diffusion 1.5 三种策略,发现 FastGAN 在训练样本极少时不仅性能下降,反而显著加剧分类器偏差(bias gap 增加 +20.7%,Cohen’s d = +5.03),而 LoRA 微调的 Stable Diffusion 表现最优,实现宏 F1 达 0.9125 并降低 13.1% 的偏差差距,表明在低数据场景下采用参数高效微调的扩散模型可有效避免模式崩溃并提升公平性。

链接: https://arxiv.org/abs/2603.16134
作者: Shesh Narayan Gupta,Nik Bear Brown
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative models are widely used to compensate for class imbalance in AI training pipelines, yet their failure modes under low-data conditions are poorly understood. This paper reports a controlled benchmark comparing three augmentation strategies applied to a fine-grained animal classification task: traditional transforms, FastGAN, and Stable Diffusion 1.5 fine-tuned with Low-Rank Adaptation (LoRA). Using the Oxford-IIIT Pet Dataset with eight artificially underrepresented breeds, we find that FastGAN augmentation does not merely underperform at very low training set sizes but actively increases classifier bias, with a statistically significant large effect across three random seeds (bias gap increase: +20.7%, Cohen’s d = +5.03, p = 0.013). The effect size here is large enough to give confidence in the direction of the finding despite the small number of seeds. Feature embedding analysis using t-distributed Stochastic Neighbor Embedding reveals that FastGAN images for severe-minority breeds form tight isolated clusters outside the real image distribution, a pattern consistent with mode collapse. Stable Diffusion with Low-Rank Adaptation produced the best results overall, achieving the highest macro F1 (0.9125 plus or minus 0.0047) and a 13.1% reduction in the bias gap relative to the unaugmented baseline. The data suggest a sample-size boundary somewhere between 20 and 50 training images per class below which GAN augmentation becomes harmful in this setting, though further work across additional domains is needed to establish where that boundary sits more precisely. All experiments run on a consumer-grade GPU with 6 to 8 GB of memory, with no cloud compute required.

[CV-109] DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives

【速读】:该论文旨在解决神经重建方法中普遍存在的问题:即在追求高保真度时牺牲了结构的紧凑性和规则性,导致生成的网格过于密集、拓扑不规则且部件边界模糊,从而不利于编辑、动画制作及下游资产复用。其解决方案的关键在于提出DualPrim框架,通过引入正负超椭球(superquadrics)的加减混合表示方式——其中正超椭球构建主体结构,负超椭球通过可微运算符雕刻局部空腔与凹陷,实现拓扑感知的形状建模,同时保持模型的紧凑性与可微性。这一设计显著提升了表达能力,且支持从多视角图像端到端训练,并可通过闭式布尔差分操作无缝导出结构化网格。

链接: https://arxiv.org/abs/2603.16133
作者: Xiaoxu Meng,Zhongmin Chen,Bo Yang,Weikai Chen,Weixiao Liu,Lin Gao
机构: Waymo LLC; Lucid Motors; Institute of Computing Technology, Chinese Academy of Sciences; School of Advanced Interdisciplinary Science, University of Chinese Academy of Sciences
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural reconstructions often trade structure for fidelity, yielding dense and unstructured meshes with irregular topology and weak part boundaries that hinder editing, animation, and downstream asset reuse. We present DualPrim, a compact and structured 3D reconstruction framework. Unlike additive-only implicit or primitive methods, DualPrim represents shapes with positive and negative superquadrics: the former builds the bases while the latter carves local volumes through a differentiable operator, enabling topology-aware modeling of holes and concavities. This additive-subtractive design increases the representational power without sacrificing compactness or differentiability. We embed DualPrim in a volumetric differentiable renderer, enabling end-to-end learning from multi-view images and seamless mesh export via closed-form boolean difference. Empirically, DualPrim delivers state-of-the-art accuracy and produces compact, structured, and interpretable outputs that better satisfy downstream needs than additive-only alternatives.

[CV-110] EPOFusion: Exposure aware Progressive Optimization Method for Infrared and Visible Image Fusion

【速读】:该论文旨在解决红外与可见光图像融合中因过曝(overexposure)导致关键视觉信息丢失的问题,现有方法在高亮度区域表现不佳。其解决方案的关键在于提出EPOFusion模型,该模型包含三个核心组件:一是引导模块(guidance module),用于帮助编码器从过曝区域提取细粒度的红外特征;二是迭代解码器结合多尺度上下文融合模块(multiscale context fusion module),实现逐级增强融合图像,保证细节一致性与高质量视觉效果;三是自适应损失函数(adaptive loss function),动态约束融合过程,在不同曝光条件下平衡模态间的信息权重。此外,作者构建了首个红外与可见光过曝数据集(IVOE),并提供高质量红外引导标注,为模型训练和评估提供了支持。实验表明,EPOFusion在保持过曝区域红外线索的同时,实现了非过曝区域的高保真融合,显著提升了视觉质量和下游任务性能。

链接: https://arxiv.org/abs/2603.16130
作者: Zhiwei Wang,Yayu Zheng,Defeng He,Li Zhao,Xiaoqin Zhang,Yuxing Li,Edmund Y. Lam
机构: Zhejiang University of Technology (浙江工业大学); Zhejiang Shuren University (浙江树人大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Overexposure frequently occurs in practical scenarios, causing the loss of critical visual information. However, existing infrared and visible fusion methods still exhibit unsatisfactory performance in highly bright regions. To address this, we propose EPOFusion, an exposure-aware fusion model. Specifically, a guidance module is introduced to facilitate the encoder in extracting fine-grained infrared features from overexposed regions. Meanwhile, an iterative decoder incorporating a multiscale context fusion module is designed to progressively enhance the fused image, ensuring consistent details and superior visual quality. Finally, an adaptive loss function dynamically constrains the fusion process, enabling an effective balance between the modalities under varying exposure conditions. To achieve better exposure awareness, we construct the first infrared and visible overexposure dataset (IVOE) with high quality infrared guided annotations for overexposed regions. Extensive experiments show that EPOFusion outperforms existing methods. It maintains infrared cues in overexposed regions while achieving visually faithful fusion in non-overexposed areas, thereby enhancing both visual fidelity and downstream task performance. Code, fusion results and IVOE dataset will be made available at this https URL.

[CV-111] Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting CVPR2026

【速读】:该论文旨在解决零样本目标计数(Zero-shot Object Counting, ZSOC)中因将计数视为粗粒度检索任务而导致的数量感知不足、空间敏感性差以及特征空间畸变引发的泛化性能下降问题。其核心解决方案是提出QICA框架,关键在于两个创新模块:一是协同提示策略(Synergistic Prompting Strategy, SPS),通过数值条件化提示适配视觉与语言编码器,实现语义识别与定量推理之间的对齐;二是代价聚合解码器(Cost Aggregation Decoder, CAD),直接在视觉-文本相似度图上进行空间聚合以缓解特征畸变,同时保持零样本迁移能力。此外,引入多层级数量一致性损失(Multi-level Quantity Alignment Loss, LMQA\mathcal{L}_{\text{MQA}})强化整个流程中的数值一致性,从而显著提升模型在未见域上的泛化性能。

链接: https://arxiv.org/abs/2603.16129
作者: Da Zhang,Bingyu Li,Feiyu Wang,Zhiyuan Zhao,Junyu Gao
机构: Northwestern Polytechnical University (西北工业大学); Institute of Artificial Intelligence (TeleAI), China Telecom (中国电信人工智能研究院); University of Science and Technology of China (中国科学技术大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Zero-shot object counting (ZSOC) aims to enumerate objects of arbitrary categories specified by text descriptions without requiring visual exemplars. However, existing methods often treat counting as a coarse retrieval task, suffering from a lack of fine-grained quantity awareness. Furthermore, they frequently exhibit spatial insensitivity and degraded generalization due to feature space distortion during model this http URL address these challenges, we present \textbfQICA, a novel framework that synergizes \underlinequantity percept\underlineion with robust spatial \underlinecast \underlineaggregation. Specifically, we introduce a Synergistic Prompting Strategy (\textbfSPS) that adapts vision and language encoders through numerically conditioned prompts, bridging the gap between semantic recognition and quantitative reasoning. To mitigate feature distortion, we propose a Cost Aggregation Decoder (\textbfCAD) that operates directly on vision-text similarity maps. By refining these maps through spatial aggregation, CAD prevents overfitting while preserving zero-shot transferability. Additionally, a multi-level quantity alignment loss ( \mathcalL_MQA ) is employed to enforce numerical consistency across the entire pipeline. Extensive experiments on FSC-147 demonstrate competitive performance, while zero-shot evaluation on CARPK and ShanghaiTech-A validates superior generalization to unseen domains.

[CV-112] Out-of-Distribution Object Detection in Street Scenes via Synthetic Outlier Exposure and Transfer Learning

【速读】:该论文旨在解决分布外(Out-of-Distribution, OOD)目标检测中存在的关键问题:现有检测模型在遇到未见过的、分布外的目标时,常会将其错误地识别为背景,导致漏检,且当前方法多依赖复杂架构或辅助分支,缺乏对分布内(In-Distribution, ID)与OOD对象的统一处理框架。解决方案的关键在于提出一种名为SynOE-OD(Synthetic Outlier-Exposure-based Object Detection)的新框架,其核心创新是利用强大的生成模型(如Stable Diffusion)和开放词汇目标检测器(Open-Vocabulary Object Detectors, OVODs),在训练阶段合成语义合理的、作为异常样本的对象级数据,并通过迁移学习提升模型在ID任务上的性能同时增强对OOD对象的鲁棒性检测能力。

链接: https://arxiv.org/abs/2603.16122
作者: Sadia Ilyas,Annika Mütze,Klaus Friedrichs,Thomas Kurbiel,Matthias Rottmann
机构: 1: University of Stuttgart (斯图加特大学); 2: Fraunhofer Institute for Manufacturing Engineering and Automation IPA (弗劳恩霍夫制造工程与自动化研究所 IPA); 3: Bosch Rexroth AG (博世力士乐公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) object detection is an important yet underexplored task. A reliable object detector should be able to handle OOD objects by localizing and correctly classifying them as OOD. However, a critical issue arises when such atypical objects are completely missed by the object detector and incorrectly treated as background. Existing OOD detection approaches in object detection often rely on complex architectures or auxiliary branches and typically do not provide a framework that treats in-distribution (ID) and OOD in a unified way. In this work, we address these limitations by enabling a single detector to detect OOD objects, that are otherwise silently overlooked, alongside ID objects. We present \textbfSynOE-OD, a \textbfSynthetic \textbfOutlier-\textbfExposure-based \textbfObject \textbfDetection framework, that leverages strong generative models, like Stable Diffusion, and Open-Vocabulary Object Detectors (OVODs) to generate semantically meaningful, object-level data that serve as outliers during training. The generated data is used for transfer-learning to establish strong ID task performance and supplement detection models with OOD object detection robustness. Our approach achieves state-of-the-art average precision on an established OOD object detection benchmark, where OVODs, such as GroundingDINO, show limited zero-shot performance in detecting OOD objects in street-scenes.

[CV-113] PathGLS: Evaluating Pathology Vision-Language Models without Ground Truth through Multi-Dimensional Consistency

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在数字病理学中缺乏可靠、自动化评估指标的问题,尤其针对模型在临床场景下可能出现的细微失败(如幻觉)难以被有效识别的困境。解决方案的关键在于提出 PathGLS,一种无参考的评估框架,通过三个维度全面衡量视觉-语言模型(Vision-Language Models, VLMs)的可信度:Grounding(细粒度视觉-文本对齐)、Logic(基于自然语言推理的蕴涵图一致性)和 Stability(对抗性视觉-语义扰动下的输出稳定性)。该框架支持切片级与全切片图像(Whole-Slide Image, WSI)级分析,并生成综合信任分数,实验证明其在检测幻觉报告和匹配专家定义的临床错误层级方面显著优于现有基于大语言模型(Large Language Model, LLM)的方法。

链接: https://arxiv.org/abs/2603.16113
作者: Minbing Chen,Zhu Meng,Fei Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) offer significant potential in computational pathology by enabling interpretable image analysis, automated reporting, and scalable decision support. However, their widespread clinical adoption remains limited due to the absence of reliable, automated evaluation metrics capable of identifying subtle failures such as hallucinations. To address this gap, we propose PathGLS, a novel reference-free evaluation framework that assesses pathology VLMs across three dimensions: Grounding (fine-grained visual-text alignment), Logic (entailment graph consistency using Natural Language Inference), and Stability (output variance under adversarial visual-semantic perturbations). PathGLS supports both patch-level and whole-slide image (WSI)-level analysis, yielding a comprehensive trust score. Experiments on Quilt-1M, TCGA, REG2025, PathMMU and TCGA-Sarcoma datasets demonstrate the superiority of PathGLS. Specifically, on the Quilt-1M dataset, PathGLS reveals a steep sensitivity drop of 40.2% for hallucinated reports compared to only 2.1% for BERTScore. Moreover, validation against expert-defined clinical error hierarchies reveals that PathGLS achieves a strong Spearman’s rank correlation of \rho=0.71 ( p 0.0001 ), significantly outperforming Large Language Model (LLM)-based approaches (Gemini 3.0 Pro: \rho=0.39 , p 0.0001 ). These results establish PathGLS as a robust reference-free metric. By directly quantifying hallucination rates and domain shift robustness, it serves as a reliable criterion for benchmarking VLMs on private clinical datasets and informing safe deployment. Code can be found at: this https URL

[CV-114] NanoGS: Training-Free Gaussian Splat Simplification

【速读】:该论文旨在解决3D Gaussian Splat (3DGS) 在实际应用中因海量Splats导致的存储与传输成本过高问题,同时克服现有压缩方法依赖GPU密集型后训练优化和校准图像的局限性。其解决方案的关键在于提出NanoGS——一种无需训练、轻量化的高斯点简化框架,通过在稀疏空间图上进行局部成对合并操作实现压缩:利用质量保持的矩匹配近似一对高斯分布为单个基元,并基于原始混合模型与其近似之间的合理合并代价评估合并质量;通过限制候选合并对象于局部邻域并高效筛选兼容对,NanoGS在保留场景结构和外观的前提下显著减少高斯基元数量,且可在CPU上高效运行,兼容标准3DGS参数化,便于集成至现有渲染管线。

链接: https://arxiv.org/abs/2603.16103
作者: Butian Xiong,Rong Liu,Tiantian Zhou,Meida Chen,Zhiwen Fan,Andrew Feng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:3D Gaussian Splat (3DGS) enables high-fidelity, real-time novel view synthesis by representing scenes with large sets of anisotropic primitives, but often requires millions of Splats, incurring significant storage and transmission costs. Most existing compression methods rely on GPU-intensive post-training optimization with calibrated images, limiting practical deployment. We introduce NanoGS, a training-free and lightweight framework for Gaussian Splat simplification. Instead of relying on image-based rendering supervision, NanoGS formulates simplification as local pairwise merging over a sparse spatial graph. The method approximates a pair of Gaussians with a single primitive using mass preserved moment matching and evaluates merge quality through a principled merge cost between the original mixture and its approximation. By restricting merge candidates to local neighborhoods and selecting compatible pairs efficiently, NanoGS produces compact Gaussian representations while preserving scene structure and appearance. NanoGS operates directly on existing Gaussian Splat models, runs efficiently on CPU, and preserves the standard 3DGS parameterization, enabling seamless integration with existing rendering pipelines. Experiments demonstrate that NanoGS substantially reduces primitive count while maintaining high rendering fidelity, providing an efficient and practical solution for Gaussian Splat simplification. Our project website is available at this https URL.

[CV-115] Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP CVPR’26

【速读】:该论文旨在解决当前主流的对比语言图像预训练模型(如CLIP)在纯图像任务中表现不佳的问题,其核心假设是这些模型因忽视图像-图像(intra-modal)对齐而造成嵌入空间中图像距离校准不良。然而,本文通过理论分析和实证验证质疑了这一假设:首先,理论证明图像嵌入距离不存在所谓“自由度”导致的失配;其次,实验发现语言-图像训练模型(CLIP、SigLIP)与图像-图像训练模型(DINO、SigLIP2)在相同评估指标下表现一致,表明问题并非源于特定于前者的设计缺陷。因此,论文指出提升图像任务性能的关键在于缓解任务歧义(task ambiguity),而非修复所谓的跨模态对齐偏差。

链接: https://arxiv.org/abs/2603.16100
作者: Jonas Herzog,Yue Wang
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for CVPR’26

点击查看摘要

Abstract:Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best results.

[CV-116] OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

【速读】:该论文旨在解决现有基于扩散模型的3D场景生成方法主要在2D图像或视频隐空间中操作,导致跨视角外观和几何一致性难以维持的问题。其解决方案的关键在于提出OneWorld框架,该框架直接在统一的3D表示空间中进行扩散生成;核心创新包括:(1) 3D统一表示自编码器(3D Unified Representation Autoencoder, 3D-URAE),通过引入外观信息并蒸馏语义信息到统一的3D隐空间中,增强预训练3D基础模型的几何中心特性;(2) 基于token级别的跨视图对应(Cross-View-Correspondence, CVC)一致性损失,显式约束多视角间的结构对齐;(3) 流形漂移强制(Manifold-Drift Forcing, MDF)机制,通过混合漂移与原始表示来缓解训练-推理暴露偏差,构建稳健的3D流形。实验表明,该方法在跨视角一致性方面显著优于当前主流2D基方法。

链接: https://arxiv.org/abs/2603.16099
作者: Sensen Gao,Zhaoqing Wang,Qihang Cao,Dongdong Yu,Changhu Wang,Tongliang Liu,Mingming Gong,Jiawang Bian
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); AISphere; Shanghai Jiao Tong University (上海交通大学); University of Syndey (悉尼大学); University of Melbourne (墨尔本大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at this https URL.

[CV-117] LICA: Layered Image Composition Annotations for Graphic Design Research

【速读】:该论文旨在解决当前视觉-语言模型在图形设计理解与生成任务中缺乏结构化表示和系统性建模能力的问题。现有方法通常仅处理像素级图像,难以捕捉设计元素间的层次关系与语义属性,限制了对复杂布局的精确控制与动态演化建模。解决方案的关键在于提出LICA(Layered Image Composition Annotations)数据集,其核心创新是将每个图形设计表示为具有类型标签(如文本、图像、矢量、组元素)的分层组件结构,并附带丰富的每元素元数据(包括空间几何、字体属性、透明度、可见性等),同时引入动画布局视频以支持时间维度上的设计演化分析。这一结构化表示范式使得模型能够直接操作设计结构而非仅依赖像素信息,从而推动层感知修复、结构化版面生成、可控编辑及时间感知生成建模等新研究方向的发展。

链接: https://arxiv.org/abs/2603.16098
作者: Elad Hirsch,Shubham Yadav,Mohit Garg,Purvanshi Mehta
机构: lica.world
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce LICA (Layered Image Composition Annotations), a large-scale dataset of 1,550,244 multi-layer graphic design compositions designed to advance structured understanding and generation of graphic layouts1. In addition to ren- dered PNG images, LICA represents each design as a hierarchical composition of typed components including text, image, vector, and group elements, each paired with rich per-element metadata such as spatial geometry, typographic attributes, opacity, and visibility. The dataset spans 20 design categories and 971,850 unique templates, providing broad coverage of real-world design structures. We further introduce graphic design video as a new and largely unexplored challenge for current vision-language models through 27,261 animated layouts annotated with per-component keyframes and motion parameters. Beyond scale, LICA establishes a new paradigm of research tasks for graphic design, enabling structured investiga- tions into problems such as layer-aware inpainting, structured layout generation, controlled design editing, and temporally-aware generative modeling. By repre- senting design as a system of compositional layers and relationships, the dataset supports research on models that operate directly on design structure rather than pixels alone.

[CV-118] Diffusion Models for Joint Audio-Video Generation

【速读】:该论文旨在解决多模态生成模型在联合音频-视频生成任务中仍面临的关键挑战,即如何实现语义一致且时序同步的跨模态内容生成。其核心解决方案在于提出一种分步式文本到音视频的生成流程:首先基于文本提示生成高质量视频,随后在视频输出和原始文本提示的双重条件下,合成与视频在时间上严格对齐的音频。这一模块化方法有效提升了生成结果的保真度与协同性,为多模态生成模型的实际应用提供了可行路径。

链接: https://arxiv.org/abs/2603.16093
作者: Alejandro Paredes La Torre
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, paired audio-video datasets. The datasets consisting on 13 hours of video-game clips and 64 hours of concert performances, each segmented into consistent 34-second samples to facilitate reproducible research. Second, I train the MM-Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio-video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder-decoders, uncovering challenges and inconsistencies in the multimodal decoding stage. Finally, I propose a sequential two-step text-to-audio-video generation pipeline: first generating video, then conditioning on both the video output and the original prompt to synthesize temporally synchronized audio. My experiments show that this modular approach yields high-fidelity generations of audio video generation.

[CV-119] Parallel In-context Learning for Large Vision Language Models CVPR2026

【速读】:该论文旨在解决大规模视觉语言模型(Large Vision-Language Models, LVLMs)在多模态上下文学习(Multi-modal In-Context Learning, MM-ICL)中面临的准确性与效率之间的权衡问题。具体而言,随着演示样本数量增加,MM-ICL性能提升,但因Transformer注意力机制的计算复杂度随上下文长度呈二次增长,导致推理延迟显著增加。解决方案的关键在于提出并行上下文学习(Parallel In-Context Learning, Parallel-ICL),其核心思想是将长上下文演示序列划分为多个较短且可并行处理的片段,通过加权的专家产品(Product-of-Experts, PoE)集成策略在logit层融合各片段预测结果,从而近似全上下文输出。该方法结合基于聚类的分块策略以增强片段间多样性,并引入基于相似性的预测加权机制以提升查询相关性,实验证明其可在保持与完整上下文MM-ICL相当性能的同时大幅降低推理开销。

链接: https://arxiv.org/abs/2603.16092
作者: Shin’ya Yamaguchi,Daiki Chijiwa,Tamao Sakao,Taku Hasegawa
机构: NTT
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026 (Findings); Code is available at this https URL

点击查看摘要

Abstract:Large vision-language models (LVLMs) employ multi-modal in-context learning (MM-ICL) to adapt to new tasks by leveraging demonstration examples. While increasing the number of demonstrations boosts performance, they incur significant inference latency due to the quadratic computational cost of Transformer attention with respect to the context length. To address this trade-off, we propose Parallel In-Context Learning (Parallel-ICL), a plug-and-play inference algorithm. Parallel-ICL partitions the long demonstration context into multiple shorter, manageable chunks. It processes these chunks in parallel and integrates their predictions at the logit level, using a weighted Product-of-Experts (PoE) ensemble to approximate the full-context output. Guided by ensemble learning theory, we introduce principled strategies for Parallel-ICL: (i) clustering-based context chunking to maximize inter-chunk diversity and (ii) similarity-based context compilation to weight predictions by query relevance. Extensive experiments on VQA, image captioning, and classification benchmarks demonstrate that Parallel-ICL achieves performance comparable to full-context MM-ICL, while significantly improving inference speed. Our work offers an effective solution to the accuracy-efficiency trade-off in MM-ICL, enabling dynamic task adaptation with substantially reduced inference overhead.

[CV-120] owards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在实时声学感知与动态环境交互中的局限性,尤其是现有方法将声音视为静态提示或仅关注人类语音,导致在任务执行过程中容易遗漏关键的瞬时环境声学信息,特别是在开环执行下的“盲执行区间”(Blind Execution Interval)中。其核心问题在于缺乏对连续音频流的因果建模能力以及对时间动态的显式学习机制,从而限制了机器人在复杂环境中进行声学驱动的精准操作。解决方案的关键是提出一种新的连续控制范式——视觉-声音-语言-动作(Vision-Sound-Language-Action, VSLA),并设计HEAR框架:通过一个流式历史记录器(Historizer)维持因果音频上下文、一个基于多模态基础模型的想象器(Envisioner)进行跨感官推理、一个音频世界模型(Advancer)预测未来音频码以学习时间动态,以及一个基于流匹配的执行器(Realizer)生成平滑动作块,实现端到端的声音驱动决策与执行。

链接: https://arxiv.org/abs/2603.16086
作者: Chang Nie,Tianchen Deng,Guangming Wang,Zhe Liu,Hesheng Wang
机构: School of Automation and Intelligent Sensing, Shanghai Jiao Tong University and Shanghai Key Laboratory of Navigation and Location Based Services, Shanghai 200240, China; Department of Engineering, Cambridge University, Cambridge, CB2 1TN, UK
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:

点击查看摘要

Abstract:While recent Vision-Language-Action (VLA) models have begun to incorporate audio, they typically treat sound as static pre-execution prompts or focus exclusively on human speech. This leaves a significant gap in real-time, sound-centric manipulation where fleeting environmental acoustics provide critical state verification during task execution. Consequently, key sounds are easily missed due to low-frequency updates or system latency. This problem is exacerbated by action chunking with open-loop execution, which creates a Blind Execution Interval where acoustic events are lost between discrete audio observation windows. Recognizing the necessity of continuous auditory awareness, we formalize Vision-Sound-Language-Action (VSLA) as a continuous control paradigm conditioned on vision, streaming audio, language, and proprioception under delayed decision loops. As an instantiation, we introduce HEAR, a VSLA framework integrating four components: (i) a streaming Historizer to maintain a compact, causal audio context across execution gaps; (ii) an Envisioner adapted from omni foundation models to reason over multi-sensory inputs; (iii) an Advancer, formulated as an audio world model, to learn temporal dynamics by predicting near-future audio codes; and (iv) a flow-matching Realizer policy to generate smooth action chunks. To address the scarcity of pretraining data and evaluations for VSLA, we construct OpenX-Sound for pretraining, alongside HEAR-Bench, the first sound-centric manipulation benchmark with strict causal timing rules. Our results suggest that robust sound-centric manipulation necessitates causal persistence and explicit temporal learning. This framework provides a practical step toward multi-sensory foundation models for embodied agents, enabling robots to perceive and interact with dynamic environments. Code and videos are available at this https URL.

[CV-121] Interact3D: Compositional 3D Generation of Interactive Objects

【速读】:该论文旨在解决从单张图像中生成具有物理合理交互关系的3D组合物体的问题,尤其在存在遮挡情况下如何保持隐藏区域的几何细节和物体间的空间关系(Object-Object Spatial Relationships, OOR)。其解决方案的关键在于提出了一种名为Interact3D的新框架,包含两个核心环节:首先通过统一的3D引导场景利用先进的生成先验构建高质量个体资产;其次设计了一个两阶段组合流程,其中主物体通过全局到局部几何对齐(registration)进行锚定,其余物体则采用基于可微分符号距离场(Signed Distance Field, SDF)的优化方法以显式惩罚几何交叠,并引入闭环代理精化策略——由视觉-语言模型(Vision-Language Model, VLM)自动分析多视角渲染结果并生成针对性修正提示,驱动图像编辑模块迭代自纠正生成过程,从而实现碰撞感知、几何保真度高且空间关系一致的3D组合物体合成。

链接: https://arxiv.org/abs/2603.16085
作者: Hui Shan,Keyang Luo,Ming Li,Sizhe Zheng,Yanwei Fu,Zhen Chen,Xiangru Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent breakthroughs in 3D generation have enabled the synthesis of high-fidelity individual assets. However, generating 3D compositional objects from single images–particularly under occlusions–remains challenging. Existing methods often degrade geometric details in hidden regions and fail to preserve the underlying object-object spatial relationships (OOR). We present a novel framework Interact3D designed to generate physically plausible interacting 3D compositional objects. Our approach first leverages advanced generative priors to curate high-quality individual assets with a unified 3D guidance scene. To physically compose these assets, we then introduce a robust two-stage composition pipeline. Based on the 3D guidance scene, the primary object is anchored through precise global-to-local geometric alignment (registration), while subsequent geometries are integrated using a differentiable Signed Distance Field (SDF)-based optimization that explicitly penalizes geometry intersections. To reduce challenging collisions, we further deploy a closed-loop, agentic refinement strategy. A Vision-Language Model (VLM) autonomously analyzes multi-view renderings of the composed scene, formulates targeted corrective prompts, and guides an image editing module to iteratively self-correct the generation pipeline. Extensive experiments demonstrate that Interact3D successfully produces promising collsion-aware compositions with improved geometric fidelity and consistent spatial relationships.

[CV-122] Structured prototype regularization for synthetic-to-real driving scene parsing

【速读】:该论文旨在解决合成数据到真实场景的域适应问题(synthetic-to-real domain adaptation),即模型在合成数据上训练后,在真实驾驶场景中性能显著下降的问题。其关键解决方案在于提出一种新颖的无监督域适应框架,通过显式地正则化语义特征结构来增强模型对真实场景的泛化能力:具体而言,利用类别特定原型(class-specific prototypes)强制类间分离与类内紧凑性,从而提升特征簇的可区分性和结构一致性;同时结合基于熵的噪声过滤策略提高伪标签可靠性,并引入像素级注意力机制进一步优化特征对齐效果。

链接: https://arxiv.org/abs/2603.16083
作者: Jiahe Fan,Xiao Ma,Sergey Vityazev,George Giakos,Shaolong Shu,Rui Fan
机构: Tongji University (同济大学); Beijing Institute of Aerospace Control Devices (北京航空航天控制设备研究所); Ryazan State Radio Engineering University (俄罗斯里亚赞州立无线电工程大学); Manhattan University (曼哈顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Driving scene parsing is critical for autonomous vehicles to operate reliably in complex real-world traffic environments. To reduce the reliance on costly pixel-level annotations, synthetic datasets with automatically generated labels have become a popular alternative. However, models trained on synthetic data often perform poorly when applied to real-world scenes due to the synthetic-to-real domain gap. Despite the success of unsupervised domain adaptation in narrowing this gap, most existing methods mainly focus on global feature alignment while overlooking the semantic structure of the feature space. As a result, semantic relations among classes are insufficiently modeled, limiting the model’s ability to generalize. To address these challenges, this study introduces a novel unsupervised domain adaptation framework that explicitly regularizes semantic feature structures to significantly enhance driving scene parsing performance in real-world scenarios. Specifically, the proposed method enforces inter-class separation and intra-class compactness by leveraging class-specific prototypes, thereby enhancing the discriminability and structural coherence of feature clusters. An entropy-based noise filtering strategy improves the reliability of pseudo labels, while a pixel-level attention mechanism further refines feature alignment. Extensive experiments on representative benchmarks demonstrate that the proposed method consistently outperforms recent state-of-the-art methods. These results underscore the importance of preserving semantic structure for robust synthetic-to-real adaptation in driving scene parsing tasks.

[CV-123] Volumetrically Consistent Implicit Atlas Learning via Neural Diffeomorphic Flow for Placenta MRI

【速读】:该论文旨在解决基于隐式神经表示的解剖形状间密集体积对应关系建立难题,现有方法通常仅依赖零等值面附近的监督信号,导致仅能获得表面对应关系,而内部形变缺乏约束。其解决方案的关键在于提出一种体积一致的隐式模型,通过耦合符号距离函数(Signed Distance Function, SDF)重建与神经微分同胚流(neural diffeomorphic flow),学习胎盘的共享标准模板;同时引入体积正则化项(包括雅可比行列式惩罚和双调和惩罚),有效抑制局部折叠并促进全局一致的形变,从而实现个体胎盘在统一标准空间中的体素级配准与强度映射,显著提升几何保真度和体积对齐效果,适用于群体层面的解剖学分析。

链接: https://arxiv.org/abs/2603.16078
作者: Athena Taymourtash,S. Mazdak Abulnaga,Esra Abaci Turk,P. Ellen Grant,Polina Golland
机构: MIT Computer Science and Artificial Intelligence Laboratory (麻省理工学院计算机科学与人工智能实验室); Massachusetts General Hospital, Harvard Medical School (马萨诸塞州总医院,哈佛医学院); Boston Children’s Hospital, Harvard Medical School (波士顿儿童医院,哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Establishing dense volumetric correspondences across anatomical shapes is essential for group-level analysis but remains challenging for implicit neural representations. Most existing implicit registration methods rely on supervision near the zero-level set and thus capture only surface correspondences, leaving interior deformations under-constrained. We introduce a volumetrically consistent implicit model that couples reconstruction of signed distance functions (SDFs) with neural diffeomorphic flow to learn a shared canonical template of the placenta. Volumetric regularization, including Jacobian-determinant and biharmonic penalties, suppresses local folding and promotes globally coherent deformations. In the motivating application to placenta MRI, our formulation jointly reconstructs individual placentas, aligns them to a population-derived implicit template, and enables voxel-wise intensity mapping in a unified canonical space. Experiments on in-vivo placenta MRI scans demonstrate improved geometric fidelity and volumetric alignment over surface-based implicit baseline methods, yielding anatomically interpretable and topologically consistent flattening suitable for group analysis.

[CV-124] Attribution Upsampling should Redistribute Not Interpolate

【速读】:该论文旨在解决现有可解释人工智能(Explainable AI)中归因方法在上采样过程中的信号失真问题,即传统双线性(bilinear)和双三次(bicubic)插值技术因未考虑模型语义边界,在将低分辨率归因图(saliency map)上采样至原始图像分辨率时引入混叠(aliasing)、振铃(ringing)和边界溢出(boundary bleeding),从而产生虚假的高重要性区域,误导对模型决策逻辑的理解。解决方案的关键在于将归因上采样重新建模为一种基于语义边界的质量重分配问题(mass redistribution problem),而非孤立的插值问题,并提出通用语义感知上采样(Universal Semantic-Aware Upsampling, USU),其核心是通过比例形式的质量重分配算子(ratio-form mass redistribution operators)实现归因质量的严格守恒与相对重要性顺序的保持,同时满足四个理想属性(desiderata),其中三个强制要求重分配算子必须采用比例形式,第四个则确定唯一最优解,从而在理论和实验层面均实现了更忠实、语义一致的解释结果。

链接: https://arxiv.org/abs/2603.16067
作者: Vincenzo Buono,Peyman Sheikholharam Mashhadi,Mahmoud Rahat,Prayag Tiwari,Stefan Byttner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Attribution methods in explainable AI rely on upsampling techniques that were designed for natural images, not saliency maps. Standard bilinear and bicubic interpolation systematically corrupts attribution signals through aliasing, ringing, and boundary bleeding, producing spurious high-importance regions that misrepresent model reasoning. We identify that the core issue is treating attribution upsampling as an interpolation problem that operates in isolation from the model’s reasoning, rather than a mass redistribution problem where model-derived semantic boundaries must govern how importance flows. We present Universal Semantic-Aware Upsampling (USU), a principled method that reformulates upsampling through ratio-form mass redistribution operators, provably preserving attribution mass and relative importance ordering. Extending the axiomatic tradition of feature attribution to upsampling, we formalize four desiderata for faithful upsampling and prove that interpolation structurally violates three of them. These same three force any redistribution operator into a ratio form; the fourth selects the unique potential within this family, yielding USU. Controlled experiments on models with known attribution priors verify USU’s formal guarantees; evaluation across ImageNet, CIFAR-10, and CUB-200 confirms consistent faithfulness improvements and qualitatively superior, semantically coherent explanations.

[CV-125] ViT-AdaLA: Adapting Vision Transformers with Linear Attention

【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 在处理长序列时因注意力机制复杂度为二次方而带来的可扩展性瓶颈问题,同时克服现有线性注意力方法在训练时需从头开始、计算资源消耗大,以及大型语言模型中的线性化方法难以直接迁移至 ViT 的局限。其解决方案的关键在于提出 ViT-AdaLA 框架,通过三个阶段实现知识的有效迁移与适应:首先在注意力对齐阶段,将原始 softmax 注意力与线性注意力在每一层进行行为近似;其次在特征对齐阶段,通过微调线性化 ViT 使其最终层特征与冻结的 softmax-ViT 教师模型对齐以缓解残差误差累积;最后通过监督微调将适配后的先验知识迁移至下游任务。该方法实现了高效且通用的线性注意力 ViT 知识迁移策略。

链接: https://arxiv.org/abs/2603.16063
作者: Yifan Li,Seunghyun Yoon,Viet Dac Lai,Franck Dernoncourt,Jason Kuen,Yu Kong,Trung Bui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) based vision foundation models (VFMs) have achieved remarkable performance across diverse vision tasks, but suffer from quadratic complexity that limits scalability to long sequences. Existing linear attention approaches for ViTs are typically trained from scratch, requiring substantial computational resources, while linearization-based methods developed for large language model decoders do not transfer well to ViTs. To address these challenges, we propose ViT-AdaLA, a novel framework for effectively adapting and transferring prior knowledge from VFMs to linear attention ViTs. ViT-AdaLA consists of three stages: attention alignment, feature alignment, and supervised fine-tuning. In the attention alignment stage, we align vanilla linear attention with the original softmax-based attention in each block to approximate the behavior of softmax attention. However, residual approximation errors inevitably accumulate across layers. We mitigate this by fine-tuning the linearized ViT to align its final-layer features with a frozen softmax VFM teacher. Finally, the adapted prior knowledge is transferred to downstream tasks through supervised fine-tuning. Extensive experiments on classification and segmentation tasks demonstrate the effectiveness and generality of ViT-AdaLA over various state-of-the-art linear attention counterpart.

[CV-126] he Era of End-to-End Autonomy: Transitioning from Rule-Based Driving to Large Driving Models

【速读】:该论文旨在解决自动驾驶技术从传统模块化规则驱动流水线向端到端(End-to-End, E2E)学习系统演进过程中的关键挑战,包括架构设计、部署策略、安全性保障及产业影响等问题。其解决方案的关键在于通过分析特斯拉FSD V12/V14、Rivian统一智能平台、NVIDIA Cosmos等最新进展,揭示大驾驶模型(Large Driving Models, LDMs)如何直接将原始传感器输入映射为驾驶动作,并推动监督式端到端驾驶(Supervised E2E Driving)成为主流商业策略——这类系统可在复杂环境中执行大部分动态驾驶任务(Dynamic Driving Task, DDT),同时依赖人类驾驶员进行安全监督,从而实现对长尾场景的更好泛化能力与商业化落地可行性。

链接: https://arxiv.org/abs/2603.16050
作者: Eduardo Nebot,Julie Stephany Berrio Perez
机构: The Australian Centre for Robotics (澳大利亚机器人中心); The University of Sydney (悉尼大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Autonomous driving is undergoing a shift from modular rule based pipelines toward end to end (E2E) learning systems. This paper examines this transition by tracing the evolution from classical sense perceive plan control architectures to large driving models (LDMs) capable of mapping raw sensor input directly to driving actions. We analyze recent developments including Tesla’s Full Self Driving (FSD) V12 V14, Rivian’s Unified Intelligence platform, NVIDIA Cosmos, and emerging commercial robotaxi deployments, focusing on architectural design, deployment strategies, safety considerations and industry implications. A key emerging product category is supervised E2E driving, often referred to as FSD (Supervised) or L2 plus plus, which several manufacturers plan to deploy from 2026 onwards. These systems can perform most of the Dynamic Driving Task (DDT) in complex environments while requiring human supervision, shifting the driver’s role to safety oversight. Early operational evidence suggests E2E learning handles the long tail distribution of real world driving scenarios and is becoming a dominant commercial strategy. We also discuss how similar architectural advances may extend beyond autonomous vehicles (AV) to other embodied AI systems, including humanoid robotics.

[CV-127] Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition

【速读】:该论文旨在解决可穿戴惯性传感器在人体活动识别(Human Activity Recognition, HAR)中因跨用户差异(如生理特征、运动习惯和传感器位置不同)导致的性能下降问题。现有领域泛化方法要么忽略传感器数据流中的时序依赖性,要么依赖不切实际的目标域标注。其解决方案的关键在于提出一种基于强化学习的协同时序特征生成框架(CTFG),通过Transformer驱动的自回归生成器逐帧构建特征标记序列,每个特征标记均受先前上下文与编码后的传感器输入条件约束;优化过程采用无评判器的Group-Relative Policy Optimization算法,利用组内归一化而非预训练价值估计来获得稳定的优势信号,从而消除分布依赖偏差并实现自校准优化。该设计结合类判别、跨用户不变性和时序保真度三重奖励机制,有效提升模型在异构用户分布下的泛化能力。

链接: https://arxiv.org/abs/2603.16043
作者: Xiaozhou Ye,Feng Jiang,Zihan Wang,Xiulai Wang,Yutao Zhang,Kevin I-Kai Wang
机构: Nanjing University of Information Science and Technology (南京信息工程大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human Activity Recognition using wearable inertial sensors is foundational to healthcare monitoring, fitness analytics, and context-aware computing, yet its deployment is hindered by cross-user variability arising from heterogeneous physiological traits, motor habits, and sensor placements. Existing domain generalization approaches either neglect temporal dependencies in sensor streams or depend on impractical target-domain annotations. We propose a different paradigm: modeling generalizable feature extraction as a collaborative sequential generation process governed by reinforcement learning. Our framework, CTFG (Collaborative Temporal Feature Generation), employs a Transformer-based autoregressive generator that incrementally constructs feature token sequences, each conditioned on prior context and the encoded sensor input. The generator is optimized via Group-Relative Policy Optimization, a critic-free algorithm that evaluates each generated sequence against a cohort of alternatives sampled from the same input, deriving advantages through intra-group normalization rather than learned value estimation. This design eliminates the distribution-dependent bias inherent in critic-based methods and provides self-calibrating optimization signals that remain stable across heterogeneous user distributions. A tri-objective reward comprising class discrimination, cross-user invariance, and temporal fidelity jointly shapes the feature space to separate activities, align user distributions, and preserve fine-grained temporal content. Evaluations on the DSADS and PAMAP2 benchmarks demonstrate state-of-the-art cross-user accuracy (88.53% and 75.22%), substantial reduction in inter-task training variance, accelerated convergence, and robust generalization under varying action-space dimensionalities.

[CV-128] Speak Segment Track Navigate: An Interactive System for Video-Guided Skull-Base Surgery

【速读】:该论文旨在解决传统图像引导导航系统在颅底手术中依赖外部光学追踪设备和额外硬件部署所带来的操作复杂性与流程割裂问题。解决方案的关键在于提出一种语音驱动的具身智能体(speech-guided embodied agent)框架,该框架直接在术中实时视频流上实现自然语言交互与视觉感知的动态协同,通过交互式分割与标注手术器械并将其作为空间锚点进行自主跟踪,从而支持包括术前三维模型交互配准、单目视频估计工具位姿及实时解剖结构引导等下游任务,无需额外硬件即可实现高精度的空间定位与工作流整合。

链接: https://arxiv.org/abs/2603.16024
作者: Jecia Z.Y. Mao,Francis X. Creighton,Russell H. Taylor,Manish Sahu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a speech-guided embodied agent framework for video-guided skull base surgery that dynamically executes perception and image-guidance tasks in response to surgeon queries. The proposed system integrates natural language interaction with real-time visual perception directly on live intraoperative video streams, thereby enabling surgeons to request computational assistance without disengaging from operative tasks. Unlike conventional image-guided navigation systems that rely on external optical trackers and additional hardware setup, the framework operates purely on intraoperative video. The system begins with interactive segmentation and labeling of the surgical instrument. The segmented instrument is then used as a spatial anchor that is autonomously tracked in the video stream to support downstream workflows, including anatomical segmentation, interactive registration of preoperative 3D models, monocular video-based estimation of the surgical tool pose, and support image guidance through real-time anatomical this http URL evaluate the proposed system in video-guided skull base surgery scenarios and benchmark its tracking performance against a commercially available optical tracking system. Results demonstrate that speech-guided embodied agents can achieve competitive spatial accuracy while improving workflow integration and enabling rapid deployment of video-guided surgical systems.

[CV-129] FlatLands: Generative Floormap Completion From a Single Egocentric View

【速读】:该论文旨在解决单视角自指图像(egocentric image)仅能捕捉局部地面信息,难以构建完整室内环境可通行性地图的问题,从而提升室内导航等应用的精度与鲁棒性。其核心解决方案是提出FlatLands数据集与基准测试平台,包含270,575条来自17,656个真实室内场景的对齐观测数据(含可见性、有效性及真值鸟瞰图BEV maps),并设计了分布内与分布外评估协议,系统比较了无训练方法、确定性模型、集成方法和随机生成模型的表现,最终实现从单目RGB图像到完整地面地图的端到端映射流程,为不确定性感知的室内建图与生成式补全提供了严谨的测试基准。

链接: https://arxiv.org/abs/2603.16016
作者: Subhransu S. Bhattacharjee,Dylan Campbell,Rahul Shome
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Under review

点击查看摘要

Abstract:A single egocentric image typically captures only a small portion of the floor, yet a complete metric traversability map of the surroundings would better serve applications such as indoor navigation. We introduce FlatLands, a dataset and benchmark for single-view bird’s-eye view (BEV) floor completion. The dataset contains 270,575 observations from 17,656 real metric indoor scenes drawn from six existing datasets, with aligned observation, visibility, validity, and ground-truth BEV maps, and the benchmark includes both in- and out-of-distribution evaluation protocols. We compare training-free approaches, deterministic models, ensembles, and stochastic generative models. Finally, we instantiate the task as an end-to-end monocular RGB-to-floormaps pipeline. FlatLands provides a rigorous testbed for uncertainty-aware indoor mapping and generative completion for embodied navigation.

[CV-130] UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors

【速读】:该论文旨在解决如何有效且高效地利用单一用途的生成式运动基础模型(Large-scale foundation models, LFMs),即仅支持文本到运动生成的任务,在更广泛的跨模态和上下文感知的下游运动生成任务中进行泛化的问题。现有方法通常对每个下游任务单独适配预训练的生成先验,缺乏统一性与灵活性。本文提出了一种名为UMO(Unified Motion Operations)的通用框架,其核心创新在于将多样化的下游任务建模为原子级帧操作的组合,并通过引入可学习的帧级元操作嵌入(meta-operation embeddings)来指定每帧意图,同时采用轻量级时序融合机制注入上下文线索,从而在不显著增加运行时开销的前提下,解锁预训练DiT架构运动模型的生成先验能力。这一设计使单一模型能够支持多种此前不被支持的任务,如时间插补、文本引导的运动编辑、文本序列化的几何约束以及多身份反应生成等。

链接: https://arxiv.org/abs/2603.15975
作者: Xiaoyan Cong,Zekun Li,Zhiyang Dou,Hongyu Li,Omid Taheri,Chuan Guo,Abhay Mittal,Sizhe An,Taku Komura,Wojciech Matusik,Michael J. Black,Srinath Sridhar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Large-scale foundation models (LFMs) have recently made impressive progress in text-to-motion generation by learning strong generative priors from massive 3D human motion datasets and paired text descriptions. However, how to effectively and efficiently leverage such single-purpose motion LFMs, i.e., text-to-motion synthesis, in more diverse cross-modal and in-context motion generation downstream tasks remains largely unclear. Prior work typically adapts pretrained generative priors to individual downstream tasks in a task-specific manner. In contrast, our goal is to unlock such priors to support a broad spectrum of downstream motion generation tasks within a single unified framework. To bridge this gap, we present UMO, a simple yet general unified formulation that casts diverse downstream tasks into compositions of atomic per-frame operations, enabling in-context adaptation to unlock the generative priors of pretrained DiT-based motion LFMs. Specifically, UMO introduces three learnable frame-level meta-operation embeddings to specify per-frame intent and employs lightweight temporal fusion to inject in-context cues into the pretrained backbone, with negligible runtime overhead compared to the base model. With this design, UMO finetunes the pretrained model, originally limited to text-to-motion generation, to support diverse previously unsupported tasks, including temporal inpainting, text-guided motion editing, text-serialized geometric constraints, and multi-identity reaction generation. Experiments demonstrate that UMO consistently outperforms task-specific and training-free baselines across a wide range of benchmarks, despite using a single unified model. Code and model will be publicly available. Project Page: this https URL

[CV-131] A Comprehensive Benchmark of Histopathology Foundation Models for Kidney Histopathology

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 领域中,组织病理学基础模型(Histopathology Foundation Models, HFMs)在非癌性慢性肾脏疾病(chronic kidney disease, CKD)场景下的适用性尚不明确的问题。尽管肾脏病理常与肾细胞癌和尿路上皮癌等恶性肿瘤共存,但现有HFMs在CKD相关任务中的表现仍缺乏系统评估。解决方案的关键在于构建一个全面的多任务评估框架,涵盖11种公开可获取的HFMs,在11个肾脏特异性下游任务中进行测试,这些任务覆盖多种染色方式(PAS、HE、PASM和IHC)、空间尺度(切片级与图块级)、任务类型(分类、回归和复制检测)以及临床目标(检测、诊断和预后)。通过严格的交叉验证策略(重复分层组交叉验证与嵌套分层交叉验证)及统计显著性检验(Friedman检验结合Wilcoxon符号秩检验与Holm-Bonferroni校正),研究揭示了当前HFMs主要擅长基于粗粒度中尺度肾结构特征的任务(如诊断分类和显著结构异常检测),而在需要精细微结构辨别、复杂生物学表型识别或切片级预后推断的任务中表现明显下降,且这一局限性不受染色方式影响。因此,论文强调开发针对肾脏的多染色、多模态基础模型对于实现肾病临床决策可靠性的必要性。

链接: https://arxiv.org/abs/2603.15967
作者: Harishwar Reddy Kasireddy(1),Patricio S. La Rosa(1 and 2),Akshita Gupta(1),Anindya S. Paul(1),Jamie L. Fermin(1),William L. Clapp(1),Meryl A. Waldman(3),Tarek M. El-Ashkar(4),Sanjay Jain(5),Luis Rodrigues(6),Kuang Yu Jen(7),Avi Z. Rosenberg(8),Michael T. Eadon(4),Jeffrey B. Hodgin(9),Pinaki Sarder(1) ((1) University of Florida, (2) Bayer Company, (3) National Institutes of Health, (4) Indiana University School of Medicine, (5) Washington University School of Medicine, (6) Universidade de Coimbra, (7) University of California Davis, (8) Johns Hopkins University, (9) University of Michigan)
机构: University of Florida(佛罗里达大学); Bayer Company(拜耳公司); University of Florida College of Medicine(佛罗里达大学医学院); National Institute of Diabetes and Digestive and Kidney Diseases(国家糖尿病与消化和肾脏疾病研究所); Indiana University School of Medicine(印第安纳大学医学院); Washington University School of Medicine(华盛顿大学医学院); Universidade de Coimbra(科英布拉大学); University of California at Davis School of Medicine(加州大学戴维斯分校医学院); Johns Hopkins University School of Medicine(约翰霍普金斯大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 Pages, 14 Tables, 12 figures, Co-correspondence to jhodgin@med. this http URL and this http URL @ufl.edu

点击查看摘要

Abstract:Histopathology foundation models (HFMs), pretrained on large-scale cancer datasets, have advanced computational pathology. However, their applicability to non-cancerous chronic kidney disease remains underexplored, despite coexistence of renal pathology with malignancies such as renal cell and urothelial carcinoma. We systematically evaluate 11 publicly available HFMs across 11 kidney-specific downstream tasks spanning multiple stains (PAS, HE, PASM, and IHC), spatial scales (tile and slide-level), task types (classification, regression, and copy detection), and clinical objectives, including detection, diagnosis, and prognosis. Tile-level performance is assessed using repeated stratified group cross-validation, while slide-level tasks are evaluated using repeated nested stratified cross-validation. Statistical significance is examined using Friedman test followed by pairwise Wilcoxon signed-rank testing with Holm-Bonferroni correction and compact letter display visualization. To promote reproducibility, we release an open-source Python package, kidney-hfm-eval, available at this https URL , that reproduces the evaluation pipelines. Results show moderate to strong performance on tasks driven by coarse meso-scale renal morphology, including diagnostic classification and detection of prominent structural alterations. In contrast, performance consistently declines for tasks requiring fine-grained microstructural discrimination, complex biological phenotypes, or slide-level prognostic inference, largely independent of stain type. Overall, current HFMs appear to encode predominantly static meso-scale representations and may have limited capacity to capture subtle renal pathology or prognosis-related signals. Our results highlight the need for kidney-specific, multi-stain, and multimodal foundation models to support clinically reliable decision-making in nephrology.

[CV-132] owards Fair and Robust Volumetric CT Classification via KL-Regularised Group Distributionally Robust Optimisation

【速读】:该论文旨在解决胸部计算机断层扫描(CT)图像自动诊断在临床部署中面临的两个核心问题:不同采集站点之间的分布偏移(distribution shift)以及不同人口统计学亚组(如性别)间的性能差异。为同时应对这两个挑战,作者提出了一种结合轻量级MobileViT-XXS切片编码器与双层SliceTransformer聚合器的框架,用于三维体积推理,并采用KL正则化的分组分布鲁棒优化(Group Distributionally Robust Optimization, Group DRO)目标函数进行训练。该方法通过自适应提升表现较差的采集中心和人口亚组权重,在保障最差群体性能的同时避免平均性能显著下降(由KL惩罚项防止组权重坍塌),从而实现公平性与整体性能的平衡。实验表明,该方案在两项任务上均显著优于现有基准,尤其在女性鳞状细胞癌等严重低频组合类别上提升明显。

链接: https://arxiv.org/abs/2603.15941
作者: Samuel Johnny,Blessed Guda,Frank Ebeledike,Goodness Obasi,Moise Busogi
机构: Carnegie Mellon University Africa, Kigali, Rwanda; Carnegie Mellon University, Pittsburgh, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated diagnosis from chest computed tomography (CT) scans faces two persistent challenges in clinical deployment: distribution shift across acquisition sites and performance disparity across demographic subgroups. We address both simultaneously across two complementary tasks: binary COVID-19 classification from multi-site CT volumes (Task 1) and four-class lung pathology recognition with gender-based fairness constraints (Task 2). Our framework combines a lightweight MobileViT-XXS slice encoder with a two-layer SliceTransformer aggregator for volumetric reasoning, and trains with a KL-regularised Group Distributionally Robust Optimisation (Group DRO) objective that adaptively upweights underperforming acquisition centres and demographic subgroups. Unlike standard Group DRO, the KL penalty prevents group weight collapse, providing a stable balance between worst-case protection and average performance. For Task 2, we define groups at the granularity of gender class, directly targeting severely underrepresented combinations such as female Squamous cell carcinoma. On Task 1, our best configuration achieves a challenge F1 of 0.835, surpassing the best published challenge entry by +5.9. On Task 2, Group DRO with \alpha = 0.5 achieves a mean per-gender macro F1 of 0.815, outperforming the best challenge entry by +11.1 pp and improving Female Squamous F1 by +17.4 over the Fo- cal Loss baseline.

[CV-133] Do Not Leave a Gap: Hallucination-Free Object Concealment in Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对对抗性攻击时的脆弱性问题,特别是现有基于目标区域抑制的隐藏攻击方法会因引入语义断层而引发幻觉(hallucination),即模型生成看似合理但错误的对象。解决方案的关键在于提出一种新的背景一致的对象隐藏攻击(background-consistent object concealment attacks),其核心思想是通过重新编码目标对象的视觉表示,使其在统计和语义上与周围背景区域保持一致,从而避免因表示空洞导致的幻觉;同时,该方法保留了Transformer层中的token结构和注意力流,确保全局场景语义不变,实验证明该方法可有效隐藏目标对象并显著降低幻觉发生率(最多减少3倍),同时维持高达86%的非目标对象识别准确率。

链接: https://arxiv.org/abs/2603.15940
作者: Amira Guesmi,Muhammad Shafique
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have recently shown remarkable capabilities in visual understanding and generation, but remain vulnerable to adversarial manipulations of visual content. Prior object-hiding attacks primarily rely on suppressing or blocking region-specific representations, often creating semantic gaps that inadvertently induce hallucination, where models invent plausible but incorrect objects. In this work, we demonstrate that hallucination arises not from object absence per se, but from semantic discontinuity introduced by such suppression-based attacks. We propose a new class of \emphbackground-consistent object concealment attacks, which hide target objects by re-encoding their visual representations to be statistically and semantically consistent with surrounding background regions. Crucially, our approach preserves token structure and attention flow, avoiding representational voids that trigger hallucination. We present a pixel-level optimization framework that enforces background-consistent re-encoding across multiple transformer layers while preserving global scene semantics. Extensive experiments on state-of-the-art vision-language models show that our method effectively conceals target objects while preserving up to 86% of non-target objects and reducing grounded hallucination by up to 3\times compared to attention-suppression-based attacks.

[CV-134] Nodule-Aligned Latent Space Learning with LLM -Driven Multimodal Diffusion for Lung Nodule Progression Prediction

【速读】:该论文旨在解决肺癌早期诊断中因生物不确定性及对结节进展生物学机制理解不足所带来的挑战。其核心解决方案是提出一种名为Nodule-Aligned Multimodal (Latent) Diffusion (NAMD) 的新框架,关键在于构建一个结节对齐的潜在空间(nodule-aligned latent space),使得潜在表示之间的距离直接对应于结节特征的变化,并引入基于大语言模型(LLM)的控制机制,以患者和结节电子健康记录(EHR)为条件引导扩散模型生成1年随访的结节CT图像。该方法在NLST数据集上实现了优于基线和现有合成方法的恶性预测性能(AUROC: 0.805, AUPRC: 0.346),接近真实随访扫描的表现(AUROC: 0.819, AUPRC: 0.393),证明其能有效捕捉临床相关的结节进展特征,从而支持更早、更精准的肺癌诊断。

链接: https://arxiv.org/abs/2603.15932
作者: James Song,Yifan Wang,Chuan Zhou,Liyue Shen
机构: University of Michigan Medical School (密歇根大学医学院); Department of EECS, University of Michigan (密歇根大学电子工程与计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Early diagnosis of lung cancer is challenging due to biological uncertainty and the limited understanding of the biological mechanisms driving nodule progression. To address this, we propose Nodule-Aligned Multimodal (Latent) Diffusion (NAMD), a novel framework that predicts lung nodule progression by generating 1-year follow-up nodule computed tomography images with baseline scans and the patient’s and nodule’s Electronic Health Record (EHR). NAMD introduces a nodule-aligned latent space, where distances between latents directly correspond to changes in nodule attributes, and utilizes an LLM-driven control mechanism to condition the diffusion backbone on patient data. On the National Lung Screening Trial (NLST) dataset, our method synthesizes follow-up nodule images that achieve an AUROC of 0.805 and an AUPRC of 0.346 for lung nodule malignancy prediction, significantly outperforming both baseline scans and state-of-the-art synthesis methods, while closely approaching the performance of real follow-up scans (AUROC: 0.819, AUPRC: 0.393). These results demonstrate that NAMD captures clinically relevant features of lung nodule progression, facilitating earlier and more accurate diagnosis.

[CV-135] Sparse but not Simpler: A Multi-Level Interpretability Analysis of Vision Transformers

【速读】:该论文试图解决的问题是:结构稀疏性(structural sparsity)是否能够提升视觉Transformer模型的语义可解释性(semantic interpretability)。尽管已有研究指出权重稀疏性可能在语言模型中生成更紧凑的功能电路,但其对视觉模型可解释性的实际影响尚不明确。论文的关键解决方案在于提出一个多层次的评估框架——IMPACT,该框架从神经元、层表示、任务电路和模型级归因四个互补层面系统评估可解释性:通过BatchTopK稀疏自动编码器分析层表示,利用可学习节点掩码提取任务电路,并基于插入与删除指标衡量归因忠实度。实验结果表明,虽然稀疏模型确实减少了电路边数(约2.5倍),但活跃节点比例并未显著降低,且在神经元选择性、SAE特征可解释性和归因忠实度等方面均未表现出系统性提升,说明结构稀疏性本身不足以保证更好的可解释性,强调了超越电路紧凑性的多维评估的重要性。

链接: https://arxiv.org/abs/2603.15919
作者: Siyu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sparse neural networks are often hypothesized to be more interpretable than dense models, motivated by findings that weight sparsity can produce compact circuits in language models. However, it remains unclear whether structural sparsity itself leads to improved semantic interpretability. In this work, we systematically evaluate the relationship between weight sparsity and interpretability in Vision Transformers using DeiT-III B/16 models pruned with Wanda. To assess interpretability comprehensively, we introduce \textbfIMPACT, a multi-level framework that evaluates interpretability across four complementary levels: neurons, layer representations, task circuits, and model-level attribution. Layer representations are analyzed using BatchTopK sparse autoencoders, circuits are extracted via learnable node masking, and explanations are evaluated with transformer attribution using insertion and deletion metrics. Our results reveal a clear structural effect but limited interpretability gains. Sparse models produce circuits with approximately 2.5\times fewer edges than dense models, yet the fraction of active nodes remains similar or higher, indicating that pruning redistributes computation rather than isolating simpler functional modules. Consistent with this observation, sparse models show no systematic improvements in neuron-level selectivity, SAE feature interpretability, or attribution faithfulness. These findings suggest that structural sparsity alone does not reliably yield more interpretable vision models, highlighting the importance of evaluation frameworks that assess interpretability beyond circuit compactness.

[CV-136] Federated Learning for Privacy-Preserving Medical AI

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)分类任务中联邦学习(federated learning, FL)面临的三大问题:不切实际的数据划分方式、隐私保障不足以及缺乏充分的基准测试,从而限制了其在医疗领域的实际部署。其关键解决方案包括:一是提出一种站点感知的数据划分策略(site-aware data partitioning),确保数据划分尊重机构边界,更好地模拟多中心协作场景下的数据异质性;二是设计了一种自适应局部差分隐私机制(Adaptive Local Differential Privacy, ALDP),根据训练进程和参数特性动态调整隐私预算,显著优于传统固定噪声的局部差分隐私方法,在隐私与模型性能之间实现更优权衡。实证结果表明,结合FedProx优化算法与ALDP机制可在两客户端配置下达到80.4%准确率,较固定噪声局部差分隐私提升5–7个百分点,并展现出更强的训练稳定性,为医疗影像领域隐私保护型协同AI提供了可量化、可复现的方法论基础和实践指南。

链接: https://arxiv.org/abs/2603.15901
作者: Tin Hoang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: MSc Dissertation

点击查看摘要

Abstract:This dissertation investigates privacy-preserving federated learning for Alzheimer’s disease classification using three-dimensional MRI data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Existing methodologies often suffer from unrealistic data partitioning, inadequate privacy guarantees, and insufficient benchmarking, limiting their practical deployment in healthcare. To address these gaps, this research proposes a novel site-aware data partitioning strategy that preserves institutional boundaries, reflecting real-world multi-institutional collaborations and data heterogeneity. Furthermore, an Adaptive Local Differential Privacy (ALDP) mechanism is introduced, dynamically adjusting privacy parameters based on training progression and parameter characteristics, thereby significantly improving the privacy-utility trade-off over traditional fixed-noise approaches. Systematic empirical evaluation across multiple client federations and privacy budgets demonstrated that advanced federated optimisation algorithms, particularly FedProx, could equal or surpass centralised training performance while ensuring rigorous privacy protection. Notably, ALDP achieved up to 80.4% accuracy in a two-client configuration, surpassing fixed-noise Local DP by 5-7 percentage points and demonstrating substantially greater training stability. The comprehensive ablation studies and benchmarking establish quantitative standards for privacy-preserving collaborative medical AI, providing practical guidelines for real-world deployment. This work thereby advances the state-of-the-art in federated learning for medical imaging, establishing both methodological foundations and empirical evidence necessary for future privacy-compliant AI adoption in healthcare.

[CV-137] AsgardBench - Evaluating Visually Grounded Interactive Planning Under Minimal Feedback

【速读】:该论文旨在解决生成式 AI 在具身智能(Embodied AI)场景中进行交互式规划(Interactive Planning)的能力评估问题,尤其关注代理在执行过程中如何基于视觉观察动态调整计划,而非依赖导航或低级操作。其核心挑战在于现有基准测试往往混淆了推理与导航任务,或提供过于丰富的纠正反馈(如精确的物理控制信号),从而掩盖了模型对环境状态的理解和适应能力。解决方案的关键在于提出 AsgardBench 基准测试平台:它通过限制输入为图像、动作历史和轻量级成功/失败信号,排除低级控制噪声,在受控仿真环境中隔离出“交互式规划”这一能力维度;并通过系统性变化物体状态、放置位置和场景配置,构建具有条件分支的任务实例(共 108 个任务实例,覆盖 12 类任务),迫使模型根据实际视觉感知实时修正行动计划,从而精准衡量模型的视觉接地(Visual Grounding)与状态追踪能力。

链接: https://arxiv.org/abs/2603.15888
作者: Andrea Tupini,Lars Liden,Reuben Tan,Yu Wang,Jianfeng Gao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 19 figures, 6 tables, including appendix

点击查看摘要

Abstract:With AsgardBench we aim to evaluate visually grounded, high-level action sequence generation and interactive planning, focusing specifically on plan adaptation during execution based on visual observations rather than navigation or low-level manipulation. In the landscape of embodied AI benchmarks, AsgardBench targets the capability category of interactive planning, which is more sophisticated than offline high-level planning as it requires agents to revise plans in response to environmental feedback, yet remains distinct from low-level execution. Unlike prior embodied AI benchmarks that conflate reasoning with navigation or provide rich corrective feedback that substitutes for perception, AsgardBench restricts agent input to images, action history, and lightweight success/failure signals, isolating interactive planning in a controlled simulator without low-level control noise. The benchmark contains 108 task instances spanning 12 task types, each systematically varied through object state, placement, and scene configuration. These controlled variations create conditional branches in which a single instruction can require different action sequences depending on what the agent observes, emphasizing conditional branching and plan repair during execution. Our evaluations of leading vision language models show that performance drops sharply without visual input, revealing weaknesses in visual grounding and state tracking that ultimately undermine interactive planning. Our benchmark zeroes in on a narrower question: can a model actually use what it sees to adapt a plan when things do not go as expected? Comments: 19 figures, 6 tables, including appendix Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) ACMclasses: I.2.8; I.2.10 Cite as: arXiv:2603.15888 [cs.AI] (or arXiv:2603.15888v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.15888 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Andrea Tupini [view email] [v1] Mon, 16 Mar 2026 20:31:43 UTC (2,432 KB)

[CV-138] EvoIQA - Explaining Image Distortions with Evolved White-Box Logic

【速读】:该论文旨在解决传统图像质量评估(Image Quality Assessment, IQA)方法在可解释性与性能之间难以平衡的问题:一方面,手工设计的数学模型虽然结构清晰但泛化能力有限;另一方面,深度学习模型虽能实现高精度预测,却缺乏可解释性。其解决方案的关键在于提出EvoIQA框架,该框架基于遗传编程(Genetic Programming, GP)进行符号回归,通过演化生成显式、人类可读的数学公式来实现IQA建模。该方法利用来自VSI、VIF、FSIM和HaarPSI等指标的丰富终端集合,将结构、色彩及信息论退化特征映射为可观测的数学表达式,从而在保持高度可解释性的同时达到与先进深度学习模型相当的性能。

链接: https://arxiv.org/abs/2603.15887
作者: Ruchika Gupta,Illya Bakurov,Nathan Haut,Wolfgang Banzhaf
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Traditional Image Quality Assessment (IQA) metrics typically fall into one of two extremes: rigid, hand-crafted mathematical models or “black-box” deep learning architectures that completely lack interpretability. To bridge this gap, we propose EvoIQA, a fully explainable symbolic regression framework based on Genetic Programming that Evolves explicit, human-readable mathematical formulas for image quality assessment (IQA). Utilizing a rich terminal set from the VSI, VIF, FSIM, and HaarPSI metrics, our framework inherently maps structural, chromatic, and information-theoretic degradations into observable mathematical equations. Our results demonstrate that the evolved GP models consistently achieve strong alignment between the predictions and human visual preferences. Furthermore, they not only outperform traditional hand-crafted metrics but also achieve performance parity with complex, state-of-the-art deep learning models like DB-CNN, proving that we no longer have to sacrifice interpretability for state-of-the-art performance.

[CV-139] Self-supervised Disentanglement of Disease Effects from Aging in 3D Medical Shapes

【速读】:该论文旨在解决在诊断标签有限或不可用的情况下,如何从三维医学形态中分离病理变化与生理老化的问题,这对开发可解释的生物标志物和患者分层至关重要。其解决方案的关键在于提出一个两阶段框架:第一阶段利用带符号距离函数的隐式神经模型学习稳定的形状嵌入,并通过聚类获得伪疾病标签;第二阶段基于第一阶段发现的伪疾病标签和真实年龄标签,在紧凑的变分空间中通过多目标解耦损失(结合协方差和监督对比损失)实现因子解耦,从而在无需真实诊断标签的前提下有效分离疾病与年龄相关的形状变化,同时保持高保真重建、可控合成和基于因子的可解释性。

链接: https://arxiv.org/abs/2603.15862
作者: Jakaria Rabbi,Nilanjan Ray,Dana Cobzas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages

点击查看摘要

Abstract:Disentangling pathological changes from physiological aging in 3D medical shapes is crucial for developing interpretable biomarkers and patient stratification. However, this separation is challenging when diagnosis labels are limited or unavailable, since disease and aging often produce overlapping effects on shape changes, obscuring clinically relevant shape patterns. To address this challenge, we propose a two-stage framework combining unsupervised disease discovery with self-supervised disentanglement of implicit shape representations. In the first stage, we train an implicit neural model with signed distance functions to learn stable shape embeddings. We then apply clustering on the shape latent space, which yields pseudo disease labels without using ground-truth diagnosis during discovery. In the second stage, we disentangle factors in a compact variational space using pseudo disease labels discovered in the first stage and the ground truth age labels available for all subjects. We enforce separation and controllability with a multi-objective disentanglement loss combining covariance and a supervised contrastive loss. On ADNI hippocampus and OAI distal femur shapes, we achieve near-supervised performance, improving disentanglement and reconstruction over state-of-the-art unsupervised baselines, while enabling high-fidelity reconstruction, controllable synthesis, and factor-based explainability. Code and checkpoints are available at this https URL

[CV-140] FEEL (Force-Enhanced Egocentric Learning): A Dataset for Physical Action Understanding

【速读】:该论文旨在解决物理动作理解中缺乏对接触力信息的有效建模问题,尤其是在第一人称视角(egocentric)视频中,如何利用力信号提升对物体交互的细粒度理解与动作表征学习。其核心解决方案是构建首个大规模、自然场景下的力-视频同步数据集FEEL(Force-Enhanced Egocentric Learning),通过定制压阻手套采集力信号并与第一人称视频精确对齐,从而实现以力为先验的物理交互建模。关键创新在于:1)利用力作为驱动物理交互的根本因果变量,推动接触理解任务(如时间上接触分割和像素级接触物体分割)无需人工标注即可达到SOTA性能;2)将力预测设计为自监督预训练目标,有效提升视频骨干网络在多个动作识别基准(如EPIC-Kitchens、SomethingSomething-V2等)上的迁移能力,且完全不依赖人工标签。

链接: https://arxiv.org/abs/2603.15847
作者: Eadom Dessalene,Botao He,Michael Maynord,Yonatan Tussa,Pavan Mantripragada,Yianni Karabati,Nirupam Roy,Yiannis Aloimonos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:We introduce FEEL (Force-Enhanced Egocentric Learning), the first large-scale dataset pairing force measurements gathered from custom piezoresistive gloves with egocentric video. Our gloves enable scalable data collection, and FEEL contains approximately 3 million force-synchronized frames of natural unscripted manipulation in kitchen environments, with 45% of frames involving hand-object contact. Because force is the underlying cause that drives physical interaction, it is a critical primitive for physical action understanding. We demonstrate the utility of force for physical action understanding through application of FEEL to two families of tasks: (1) contact understanding, where we jointly perform temporal contact segmentation and pixel-level contacted object segmentation; and, (2) action representation learning, where force prediction serves as a self-supervised pretraining objective for video backbones. We achieve state-of-the-art temporal contact segmentation results and competitive pixel-level segmentation results without any need for manual contacted object segmentation annotations. Furthermore we demonstrate that action representation learning with FEEL improves transfer performance on action understanding tasks without any manual labels over EPIC-Kitchens, SomethingSomething-V2, EgoExo4D and Meccano.

[CV-141] Beyond the Embedding Bottleneck: Adaptive Retrieval-Augmented 3D CT Report Generation

【速读】:该论文旨在解决3D CT图像自动生成放射学报告时存在的病理覆盖不全问题,其核心症结在于视觉表征层面存在维度集中现象(dimensional concentration),即尽管对比学习得到的3D CT嵌入具备区分病理的能力,但其有效维度极低(仅约2个有效维度),导致生成模型难以充分捕捉病变信息。解决方案的关键在于提出AdaRAG-CT框架,通过引入可控检索机制获取补充文本信息,并在生成过程中选择性融合这些外部语义内容,从而突破视觉表征瓶颈,显著提升临床有效性(Clinical F1从0.420提升至0.480)。

链接: https://arxiv.org/abs/2603.15822
作者: Renjie Liang,Yiling Ma,Yang Xing,Zhengkang Fan,Jinqian Pan,Chengkun Sun,Li Li,Kuang Gong,Jie Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated radiology report generation from 3D CT volumes often suffers from incomplete pathology coverage. We provide empirical evidence that this limitation stems from a representational bottleneck: contrastive 3D CT embeddings encode discriminative pathology signals, yet exhibit severe dimensional concentration, with as few as 2 effective dimensions out of 512. Corroborating this, scaling the language model yields no measurable improvement, suggesting that the bottleneck lies in the visual representation rather than the generator. This bottleneck limits both generation and retrieval; naive static retrieval fails to improve clinical efficacy and can even degrade performance. We propose \textbfAdaRAG-CT, an adaptive augmentation framework that compensates for this visual bottleneck by introducing supplementary textual information through controlled retrieval and selectively integrating it during generation. On the CT-RATE benchmark, AdaRAG-CT achieves state-of-the-art clinical efficacy, improving Clinical F1 from 0.420 (CT-Agent) to 0.480 (+6 points); ablation studies confirm that both the retrieval and generation components contribute to the improvement. Code is available at this https URL.

[CV-142] Conflict-Aware Multimodal Fusion for Ambivalence and Hesitancy Recognition

【速读】:该论文旨在解决自动识别个体在临床场景中表现出的矛盾与犹豫状态(Ambivalence and Hesitancy, A/H)的问题,这类状态通常表现为言语、语音和面部表情之间的不一致。传统基于文本的方法因忽视多模态信息的协同作用而容易误判,导致对A/H的过度检测(高F1-AH)且难以准确确认其不存在(低F1-NoAH)。解决方案的关键在于提出一种冲突感知的多模态框架ConflictAwareAH:通过三个预训练编码器分别提取视频、音频和文本表征,并引入成对的冲突特征——即模态嵌入间的逐元素绝对差值,作为双向线索:显著的跨模态差异标志A/H状态,而较小差异则体现行为一致性并锚定负类样本。这一设计有效缓解了文本主导模型的偏差,使F1-NoAH提升4.6点,并结合文本引导的晚期融合策略进一步提升整体性能,最终在BAH数据集上实现0.715的Macro F1得分,优于现有基线超10点。

链接: https://arxiv.org/abs/2603.15818
作者: Salah Eddine Bekhouche,Hichem Telli,Azeddine Benlamoudi,Salah Eddine Herrouz,Abdelmalik Taleb-Ahmed,Abdenour Hadid
机构: University of the Basque Country UPV/EHU, San Sebastian, Spain; Laboratory of LESIA, University of Biskra, Algeria; Lab. de Génie Electrique (LAGE), University Kasdi Merbah Ouargla, Ouargla, Algeria; Institute of Electronics, Microelectronics and Nanotechnology (IEMN), Polytechnic University of Hauts-de-France, University of Lille, Valenciennes, France; Sorbonne Center for Artificial Intelligence, Sorbonne University Abu Dhabi, UAE
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ambivalence and hesitancy (A/H) are subtle affective states where a person shows conflicting signals through different channels – saying one thing while their face or voice tells another story. Recognising these states automatically is valuable in clinical settings, but it is hard for machines because the key evidence lives in the \emphdisagreements between what is said, how it sounds, and what the face shows. We present \textbfConflictAwareAH, a multimodal framework built for this problem. Three pre-trained encoders extract video, audio, and text representations. Pairwise conflict features – element-wise absolute differences between modality embeddings – serve as \emphbidirectional cues: large cross-modal differences flag A/H, while small differences confirm behavioural consistency and anchor the negative class. This conflict-aware design addresses a key limitation of text-dominant approaches, which tend to over-detect A/H (high F1-AH) while struggling to confirm its absence: our multimodal model improves F1-NoAH by +4.6 points over text alone and halves the class-performance gap. A complementary \emphtext-guided late fusion strategy blends a text-only auxiliary head with the full model at inference, adding +4.1 Macro F1. On the BAH dataset from the ABAW10 Ambivalence/Hesitancy Challenge, our method reaches \textbf0.694 Macro F1 on the labelled test split and \textbf0.715 on the private leaderboard, outperforming published multimodal baselines by over 10 points – all on a single GPU in under 25 minutes of training. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.15818 [cs.CV] (or arXiv:2603.15818v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.15818 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-143] ModTrack: Sensor-Agnostic Multi-View Tracking via Identity-Informed PHD Filtering with Covariance Propagation

【速读】:该论文旨在解决多视角多目标跟踪(Multi-View Multi-Object Tracking, MV-MOT)中因视角变化和遮挡导致的身份一致性难以维持的问题,同时克服现有端到端方法缺乏不确定性建模且对传感器布局、模态或数据集依赖性强、泛化能力差的局限。其解决方案的关键在于提出一种模块化系统 ModTrack,将学习型方法仅限于检测与特征提取阶段,其余流程(融合、关联与跟踪)均采用闭式解析方法:通过将每个传感器输出统一为校准的位置-协方差对 (z,R)(\mathbf{z}, R),利用跨视图聚类和精度加权融合获得统一估计 (z^,R^)(\hat{\mathbf{z}}, \hat{R}),并结合反馈耦合的身份感知高斯混合概率假设密度(GM-PHD)滤波器与隐马尔可夫模型(HMM)运动模式,在漏检和严重遮挡下仍能稳定维持身份,从而实现媲美端到端方法的性能,同时具备跨模态、传感器无关的迁移能力和可解释的不确定性表征。

链接: https://arxiv.org/abs/2603.15812
作者: Aditya Iyer,Jack Roberts,Nora Ayanian
机构: Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-View Multi-Object Tracking (MV-MOT) aims to localize and maintain consistent identities of objects observed by multiple sensors. This task is challenging, as viewpoint changes and occlusion disrupt identity consistency across views and time. Recent end-to-end approaches address this by jointly learning 2D Bird’s Eye View (BEV) representations and identity associations, achieving high tracking accuracy. However, these methods offer no principled uncertainty accounting and remain tightly coupled to their training configuration, limiting generalization across sensor layouts, modalities, or datasets without retraining. We propose ModTrack, a modular MV-MOT system that matches end-to-end performance while providing cross-modal, sensor-agnostic generalization and traceable uncertainty. ModTrack confines learning methods to just the \textitDetection and Feature Extraction stage of the MV-MOT pipeline, performing all fusion, association, and tracking with closed-form analytical methods. Our design reduces each sensor’s output to calibrated position-covariance pairs (\mathbfz, R) ; cross-view clustering and precision-weighted fusion then yield unified estimates (\hat\mathbfz, \hatR) for identity assignment and temporal tracking. A feedback-coupled, identity-informed Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter with HMM motion modes uses these fused estimates to maintain identities under missed detections and heavy occlusion. ModTrack achieves 95.5 IDF1 and 91.4 MOTA on \textitWildTrack, surpassing all prior modular methods by over 21 points and rivaling the state-of-the-art end-to-end methods while providing deployment flexibility they cannot. Specifically, the same tracker core transfers unchanged to \textitMultiviewX and \textitRadarScenes, with only perception-module replacement required to extend to new domains and sensor modalities.

[CV-144] Feed-forward Gaussian Registration for Head Avatar Creation and Editing ALT ATC

【速读】:该论文旨在解决当前多视角头部虚拟形象(head avatar)生成与编辑中存在的时间消耗大、流程复杂的问题,现有方法通常依赖耗时的头部追踪和昂贵的优化步骤,导致整体创建时间超过一天。其解决方案的关键在于提出MATCH(Multi-view Avatars from Topologically Corresponding Heads),一种基于拓扑对应头部的多视角高斯注册方法,能够仅用0.5秒/帧直接从校准的多视角图像中预测具有对应关系的高斯点纹理,无需数据预处理。该方法通过一个基于Transformer的端到端模型实现跨帧和跨主体的对应关系学习,并引入创新的注册引导注意力模块(registration-guided attention block),使UV图上的每个token仅关注对应网格区域的图像特征,显著提升效率与性能,从而在保持高质量的同时将头像生成速度提升至现有最优基线的10倍。

链接: https://arxiv.org/abs/2603.15811
作者: Malte Prinzler,Paulo Gotardo,Siyu Tang,Timo Bolkart
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL ; Video: this https URL

点击查看摘要

Abstract:We present MATCH (Multi-view Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. State-of-the-art multi-view head avatar methods require time-consuming head tracking followed by expensive avatar optimization, often resulting in a total creation time of more than one day. MATCH, in contrast, directly predicts Gaussian splat textures in correspondence from calibrated multi-view images in just 0.5 seconds per frame, without requiring data preprocessing. The learned intra-subject correspondence across frames enables fast creation of personalized head avatars, while correspondence across subjects supports applications such as expression transfer, optimization-free tracking, semantic editing, and identity interpolation. We establish these correspondences end-to-end using a transformer-based model that predicts Gaussian splat textures in the fixed UV layout of a template mesh. To achieve this, we introduce a novel registration-guided attention block, where each UV-map token attends exclusively to image tokens depicting its corresponding mesh region. This design improves efficiency and performance compared to dense cross-view attention. MATCH outperforms existing methods in novel-view synthesis, geometry registration, and head avatar generation, while making avatar creation 10 times faster than the closest competing baseline. The code and model weights are available on the project website.

[CV-145] Parallelised Differentiable Straightest Geodesics for 3D Meshes CVPR2026

【速读】:该论文旨在解决在网格(mesh)表示的黎曼流形上进行机器学习时,缺乏几何精确且可微分的运算方法这一关键问题。传统方法受限于闭式黎曼算子缺失、离散形式不可微以及并行化能力差等瓶颈,难以有效支持复杂几何结构上的学习任务。解决方案的核心在于提出一种基于“最短测地线”(straightest geodesics)的可微分指数映射(exponential map)框架,该框架不仅能够准确计算测地线路径和向量平行传输,还通过GPU并行实现与两种不同的反向传播策略——一种利用外嵌代理函数(extrinsic proxy function),另一种基于测地有限差分(geodesic finite differences scheme)——实现了高效的梯度传递。此方法显著提升了在一般几何体上的学习与优化性能,并成功应用于新型测地卷积层、流匹配(flow matching)方法及二阶优化器设计等场景。

链接: https://arxiv.org/abs/2603.15780
作者: Hippolyte Verninas,Caner Korkmaz,Stefanos Zafeiriou,Tolga Birdal,Simone Foti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Machine learning has been progressively generalised to operate within non-Euclidean domains, but geometrically accurate methods for learning on surfaces are still falling behind. The lack of closed-form Riemannian operators, the non-differentiability of their discrete counterparts, and poor parallelisation capabilities have been the main obstacles to the development of the field on meshes. A principled framework to compute the exponential map on Riemannian surfaces discretised as meshes is straightest geodesics, which also allows to trace geodesics and parallel-transport vectors as a by-product. We provide a parallel GPU implementation and derive two different methods for differentiating through the straightest geodesics, one leveraging an extrinsic proxy function and one based upon a geodesic finite differences scheme. After proving our parallelisation performance and accuracy, we demonstrate how our differentiable exponential map can improve learning and optimisation pipelines on general geometries. In particular, to showcase the versatility of our method, we propose a new geodesic convolutional layer, a new flow matching method for learning on meshes, and a second-order optimiser that we apply to centroidal Voronoi tessellation. Our code, models, and pip-installable library (digeo) are available at: this http URL.

[CV-146] Domain Adaptation Without the Compute Burden for Efficient Whole Slide Image Analysis

【速读】:该论文旨在解决全切片图像(Whole Slide Images, WSI)在病理分析中因超高分辨率导致的端到端训练不现实的问题,以及现有方法依赖ImageNet预训练特征提取器所造成的领域特异性不足和任务特异性缺失问题。解决方案的关键在于提出EfficientWSI(eWSI),通过将参数高效微调(Parameter-Efficient-Fine-Tuning, PEFT)与多实例学习(Multiple Instance Learning, MIL)进行精心整合,实现基于WSI的端到端训练,从而在降低计算成本的同时增强任务特定信息的捕捉能力。实验表明,eWSI在多个公开数据集上表现出色,既可替代昂贵的域内预训练,又能在已有域内特征基础上进一步提升性能。

链接: https://arxiv.org/abs/2603.15774
作者: Umar Marikkar,Muhammad Awais,Sara Atito
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computational methods on analyzing Whole Slide Images (WSIs) enable early diagnosis and treatments by supporting pathologists in detection and classification of tumors. However, the extremely high resolution of WSIs makes end-to-end training impractical compared to typical image analysis tasks. To address this, most approaches use pre-trained feature extractors to obtain fixed representations of whole slides, which are then combined with Multiple Instance Learning (MIL) for downstream tasks. These feature extractors are typically pre-trained on natural image datasets such as ImageNet, which fail to capture domain-specific characteristics. Although domain-specific pre-training on histopathology data yields more relevant feature representations, it remains computationally expensive and fail to capture task-specific characteristics within the domain. To address the computational cost and lack of task-specificity in domain-specific pre-training, we propose EfficientWSI (eWSI), a careful integration of Parameter-Efficient-Fine-Tuning (PEFT) and Multiple Instance Learning (MIL) that enables end-to-end training on WSI tasks. We evaluate eWSI on seven WSI-level tasks over Camelyon16, TCGA and BRACS datasets. Our results show that eWSI when applied with ImageNet feature extractors yields strong classification performance, matching or outperforming MILs with in-domain feature extractors, alleviating the need for extensive in-domain pre-training. Furthermore, when eWSI is applied with in-domain feature extractors, it further improves classification performance in most cases, demonstrating its ability to capture task-specific information where beneficial. Our findings suggest that eWSI provides a task-targeted, computationally efficient path for WSI tasks, offering a promising direction for task-specific learning in computational pathology.

[CV-147] CLRNet: Targetless Extrinsic Calibration for Camera Lidar and 4D Radar Using Deep Learning

【速读】:该论文旨在解决多模态传感器(相机、激光雷达和4D雷达)之间的外参标定问题,尤其针对雷达数据稀疏导致的标定精度不足难题。其核心解决方案是提出CLRNet——一种端到端的深度学习标定网络,通过引入等距圆柱投影(equirectangular projection)、基于相机的深度图预测、额外的雷达通道,并利用激光雷达在共享特征空间中的表示及回环闭合损失(loop closure loss),实现相机-激光雷达-雷达三者联合标定或任意两者间的配对标定,从而显著提升标定精度,在View-of-Delft和Dual-Radar数据集上将中值平移和旋转误差均降低至少50%。

链接: https://arxiv.org/abs/2603.15767
作者: Marcell Kegl,Andras Palffy,Csaba Benedek,Dariu M. Gavrila
机构: HUN-REN SZTAKI(匈牙利研究网络信息科学与控制研究所); Pázmány Péter Catholic University(帕兹曼尼·彼得天主教大学); TU Delft(代尔夫特理工大学); Perciv AI(Perceiv AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Intelligent Vehicles

点击查看摘要

Abstract:In this paper, we address extrinsic calibration for camera, lidar, and 4D radar sensors. Accurate extrinsic calibration of radar remains a challenge due to the sparsity of its data. We propose CLRNet, a novel, multi-modal end-to-end deep learning (DL) calibration network capable of addressing joint camera-lidar-radar calibration, or pairwise calibration between any two of these sensors. We incorporate equirectangular projection, camera-based depth image prediction, additional radar channels, and leverage lidar with a shared feature space and loop closure loss. In extensive experiments using the View-of-Delft and Dual-Radar datasets, we demonstrate superior calibration accuracy compared to existing state-of-the-art methods, reducing both median translational and rotational calibration errors by at least 50%. Finally, we examine the domain transfer capabilities of the proposed network and baselines, when evaluating across datasets. The code will be made publicly available upon acceptance at: this https URL.

[CV-148] GLANCE: Gaze-Led Attention Network for Compressed Edge-inference

【速读】:该论文旨在解决增强现实/虚拟现实(AR/VR)系统中实时目标检测面临的计算资源受限问题,特别是如何在亚10毫秒延迟和严格功耗预算下实现高效的目标检测。其解决方案的关键在于提出一种两阶段流水线:首先利用可微分无权神经网络(differentiable weightless neural networks)通过内存查找而非乘累加(MAC)运算实现超高效的注视点估计(angular error为8.32°,仅需393 MACs和2.2 KiB内存),随后基于注视引导的感兴趣区域(Region-of-Interest, ROI)进行目标检测,从而将整体计算负担降低40–50%,能耗减少65%。该方法在Arduino Nano 33 BLE平台上实现了48.1%的COCO mAP( attended objects上达51.8%),同时满足亚10毫秒延迟要求,显著优于全局YOLOv12n基线模型,并验证了以内存为中心的架构结合显式注意力建模在资源受限可穿戴平台上的优越性。

链接: https://arxiv.org/abs/2603.15717
作者: Neeraj Solanki,Hong Ding,Sepehr Tabrizchi,Ali Shafiee Sarvestani,Shaahin Angizi,David Z. Pan,Arman Roohi
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); New Jersey Institute of Technology (新泽西理工学院); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Real-time object detection in AR/VR systems faces critical computational constraints, requiring sub-10,ms latency within tight power budgets. Inspired by biological foveal vision, we propose a two-stage pipeline that combines differentiable weightless neural networks for ultra-efficient gaze estimation with attention-guided region-of-interest object detection. Our approach eliminates arithmetic-intensive operations by performing gaze tracking through memory lookups rather than multiply-accumulate computations, achieving an angular error of 8.32^\circ with only 393 MACs and 2.2 KiB of memory per frame. Gaze predictions guide selective object detection on attended regions, reducing computational burden by 40-50% and energy consumption by 65%. Deployed on the Arduino Nano 33 BLE, our system achieves 48.1% mAP on COCO (51.8% on attended objects) while maintaining sub-10,ms latency, meeting stringent AR/VR requirements by improving the communication time by \times 177 . Compared to the global YOLOv12n baseline, which achieves 39.2%, 63.4%, and 83.1% accuracy for small, MEDium, and LARGE objects, respectively, the ROI-based method yields 51.3%, 72.1%, and 88.1% under the same settings. This work shows that memory-centric architectures with explicit attention modeling offer better efficiency and accuracy for resource-constrained wearable platforms than uniform processing.

[CV-149] ransition Flow Matching

【速读】:该论文旨在解决主流流匹配(Flow Matching)方法依赖局部速度场学习导致生成过程需多步积分的问题。其解决方案的关键在于提出一种新范式,直接学习“过渡流”(Transition Flow),该量作为全局变量天然支持单步生成或任意时间点的生成,并通过理论分析揭示其与均值速度流(Mean Velocity Flow)的内在联系,从而构建统一的理论框架。

链接: https://arxiv.org/abs/2603.15689
作者: Chenrui Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mainstream flow matching methods typically focus on learning the local velocity field, which inherently requires multiple integration steps during generation. In contrast, Mean Velocity Flow models establish a relationship between the local velocity field and the global mean velocity, enabling the latter to be learned through a mathematically grounded formulation and allowing generation to be transferred to arbitrary future time points. In this work, we propose a new paradigm that directly learns the transition flow. As a global quantity, the transition flow naturally supports generation in a single step or at arbitrary time points. Furthermore, we demonstrate the connection between our approach and Mean Velocity Flow, establishing a unified theoretical perspective. Extensive experiments validate the effectiveness of our method and support our theoretical claims.

[CV-150] DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression

【速读】:该论文旨在解决多模态大语言模型(Omnimodal Large Language Models, OmniLLMs)在处理音频-视觉流时,因生成长序列的多模态标记(token)而导致推理成本过高的问题。现有压缩方法依赖固定窗口划分和基于注意力机制的剪枝策略,忽视了音频-视觉信号的分段语义结构,在激进的标记削减下易失效。其解决方案的关键在于提出一种无需训练的动态语义分块框架——DASH(Dynamic Audio-driven Semantic cHunking),该框架以音频嵌入作为语义锚点,通过余弦相似度突变检测边界候选点,从而生成与语义结构对齐的动态、可变长度片段;进一步将这些边界投影至视频标记以实现跨模态分割,并在每个片段内利用三信号重要性估计器(融合结构边界线索、表征独特性和注意力显著性)进行精细化标记保留,有效缓解纯注意力选择导致的稀疏偏差,从而在保持关键过渡信息的同时减少冗余区域,显著提升压缩比并维持优异性能。

链接: https://arxiv.org/abs/2603.15685
作者: Bingzhou Li,Tao Huang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention-based pruning, which overlook the piecewise semantic structure of audio-visual signals and become fragile under aggressive token reduction. We propose Dynamic Audio-driven Semantic cHunking (DASH), a training-free framework that aligns token compression with semantic structure. DASH treats audio embeddings as a semantic anchor and detects boundary candidates via cosine-similarity discontinuities, inducing dynamic, variable-length segments that approximate the underlying piecewise-coherent organization of the sequence. These boundaries are projected onto video tokens to establish explicit cross-modal segmentation. Within each segment, token retention is determined by a tri-signal importance estimator that fuses structural boundary cues, representational distinctiveness, and attention-based salience, mitigating the sparsity bias of attention-only selection. This structure-aware allocation preserves transition-critical tokens while reducing redundant regions. Extensive experiments on AVUT, VideoMME, and WorldSense demonstrate that DASH maintains superior accuracy while achieving higher compression ratios compared to prior methods. Code is available at: this https URL.

[CV-151] IdentityGuard: Context-Aware Restriction and Provenance for Personalized Synthesis ICASSP

【速读】:该论文旨在解决个性化文本到图像生成模型(text-to-image models)中存在的安全挑战,即传统全局无上下文感知的过滤方法在防止滥用时会误伤模型的广泛用途,导致概念被完全擦除,产生不可接受的副作用。解决方案的关键在于提出一种更具针对性的安全机制——IDENTITYGUARD,其核心是将安全性与个性化概念紧密绑定:通过条件性限制仅在特定身份(personalized identity)与有害内容结合时触发拦截,并引入概念特异性水印以实现精准溯源。该方法在保障模型功能不受损的同时,提升了对滥用行为的防范能力与可追溯性。

链接: https://arxiv.org/abs/2603.15679
作者: Lingyun Zhang,Yu Xie,Ping Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures, Accepted to ICASSP

点击查看摘要

Abstract:The nature of personalized text-to-image models poses a unique safety challenge that generic context-blind methods are ill-equipped to handle. Such global filters create a dilemma: to prevent misuse, they are forced to damage the model’s broader utility by erasing concepts entirely, causing unacceptable collateral this http URL work presents a more precisely targeted approach, built on the principle that security should be as context-aware as the threat itself, intrinsically bound to the personalized concept. We present IDENTITYGUARD, which realizes this principle through a conditional restriction that blocks harmful content only when combined with the personalized identity, and a concept-specific watermark for precise traceability. Experiments show our approach prevents misuse while preserving the model’s utility and enabling robust traceability. By moving beyond blunt, global filters, our work demonstrates a more effective and responsible path toward AI safety.

[CV-152] OrthoAI v2: From Single-Agent Segmentation to Dual-Agent Treatment Planning for Clear Aligners

【速读】:该论文旨在解决生成式AI (Generative AI) 在隐形矫治器治疗规划中精度不足、缺乏多维度评估及无法实现动态模拟的问题。其关键解决方案在于:(1)引入第二代理,采用条件热图回归方法(Conditioned Heatmap Regression Methodology, CHARM),实现无需分割的牙位标志点精准检测,并与第一代理通过置信度加权协调器融合;(2)构建包含生物力学、分期、附件、邻面去釉、咬合和可预测性六类指标的复合评分模型,替代原版本单一通过/失败判定;(3)开发多帧治疗模拟器,基于SLERP插值和循证分期规则生成时序一致的6自由度牙齿运动轨迹,支持ClinCheck 4D可视化,显著提升规划质量与临床实用性。

链接: https://arxiv.org/abs/2603.15663
作者: Lansiaux Edouard,Leman Margaux
机构: STaR-AI Research Group (STaR-AI 研究组); Lille University Hospital (里尔大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present OrthoAI v2, the second iteration of our open-source pipeline for AI-assisted orthodontic treatment planning with clear aligners, substantially extending the single-agent framework previously introduced. The first version established a proof-of-concept based on Dynamic Graph Convolutional Neural Networks (\dgcnn) for tooth segmentation but was limited to per-tooth centroid extraction, lacked landmark-level precision, and produced a scalar quality score without staging simulation. \vtwo addresses all three limitations through three principal contributions: (i)~a second agent adopting the Conditioned Heatmap Regression Methodology (\charm)~\citerodriguez2025charm for direct, segmentation-free dental landmark detection, fused with Agent~1 via a confidence-weighted orchestrator in three modes (parallel, sequential, single-agent); (ii)~a composite six-category biomechanical scoring model (biomechanics \times 0.30 + staging \times 0.20 + attachments \times 0.15 + IPR \times 0.10 + occlusion \times 0.10 + predictability \times 0.15) replacing the binary pass/fail check of v1; (iii)~a multi-frame treatment simulator generating F = A \times r temporally coherent 6-DoF tooth trajectories via SLERP interpolation and evidence-based staging rules, enabling ClinCheck 4D visualisation. On a synthetic benchmark of 200 crowding scenarios, the parallel ensemble of OrthoAI v2 reaches a planning quality score of 92.8 \pm 4.1 vs.\ 76.4 \pm 8.3 for OrthoAI v1, a +21% relative gain, while maintaining full CPU deployability ( 4.2 \pm 0.8 ~s).

[CV-153] Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors CVPR2026

【速读】:该论文旨在解决神经网络模型在受损样本的非鲁棒特征上表现不可靠的问题,此类问题通常导致模型性能下降,而传统修复方法依赖繁琐的数据清洗和模型重训练,计算与人工成本高昂。其解决方案的关键在于提出一种基于秩一模型编辑(rank-one model editing)的归因引导型模型修正框架(attribution-guided model rectification framework),通过量化各层的可编辑性(editability)并定位最可能导致误判的层,实现对模型不可靠行为的精准纠正;实验表明,该方法仅需单个清洁样本即可达成修正目标,显著降低了对大规模清洁数据集的依赖,提升了模型修正的效率与实用性。

链接: https://arxiv.org/abs/2603.15656
作者: Peiyu Yang,Naveed Akhtar,Jiantong Jiang,Ajmal Mian
机构: The University of Melbourne(墨尔本大学); The University of Western Australia(西澳大利亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:The performance of neural network models deteriorates due to their unreliable behavior on non-robust features of corrupted samples. Owing to their opaque nature, rectifying models to address this problem often necessitates arduous data cleaning and model retraining, resulting in huge computational and manual overhead. In this work, we leverage rank-one model editing to establish an attribution-guided model rectification framework that effectively locates and corrects model unreliable behaviors. We first distinguish our rectification setting from existing model editing, yielding a formulation that corrects unreliable behavior while preserving model performance and reducing reliance on large budgets of cleansed samples. We further reveal a bottleneck of model rectifying arising from heterogeneous editability across layers. To target the primary source of misbehavior, we introduce an attribution-guided layer localization method that quantifies layer-wise editability and identifies the layer most responsible for unreliabilities. Extensive experiments demonstrate the effectiveness of our method in correcting unreliabilities observed for neural Trojans, spurious correlations and feature leakage. Our method shows remarkable performance by achieving its editing objective with as few as a single cleansed sample, which makes it appealing for practice.

[CV-154] Discovering the Hidden Role of Gini Index In Prompt-based Classification

【速读】:该论文旨在解决长尾分布下少数类(minority classes)在分类任务中准确率持续偏低的问题,这类类别的预测往往最为关键,却常被高准确率的多数类所掩盖。其核心问题是揭示并优化类别间准确率的不均衡性,从而提升模型对弱势类别的识别能力。解决方案的关键在于引入基尼指数(Gini Index)作为检测和优化此类偏差的工具:一方面,Gini指数可量化相对准确率的不平衡程度;另一方面,它也可直接用作优化目标,驱动模型在后处理阶段实现无模型依赖的偏差缓解。通过实证分析与多场景实验(如少样本新闻、生物医学文本及零样本图像分类),作者提出了一种基于Gini指标的后处理去偏方法,显著降低了相对与绝对准确率差异,有效削弱了主导类别的优势并提升了最弱类别的性能。

链接: https://arxiv.org/abs/2603.15654
作者: Ruixi Lin
机构: National University of Singapore (新加坡国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In classification tasks, the long-tailed minority classes usually offer the predictions that are most important. Yet these classes consistently exhibit low accuracies, whereas a few high-performing classes dominate the game. We pursue a foundational understanding of the hidden role of Gini Index as a tool for detecting and optimizing (debiasing) disparities in class accuracy, focusing on the case of prompt-based classification. We introduce the intuitions, benchmark Gini scores in real-world LLMs and vision models, and thoroughly discuss the insights of Gini not only as a measure of relative accuracy dominance but also as a direct optimization metric. Through rigorous case analyses, we first show that weak to strong relative accuracy imbalance exists in both prompt-based, text and image classification results and regardless of whether the classification is high-dimensional or low-dimensional. Then, we harness the Gini metric to propose a post-hoc model-agnostic bias mitigation method. Experimental results across few-shot news, biomedical, and zero-shot image classification show that our method significantly reduces both relative and absolute accuracy imbalances, minimizing top class relative dominance while elevating weakest classes.

[CV-155] How to Achieve Prototypical Birth and Death for OOD Detection?

【速读】:该论文旨在解决现有基于原型(prototype)的学习方法在分布外(Out-of-Distribution, OOD)检测中因采用固定数量原型而导致的适应性不足问题,即无法根据数据类别内部复杂度差异动态调整原型数量,从而影响模型对ID(In-Distribution)嵌入的紧凑性和分离性。解决方案的关键在于提出一种受生物细胞增殖与凋亡机制启发的新方法——PID(Prototype bIrth and Death),其核心是引入两个动态机制:原型出生机制通过识别现有原型的过载程度,在表示稀疏区域实例化新原型以精细捕捉类内子结构;原型死亡机制则通过评估原型的判别能力,移除边界模糊的原型以强化决策边界。这两个机制协同作用,使原型数量能自适应地随数据复杂度变化,从而显著提升OOD检测性能,尤其在FPR95指标上达到SOTA水平。

链接: https://arxiv.org/abs/2603.15650
作者: Ningkang Peng,Qianfeng Yu,Xiaoqian Peng,Linjing Qian,Yafei Liu,Canran Xiao,Xinyu Lu,Tingyu Lu,Zhichao Zheng,Yanhui Gu
机构: Nanjing Normal University (南京师范大学); Nanjing University of Chinese Medicine (南京中医药大学); Sun Yat-sen University (中山大学); Tohoku University (东北大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Out-of-Distribution (OOD) detection is crucial for the secure deployment of machine learning models, and prototype-based learning methods are among the mainstream strategies for achieving OOD detection. Existing prototype-based learning methods generally rely on a fixed number of prototypes. This static assumption fails to adapt to the inherent complexity differences across various categories. Currently, there is still a lack of a mechanism that can adaptively adjust the number of prototypes based on data complexity. Inspired by the processes of cell birth and death in biology, we propose a novel method named PID (Prototype bIrth and Death) to adaptively adjust the prototype count based on data complexity. This method relies on two dynamic mechanisms during the training process: prototype birth and prototype death. The birth mechanism instantiates new prototypes in data regions with insufficient representation by identifying the overload level of existing prototypes, thereby meticulously capturing intra-class substructures. Conversely, the death mechanism reinforces the decision boundary by pruning prototypes with ambiguous class boundaries through evaluating their discriminability. Through birth and death, the number of prototypes can be dynamically adjusted according to the data complexity, leading to the learning of more compact and better-separated In-Distribution (ID) embeddings, which significantly enhances the capability to detect OOD samples. Experiments demonstrate that our dynamic method, PID, significantly outperforms existing methods on benchmarks such as CIFAR-100, achieving State-of-the-Art (SOTA) performance, especially on the FPR95 metric.

[CV-156] Improving Generative Adversarial Network Generalization for Facial Expression Synthesis

【速读】:该论文旨在解决现有条件生成对抗网络(Conditional Generative Adversarial Networks, cGANs)在面部表情合成任务中泛化能力不足的问题,即当测试图像分布与训练数据存在差异时,模型性能显著下降。其解决方案的关键在于提出Regression GAN(RegGAN),该模型通过引入一个中间表示学习机制来提升对训练分布外样本的适应性:一方面,采用具有局部感受野的回归层,利用岭回归损失最小化重建误差以精确捕捉表情细节;另一方面,设计了一个对抗训练的精炼网络(refinement network)以增强生成图像的真实感。这种两阶段结构有效提升了模型在跨域图像(如名人照片、雕塑、虚拟角色等)上的表达质量与身份保真度。

链接: https://arxiv.org/abs/2603.15648
作者: Arbish Akram,Nazar Khan,Arif Mahmood
机构: University of the Punjab, Lahore, Pakistan; Information Technology University, Lahore, Pakistan
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Facial expression synthesis aims to generate realistic facial expressions while preserving identity. Existing conditional generative adversarial networks (GANs) achieve excellent image-to-image translation results, but their performance often degrades when test images differ from the training dataset. We present Regression GAN (RegGAN), a model that learns an intermediate representation to improve generalization beyond the training distribution. RegGAN consists of two components: a regression layer with local receptive fields that learns expression details by minimizing the reconstruction error through a ridge regression loss, and a refinement network trained adversarially to enhance the realism of generated images. We train RegGAN on the CFEE dataset and evaluate its generalization performance both on CFEE and challenging out-of-distribution images, including celebrity photos, portraits, statues, and avatar renderings. For evaluation, we employ four widely used metrics: Expression Classification Score (ECS) for expression quality, Face Similarity Score (FSS) for identity preservation, QualiCLIP for perceptual realism, and Fréchet Inception Distance (FID) for assessing both expression quality and realism. RegGAN outperforms six state-of-the-art models in ECS, FID, and QualiCLIP, while ranking second in FSS. Human evaluations indicate that RegGAN surpasses the best competing model by 25% in expression quality, 26% in identity preservation, and 30% in realism.

[CV-157] Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision

【速读】:该论文旨在解决如何利用视觉语言模型(Vision-Language Models, VLMs)提升盲人及低视力人群(people with blindness and low vision, pBLV)在导航任务中的辅助能力问题。其解决方案的关键在于系统评估当前主流闭源与开源VLMs在基础视觉技能(如障碍物计数、相对空间推理和常识性路径规划场景理解)以及真实场景导航模拟任务中的表现,发现GPT-4o在空间推理和场景理解方面显著优于其他模型,而开源模型则因缺乏对复杂环境的适应性和精确的空间感知能力存在局限。研究进一步指出,通过更紧密地对齐人类反馈并增强空间推理能力,可有效提升VLMs在pBLV导航辅助中的可用性。

链接: https://arxiv.org/abs/2603.15624
作者: Yu Li,Yuchen Zheng,Giles Hamilton-Fletcher,Marco Mezzavilla,Yao Wang,Sundeep Rangan,Maurizio Porfiri,Zhou Yu,John-Ross Rizzo
机构: Columbia University (哥伦比亚大学); New York University (纽约大学); Politecnico di Milano (米兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This paper investigates the potential of vision-language models (VLMs) to assist people with blindness and low vision (pBLV) in navigation tasks. We evaluate state-of-the-art closed-source models, including GPT-4V, GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, alongside open-source models, such as Llava-v1.6-mistral and Llava-onevision-qwen, to analyze their capabilities in foundational visual skills: counting ambient obstacles, relative spatial reasoning, and common-sense wayfinding-pertinent scene understanding. We further assess their performance in navigation scenarios, using pBLV-specific prompts designed to simulate real-world assistance tasks. Our findings reveal notable performance disparities between these models: GPT-4o consistently outperforms others across all tasks, particularly in spatial reasoning and scene understanding. In contrast, open-source models struggle with nuanced reasoning and adaptability in complex environments. Common challenges include difficulties in accurately counting objects in cluttered settings, biases in spatial reasoning, and a tendency to prioritize object details over spatial feedback, limiting their usability for pBLV in navigation tasks. Despite these limitations, VLMs show promise for wayfinding assistance when better aligned with human feedback and equipped with improved spatial reasoning. This research provides actionable insights into the strengths and limitations of current VLMs, guiding developers on effectively integrating VLMs into assistive technologies while addressing key limitations for enhanced usability.

[CV-158] SAC-NeRF: Adaptive Ray Sampling for Neural Radiance Fields via Soft Actor-Critic Reinforcement Learning

【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRF)在体渲染过程中因密集光线采样导致的计算效率低下问题。其核心解决方案是提出一种基于强化学习的自适应采样框架 SAC-NeRF,通过 Soft Actor-Critic (SAC) 算法学习采样策略,将采样过程建模为马尔可夫决策过程(Markov Decision Process),使智能体能够根据场景特征动态分配采样点。关键创新包括:(1) 使用高斯混合分布颜色模型提供不确定性估计;(2) 设计多组件奖励函数以平衡渲染质量、效率与一致性;(3) 采用两阶段训练策略缓解环境非平稳性问题。实验表明,该方法可在保持近似于密集采样基线的视觉质量(PSNR差异仅0.3–0.8 dB)的前提下,减少35–48%的采样点数。

链接: https://arxiv.org/abs/2603.15622
作者: Chenyu Ge
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have achieved photorealistic novel view synthesis but suffer from computational inefficiency due to dense ray sampling during volume rendering. We propose SAC-NeRF, a reinforcement learning framework that learns adaptive sampling policies using Soft Actor-Critic (SAC). Our method formulates sampling as a Markov Decision Process where an RL agent learns to allocate samples based on scene characteristics. We introduce three technical components: (1) a Gaussian mixture distribution color model providing uncertainty estimates, (2) a multi-component reward function balancing quality, efficiency, and consistency, and (3) a two-stage training strategy addressing environment non-stationarity. Experiments on Synthetic-NeRF and LLFF datasets show that SAC-NeRF reduces sampling points by 35-48% while maintaining rendering quality within 0.3-0.8 dB PSNR of dense sampling baselines. While the learned policy is scene-specific and the RL framework adds complexity compared to simpler heuristics, our work demonstrates that data-driven sampling strategies can discover effective patterns that would be difficult to hand-design.

[CV-159] HistoAtlas: A Pan-Cancer Morphology Atlas Linking Histomics to Molecular Programs and Clinical Outcomes

【速读】:该论文旨在解决如何从常规苏木精-伊红染色(Hematoxylin and Eosin, HE)病理切片中系统性挖掘具有生物学意义和临床价值的形态学生物标志物的问题。传统方法依赖于专家主观判断或特定分子检测,难以实现大规模、定量且可解释的特征分析。其解决方案的关键在于构建一个跨癌种的计算解剖图谱(HistoAtlas),通过自动提取38个可解释的组织形态学特征(histomic features),并将其与生存率、基因表达、体细胞突变及免疫亚型等多维数据进行协变量调整和多重检验校正的关联分析,同时按证据强度分级,从而实现从HE图像到功能机制的可追溯、统计校准和开放查询的全流程解析。

链接: https://arxiv.org/abs/2603.16587
作者: Pierre-Antoine Bannier
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We present HistoAtlas, a pan-cancer computational atlas that extracts 38 interpretable histomic features from 6,745 diagnostic HE slides across 21 TCGA cancer types and systematically links every feature to survival, gene expression, somatic mutations, and immune subtypes. All associations are covariate-adjusted, multiple-testing corrected, and classified into evidence-strength tiers. The atlas recovers known biology, from immune infiltration and prognosis to proliferation and kinase signaling, while uncovering compartment-specific immune signals and morphological subtypes with divergent outcomes. Every result is spatially traceable to tissue compartments and individual cells, statistically calibrated, and openly queryable. HistoAtlas enables systematic, large-scale biomarker discovery from routine HE without specialized staining or sequencing. Data and an interactive web atlas are freely available at this https URL .

[CV-160] LenghuSky-8: An 8-Year All-Sky Cloud Dataset with Star-Aware Masks and Alt-Az Calibration for Segmentation and Nowcasting CVPR

【速读】:该论文旨在解决地面时域天文台对云层覆盖进行分钟级、站点尺度感知的需求,而现有全天空数据集普遍存在时间跨度短、白天偏倚或缺乏天体测量校准的问题。解决方案的关键在于构建了LenghuSky-8数据集——一个涵盖2018至2025年共八年、包含429,620帧(512×512分辨率)的全天空图像序列,其中81.2%为夜间图像,并提供星敏感云掩膜、背景掩膜及像素级方位角-仰角(Alt-Az)校准信息;同时,利用DINOv3局部特征训练线性探测器实现高精度云分割(准确率达93.3%±1.1%),并通过恒星天体测量将每个像素映射至本地Alt-Az坐标系,校准不确定性在天顶处约0.37°、30°仰角处约1.34°,满足与望远镜调度系统集成的要求。此外,作者还提出了基于像素级三类逻辑值(天空/云/污染)的短时预报基准测试,验证了当前方法在近实时云演变预测中的挑战性。

链接: https://arxiv.org/abs/2603.16429
作者: Yicheng Rui,Xiao-Wei Duan,Licai Deng,Fan Yang,Zhengming Dang,Zhengjun Du,Junhao Peng,Wenhao Chu,Umut Mahmut,Kexin Li,Yiyun Wu,Fabo Feng
机构: Shanghai Jiao Tong University (上海交通大学); National Astronomical Observatories, Chinese Academy of Sciences (中国科学院国家天文台); Qinghai University (青海大学)
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Findings accepted. 20 pages, 8 figures

点击查看摘要

Abstract:Ground-based time-domain observatories require minute-by-minute, site-scale awareness of cloud cover, yet existing all-sky datasets are short, daylight-biased, or lack astrometric calibration. We present LenghuSky-8, an eight-year (2018-2025) all-sky imaging dataset from a premier astronomical site, comprising 429,620 512 \times 512 frames with 81.2% night-time coverage, star-aware cloud masks, background masks, and per-pixel altitude-azimuth (Alt-Az) calibration. For robust cloud segmentation across day, night, and lunar phases, we train a linear probe on DINOv3 local features and obtain 93.3% \pm 1.1% overall accuracy on a balanced, manually labeled set of 1,111 images. Using stellar astrometry, we map each pixel to local alt-az coordinates and measure calibration uncertainties of approximately 0.37 deg at zenith and approximately 1.34 deg at 30 deg altitude, sufficient for integration with telescope schedulers. Beyond segmentation, we introduce a short-horizon nowcasting benchmark over per-pixel three-class logits (sky/cloud/contamination) with four baselines: persistence (copying the last frame), optical flow, ConvLSTM, and VideoGPT. ConvLSTM performs best but yields only limited gains over persistence, underscoring the difficulty of near-term cloud evolution. We release the dataset, calibrations, and an open-source toolkit for loading, evaluation, and scheduler-ready alt-az maps to boost research in segmentation, nowcasting, and autonomous observatory operations.

[CV-161] 3D tomography of exchange phase in a Si/SiGe quantum dot device

【速读】:该论文旨在解决自旋量子比特器件中交换相互作用系数 $ J(\mathbf{V}) $ 的提取难题,该系数依赖于栅极电压,是实现高保真度量子操作和准确模拟器件性能的关键参数。由于实验测量通常仅能获得相位的余弦调制信号(即交换作用的时间积分),直接反演 $ J(\mathbf{V}) $ 存在相位模糊性、噪声敏感的相位解缠问题以及积分反演困难等挑战。为应对前两个核心问题,研究者提出了一种基于多维相位重建的方法:首先采用类似相移数字全息术的测量技术获取包裹相位(wrapped phase),随后利用最大流/最小割相位解缠算法(PUMA)在三维电压空间中进行鲁棒解缠,从而恢复累积相位 $ \phi(\mathbf{V}) $;在此基础上构建相位模型并优化以找到梯度最小的 π\pi 交换脉冲点。该方案显著提升了对器件非均匀性和漂移的鲁棒性,并为器件校准、误差归因及控制优化提供了可扩展的技术路径。

链接: https://arxiv.org/abs/2603.16025
作者: Dylan Albrecht,Sarah Thompson,N. Tobias Jacobson,Ryan Jock
机构: Sandia National Laboratories (桑迪亚国家实验室)
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Computer Vision and Pattern Recognition (cs.CV); Quantum Physics (quant-ph)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:The exchange interaction is a foundational building block for the operation of spin-based quantum processors. Extracting the exchange interaction coefficient J(\mathbfV) , as a function of gate electrode voltages, is important for understanding disorder, faithfully simulating device performance, and operating spin qubits with high fidelity. Typical coherent measurements of exchange in spin qubit devices yield a modulated cosine of an accumulated phase, which in turn is the time integral of exchange. As such, extracting J(\mathbfV) from experimental data is difficult due to the ambiguity of inverting a cosine, the sensitivity to noise when unwrapping phase, as well as the problem of inverting the integral. As a step toward obtaining J(\mathbfV) , we tackle the first two challenges to reveal the accumulated phase, \phi(\mathbfV) . We incorporate techniques from a wide range of fields to robustly extract and model a 3D phase volume for spin qubit devices from a sequence of 2D measurements. In particular, we present a measurement technique to obtain the wrapped phase, as done in phase-shifting digital holography, and utilize the max-flow/min-cut phase unwrapping method (PUMA) to unwrap the phase in 3D voltage space. We show this method is robust to the minimal observed drift in the device, which we confirm by increasing scan resolution. Upon building a model of the extracted phase, we optimize over the model to locate a minimal-gradient \pi exchange pulse point in voltage space. Our measurement protocol may provide detailed information useful for understanding the origins of device variability governing device yield, enable calibrating device models to specific devices during operation for more sophisticated error attribution, and enable a systematic optimization of qubit control. We anticipate that the methods presented here may be applicable to other qubit platforms.

[CV-162] Spectral Hierarchy of the Cosmic Web

【速读】:该论文旨在解决宇宙网(cosmic web)分类方法在不同尺度下信息表达不统一的问题,特别是如何将大尺度非局域性和小尺度局部性特征整合到一个统一框架中,以更好地刻画暗物质晕(halo)的环境依赖型聚类和组装偏置(assembly bias)。其解决方案的关键在于提出一种基于谱层次(spectral hierarchy)的分类体系:通过在密度场中引入简单尺度加权核函数(scale-weighting kernels),对第二阶导数进行滤波处理,从而构建出一系列逐步强调更小尺度结构的分类层级。这一方法自然地衔接了重整化偏差(renormalised bias)和大尺度结构的有效描述中的算子族,使各层级分类与长程和短程非局域偏差成分直接对应,同时提供了一个可量化、高保真度的“web contrast”场用于跨尺度分析,显著提升了模拟星系生成与环境条件建模的效率与解释力。

链接: https://arxiv.org/abs/2603.15834
作者: Francisco-Shu Kitaura,Francesco Sinigaglia
机构: 未知
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 7 figures, 1 table

点击查看摘要

Abstract:We introduce a spectral hierarchy of cosmic-web classifications obtained by applying simple scale-weighting kernels to the density field before performing a standard eigenvalue-based web classification. This unifies and extends several widely used web definitions within a single framework: the familiar potential/tidal web (large-scale, nonlocal), a curvature-based web (more local, peak- and ridge-sensitive), and additional higher-derivative levels that progressively emphasize smaller-scale structure. Because the classification is built from second derivatives of the filtered field, successive hierarchy levels align naturally with operator families that appear in renormalised bias and effective descriptions of large-scale structure, providing an explicit bridge between cosmic-web environments and long- and short-range nonlocal bias ingredients. We quantify the information content of the hierarchy with a compact statistic: we map each cell to one of four ordered web types (void, sheet, filament, knot), construct a corresponding ``web contrast’’ field, and measure its cross-correlation with halos from the AbacusSummit simulation suite on a coarse mesh with \Delta L\simeq 5.5,h^-1\mathrmMpc . We find that the hierarchy retains significant tracer-relevant information from very large scales down to the mesh Nyquist limit, with the more local (curvature/higher-derivative) levels dominating toward nonlinear scales. This makes the spectral hierarchy a practical, interpretable conditioning basis for fast mock-galaxy production and field-level modelling, and a flexible tool for studying environment-dependent clustering and assembly bias. Comments: 32 pages, 7 figures, 1 table Subjects: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.15834 [astro-ph.CO] (or arXiv:2603.15834v1 [astro-ph.CO] for this version) https://doi.org/10.48550/arXiv.2603.15834 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

人工智能

[AI-0] ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K

【速读】:该论文旨在解决机器人操作中仿真数据生成面临的数字资产(digital assets)规模与多样性不足的问题,这限制了机器人学习的可扩展性。解决方案的关键在于提出了一种自动化且高效的管道——ManiTwin,能够将单张图像转化为具备物理属性、语义标注和功能信息的仿真就绪3D资产,从而实现大规模高质量数字孪生体的快速构建;该方法通过构建包含10万条标注资产的ManiTwin-100K数据集,显著提升了机器人操作场景中数据合成的效率与多样性,为策略学习和视觉问答(VQA)等任务提供了坚实基础。

链接: https://arxiv.org/abs/2603.16866
作者: Kaixuan Wang,Tianxing Chen,Jiawei Liu,Honghao Su,Shaolong Zhu,Minxuan Wang,Zixuan Li,Yue Chen,Huan-ang Gao,Yusen Qin,Jiawei Wang,Qixuan Zhang,Lan Xu,Jingyi Yu,Yao Mu,Ping Luo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Website: this https URL

点击查看摘要

Abstract:Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data-generation-ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data-generation-ready digital object twins. Our pipeline transforms a single image into simulation-ready and semantically annotated 3D asset, enabling large-scale robotic manipulation data generation. Using this pipeline, we construct ManiTwin-100K, a dataset containing 100K high-quality annotated 3D assets. Each asset is equipped with physical properties, language descriptions, functional annotations, and verified manipulation proposals. Experiments demonstrate that ManiTwin provides an efficient asset synthesis and annotation workflow, and that ManiTwin-100K offers high-quality and diverse assets for manipulation data generation, random scene synthesis, and VQA data generation, establishing a strong foundation for scalable simulation data synthesis and policy learning. Our webpage is available at this https URL.

[AI-1] SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

【速读】:该论文旨在解决当前多模态大语言模型(Omni-modal Large Language Models, OLMs)在社交交互能力评估上的不足,尤其是现有基准测试仍局限于静态、以准确性为中心的任务,未能有效衡量模型在自然对话中处理动态线索的能力。解决方案的关键在于提出SocialOmni基准,其通过三个核心维度系统化评估模型的对话交互能力:(i) 发言者分离与识别(谁在说话),(ii) 打断时机控制(何时插话),以及(iii) 自然打断内容生成(如何表达打断)。该基准包含2000个感知样本和209个严格时空约束下的高质量交互生成实例,并引入可控的视听不一致场景以检验模型鲁棒性,从而揭示了感知准确性和打断生成能力之间的显著脱节,为未来OLMs在感知与交互之间建立更紧密联系提供了可操作的诊断信号。

链接: https://arxiv.org/abs/2603.16859
作者: Tianyu Xie,Jinfa Huang,Yuexiao Ma,Rongfang Luo,Yan Yang,Wang Chen,Yuhui Zeng,Ruize Fang,Yixuan Zou,Xiawu Zheng,Jiebo Luo,Rongrong Ji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code is available at this https URL and dataset is available at this https URL

点击查看摘要

Abstract:Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model’s perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.

[AI-2] Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks

【速读】:该论文旨在解决动态系统(如循环神经网络和马尔可夫链蒙特卡洛)在大规模机器学习中因序列计算的串行瓶颈而难以并行化的问题。其核心挑战在于传统并行算法在效率、稳定性及收敛性保障方面的局限性。解决方案的关键在于将动态系统的求解重构为非线性方程组,并利用牛顿法(Newton’s method)结合并行关联扫描(parallel associative scan)实现序列维度上的并行化;进一步通过引入拟牛顿(quasi-Newton)与信赖域(trust-region)方法,显著提升了算法的可扩展性和数值稳定性;理论层面则统一了多种不动点迭代方法(如Picard和Jacobi迭代),并基于动力学稳定性条件(Largest Lyapunov Exponent的符号)建立了收敛速率分析,从而明确了何时并行化能有效加速计算。

链接: https://arxiv.org/abs/2603.16850
作者: Xavier Gonzalez
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Dynamical Systems (math.DS); Optimization and Control (math.OC)
备注: PhD Dissertation; Stanford University

点击查看摘要

Abstract:Massively parallel hardware (GPUs) and long sequence data have made parallel algorithms essential for machine learning at scale. Yet dynamical systems, like recurrent neural networks and Markov chain Monte Carlo, were thought to suffer from sequential bottlenecks. Recent work showed that dynamical systems can in fact be parallelized across the sequence length by reframing their evaluation as a system of nonlinear equations, which can be solved with Newton’s method using a parallel associative scan. However, these parallel Newton methods struggled with limitations, primarily inefficiency, instability, and lack of convergence guarantees. This thesis addresses these limitations with methodological and theoretical contributions, drawing particularly from optimization. Methodologically, we develop scalable and stable parallel Newton methods, based on quasi-Newton and trust-region approaches. The quasi-Newton methods are faster and more memory efficient, while the trust-region approaches are significantly more stable. Theoretically, we unify many fixed-point methods into our parallel Newton framework, including Picard and Jacobi iterations. We establish a linear convergence rate for these techniques that depends on the method’s approximation accuracy and stability. Moreover, we give a precise condition, rooted in dynamical stability, that characterizes when parallelization provably accelerates a dynamical system and when it cannot. Specifically, the sign of the Largest Lyapunov Exponent of a dynamical system determines whether or not parallel Newton methods converge quickly. In sum, this thesis unlocks scalable and stable methods for parallelizing sequential computation, and provides a firm theoretical basis for when such techniques will and will not work. This thesis also serves as a guide to parallel Newton methods for researchers who want to write the next chapter in this ongoing story.

[AI-3] Internalizing Agency from Reflective Experience ICML2026

【速读】:该论文旨在解决当前大语言模型在作为自主代理(autonomous agents)进行长时间交互时,因依赖结果驱动的后训练方法(如基于可验证奖励的强化学习)而导致的“分布锐化”(distribution sharpening)问题——即模型仅能重复已成功的行为模式,而未能有效利用环境中丰富的反馈信息来提升问题求解能力(如Pass@k指标)。其解决方案的关键在于提出LEAFE(Learning Feedback-Grounded Agency from Reflective Experience)框架,通过在探索过程中将环境反馈总结为可操作的经验,并回溯至早期决策点以探索修正后的行动分支,进而通过监督微调将这些经验引导的纠错机制内化到模型中,从而增强策略在未来交互中的恢复能力与泛化性能。

链接: https://arxiv.org/abs/2603.16843
作者: Rui Ge,Yichao Fu,Yuyang Qian,Junda Su,Yiming Zhao,Peng Zhao,Hao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures; Submitted to ICML 2026

点击查看摘要

Abstract:Large language models are increasingly deployed as autonomous agents that must plan, act, and recover from mistakes through long-horizon interaction with environments that provide rich feedback. However, prevailing outcome-driven post-training methods (e.g., RL with verifiable rewards) primarily optimize final success signals, leaving rich environment feedback underutilized. Consequently, they often lead to distribution sharpening: the policy becomes better at reproducing a narrow set of already-successful behaviors, while failing to improve the feedback-grounded agency needed to expand problem-solving capacity (e.g., Pass@k) in long-horizon settings. To address this, we propose LEAFE (Learning Feedback-Grounded Agency from Reflective Experience), a framework that internalizes recovery agency from reflective experience. Specifically, during exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. We then distill these experience-guided corrections into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions. Across a diverse set of interactive coding and agentic tasks under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (GRPO) and experience-based methods such as Early Experience, with gains of up to 14% on Pass@128. Comments: 17 pages, 5 figures; Submitted to ICML 2026 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.16843 [cs.AI] (or arXiv:2603.16843v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.16843 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-4] Learning to Present: Inverse Specification Rewards for Agent ic Slide Generation KR

【速读】:该论文旨在解决自动化演示文稿生成这一复杂任务中的核心挑战,即如何在内容连贯性、视觉设计与受众感知沟通之间实现协同优化。其解决方案的关键在于构建一个兼容OpenEnv的强化学习环境,使大语言模型(LLM)代理通过工具调用学习研究主题、规划内容并生成专业的HTML幻灯片;同时引入多组件奖励机制,包括结构验证、渲染质量评估、基于LLM的美学评分、内容质量指标以及一项创新的“逆向规范奖励”——该奖励通过让另一个LLM从生成的幻灯片中恢复原始需求来衡量输出对意图的忠实度,从而提供全局质量信号。该方法仅微调0.5%的参数(基于Claude Opus 4.6收集的专家示范数据),最终在48个跨行业商业简报上实现91.2%的Claude Opus 4.6质量表现,并较基线提升33.1%,表明指令遵循能力和工具使用合规性比模型规模更能决定代理任务性能。

链接: https://arxiv.org/abs/2603.16839
作者: Karthik Ragunath Ananda Kumar,Subrahmanyam Arunachalam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 11 figures, 13 tables, 26 references. Code: this https URL Dataset: this https URL

点击查看摘要

Abstract:Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience-aware communication. This work proposes an OpenEnv-compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi-component reward system combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics, and an inverse specification reward that measures how faithfully generated slides convey their intended purpose. The inverse specification reward, an “inverse task” where an LLM attempts to recover the original specification from generated slides, provides a holistic quality signal. Our approach fine-tunes Qwen2.5-Coder-7B via GRPO, training only 0.5% of parameters on prompts derived from expert demonstrations collected using Claude Opus 4.6. Experiments on 48 diverse business briefs across six models demonstrate that our fine-tuned 7B model achieves 91.2% of Claude Opus 4.6’s quality while improving 33.1% over the base model. The six-model comparison reveals that instruction adherence and tool-use compliance, rather than raw parameter count, determine agentic task performance. We contribute SlideRL, an open-source dataset of 288 multi-turn rollout trajectories across all six models: this https URL Code: this https URL

[AI-5] SurgΣ: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence

【速读】:该论文旨在解决当前手术人工智能(Surgical AI)框架普遍存在的任务特异性问题,即现有模型难以在不同手术操作和医疗机构之间实现有效泛化。其核心挑战在于缺乏大规模、系统化整理的多模态数据,限制了多模态基础模型(特别是多模态大语言模型)在手术领域的应用潜力。解决方案的关键在于提出 SurgΣ 框架,其中心组件为 SurgΣ-DB——一个大规模多模态数据基础库,通过整合开源数据集、院内临床数据及网络来源数据,构建统一的数据 schema,提升标签一致性与标准化水平;同时,SurgΣ-DB 提供覆盖 18 项实际手术任务的图像级与视频级标注,并引入分层推理标注(hierarchical reasoning annotations),增强复杂手术场景中的语义理解能力,从而显著改善跨任务泛化能力和模型可解释性。

链接: https://arxiv.org/abs/2603.16822
作者: Zhitao Zeng,Mengya Xu,Jian Jiang,Pengfei Guo,Yunqiu Xu,Zhu Zhuo,Chang Han Low,Yufan He,Dong Yang,Chenxi Lin,Yiming Gu,Jiaxin Guo,Yutong Ban,Daguang Xu,Qi Dou,Yueming Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Surgical intelligence has the potential to improve the safety and consistency of surgical care, yet most existing surgical AI frameworks remain task-specific and struggle to generalize across procedures and institutions. Although multimodal foundation models, particularly multimodal large language models, have demonstrated strong cross-task capabilities across various medical domains, their advancement in surgery remains constrained by the lack of large-scale, systematically curated multimodal data. To address this challenge, we introduce Surg \Sigma , a spectrum of large-scale multimodal data and foundation models for surgical intelligence. At the core of this framework lies Surg \Sigma -DB, a large-scale multimodal data foundation designed to support diverse surgical tasks. Surg \Sigma -DB consolidates heterogeneous surgical data sources (including open-source datasets, curated in-house clinical collections and web-source data) into a unified schema, aiming to improve label consistency and data standardization across heterogeneous datasets. Surg \Sigma -DB spans 6 clinical specialties and diverse surgical types, providing rich image- and video-level annotations across 18 practical surgical tasks covering understanding, reasoning, planning, and generation, at an unprecedented scale (over 5.98M conversations). Beyond conventional multimodal conversations, Surg \Sigma -DB incorporates hierarchical reasoning annotations, providing richer semantic cues to support deeper contextual understanding in complex surgical scenarios. We further provide empirical evidence through recently developed surgical foundation models built upon Surg \Sigma -DB, illustrating the practical benefits of large-scale multimodal annotations, unified semantic design, and structured reasoning annotations for improving cross-task generalization and interpretability.

[AI-6] Beyond Accuracy: Evaluating Forecasting Models by Multi-Echelon Inventory Cost

【速读】:该论文旨在解决传统库存优化方法在复杂供应链场景下预测精度不足与决策效率低的问题,特别是在多层级库存系统中如何有效整合预测模型以降低库存成本并提升服务水平。其解决方案的关键在于构建一个集成化数字预测-库存优化流程(forecasting-inventory optimization pipeline),将经典统计模型、机器学习回归器与深度序列模型统一嵌入到一个库存仿真框架中,并基于M5 Walmart数据集对七种预测方法进行系统评估,结果表明时序卷积神经网络(Temporal CNN)和长短期记忆网络(LSTM)显著优于传统统计基线,在单层和双层报童(newsvendor)系统中均能降低库存成本并提高满足率(fill rate),展现出良好的鲁棒性与可扩展性,为现代供应链提供了一种数据驱动的决策支持工具。

链接: https://arxiv.org/abs/2603.16815
作者: Swata Marik,Swayamjit Saha,Garga Chatterjee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 2 tables

点击查看摘要

Abstract:This study develops a digitalized forecasting-inventory optimization pipeline integrating traditional forecasting models, machine learning regressors, and deep sequence models within a unified inventory simulation framework. Using the M5 Walmart dataset, we evaluate seven forecasting approaches and assess their operational impact under single- and two-echelon newsvendor systems. Results indicate that Temporal CNN and LSTM models significantly reduce inventory costs and improve fill rates compared to statistical baselines. Sensitivity and multi-echelon analyses demonstrate robustness and scalability, offering a data-driven decision-support tool for modern supply chains.

[AI-7] ODIN-Based CPU-GPU Architecture with Replay-Driven Simulation and Emulation

【速读】:该论文旨在解决芯片粒(chiplet)架构下CPU-GPU子系统在预硅验证阶段面临的复杂挑战,包括验证框架搭建困难、设计规模庞大、高并发性、非确定性执行以及芯粒边界处协议交互复杂等问题,这些问题常导致集成周期冗长。解决方案的关键在于提出一种基于重放(replay-driven)的验证方法,通过利用单一设计数据库在仿真与仿真加速(emulation)环境中实现确定性波形捕获与重放,从而可靠地复现复杂的GPU工作负载和协议序列,显著提升调试效率、增强集成信心,并支持在单个季度内完成端到端系统启动及工作负载执行,验证了该方法在芯片粒系统中的可扩展性和有效性。

链接: https://arxiv.org/abs/2603.16812
作者: Nij Dorairaj,Debabrata Chatterjee,Hong Wang,Hong Jiang,Alankar Saxena,Altug Koker,Thiam Ern Lim,Cathrane Teoh,Chuan Yin Loo,Bishara Shomar,Anthony Lester
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Integration of CPU and GPU technologies is a key enabler for modern AI and graphics workloads, combining control-oriented processing with massive parallel compute capability. As systems evolve toward chiplet-based architectures, pre-silicon validation of tightly coupled CPU-GPU subsystems becomes increasingly challenging due to complex validation framework setup, large design scale, high concurrency, non-deterministic execution, and intricate protocol interactions at chiplet boundaries, often resulting in long integration cycles. This paper presents a replay-driven validation methodology developed during the integration of a CPU subsystem, multiple Xe GPU cores, and a configurable Network-on-Chip (NoC) within a foundational SoC building block targeting the ODIN integrated chiplet architecture. By leveraging deterministic waveform capture and replay across both simulation and emulation using a single design database, complex GPU workloads and protocol sequences can be reproduced reliably at the system level. This approach significantly accelerates debug, improves integration confidence, and enables end-to-end system boot and workload execution within a single quarter, demonstrating the effectiveness of replay-based validation as a scalable methodology for chiplet-based systems.

[AI-8] CABTO: Context-Aware Behavior Tree Grounding for Robot Manipulation

【速读】:该论文试图解决行为树(Behavior Tree, BT)系统构建中的“接地”(Grounding)问题,即自动化地生成一个完整且一致的BT系统,包括高层动作模型和低层控制策略,而无需大量专家知识和人工干预。解决方案的关键在于提出CABTO(Context-Aware Behavior Tree grOunding)框架,该框架利用预训练大语言模型(Large Language Models, LMs)在动作模型与控制策略空间中进行启发式搜索,并通过BT规划器和环境观测提供的上下文反馈进行引导,从而高效实现BT系统的自动构造。

链接: https://arxiv.org/abs/2603.16809
作者: Yishuai Cai,Xinglin Chen,Yunxin Mao,Kun Hu,Minglong Li,Yaodong Yang,Yuanpei Chen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Behavior Trees (BTs) offer a powerful paradigm for designing modular and reactive robot controllers. BT planning, an emerging field, provides theoretical guarantees for the automated generation of reliable BTs. However, BT planning typically assumes that a well-designed BT system is already grounded – comprising high-level action models and low-level control policies – which often requires extensive expert knowledge and manual effort. In this paper, we formalize the BT Grounding problem: the automated construction of a complete and consistent BT system. We analyze its complexity and introduce CABTO (Context-Aware Behavior Tree grOunding), the first framework to efficiently solve this challenge. CABTO leverages pre-trained Large Models (LMs) to heuristically search the space of action models and control policies, guided by contextual feedback from BT planners and environmental observations. Experiments spanning seven task sets across three distinct robotic manipulation scenarios demonstrate CABTO’s effectiveness and efficiency in generating complete and consistent behavior tree systems.

[AI-9] DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping

【速读】:该论文旨在解决多样化灵巧手硬件在零样本跨形态抓取(zero-shot cross-embodiment grasping)中的迁移难题,即如何在不进行冗余再训练的情况下,使一个抓取策略能够直接适配未见过的机器人手型。现有方法通常通过预测中间运动目标并将其重定向到不同手型来实现迁移,但这种策略易引入误差且可能违反特定手型的物理约束,从而限制了跨形态性能。解决方案的关键在于提出 DexGrasp-Zero 策略,其核心创新是设计了一种形态对齐图表示(morphology-aligned graph representation),将每只手的运动学关键点映射到解剖学基础节点,并为每个节点赋予三轴正交运动基元,从而实现结构与语义层面的跨形态对齐;进一步构建了形态对齐图卷积网络(MAGCN),并引入物理属性注入机制(Physical Property Injection),将手型特异性物理约束融合进图特征中,实现对不同连杆长度和驱动极限的自适应补偿,从而保障抓取的精度与稳定性。

链接: https://arxiv.org/abs/2603.16806
作者: Yuliang Wu,Yanhan Lin,WengKit Lao,Yuhao Lin,Yi-Lin Wei,Wei-Shi Zheng,Ancong Wu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To meet the demands of increasingly diverse dexterous hand hardware, it is crucial to develop a policy that enables zero-shot cross-embodiment grasping without redundant re-learning. Cross-embodiment alignment is challenging due to heterogeneous hand kinematics and physical constraints. Existing approaches typically predict intermediate motion targets and retarget them to each embodiment, which may introduce errors and violate embodiment-specific limits, hindering transfer across diverse hands. To overcome these limitations, we propose \textitDexGrasp-Zero, a policy that learns universal grasping skills from diverse embodiments, enabling zero-shot transfer to unseen hands. We first introduce a morphology-aligned graph representation that maps each hand’s kinematic keypoints to anatomically grounded nodes and equips each node with tri-axial orthogonal motion primitives, enabling structural and semantic alignment across different morphologies. Relying on this graph-based representation, we design a \textitMorphology-Aligned Graph Convolutional Network (MAGCN) to encode the graph for policy learning. MAGCN incorporates a \textitPhysical Property Injection mechanism that fuses hand-specific physical constraints into the graph features, enabling adaptive compensation for varying link lengths and actuation limits for precise and stable grasping. Our extensive simulation evaluations on the YCB dataset demonstrate that our policy, jointly trained on four heterogeneous hands (Allegro, Shadow, Schunk, Ability), achieves an 85% zero-shot success rate on unseen hardware (LEAP, Inspire), outperforming the state-of-the-art method by 59.5%. Real-world experiments further evaluate our policy on three robot platforms (LEAP, Inspire, Revo2), achieving an 82% average success rate on unseen objects.

[AI-10] InCoder-32B: Code Foundation Model for Industrial Scenarios

【速读】:该论文旨在解决当前代码大语言模型在工业场景中性能显著下降的问题,尤其是在涉及硬件语义推理、专用语言构造和严格资源约束等复杂任务时。其关键解决方案是提出首个32B参数的代码基础模型InCoder-32B(Industrial-Coder-32B),通过统一芯片设计、GPU内核优化、嵌入式系统、编译器优化和3D建模等五大工业领域的能力,结合从头训练策略、精选工业代码精炼(curated industrial code annealing)、逐步扩展上下文长度至128K tokens的中期训练以及基于执行验证的后训练机制,从而在通用代码基准和9个工业基准上均实现卓越性能,为工业代码智能提供了强有力的开源基线。

链接: https://arxiv.org/abs/2603.16790
作者: Jian Yang,Wei Zhang,Jiajun Wu,Junhang Cheng,Shawn Guo,Haowen Wang,Weicheng Gu,Yaxin Du,Joseph Li,Fanglin Xu,Yizhi Li,Lin Jing,Yuanbo Wang,Yuhan Gao,Ruihao Gong,Chuan Hao,Ran Tao,Aishan Liu,Tuney Zheng,Ganqu Cui,Zhoujun Li,Mingjie Tang,Chenghua Lin,Wayne Xin Zhao,Xianglong Liu,Ming Zhou,Bryan Dai,Weifeng Lv
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent code large language models have achieved remarkable progress on general programming tasks. Nevertheless, their performance degrades significantly in industrial scenarios that require reasoning about hardware semantics, specialized language constructs, and strict resource constraints. To address these challenges, we introduce InCoder-32B (Industrial-Coder-32B), the first 32B-parameter code foundation model unifying code intelligence across chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling. By adopting an efficient architecture, we train InCoder-32B from scratch with general code pre-training, curated industrial code annealing, mid-training that progressively extends context from 8K to 128K tokens with synthetic industrial reasoning data, and post-training with execution-grounded verification. We conduct extensive evaluation on 14 mainstream general code benchmarks and 9 industrial benchmarks spanning 4 specialized domains. Results show InCoder-32B achieves highly competitive performance on general tasks while establishing strong open-source baselines across industrial domains.

[AI-11] Anticipatory Planning for Multimodal AI Agents CVPR2026

【速读】:该论文旨在解决当前多模态智能体(multimodal agents)在执行高阶、多步骤任务时缺乏前瞻性规划的问题,即现有系统大多为反应式(reactive)行为模式,仅优化单步动作而未对未来的状态或长期目标进行推理,导致规划连贯性差且难以可靠完成复杂任务。其解决方案的关键在于提出一种两阶段强化学习框架——TraceR1,通过显式训练前瞻推理能力:第一阶段基于预测的短时程轨迹进行轨迹级强化学习,以全局一致性奖励约束动作序列;第二阶段利用冻结工具代理(frozen tool agents)的执行反馈进行接地强化微调(grounded reinforcement fine-tuning),提升每一步动作的准确性与可执行性。此方法显著提升了规划稳定性、执行鲁棒性和泛化能力,在多个在线/离线计算机使用及多模态工具推理基准上优于反应式和单阶段基线模型。

链接: https://arxiv.org/abs/2603.16777
作者: Yongyuan Liang,Shijie Zhou,Yu Gu,Hao Tan,Gang Wu,Franck Dernoncourt,Jihyung Kil,Ryan A. Rossi,Ruiyi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published at CVPR 2026 Findings Track

点击查看摘要

Abstract:Recent advances in multimodal agents have improved computer-use interaction and tool-usage, yet most existing systems remain reactive, optimizing actions in isolation without reasoning about future states or long-term goals. This limits planning coherence and prevents agents from reliably solving high-level, multi-step tasks. We introduce TraceR1, a two-stage reinforcement learning framework that explicitly trains anticipatory reasoning by forecasting short-horizon trajectories before execution. The first stage performs trajectory-level reinforcement learning with rewards that enforce global consistency across predicted action sequences. The second stage applies grounded reinforcement fine-tuning, using execution feedback from frozen tool agents to refine step-level accuracy and executability. TraceR1 is evaluated across seven benchmarks, covering online computer-use, offline computer-use benchmarks, and multimodal tool-use reasoning tasks, where it achieves substantial improvements in planning stability, execution robustness, and generalization over reactive and single-stage baselines. These results show that anticipatory trajectory reasoning is a key principle for building multimodal agents that can reason, plan, and act effectively in complex real-world environments.

[AI-12] Finding Common Ground in a Sea of Alternatives

【速读】:该论文旨在解决在多元群体偏好中选择一个能达成共识的陈述(statement)的问题,其核心挑战在于如何从理论上定义“共识”并高效地找到满足该共识条件的陈述。传统方法如Habermas机器依赖投票规则决定生成语句,但未明确定义“共识”的含义。为此,作者提出基于社会选择理论中的比例否决核心(proportional veto core)的正式模型,并设计了一种基于采样的高效算法,在仅通过查询未知偏好分布和选民分布的情况下,以高概率返回近似比例否决核心内的陈述。该方案的关键创新在于将无限多候选陈述场景下的共识判定问题转化为可计算的采样优化问题,并通过匹配的下界证明表明所提算法在查询复杂度上是最优的。

链接: https://arxiv.org/abs/2603.16751
作者: Jay Chooi,Paul Gölz,Ariel D. Procaccia,Benjamin Schiffer,Shirley Zhang
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study the problem of selecting a statement that finds common ground across diverse population preferences. Generative AI is uniquely suited for this task because it can access a practically infinite set of statements, but AI systems like the Habermas machine leave the choice of generated statement to a voting rule. What it means for this rule to find common ground, however, is not well-defined. In this work, we propose a formal model for finding common ground in the infinite alternative setting based on the proportional veto core from social choice. To provide guarantees relative to these infinitely many alternatives and a large population, we wish to satisfy a notion of proportional veto core using only query access to the unknown distribution of alternatives and voters. We design an efficient sampling-based algorithm that returns an alternative in the (approximate) proportional veto core with high probability and prove matching lower bounds, which show that no algorithm can do the same using fewer queries. On a synthetic dataset of preferences over text, we confirm the effectiveness of our sampling-based algorithm and compare other social choice methods as well as LLM-based methods in terms of how reliably they produce statements in the proportional veto core.

[AI-13] Nonstandard Errors in AI Agents

【速读】:该论文旨在解决一个关键问题:当使用最先进的AI代码代理(AI coding agents)在相同数据和研究问题下进行实证分析时,是否能够产生一致的 empirical results(实证结果)。研究表明,尽管输入条件一致,AI代理之间仍存在显著的非标准误差(nonstandard errors, NSEs),这源于代理间分析选择的差异,如指标选取(例如自相关 vs. 方差比、金额交易量 vs. 股数交易量)以及模型家族偏好(如Sonnet 4.6与Opus 4.6)。解决方案的关键在于引入三阶段反馈协议,其中暴露于高质量示例论文(top-rated exemplar papers)可使估计值的四分位间距缩小80–99%,且收敛主要通过同一方法族内部估计收紧或代理完全切换方法族实现,但这种收敛体现的是模仿而非对方法机制的理解。这一发现对AI在自动化政策评估与实证研究中的可靠性提出了重要警示。

链接: https://arxiv.org/abs/2603.16744
作者: Ruijiang Gao,Steven Chong Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 45 pages

点击查看摘要

Abstract:We study whether state-of-the-art AI coding agents, given the same data and research question, produce the same empirical results. Deploying 150 autonomous Claude Code agents to independently test six hypotheses about market quality trends in NYSE TAQ data for SPY (2015–2024), we find that AI agents exhibit sizable \textitnonstandard errors (NSEs), that is, uncertainty from agent-to-agent variation in analytical choices, analogous to those documented among human researchers. AI agents diverge substantially on measure choice (e.g., autocorrelation vs.\ variance ratio, dollar vs.\ share volume). Different model families (Sonnet 4.6 vs.\ Opus 4.6) exhibit stable ``empirical styles,‘’ reflecting systematic differences in methodological preferences. In a three-stage feedback protocol, AI peer review (written critiques) has minimal effect on dispersion, whereas exposure to top-rated exemplar papers reduces the interquartile range of estimates by 80–99% within \textitconverging measure families. Convergence occurs both through within-family estimation tightening and through agents switching measure families entirely, but convergence reflects imitation rather than understanding. These findings have implications for the growing use of AI in automated policy evaluation and empirical research.

[AI-14] MedCL-Bench: Benchmarking stability-efficiency trade-offs and scaling in biomedical continual learning

【速读】:该论文旨在解决医学语言模型在持续学习(continual learning)过程中因顺序更新导致的灾难性遗忘(catastrophic forgetting)问题,同时缺乏统一、任务多样且标准化的评估基准来系统比较不同持续学习策略的有效性。解决方案的关键在于提出 MedCL-Bench,这是一个涵盖十个生物医学自然语言处理(NLP)数据集、覆盖五类任务的基准测试框架,能够标准化评估十一种持续学习方法在八种任务顺序下的性能表现,包括保留率(retention)、迁移能力(transfer)和GPU小时成本(GPU-hour cost),从而为模型迭代更新提供可复现的审计机制。

链接: https://arxiv.org/abs/2603.16738
作者: Min Zeng,Shuang Zhou,Zaifu Zhan,Rui Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical language models must be updated as evidence and terminology evolve, yet sequential updating can trigger catastrophic forgetting. Although biomedical NLP has many static benchmarks, no unified, task-diverse benchmark exists for evaluating continual learning under standardized protocols, robustness to task order and compute-aware reporting. We introduce MedCL-Bench, which streams ten biomedical NLP datasets spanning five task families and evaluates eleven continual learning strategies across eight task orders, reporting retention, transfer, and GPU-hour cost. Across backbones and task orders, direct sequential fine-tuning on incoming tasks induces catastrophic forgetting, causing update-induced performance regressions on prior tasks. Continual learning methods occupy distinct retention-compute frontiers: parameter-isolation provides the best retention per GPU-hour, replay offers strong protection at higher cost, and regularization yields limited benefit. Forgetting is task-dependent, with multi-label topic classification most vulnerable and constrained-output tasks more robust. MedCL-Bench provides a reproducible framework for auditing model updates before deployment.

[AI-15] Differential Harm Propensity in Personalized LLM Agents : The Curious Case of Mental Health Disclosure

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)作为工具型代理(tool-using agents)部署时,因忽视用户个性化信号(如心理状态披露)而导致的有害任务完成风险评估不足的问题。解决方案的关键在于构建一个受控实验框架,在AgentHarm基准基础上引入三种用户上下文条件(无生物信息、仅生物信息、生物信息+心理健康披露),并结合轻量级越狱注入(jailbreak injection),系统评估不同模型在多步骤恶意任务中的行为变化。结果表明,个性化信息可作为弱保护因素降低有害行为,但其效果易被轻微对抗性提示削弱,凸显了在用户上下文多样化场景下进行鲁棒性安全评估与防护机制设计的重要性。

链接: https://arxiv.org/abs/2603.16734
作者: Caglar Yildirim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as tool-using agents, shifting safety concerns from harmful text generation to harmful task completion. Deployed systems often condition on user profiles or persistent memory, yet agent safety evaluations typically ignore personalization signals. To address this gap, we investigated how mental health disclosure, a sensitive and realistic user-context cue, affects harmful behavior in agentic settings. Building on the AgentHarm benchmark, we evaluated frontier and open-source LLMs on multi-step malicious tasks (and their benign counterparts) under controlled prompt conditions that vary user-context personalization (no bio, bio-only, bio+mental health disclosure) and include a lightweight jailbreak injection. Our results reveal that harmful task completion is non-trivial across models: frontier lab models (e.g., GPT 5.2, Claude Sonnet 4.5, Gemini 3-Pro) still complete a measurable fraction of harmful tasks, while an open model (DeepSeek 3.2) exhibits substantially higher harmful completion. Adding a bio-only context generally reduces harm scores and increases refusals. Adding an explicit mental health disclosure often shifts outcomes further in the same direction, though effects are modest and not uniformly reliable after multiple-testing correction. Importantly, the refusal increase also appears on benign tasks, indicating a safety–utility trade-off via over-refusal. Finally, jailbreak prompting sharply elevates harm relative to benign conditions and can weaken or override the protective shift induced by personalization. Taken together, our results indicate that personalization can act as a weak protective factor in agentic misuse settings, but it is fragile under minimal adversarial pressure, highlighting the need for personalization-aware evaluations and safeguards that remain robust across user-context conditions.

[AI-16] Federated Learning with Multi-Partner OneFlorida Consortium Data for Predicting Major Postoperative Complications

【速读】:该论文旨在解决多中心医疗数据在用于构建术后并发症预测模型时面临的隐私保护与模型泛化能力不足的问题。其解决方案的关键在于采用联邦学习(Federated Learning)框架,通过在不共享原始数据的前提下聚合各医疗机构的本地模型参数,从而实现跨中心的高性能、高隐私保护的预测模型构建。研究结果表明,所提出的联邦学习模型在多个临床结局(如ICU入院、机械通气、急性肾损伤和院内死亡)上的表现优于或至少等同于单中心本地模型,并具备良好的外部泛化性能,验证了联邦学习在临床决策支持系统中的可行性与有效性。

链接: https://arxiv.org/abs/2603.16723
作者: Yuanfang Ren,Varun Sai Vemuri,Zhenhong Hu,Benjamin Shickel,Ziyuan Guan,Tyler J. Loftus,Parisa Rashidi,Tezcan Ozrazgat-Baslanti,Azra Bihorac
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 1 figure, 6 tables

点击查看摘要

Abstract:Background: This study aims to develop and validate federated learning models for predicting major postoperative complications and mortality using a large multicenter dataset from the OneFlorida Data Trust. We hypothesize that federated learning models will offer robust generalizability while preserving data privacy and security. Methods: This retrospective, longitudinal, multicenter cohort study included 358,644 adult patients admitted to five healthcare institutions, who underwent 494,163 inpatient major surgical procedures from 2012-2023. We developed and internally and externally validated federated learning models to predict the postoperative risk of intensive care unit (ICU) admission, mechanical ventilation (MV) therapy, acute kidney injury (AKI), and in-hospital mortality. These models were compared with local models trained on data from a single center and central models trained on a pooled dataset from all centers. Performance was primarily evaluated using area under the receiver operating characteristics curve (AUROC) and the area under the precision-recall curve (AUPRC) values. Results: Our federated learning models demonstrated strong predictive performance, with AUROC scores consistently comparable or superior performance in terms of AUROC and AUPRC across all outcomes and sites. Our federated learning models also demonstrated strong generalizability, with comparable or superior performance in terms of both AUROC and AUPRC compared to the best local learning model at each site. Conclusions: By leveraging multicenter data, we developed robust, generalizable, and privacy-preserving predictive models for major postoperative complications and mortality. These findings support the feasibility of federated learning in clinical decision support systems.

[AI-17] Cost Trade-offs in Matrix Inversion Updates for Streaming Outlier Detection

【速读】:该论文旨在解决在基于Christoffel函数的在线异常检测中,如何高效更新矩阵逆以适应新数据的问题。其关键解决方案在于系统比较了三种矩阵逆更新方法——直接求逆(Direct Inversion, DI)、迭代Sherman-Morrison公式(Iterative Sherman-Morrison, ISM)和Woodbury矩阵恒等式(Woodbury Matrix Identity, WMI),并从理论计算复杂度和实际仿真两个层面验证其性能差异,最终提出一个简洁、可量化且易记忆的选型准则:当更新秩为1时ISM最优,小秩更新(相对于矩阵尺寸)时WMI表现最佳,其余情况则推荐DI。这一成果为在线异常检测及其他需要频繁矩阵逆更新的应用提供了通用且高效的计算策略。

链接: https://arxiv.org/abs/2603.16697
作者: Florian Grivet,Louise Travé-Massuyès
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Outlier detection identifies data points that deviate significantly from expected patterns, revealing anomalies that may require special attention. Incorporating online learning further improves accuracy by continuously updating the model to reflect the most recent data. When employing the Christoffel function as an outlier score, online learning requires updating the inverse of a matrix following a rank-k update, given the initial inverse. Surprisingly, there is no consensus on the optimal method for this task. This technical note aims to compare three different updating methods: Direct Inversion (DI), Iterative Sherman-Morrison (ISM), and Woodbury Matrix Identity (WMI), to identify the most suitable approach for different scenarios. We first derive the theoretical computational costs of each method and then validate these findings through comprehensive Python simulations run on a CPU. These results allow us to propose a simple, quantitative, and easy-to-remember rule that can be stated qualitatively as follows: ISM is optimal for rank-1 updates, WMI excels for small updates relative to matrix size, and DI is preferable otherwise. This technical note produces a general result for any problem involving a matrix inversion update. In particular, it contributes to the ongoing development of efficient online outlier detection techniques.

[AI-18] When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making

【速读】:该论文旨在解决具身机器人系统中因依赖大语言模型(Large Language Model, LLM)进行高阶推理而导致的计算延迟与资源开销问题,这些问题可能中断动作执行并降低系统可靠性。其核心挑战在于如何在“推理”与“行动”之间做出动态决策:过度推理会延迟任务执行,而推理不足则易导致错误决策和任务失败。解决方案的关键在于提出一种基于强化学习的分层框架——RARRL(Resource-Aware Reasoning via Reinforcement Learning),该框架不直接学习底层控制策略,而是学习一个高层调度策略,使代理能够根据当前观测、执行历史及剩余资源,自适应地决定是否触发推理、选择何种推理角色以及分配多少计算预算。实验表明,相较于固定或启发式推理策略,RARRL在提升任务成功率的同时显著降低执行延迟并增强鲁棒性,验证了自适应推理控制对构建高效可靠具身机器人的必要性。

链接: https://arxiv.org/abs/2603.16673
作者: Jun Liu,Pu Zhao,Zhenglun Kong,Xuan Shen,Peiyan Dong,Fan Yang,Lin Cui,Hao Tang,Geng Yuan,Wei Niu,Wenbin Zhang,Xue Lin,Gaowen Liu,Yanzhi Wang,Dong Huang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decision-making during interactions with the environment. However, invoking LLM reasoning introduces substantial computational latency and resource overhead, which can interrupt action execution and reduce system reliability. Excessive reasoning may delay actions, while insufficient reasoning often leads to incorrect decisions and task failures. This raises a fundamental question for embodied agents: when should the agent reason, and when should it act? In this work, we propose RARRL (Resource-Aware Reasoning via Reinforcement Learning), a hierarchical framework for resource-aware orchestration of embodied agents. Rather than learning low-level control policies, RARRL learns a high-level orchestration policy that operates at the agent’s decision-making layer. This policy enables the agent to adaptively determine whether to invoke reasoning, which reasoning role to employ, and how much computational budget to allocate based on current observations, execution history, and remaining resources. Extensive experiments, including evaluations with empirical latency profiles derived from the ALFRED benchmark, show that RARRL consistently improves task success rates while reducing execution latency and enhancing robustness compared with fixed or heuristic reasoning strategies. These results demonstrate that adaptive reasoning control is essential for building reliable and efficient embodied robotic agents.

[AI-19] Machines acquire scientific taste from institutional traces

【速读】:该论文试图解决科学评估中“科学品味”(scientific taste)难以量化与自动化的难题,即如何客观判断哪些未经验证的研究想法值得投入资源进行探索。传统上,这一能力依赖于编辑和资助机构的主观判断,但始终未能被清晰表述、传授或自动化。解决方案的关键在于:将语言模型在期刊发表决策数据上进行微调(fine-tuning),从而从长期积累的学术出版记录中提取出隐含的评价信号。实验表明,此类微调模型在管理学研究提案的多级质量分类任务中达到59%准确率,显著优于11个前沿大模型(平均31%)和专家评审小组(42%),且具备校准置信度的能力,并能迁移至未训练过的成对比较和摘要生成任务,证明了科学品味可通过结构化文献数据有效建模与泛化。

链接: https://arxiv.org/abs/2603.16659
作者: Ziqin Gong,Ning Li,Huaikang Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Artificial intelligence matches or exceeds human performance on tasks with verifiable answers, from protein folding to Olympiad mathematics. Yet the capacity that most governs scientific advance is not reasoning but taste: the ability to judge which untested ideas deserve pursuit, exercised daily by editors and funders but never successfully articulated, taught, or automated. Here we show that fine-tuning language models on journal publication decisions recovers evaluative judgment inaccessible to both frontier models and human expertise. Using a held-out benchmark of research pitches in management spanning four quality tiers, we find that eleven frontier models, spanning major proprietary and open architectures, barely exceed chance, averaging 31% accuracy. Panels of journal editors and editorial board members reach 42% by majority vote. Fine-tuned models trained on years of publication records each surpass every frontier model and expert panel, with the best single model achieving 59%. These models exhibit calibrated confidence, reaching 100% accuracy on their highest-confidence predictions, and transfer this evaluative signal to untrained pairwise comparisons and one-sentence summaries. The mechanism generalizes: models trained on economics publication records achieve 70% accuracy. Scientific taste was not missing from AI’s reach; it was deposited in the institutional record, waiting to be extracted. These results provide a scalable mechanism to triage the expanding volume of scientific production across disciplines where quality resists formal verification.

[AI-20] What if Pinocchio Were a Reinforcement Learning Agent : A Normative End-to-End Pipeline

【速读】:该论文旨在解决如何使人工智能(AI)系统在复杂社会环境中实现规范合规性(norm compliance)与情境感知能力的问题,以确保其能够安全、有效地融入人类日常生活。解决方案的关键在于提出一个名为 \pino 的混合模型,该模型结合了强化学习(reinforcement learning, RL)代理与基于论证的规范顾问(argumentation-based normative advisors),通过后者对前者的行为进行监督,从而引导代理在决策过程中遵守社会规范。此外,论文还设计了一种新颖算法,用于自动提取顾问决策背后的论证及其关系,并首次定义并提出了缓解“规范规避”(norm avoidance)现象的策略,该现象指代理在训练中有意或无意地规避规范约束。整个方案通过实证评估验证了各模块的有效性。

链接: https://arxiv.org/abs/2603.16651
作者: Benoît Alcaraz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: PhD thesis

点击查看摘要

Abstract:In the past decade, artificial intelligence (AI) has developed quickly. With this rapid progression came the need for systems capable of complying with the rules and norms of our society so that they can be successfully and safely integrated into our daily lives. Inspired by the story of Pinocchio in ``Le avventure di Pinocchio - Storia di un burattino’‘, this thesis proposes a pipeline that addresses the problem of developing norm compliant and context-aware agents. Building on the AJAR, Jiminy, and NGRL architectures, the work introduces \pino, a hybrid model in which reinforcement learning agents are supervised by argumentation-based normative advisors. In order to make this pipeline operational, this thesis also presents a novel algorithm for automatically extracting the arguments and relationships that underlie the advisors’ decisions. Finally, this thesis investigates the phenomenon of \textitnorm avoidance, providing a definition and a mitigation strategy within the context of reinforcement learning agents. Each component of the pipeline is empirically evaluated. The thesis concludes with a discussion of related work, current limitations, and directions for future research.

[AI-21] Domain-Independent Dynamic Programming with Constraint Propagation ICAPS2026

【速读】:该论文旨在解决动态规划(Dynamic Programming, DP)与约束编程(Constraint Programming, CP)两种模型驱动范式之间的割裂问题,即如何将约束传播(constraint propagation)机制有效集成到DP框架中,以提升求解效率。其解决方案的关键在于:在领域无关的动态规划(Domain-Independent Dynamic Programming)框架中引入通用约束求解器(general-purpose CP solver),通过约束传播技术对状态和转移进行剪枝,从而减少状态扩展数量并增强求解能力。实验表明,该方法在单机调度带时间窗、资源受限项目调度问题(RCPSP)及带时间窗的旅行商问题(TSPTW)上均显著提升了实例求解率,并在约束较强的场景下证明了传播收益超过计算开销,为DP与CP的融合提供了可验证的建模路径。

链接: https://arxiv.org/abs/2603.16648
作者: Imko Marijnissen,J. Christopher Beck,Emir Demirović,Ryo Kuroiwa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages. To appear at the 36th International Conference on Automated Planning and Scheduling (ICAPS 2026)

点击查看摘要

Abstract:There are two prevalent model-based paradigms for combinatorial problems: 1) state-based representations, such as heuristic search, dynamic programming (DP), and decision diagrams, and 2) constraint and domain-based representations, such as constraint programming (CP), (mixed-)integer programming, and Boolean satisfiability. In this paper, we bridge the gap between the DP and CP paradigms by integrating constraint propagation into DP, enabling a DP solver to prune states and transitions using constraint propagation. To this end, we implement constraint propagation using a general-purpose CP solver in the Domain-Independent Dynamic Programming framework and evaluate using heuristic search on three combinatorial optimisation problems: Single Machine Scheduling with Time Windows, the Resource Constrained Project Scheduling Problem (RCPSP), and the Travelling Salesperson Problem with Time Windows (TSPTW). Our evaluation shows that constraint propagation significantly reduces the number of state expansions, causing our approach to solve more instances than a DP solver for Single Machine Scheduling and RCPSP, and showing similar improvements for tightly constrained TSPTW instances. The runtime performance indicates that the benefits of propagation outweigh the overhead for constrained instances, but that further work into reducing propagation overhead could improve performance further. Our work is a key step in understanding the value of constraint propagation in DP solvers, providing a model-based approach to integrating DP and CP.

[AI-22] Data-driven generalized perimeter control: Zürich case study

【速读】:该论文旨在解决城市交通拥堵问题,其核心挑战在于如何高效利用现有基础设施并实现精准的交通控制。传统模型-based 控制方法在建模过程中成本高、耗时长,而机器学习方法则面临数据稀疏性和难以施加硬性约束的问题。解决方案的关键在于提出一种基于行为系统理论(behavioral systems theory)的交通动态建模新范式,并结合数据驱动的预测控制(data-enabled predictive control, DePC)技术,通过动态交通灯调控来引导交通流演化。该方法无需显式建模即可利用实测数据进行闭环优化,在保证物理约束的前提下显著降低总出行时间和碳排放,验证结果基于对苏黎世市高保真微观仿真(目前文献中最大规模的闭合回路城市交通仿真)得出。

链接: https://arxiv.org/abs/2603.16599
作者: Alessio Rimoldi,Carlo Cenedese,Alberto Padoan,Florian Dörfler,John Lygeros
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET)
备注: 33 pages, 16 figures

点击查看摘要

Abstract:Urban traffic congestion is a key challenge for the development of modern cities, requiring advanced control techniques to optimize existing infrastructures usage. Despite the extensive availability of data, modeling such complex systems remains an expensive and time consuming step when designing model-based control approaches. On the other hand, machine learning approaches require simulations to bootstrap models, or are unable to deal with the sparse nature of traffic data and enforce hard constraints. We propose a novel formulation of traffic dynamics based on behavioral systems theory and apply data-enabled predictive control to steer traffic dynamics via dynamic traffic light control. A high-fidelity simulation of the city of Zürich, the largest closed-loop microscopic simulation of urban traffic in the literature to the best of our knowledge, is used to validate the performance of the proposed method in terms of total travel time and CO2 emissions.

[AI-23] Runtime Governance for AI Agents : Policies on Paths

【速读】:该论文旨在解决生成式 AI (Generative AI) 代理在运行时因路径依赖性(path-dependent behavior)导致的治理难题,即如何在确保任务完成率最大化的同时,有效控制法律、数据泄露、声誉等潜在风险。其核心解决方案是将执行路径(execution path)作为治理的核心对象,提出一种形式化框架:将合规策略定义为从代理身份、部分路径、拟执行动作及组织状态到违规概率的确定性映射函数。该框架能够统一处理提示词指令(prompt-level instructions)和静态访问控制等传统方法,并强调运行时评估(runtime evaluation)对于实现全面路径依赖型策略的必要性,从而实现更精细化的风险管控与合规保障。

链接: https://arxiv.org/abs/2603.16586
作者: Maurits Kaptein,Vassilis-Javed Khan,Andriy Podstavnychy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI agents – systems that plan, reason, and act using large language models – produce non-deterministic, path-dependent behavior that cannot be fully governed at design time, where with governed we mean striking the right balance between as high as possible successful task completion rate and the legal, data-breach, reputational and other costs associated with running agents. We argue that the execution path is the central object for effective runtime governance and formalize compliance policies as deterministic functions mapping agent identity, partial path, proposed next action, and organizational state to a policy violation probability. We show that prompt-level instructions (and “system prompts”), and static access control are special cases of this framework: the former shape the distribution over paths without actually evaluating them; the latter evaluates deterministic policies that ignore the path (i.e., these can only account for a specific subset of all possible paths). In our view, runtime evaluation is the general case, and it is necessary for any path-dependent policy. We develop the formal framework for analyzing AI agent governance, present concrete policy examples (inspired by the AI act), discuss a reference implementation, and identify open problems including risk calibration and the limits of enforced compliance.

[AI-24] V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge in Vision Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对时间敏感事实时的知识滞后问题,即当前VLMs由于训练数据和评估基准多为静态快照,导致其对动态变化的现实世界知识缺乏更新能力,从而产生过时预测。解决方案的关键在于构建了一个名为V-DyKnow的视觉动态知识基准,用于系统评估VLMs在跨模态(文本与图像)场景下对时间敏感事实的掌握可靠性、知识编辑与多模态检索增强生成(Multi-modal RAG)方法的有效性,并通过数据和机制分析定位过时预测的根本来源。实证结果表明,VLMs普遍存在知识过时现象,且跨模态一致性差,现有对齐策略难以有效实现多模态知识更新,揭示了当前VLMs在获取和维护动态知识方面的根本局限。

链接: https://arxiv.org/abs/2603.16581
作者: Seyed Mahed Mousavi,Christian Moiola,Massimo Rizzoli,Simone Alghisi,Giuseppe Riccardi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are trained on data snapshots of documents, including images and texts. Their training data and evaluation benchmarks are typically static, implicitly treating factual knowledge as time-invariant. However, real-world facts are intrinsically time-sensitive and subject to erratic and periodic changes, causing model predictions to become outdated. We present V-DyKnow, a Visual Dynamic Knowledge benchmark for evaluating time-sensitive factual knowledge in VLMs. Using V-DyKnow, we benchmark closed- and open-source VLMs and analyze a) the reliability (correctness and consistency) of model responses across modalities and input perturbations; b) the efficacy of knowledge editing and multi-modal RAG methods for knowledge updates across modalities; and c) the sources of outdated predictions, through data and mechanistic analysis. Our results show that VLMs frequently output outdated facts, reflecting outdated snapshots used in the (pre-)training phase. Factual reliability degrades from textual to visual stimuli, even when entities are correctly recognized. Besides, existing alignment approaches fail to consistently update the models’ knowledge across modalities. Together, these findings highlight fundamental limitations in how current VLMs acquire and update time-sensitive knowledge across modalities. We release the benchmark, code, and evaluation data.

[AI-25] Malicious Or Not: Adding Repository Context to Agent Skill Classification

【速读】:该论文旨在解决当前AI代理技能(Agent Skills)生态系统中安全扫描工具误报率过高的问题,即现有自动化扫描器基于技能描述文本将大量技能标记为恶意行为,导致对整个生态风险的误判。其关键解决方案在于构建一个大规模、多源的数据收集与分析框架,从三大分发平台及GitHub获取238,180个唯一技能,并通过比对其技能描述与对应GitHub仓库内容的一致性,实现更精准的行为验证。此方法显著降低误报率——将原本高达46.8%的恶意技能比例降至仅0.52%,从而提供更可靠的生态风险画像,并揭示了如废弃GitHub仓库被劫持等新型攻击向量。

链接: https://arxiv.org/abs/2603.16572
作者: Florian Holzbauer,David Schmidt,Gabriel Gegenhuber,Sebastian Schrittwieser,Johanna Ullrich
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 23 Pages, 10 Figures

点击查看摘要

Abstract:Agent skills extend local AI agents, such as Claude Code or Open Claw, with additional functionality, and their popularity has led to the emergence of dedicated skill marketplaces, similar to app stores for mobile applications. Simultaneously, automated skill scanners were introduced, analyzing the skill description available in this http URL, to verify their benign behavior. The results for individual market places mark up to 46.8% of skills as malicious. In this paper, we present the largest empirical security analysis of the AI agent skill ecosystem, questioning this high classification of malicious skills. Therefore, we collect 238,180 unique skills from three major distribution platforms and GitHub to systematically analyze their type and behavior. This approach substantially reduces the number of skills flagged as non-benign by security scanners to only 0.52% which remain in malicious flagged repositories. Consequently, out methodology substantially reduces false positives and provides a more robust view of the ecosystem’s current risk surface. Beyond that, we extend the security analysis from the mere investigation of the skill description to a comparison of its congruence with the GitHub repository the skill is embedded in, providing additional context. Furthermore, our analysis also uncovers several, by now undocumented real-world attack vectors, namely hijacking skills hosted on abandoned GitHub repositories.

[AI-26] Manifold-Matching Autoencoders

【速读】:该论文旨在解决自编码器(Autoencoder)在无监督学习中缺乏有效正则化机制,导致潜在空间表示质量不高、难以保持数据流形结构的问题。其解决方案的关键在于提出一种名为流形匹配(Manifold-Matching, MMAE)的简单无监督正则化策略:通过最小化潜在空间与输入空间之间成对距离的均方误差,强制两者在几何结构上对齐。该方法不依赖于坐标级对齐,而是基于距离关系的约束,因此可扩展至低维表示,并在保留最近邻距离和持久同调(persistent homology)度量方面优于现有方法,同时提供多维缩放(Multi-Dimensional Scaling, MDS)的可扩展近似。

链接: https://arxiv.org/abs/2603.16568
作者: Laurent Cheret,Vincent Létourneau,Isar Nejadgholi,Chris Drummond,Hussein Al Osman,Maia Fraser
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study a simple unsupervised regularization scheme for autoencoders called Manifold-Matching (MMAE): we align the pairwise distances in the latent space to those of the input data space by minimizing mean squared error. Because alignment occurs on pairwise distances rather than coordinates, it can also be extended to a lower-dimensional representation of the data, adding flexibility to the method. We find that this regularization outperforms similar methods on metrics based on preservation of nearest-neighbor distances and persistent homology-based measures. We also observe that MMAE provides a scalable approximation of Multi-Dimensional Scaling (MDS).

[AI-27] Exploring different approaches to customize language models for domain-specific text-to-code generation

【速读】:该论文旨在解决通用大语言模型(Large Language Models, LLMs)在特定编程领域中表现不足的问题,尤其是在需要使用领域专用库、API 或编码规范的场景下。其核心挑战在于如何以低成本方式提升较小开源模型在特定编程任务中的代码生成能力。解决方案的关键在于通过构建三个 Python 生态领域的合成数据集(涵盖通用 Python 编程、Scikit-learn 机器学习流程和 OpenCV 计算机视觉任务),并对比三种定制策略:少量样本提示(few-shot prompting)、检索增强生成(Retrieval-Augmented Generation, RAG)以及基于低秩适应(Low-Rank Adaptation, LoRA)的参数高效微调。实验表明,LoRA 微调在多数任务中均能实现更高的准确率和更强的领域对齐性,而提示方法虽成本低但对基准指标提升有限,揭示了灵活性、计算开销与性能之间的权衡关系。

链接: https://arxiv.org/abs/2603.16526
作者: Luís Freire,Fernanda A. Andaló,Nicki Skafte Detlefsen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong capabilities in generating executable code from natural language descriptions. However, general-purpose models often struggle in specialized programming contexts where domain-specific libraries, APIs, or conventions must be used. Customizing smaller open-source models offers a cost-effective alternative to relying on large proprietary systems. In this work, we investigate how smaller language models can be adapted for domain-specific code generation using synthetic datasets. We construct datasets of programming exercises across three domains within the Python ecosystem: general Python programming, Scikit-learn machine learning workflows, and OpenCV-based computer vision tasks. Using these datasets, we evaluate three customization strategies: few-shot prompting, retrieval-augmented generation (RAG), and parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA). Performance is evaluated using both benchmark-based metrics and similarity-based metrics that measure alignment with domain-specific code. Our results show that prompting-based approaches such as few-shot learning and RAG can improve domain relevance in a cost-effective manner, although their impact on benchmark accuracy is limited. In contrast, LoRA-based fine-tuning consistently achieves higher accuracy and stronger domain alignment across most tasks. These findings highlight practical trade-offs between flexibility, computational cost, and performance when adapting smaller language models for specialized programming tasks.

[AI-28] FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data

【速读】:该论文旨在解决大型结构化数据基础模型(Large Structured-Data Models, LDMs)在实际应用中面临的三大核心问题:一是基于样本级自注意力机制的模型存在O(N²)复杂度,限制了样本数量;二是线性序列模型因隐藏状态压缩和人为因果偏置导致表征能力下降;三是仅使用合成数据预训练难以匹配真实世界分布。解决方案的关键在于提出FEAT,一种线性复杂度的基础模型,其核心创新是多层双轴架构:通过自适应融合双Mamba-2(Adaptive-Fusion bi-Mamba-2, AFBM)捕捉局部样本依赖关系,并结合卷积门控线性注意力(Convolutional Gated Linear Attention, Conv-GLA)实现全局记忆建模,从而在保持表达能力的同时实现跨样本建模的线性复杂度;此外,采用混合结构因果模型流水线和稳定重建目标提升鲁棒性,实验证明该方法在11个真实数据集上显著优于基线模型,且推理速度最高提升40倍。

链接: https://arxiv.org/abs/2603.16513
作者: Zhenghang Song,Tang Qian,Lu Chen,Yushuai Li,Zhengke Hu,Bingbing Fang,Yumeng Song,Junbo Zhao,Sheng Zhang,Tianyi Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Structured data is foundational to healthcare, finance, e-commerce, and scientific data management. Large structured-data models (LDMs) extend the foundation model paradigm to unify heterogeneous datasets for tasks such as classification, regression, and decision support. However, existing LDMs face major limitations. First, most rely on sample-wise self-attention, whose O(N^2) complexity limits the sample count. Second, linear sequence models often degrade representations due to hidden-state compression and artificial causal bias. Third, synthetic-only pre-training often fails to match real-world distributions. We propose FEAT, a linear-complexity foundation model for extremely large structured data. FEAT introduces a multi-layer dual-axis architecture that replaces quadratic attention with hybrid linear encoding. The architecture combines adaptive-fusion bi-Mamba-2 (AFBM) for local sample dependencies and convolutional gated linear attention (Conv-GLA) for global memory. This design enables linear-complexity cross-sample modeling while preserving expressive representations. To improve robustness, FEAT adopts a hybrid structural causal model pipeline and a stable reconstruction objective. Experiments on 11 real-world datasets show that FEAT consistently outperforms baselines in zero-shot performance, while scaling linearly and achieving up to 40x faster inference.

[AI-29] Bridging the High-Frequency Data Gap: A Millisecond-Resolution Network Dataset for Advancing Time Series Foundation Models

【速读】:该论文旨在解决当前时间序列基础模型(Time Series Foundation Models, TSFMs)在预训练阶段缺乏高频率数据支持的问题,从而限制了其在真实场景中的泛化能力和鲁棒性。现有大规模数据集主要覆盖低频时间序列(采样间隔为秒至年),难以捕捉毫秒级无线网络与交通条件的动态特征。解决方案的关键在于构建一个全新的、具有毫秒级分辨率的无线网络与交通状态数据集,该数据集来源于实际5G部署,首次将高频率数据引入TSFMs的预训练范式,并拓展了无线网络这一新领域。通过在此数据集上对传统机器学习模型和TSFMs进行基准测试,作者发现大多数TSFM配置在零样本和微调设置下表现不佳,凸显出在预训练中融入高频率数据对于提升模型架构设计、微调策略及跨域适应能力的重要性。

链接: https://arxiv.org/abs/2603.16497
作者: Subina Khanal,Seshu Tirupathi,Merim Dzaferagic,Marco Ruffini,Torben Bach Pedersen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series foundation models (TSFMs) require diverse, real-world datasets to adapt across varying domains and temporal frequencies. However, current large-scale datasets predominantly focus on low-frequency time series with sampling intervals, i.e., time resolution, in the range of seconds to years, hindering their ability to capture the nuances of high-frequency time series data. To address this limitation, we introduce a novel dataset that captures millisecond-resolution wireless and traffic conditions from an operational 5G wireless deployment, expanding the scope of TSFMs to incorporate high-frequency data for pre-training. Further, the dataset introduces a new domain, wireless networks, thus complementing existing more general domains like energy and finance. The dataset also provides use cases for short-term forecasting, with prediction horizons spanning from 100 milliseconds (1 step) to 9.6 seconds (96 steps). By benchmarking traditional machine learning models and TSFMs on predictive tasks using this dataset, we demonstrate that most TSFM model configurations perform poorly on this new data distribution in both zero-shot and fine-tuned settings. Our work underscores the importance of incorporating high-frequency datasets during pre-training and forecasting to enhance architectures, fine-tuning strategies, generalization, and robustness of TSFMs in real-world applications.

[AI-30] ExpressMind: A Multimodal Pretrained Large Language Model for Expressway Operation

【速读】:该论文旨在解决当前高速公路运营中依赖规则驱动且孤立的模型所导致的知识跨系统协同分析能力不足的问题,以及通用大语言模型(Large Language Models, LLMs)在非常规场景下难以有效理解交通法规与事件因果关系的局限性。解决方案的关键在于构建首个面向高速公路领域的全栈预训练多模态大语言模型(Multimodal Large Language Model, MLLM)——ExpressMind,其核心创新包括:1)提出基于自监督学习与无监督学习的双层预训练范式;2)设计图增强的检索增强生成(Graph-Augmented RAG)框架以动态索引高速公路知识库;3)引入强化学习对齐的思维链(RL-aligned Chain-of-Thought, RL-CoT)机制,确保模型推理与专家应急处理策略的一致性;4)集成跨模态编码器实现视觉与文本通道动态特征序列的对齐,从而支持视频与图像模态下的交通场景理解。实验表明,ExpressMind在事件检测、安全响应生成及复杂交通分析任务上显著优于现有基线方法。

链接: https://arxiv.org/abs/2603.16495
作者: Zihe Wang,Yihuan Wang,Haiyang Yu. Zhiyong Cui,Xiaojian Liao,Chengcheng Wang,Yonglin Tian,Yongxin Tong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The current expressway operation relies on rule-based and isolated models, which limits the ability to jointly analyze knowledge across different systems. Meanwhile, Large Language Models (LLMs) are increasingly applied in intelligent transportation, advancing traffic models from algorithmic to cognitive intelligence. However, general LLMs are unable to effectively understand the regulations and causal relationships of events in unconventional scenarios in the expressway field. Therefore, this paper constructs a pre-trained multimodal large language model (MLLM) for expressways, ExpressMind, which serves as the cognitive core for intelligent expressway operations. This paper constructs the industry’s first full-stack expressway dataset, encompassing traffic knowledge texts, emergency reasoning chains, and annotated video events to overcome data scarcity. This paper proposes a dual-layer LLM pre-training paradigm based on self-supervised training and unsupervised learning. Additionally, this study introduces a Graph-Augmented RAG framework to dynamically index the expressway knowledge base. To enhance reasoning for expressway incident response strategies, we develop a RL-aligned Chain-of-Thought (RL-CoT) mechanism that enforces consistency between model reasoning and expert problem-solving heuristics for incident handling. Finally, ExpressMind integrates a cross-modal encoder to align the dynamic feature sequences under the visual and textual channels, enabling it to understand traffic scenes in both video and image modalities. Extensive experiments on our newly released multi-modal expressway benchmark demonstrate that ExpressMind comprehensively outperforms existing baselines in event detection, safety response generation, and complex traffic analysis. The code and data are available at: this https URL.

[AI-31] Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures

【速读】:该论文试图解决的问题是:在基于模式引导的推理流程(schema-guided reasoning pipelines)中,模型生成的中间结构(如检查表、验证查询等)是否真正因果性地决定了最终决策,还是仅作为伴随现象存在。为解决这一问题,作者提出了一种因果评估协议(causal evaluation protocol),其关键在于设计任务场景,使得中间结构与最终决策之间存在确定性映射关系——即对中间结构进行可控修改后,可唯一确定正确输出。实验结果表明,尽管大语言模型(LLM)在未干预时表现出与其自身中间结构的一致性,但在约60%的情况下无法根据干预后的中间结构更新预测,说明其“忠实性”在结构变动时极为脆弱;而将决策过程从模型内部转移到外部工具时,这种脆弱性显著降低,表明中间结构本质上更像具有影响力的上下文信息,而非稳定的因果中介。

链接: https://arxiv.org/abs/2603.16475
作者: Oleg Somov,Mikhail Chaichuk,Mikhail Seleznyov,Alexander Panchenko,Elena Tutubalina
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Schema-guided reasoning pipelines ask LLMs to produce explicit intermediate structures – rubrics, checklists, verification queries – before committing to a final decision. But do these structures causally determine the output, or merely accompany it? We introduce a causal evaluation protocol that makes this directly measurable: by selecting tasks where a deterministic function maps intermediate structures to decisions, every controlled edit implies a unique correct output. Across eight models and three benchmarks, models appear self-consistent with their own intermediate structures but fail to update predictions after intervention in up to 60% of cases – revealing that apparent faithfulness is fragile once the intermediate structure changes. When derivation of the final decision from the structure is delegated to an external tool, this fragility largely disappears; however, prompts which ask to prioritize the intermediate structure over the original input do not materially close the gap. Overall, intermediate structures in schema-guided pipelines function as influential context rather than stable causal mediators.

[AI-32] Multi-Agent Reinforcement Learning Counteracts Delayed CSI in Multi-Satellite Systems

【速读】:该论文旨在解决卫星通信中因用户与卫星间高传播延迟导致的信道状态信息(Channel State Information, CSI)过时问题,从而影响下行链路传输性能的问题。其解决方案的关键在于提出一种双阶段近端策略优化(Dual Stage Proximal Policy Optimization, DS-PPO)算法,该算法通过分层优化机制应对多卫星作为分布式基站(Distributed Base Station, BS)场景下的连续动作空间大和非独立同分布(non-IID)环境挑战:第一阶段优化单个卫星的用户总速率,第二阶段协同所有卫星形成虚拟多天线系统以进一步提升整体性能。数值结果表明,DS-PPO在CSI不完美条件下仍具鲁棒性,并显著提升了系统吞吐量。

链接: https://arxiv.org/abs/2603.16470
作者: Marios Aristodemou,Yasaman Omid,Sangarapillai Lambotharan,Mahsa Derakhshan,Lajos Hanzo
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 12 pages, 6 Figures, Submit to IEEE Transactions of Vehicular Technology. It has been reviewed once

点击查看摘要

Abstract:The integration of satellite communication networks with next-generation (NG) technologies is a promising approach towards global connectivity. However, the quality of services is highly dependant on the availability of accurate channel state information (CSI). Channel estimation in satellite communications is challenging due to the high propagation delay between terrestrial users and satellites, which results in outdated CSI observations on the satellite side. In this paper, we study the downlink transmission of multiple satellites acting as distributed base stations (BS) to mobile terrestrial users. We propose a multi-agent reinforcement learning (MARL) algorithm which aims for maximising the sum-rate of the users, while coping with the outdated CSI. We design a novel bi-level optimisation, procedure themes as dual stage proximal policy optimisation (DS-PPO), for tackling the problem of large continuous action spaces as well as of independent and non-identically distributed (non-IID) environments in MARL. Specifically, the first stage of DS-PPO maximises the sum-rate for an individual satellite and the second stage maximises the sum-rate when all the satellites cooperate to form a distributed multi-antenna BS. Our numerical results demonstrate the robustness of DS-PPO to CSI imperfections as well as the sum-rate improvement attached by the use of DS-PPO. In addition, we provide the convergence analysis for the DS-PPO along with the computational complexity.

[AI-33] RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在现实商业场景中长期、动态环境下保持连贯决策能力的挑战,尤其是面对随机需求和不断变化的外部条件时,现有LLM代理难以维持稳定且高效的长期策略执行。其解决方案的核心是提出Evolving Strategy Execution(ESE)框架,该框架通过将高层战略推理与底层动作执行分离,实现策略层面的自适应演化与可解释性更新,从而在不同时间尺度上应对环境非平稳性和误差累积问题,显著提升了操作稳定性与效率,尽管仍存在任务复杂度上升时性能显著下降的问题,揭示了当前LLM在多因素、长周期决策中的根本局限性。

链接: https://arxiv.org/abs/2603.16453
作者: Linghua Zhang,Jun Wang,Jingtong Wu,Zhisong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based agents have achieved notable success on short-horizon and highly structured tasks. However, their ability to maintain coherent decision-making over long horizons in realistic and dynamic environments remains an open challenge. We introduce RetailBench, a high-fidelity benchmark designed to evaluate long-horizon autonomous decision-making in realistic commercial scenarios, where agents must operate under stochastic demand and evolving external conditions. We further propose the Evolving Strategy Execution framework, which separates high-level strategic reasoning from low-level action execution. This design enables adaptive and interpretable strategy evolution over time. It is particularly important for long-horizon tasks, where non-stationary environments and error accumulation require strategies to be revised at a different temporal scale than action execution. Experiments on eight state-of-the-art LLMs across progressively challenging environments show that our framework improves operational stability and efficiency compared to other baselines. However, performance degrades substantially as task complexity increases, revealing fundamental limitations in current LLMs for long-horizon, multi-factor decision-making. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.16453 [cs.AI] (or arXiv:2603.16453v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.16453 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Linghua Zhang [view email] [v1] Tue, 17 Mar 2026 12:35:52 UTC (501 KB)

[AI-34] RUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas

【速读】:该论文旨在解决现实企业环境中文本到SQL(Text-to-SQL)解析面临的“未知模式”(Unknown Schema)问题,即数据库包含数百张表且存在大量噪声元数据时,传统方法依赖预加载完整模式(Full Schema Assumption)失效的问题。解决方案的关键在于提出TRUST-SQL框架,其核心创新是将任务建模为部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process),并设计了一种结构化的四阶段协议,使推理始终基于验证过的元数据;同时引入双轨GRPO(Dual-Track GRPO)策略,通过token级掩码优势分离探索奖励与执行结果,实现更精准的信用分配,从而显著提升模型在无预加载元数据条件下的性能表现。

链接: https://arxiv.org/abs/2603.16448
作者: Ai Jian,Xiaoyun Zhang,Wanrou Du,Jingqing Ruan,Jiangbo Pei,Weipeng Zhang,Ke Zeng,Xunliang Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-SQL parsing has achieved remarkable progress under the Full Schema Assumption. However, this premise fails in real-world enterprise environments where databases contain hundreds of tables with massive noisy metadata. Rather than injecting the full schema upfront, an agent must actively identify and verify only the relevant subset, giving rise to the Unknown Schema scenario we study in this work. To address this, we propose TRUST-SQL (Truthful Reasoning with Unknown Schema via Tools). We formulate the task as a Partially Observable Markov Decision Process where our autonomous agent employs a structured four-phase protocol to ground reasoning in verified metadata. Crucially, this protocol provides a structural boundary for our novel Dual-Track GRPO strategy. By applying token-level masked advantages, this strategy isolates exploration rewards from execution outcomes to resolve credit assignment, yielding a 9.9% relative improvement over standard GRPO. Extensive experiments across five benchmarks demonstrate that TRUST-SQL achieves an average absolute improvement of 30.6% and 16.6% for the 4B and 8B variants respectively over their base models. Remarkably, despite operating entirely without pre-loaded metadata, our framework consistently matches or surpasses strong baselines that rely on schema prefilling.

[AI-35] Visual Distraction Undermines Moral Reasoning in Vision-Language Models

【速读】:该论文旨在解决当前人工智能(AI)系统在多模态环境下道德推理一致性不足的问题,特别是当AI从纯文本交互发展为具身代理(embodied agents)时,视觉输入如何影响道德决策机制尚不明确。现有安全技术主要针对文本场景设计,缺乏对视觉因素的系统性控制与评估,导致其在多模态情境下可能失效。解决方案的关键在于提出一个基于道德基础理论(Moral Foundation Theory, MFT)的多模态基准测试框架——道德困境模拟(Moral Dilemma Simulation, MDS),通过正交操控视觉和语境变量,揭示了视觉模态会激活类直觉路径,从而绕过文本中已训练的安全约束机制,暴露出语言调优的安全过滤器在视觉处理中的失效风险,强调了多模态对齐(multimodal safety alignment)的紧迫性。

链接: https://arxiv.org/abs/2603.16445
作者: Xinyi Yang,Chenheng Xu,Weijun Hong,Ce Mo,Qian Wang,Fang Fang,Yixin Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Moral reasoning is fundamental to safe Artificial Intelligence (AI), yet ensuring its consistency across modalities becomes critical as AI systems evolve from text-based assistants to embodied agents. Current safety techniques demonstrate success in textual contexts, but concerns remain about generalization to visual inputs. Existing moral evaluation benchmarks rely on textonly formats and lack systematic control over variables that influence moral decision-making. Here we show that visual inputs fundamentally alter moral decision-making in state-of-the-art (SOTA) Vision-Language Models (VLMs), bypassing text-based safety mechanisms. We introduce Moral Dilemma Simulation (MDS), a multimodal benchmark grounded in Moral Foundation Theory (MFT) that enables mechanistic analysis through orthogonal manipulation of visual and contextual variables. The evaluation reveals that the vision modality activates intuition-like pathways that override the more deliberate and safer reasoning patterns observed in text-only contexts. These findings expose critical fragilities where language-tuned safety filters fail to constrain visual processing, demonstrating the urgent need for multimodal safety alignment.

[AI-36] From Natural Language to Executable Option Strategies via Large Language Models

【速读】:该论文旨在解决将自然语言描述的交易意图准确转化为正确期权策略的问题,这一任务在真实期权市场中尤为复杂,因其需处理多维期权链数据并满足严格的约束条件,传统直接生成方法难以保证逻辑一致性和执行准确性。解决方案的关键在于提出一种领域特定的中间表示——期权查询语言(Option Query Language, OQL),它通过语法化规则将期权市场抽象为高层语义原语,使大型语言模型(LLMs)能够作为可靠的语义解析器而非自由编程者使用;随后,OQL查询由一个确定性引擎验证并执行,从而生成可执行的期权策略,显著提升了执行准确率与逻辑一致性。

链接: https://arxiv.org/abs/2603.16434
作者: Haochen Luo,Zhengzhao Lai,Junjie Xu,Yifan Li,Tang Pok Hin,Yuan Zhang,Chen Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Trading and Market Microstructure (q-fin.TR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel at general code generation, yet translating natural-language trading intents into correct option strategies remains challenging. Real-world option design requires reasoning over massive, multi-dimensional option chain data with strict constraints, which often overwhelms direct generation methods. We introduce the Option Query Language (OQL), a domain-specific intermediate representation that abstracts option markets into high-level primitives under grammatical rules, enabling LLMs to function as reliable semantic parsers rather than free-form programmers. OQL queries are then validated and executed deterministically by an engine to instantiate executable strategies. We also present a new dataset for this task and demonstrate that our neuro-symbolic pipeline significantly improves execution accuracy and logical consistency over direct baselines.

[AI-37] An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)微调过程中因内存密集特性导致的GPU资源瓶颈问题,尤其在单GPU环境下难以实现高效微调的挑战。其核心解决方案在于提出SlideFormer系统,关键创新包括:(1) 一种轻量级异步引擎,将GPU视为滑动窗口,实现GPU计算与CPU更新及多级I/O操作的重叠;(2) 一种高效的异构内存管理机制,显著降低峰值内存占用;(3) 针对关键性能瓶颈优化的Triton内核与先进I/O集成。这一协同设计使得在单张RTX 4090显卡上即可支持123B+参数模型的微调,同时提升吞吐量达1.40x–6.27x,并将CPU/GPU内存使用量减少约50%,且在NVIDIA和AMD GPU上均能维持95%的峰值性能。

链接: https://arxiv.org/abs/2603.16428
作者: Ruijia Yang,Zeyi Wen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) has become essential for domain adaptation, but its memory-intensive property exceeds the capabilities of most GPUs. To address this challenge and democratize LLM fine-tuning, we present SlideFormer, a novel system designed for single-GPU environments. Our innovations are: (1) A lightweight asynchronous engine that treats the GPU as a sliding window and overlaps GPU computation with CPU updates and multi-tier I/O. (2) A highly efficient heterogeneous memory management scheme significantly reduces peak memory usage. (3) Optimized Triton kernels to solve key bottlenecks and integrated advanced I/O. This collaborative design enables fine-tuning of the latest 123B+ models on a single RTX 4090, supporting up to 8x larger batch sizes and 6x larger models. In evaluations, SlideFormer achieves 1.40x to 6.27x higher throughput while roughly halving CPU/GPU memory usage compared to baselines, sustaining 95% peak performance on both NVIDIA and AMD GPUs.

[AI-38] Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences

【速读】:该论文试图解决的问题是:为何仅使用负向反馈(negative-only feedback)训练大语言模型(Large Language Models, LLMs)在某些任务上能与标准的人类反馈强化学习(Reinforcement Learning from Human Feedback, RLHF)相媲美甚至超越,而目前缺乏一个统一的理论框架来解释这一现象。解决方案的关键在于提出一种结构上的不对称性理论:正向偏好(positive preferences)体现为连续、情境依赖的人类价值观,难以穷尽定义,易导致模型学习表面相关特征(如迎合用户的行为,即sycophancy);而负向约束(negative constraints)则表现为离散、有限且可独立验证的禁止项,能够收敛到稳定边界。这种不对称性根植于波普尔的证伪逻辑(falsification logic)和负知识(negative knowledge)的认识论,从而解释了RLHF中sycophancy失败的原因,并阐明了负向信号方法有效性的根源。论文进一步主张,对齐研究应从“学习人类偏好”转向“学习人类排斥”,并为此提供了可检验的预测。

链接: https://arxiv.org/abs/2603.16417
作者: Quan Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, position paper

点击查看摘要

Abstract:Recent empirical results have demonstrated that training large language models (LLMs) with negative-only feedback can match or exceed standard reinforcement learning from human feedback (RLHF). Negative Sample Reinforcement achieves parity with PPO on mathematical reasoning; Distributional Dispreference Optimization trains effectively using only dispreferred samples; and Constitutional AI outperforms pure RLHF on harmlessness benchmarks. Yet no unified theoretical account explains why negative signals are so effective. This paper proposes such an account: positive preferences and negative constraints are structurally asymmetric. Positive preferences (“which is better”) encode continuously coupled, context-dependent human values that cannot be exhaustively specified – leading models to learn surface correlates such as agreement with the user (sycophancy). Negative constraints (“what is wrong”) encode discrete, finite, independently verifiable prohibitions that can converge to a stable boundary. This asymmetry – rooted in Popper’s falsification logic and the epistemology of negative knowledge – explains both the sycophancy failure of preference-based RLHF and the surprising effectiveness of negative-signal methods. We argue that alignment research should shift its center of gravity from “learning what humans prefer” to “learning what humans reject,” and offer testable predictions for this framework.

[AI-39] rained Persistent Memory for Frozen Encoder–Decoder LLM s: Six Architectural Methods

【速读】:该论文旨在解决冻结的编码器-解码器语言模型(frozen encoder–decoder language models)缺乏跨会话持续记忆的问题,这类模型在每次前向传播后都会丢弃潜在表示(latent representation),导致信息无法在不同对话轮次间留存。解决方案的关键在于在连续潜在空间(continuous latent space)中实现可微分的记忆机制:通过六种架构方法,在三个注入点和四种写入机制下,将记忆以密集向量形式存储于一个紧凑的数值数组(memory bank)中;训练仅作用于小规模可训练适配器(trainable adapters),而推理时无需梯度即可持续积累记忆,从而支持对话式学习(conversational learning)。实验表明,在LoCoMo数据集上,即使在资源受限条件下(单个Flan-T5-XL骨干网络、单一数据集),10倍容量下的所有适配器均能产生正向记忆召回曲线,验证了该方案的可行性与容量敏感性。

链接: https://arxiv.org/abs/2603.16413
作者: Hong Jeong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Frozen encoder–decoder language models are stateless: the latent representation is discarded after every forward pass, so no information persists across sessions. This paper presents a \textbfproof-of-concept pilot study showing that persistent memory in the \emphcontinuous latent space of a frozen LLM is feasible – even under severe resource constraints (a single frozen Flan-T5-XL backbone, small trainable adapters, a single dataset). We implement six architectural methods spanning three injection points and four write mechanisms; unlike text-level memory systems, every write and read is a differentiable operation on dense vectors. After training only the adapter, the memory bank continues to accumulate at inference time without gradients, enabling \emphconversational learning. Under a forgetting-curve evaluation on LoCoMo at two capacity scales (1 \times and 10 \times ), the stateless baseline scores exactly zero; at 10 \times all six trained adapters produce positive memory-recall curves; at 1 \times three methods collapse, revealing capacity as a critical design parameter. Because the memory bank is a compact numerical array, it can be scaled to arbitrarily large capacity without altering the backbone. We argue that full end-to-end training with larger models, larger data, and orders-of-magnitude larger memory will yield substantially stronger results; this pilot study establishes the feasibility baseline and design-space taxonomy that such efforts require.

[AI-40] Robust Physics-Guided Diffusion for Full-Waveform Inversion

【速读】:该论文旨在解决全波形反演(Full-Waveform Inversion, FWI)中因数据噪声、振幅失衡及时间/相位错位导致的重建不稳定与精度不足问题。其解决方案的关键在于提出一种物理引导的扩散框架,将基于评分的生成先验(score-based generative prior)与通过波动方程模拟计算的似然引导(likelihood guidance)相结合;同时采用基于Wasserstein-2距离的数据一致性势能(transport-based data-consistency potential),引入有界加权和观测依赖归一化以增强波场并提升对振幅不平衡和时移敏感性的鲁棒性;在推理阶段进一步设计预条件化的引导逆向扩散机制,动态调整引导强度与空间缩放,从而实现比标准扩散后验采样(Diffusion Posterior Sampling, DPS)更稳定有效的数据一致性引导步骤。

链接: https://arxiv.org/abs/2603.16393
作者: Jishen Peng,Enze Jiang,Zheng Ma,Xiongbin Yan
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We develop a robust physics-guided diffusion framework for full-waveform inversion that combines a score-based generative prior with likelihood guidance computed through wave-equation simulations. We adopt a transport-based data-consistency potential (Wasserstein-2), incorporating wavefield enhancement via bounded weighting and observation-dependent normalization, thereby improving robustness to amplitude imbalance and time/phase misalignment. On the inference side, we introduce a preconditioned guided reverse-diffusion scheme that adapts the guidance strength and spatial scaling throughout the reverse-time dynamics, yielding a more stable and effective data-consistency guidance step than standard diffusion posterior sampling (DPS). Numerical experiments on OpenFWI datasets demonstrate improved reconstruction quality over deterministic optimization baselines and standard DPS under comparable computational budgets.

[AI-41] Age Predictors Through the Lens of Generalization Bias Mitigation and Interpretability: Reflections on Causal Implications

【速读】:该论文旨在解决年龄预测模型在分布外(out-of-distribution, OOD)场景下泛化能力不足的问题,其根源在于模型对种族、性别或组织类型等外部属性(exogenous attributes)的敏感性。这些问题不仅导致预测性能下降,还可能引入偏倚并阻碍因果推断的可靠性。解决方案的关键在于通过对抗式表示学习(adversarial representation learning)构建对这些属性不变的表征(invariant representation),从而提升模型的OOD泛化能力,并在预测、因果分析和公平性保护等多个维度实现统一优化。

链接: https://arxiv.org/abs/2603.16377
作者: Debdas Paul,Elisa Ferrari,Irene Gravili,Alessandro Cellerino
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chronological age predictors often fail to achieve out-of-distribution (OOD) gen- eralization due to exogenous attributes such as race, gender, or tissue. Learning an invariant representation with respect to those attributes is therefore essential to improve OOD generalization and prevent overly optimistic results. In predic- tive settings, these attributes motivate bias mitigation; in causal analyses, they appear as confounders; and when protected, their suppression leads to fairness. We coherently explore these concepts with theoretical rigor and discuss the scope of an interpretable neural network model based on adversarial representation learning. Using publicly available mouse transcriptomic datasets, we illustrate the behavior of this model relative to conventional machine learning models. We observe that the outcome of this model is consistent with the predictive results of a published study demonstrating the effects of Elamipretide on mouse skeletal and cardiac muscle. We conclude by discussing the limitations of deriving causal interpretation from such purely predictive models.

[AI-42] FederatedFactory: Generative One-Shot Learning for Extremely Non-IID Distributed Scenarios

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在局部标签分布互斥场景下因优化轨迹冲突导致的标准权重聚合失效问题,以及现有方法对预训练基础模型的依赖所引入的不切实际假设。其解决方案的关键在于提出FederatedFactory框架,该框架将联邦单元从判别参数(discriminative parameters)反转为生成先验(generative priors),通过单轮通信交换生成模块,实现无偏的类平衡数据集原位合成(ex nihilo synthesis),从而彻底消除梯度冲突和外部先验偏差。

链接: https://arxiv.org/abs/2603.16370
作者: Andrea Moleri,Christian Internò,Ali Raza,Markus Olhofer,David Klindt,Fabio Stella,Barbara Hammer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) enables distributed optimization without compromising data sovereignty. Yet, where local label distributions are mutually exclusive, standard weight aggregation fails due to conflicting optimization trajectories. Often, FL methods rely on pretrained foundation models, introducing unrealistic assumptions. We introduce FederatedFactory, a zero-dependency framework that inverts the unit of federation from discriminative parameters to generative priors. By exchanging generative modules in a single communication round, our architecture supports ex nihilo synthesis of universally class balanced datasets, eliminating gradient conflict and external prior bias entirely. Evaluations across diverse medical imagery benchmarks, including MedMNIST and ISIC2019, demonstrate that our approach recovers centralized upper-bound performance. Under pathological heterogeneity, it lifts baseline accuracy from a collapsed 11.36% to 90.57% on CIFAR-10 and restores ISIC2019 AUROC to 90.57%. Additionally, this framework facilitates exact modular unlearning through the deterministic deletion of specific generative modules.

[AI-43] DynamicGate MLP Conditional Computation via Learned Structural Dropout and Input Dependent Gating for Functional Plasticity

【速读】:该论文旨在解决传统Dropout正则化与条件计算(conditional computation)之间目标和机制不一致的问题:Dropout在训练时随机屏蔽隐藏单元以缓解过拟合,而标准推理阶段则执行完整网络进行密集计算,缺乏根据输入动态调整计算路径的能力。解决方案的关键在于提出DynamicGate-MLP框架,通过学习连续的门控概率(continuous gate probabilities)来决定每个单元或模块是否激活,并在推理时将其转换为离散的执行掩码(discrete execution mask),从而实现样本依赖的计算分配——即仅对每个输入所需的网络部分进行计算,提升计算效率。训练过程中引入对期望门控使用率的惩罚项控制计算预算,并采用Straight-Through Estimator(STE)优化离散掩码的梯度传播,使模型同时具备正则化效果与条件计算特性。

链接: https://arxiv.org/abs/2603.16367
作者: Yong Il Choi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 8 Figures

点击查看摘要

Abstract:Dropout is a representative regularization technique that stochastically deactivates hidden units during training to mitigate overfitting. In contrast, standard inference executes the full network with dense computation, so its goal and mechanism differ from conditional computation, where the executed operations depend on the input. This paper organizes DynamicGate-MLP into a single framework that simultaneously satisfies both the regularization view and the conditional-computation view. Instead of a random mask, the proposed model learns gates that decide whether to use each unit (or block), suppressing unnecessary computation while implementing sample-dependent execution that concentrates computation on the parts needed for each input. To this end, we define continuous gate probabilities and, at inference time, generate a discrete execution mask from them to select an execution path. Training controls the compute budget via a penalty on expected gate usage and uses a Straight-Through Estimator (STE) to optimize the discrete mask. We evaluate DynamicGate-MLP on MNIST, CIFAR-10, Tiny-ImageNet, Speech Commands, and PBMC3k, and compare it with various MLP baselines and MoE-style variants. Compute efficiency is compared under a consistent criterion using gate activation ratios and a layerweighted relative MAC metric, rather than wall-clock latency that depends on hardware and backend kernels.

[AI-44] FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment

【速读】:该论文致力于解决**因子挖掘(alpha factor mining)**问题,即从噪声大、非平稳的市场数据中自动发现具有预测能力的因子,并满足可直接执行、可审计以及计算可扩展性的实际要求。现有符号方法受限于表达能力不足,而神经预测模型虽性能优越却常牺牲可解释性,且对市场状态变化敏感、易过拟合。论文提出FactorEngine(FE),其核心创新在于通过三个关键分离机制提升效率与效果:(i)逻辑修订与参数优化分离,(ii)大型语言模型(LLM)引导的方向性搜索与贝叶斯超参数搜索分离,(iii)LLM使用与本地计算分离。此外,FE引入知识增强的自举模块,将非结构化财务报告转化为可执行因子程序,并建立经验知识库支持轨迹感知的迭代优化(包括从失败中学习)。实证表明,FE在真实OHLCV数据上显著优于基线方法,在信息系数(IC/ICIR)、排序信息系数(Rank IC/ICIR)及年化收益/夏普比率等方面均达到当前最优水平。

链接: https://arxiv.org/abs/2603.16365
作者: Qinhong Lin,Ruitao Feng,Yinglun Feng,Zhenxin Huang,Yukun Chen,Zhongliang Yang,Linna Zhou,Binjie Fei,Jiaqi Liu,Yu Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 10 figures

点击查看摘要

Abstract:We study alpha factor mining, the automated discovery of predictive signals from noisy, non-stationary market data-under a practical requirement that mined factors be directly executable and auditable, and that the discovery process remain computationally tractable at scale. Existing symbolic approaches are limited by bounded expressiveness, while neural forecasters often trade interpretability for performance and remain vulnerable to regime shifts and overfitting. We introduce FactorEngine (FE), a program-level factor discovery framework that casts factors as Turing-complete code and improves both effectiveness and efficiency via three separations: (i) logic revision vs. parameter optimization, (ii) LLM-guided directional search vs. Bayesian hyperparameter search, and (iii) LLM usage vs. local computation. FE further incorporates a knowledge-infused bootstrapping module that transforms unstructured financial reports into executable factor programs through a closed-loop multi-agent extraction-verification-code-generation pipeline, and an experience knowledge base that supports trajectory-aware refinement (including learning from failures). Across extensive backtests on real-world OHLCV data, FE produces factors with substantially stronger predictive stability and portfolio impact-for example, higher IC/ICIR (and Rank IC/ICIR) and improved AR/Sharpe, than baseline methods, achieving state-of-the-art predictive and portfolio performance.

[AI-45] oward Experimentation-as-a-Service in 5G/6G: The Plaza6G Prototype for AI-Assisted Trials

【速读】:该论文旨在解决当前无线通信实验平台缺乏统一、易用且可复现的实验环境问题,尤其在6G研究中亟需整合云资源与下一代无线基础设施以支持灵活、自动化和可扩展的实验设计。其解决方案的关键在于构建首个运行中的“实验即服务”(Experiment-as-a-Service, ExaS)平台Plaza6G,该平台通过统一编排GPU加速计算集群、多种5G核心网(包括开源Free5GC与商用Cumucore)、可编程无线接入网(RAN)以及物理或仿真用户设备(UE),实现端到端实验流程自动化。平台采用自然语言交互界面(Web门户或REST API)降低实验设计门槛,并引入基于检索增强生成(Retrieval-Augmented Generation, RAG)的大语言模型(Large Language Model, LLM)辅助系统与低秩适应(Low-Rank Adaptation, LoRA)技术进行持续领域微调,从而提升实验知识获取与动态适配能力;同时结合四腔消声室和双站点室外5G网络(sub-6 GHz与毫米波频段)开展真实空中接口(Over-the-Air, OTA)测试,显著缩短实验准备时间至十分钟以内,并支持可编程传播条件下的交互式测试,最终通过机器可读实验描述符保障实验结果的可复现性。

链接: https://arxiv.org/abs/2603.16356
作者: Sergio Barrachina-Muñoz,Marc Carrascosa-Zamacois,Horacio Bleda,Umair Riaz,Yasir Maqsood,Xavier Calle,Selva Vía,Miquel Payaró,Josep Mangues-Bafalluy
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 5 pages, 5 figures

点击查看摘要

Abstract:This paper presents Plaza6G, the first operational Experiment-as-a-Service (ExaS) platform unifying cloud resources with next-generation wireless infrastructure. Developed at CTTC in Barcelona, Plaza6G integrates GPU-accelerated compute clusters, multiple 5G cores, both open-source (e.g., Free5GC) and commercial (e.g., Cumucore), programmable RANs, and physical or emulated user equipment under unified orchestration. In Plaza6G, the experiment design requires minimal expertise as it is expressed in natural language via a web portal or a REST API. The web portal and REST API are enhanced with a Large Language Model (LLM)-based assistant, which employs retrieval-augmented generation (RAG) for up-to-date experiment knowledge and Low-Rank Adaptation (LoRA) for continuous domain fine-tuning. Over-the-air (OTA) trials leverage a four-chamber anechoic facility and a dual-site outdoor 5G network operating in sub-6~GHz and mmWave bands. Demonstrations include automated CI/CD integration with sub-ten-minute setup and interactive OTA testing under programmable propagation conditions. Machine-readable experiment descriptors ensure reproducibility, while future work targets policy-aware orchestration, safety validation, and federated testbed integration toward open, reproducible wireless experimentation.

[AI-46] Detecting Sentiment Steering Attacks on RAG -enabled Large Language Models

【速读】:该论文旨在解决大规模物联网(Internet of Things, IoT)网络中日益严峻的安全威胁问题,尤其是未经授权设备接入和特定攻击类型利用现有漏洞所带来的风险。其解决方案的关键在于提出两种轻量级深度学习(Deep Learning, DL)驱动的智能入侵检测系统(Intrusion Detection System, IDS):基于卷积神经网络(Convolutional Neural Network, CNN)的IDS和基于长短期记忆网络(Long Short-Term Memory, LSTM)的IDS。这两种模型在CICIoT2023数据集上验证了其有效性,在二分类、分组分类和多类分类任务中均实现了超过98.6%的准确率,显著提升了对多种网络攻击类型的识别与分类能力,从而增强物联网环境下的安全性。

链接: https://arxiv.org/abs/2603.16342
作者: Isha Andrade,Shalaka S Mahadik,Mithun Mukherjee,Pranav M Pawar,Raja Muthalagu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 6 pages, 7 figures

点击查看摘要

Abstract:The proliferation of large-scale IoT networks has been both a blessing and a curse. Not only has it revolutionized the way organizations operate by increasing the efficiency of automated procedures, but it has also simplified our daily lives. However, while IoT networks have improved convenience and connectivity, they have also increased security risk due to unauthorized devices gaining access to these networks and exploiting existing weaknesses with specific attack types. The research proposes two lightweight deep learning (DL)-based intelligent intrusion detection systems (IDS). to enhance the security of IoT networks: the proposed convolutional neural network (CNN)-based IDS and the proposed long short-term memory (LSTM)-based IDS. The research evaluated the performance of both intelligent IDSs based on DL using the CICIoT2023 dataset. DL-based intelligent IDSs successfully identify and classify various cyber threats using binary, grouped, and multi-class classification. The proposed CNN-based IDS achieves an accuracy of 99.34%, 99.02% and 98.6%, while the proposed LSTM-based IDS achieves an accuracy of 99.42%, 99.13%, and 98.68% for binary, grouped, and multi-class classification, respectively.

[AI-47] A Human-Centred Architecture for Large Language Models -Cognitive Assistants in Manufacturing within Quality Management Systems

【速读】:该论文旨在解决当前制造领域中质量管理体系(Quality Management Systems, QMS)缺乏以人为中心的软件架构,无法有效集成大语言模型认知助手(Large Language Models-Cognitive Assistants, LLM-CAs)的问题。解决方案的关键在于设计了一种基于组件的软件架构,该架构通过需求分析与软件开发流程的系统性整合,确保了灵活性、可扩展性、模块化以及对工作的增强支持,从而为LLM-CAs在QMS中的落地应用提供了可行路径,并通过迭代专家焦点小组验证了其有效性。

链接: https://arxiv.org/abs/2603.16325
作者: Marcos Galdino,Johanna Grahl,Tobias Hamann,Anas Abdelrazeq,Ingrid Isenhardt
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models-Cognitive Assistants (LLM-CAs) can enhance Quality Management Systems (QMS) in manufacturing, fostering continuous process improvement and knowledge management. However, there is no human-centred software architecture focused on QMS that enables the integration of LLM-CAs into manufacturing in the current literature. This study addresses this gap by designing a component-based architecture considering requirement analysis and software development process. Validation was conducted via iterative expert focus groups. The proposed architecture ensures flexibility, scalability, modularity, and work augmentation within QMS. Moreover, it paves the way for its operationalization with industrial partners, showcasing its potential for advancing manufacturing processes.

[AI-48] Learning to Predict Discover and Reason in High-Dimensional Discrete Event Sequences

【速读】:该论文旨在解决现代车辆中由电子控制单元(ECUs)产生的大量异步诊断故障码(DTCs)难以高效、自动地转化为可解释的高阶故障模式(EPs)的问题。随着车辆复杂度提升,传统依赖领域专家手动构建布尔规则的方式已无法满足规模性与准确性要求,且DTC序列具有高维类别空间(类比自然语言词汇量)和长序列依赖特性,使得传统统计方法失效。解决方案的关键在于提出一个统一框架,融合事件序列建模、因果发现与大语言模型(LLMs),具体包括:基于Transformer的预测维护架构、可扩展的样本级与群体级因果发现机制,以及用于自动化合成布尔EP规则的多智能体系统,从而实现从故障预测到因果理解再到推理决策的全流程自动化诊断。

链接: https://arxiv.org/abs/2603.16313
作者: Hugo Math
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: PhD dissertation, 131 pages of main content, 202 pages in total

点击查看摘要

Abstract:Electronic control units (ECUs) embedded within modern vehicles generate a large number of asynchronous events known as diagnostic trouble codes (DTCs). These discrete events form complex temporal sequences that reflect the evolving health of the vehicle’s subsystems. In the automotive industry, domain experts manually group these codes into higher-level error patterns (EPs) using Boolean rules to characterize system faults and ensure safety. However, as vehicle complexity grows, this manual process becomes increasingly costly, error-prone, and difficult to scale. Notably, the number of unique DTCs in a modern vehicle is on the same order of magnitude as the vocabulary of a natural language, often numbering in the tens of thousands. This observation motivates a paradigm shift: treating diagnostic sequences as a language that can be modeled, predicted, and ultimately explained. Traditional statistical approaches fail to capture the rich dependencies and do not scale to high-dimensional datasets characterized by thousands of nodes, large sample sizes, and long sequence lengths. Specifically, the high cardinality of categorical event spaces in industrial logs poses a significant challenge, necessitating new machine learning architectures tailored to such event-driven systems. This thesis addresses automated fault diagnostics by unifying event sequence modeling, causal discovery, and large language models (LLMs) into a coherent framework for high-dimensional event streams. It is structured in three parts, reflecting a progressive transition from prediction to causal understanding and finally to reasoning for vehicle diagnostics. Consequently, we introduce several Transformer-based architectures for predictive maintenance, scalable sample- and population-level causal discovery frameworks and a multi-agent system that automates the synthesis of Boolean EP rules.

[AI-49] NeSy-Route: A Neuro-Symbolic Benchmark for Constrained Route Planning in Remote Sensing

【速读】:该论文旨在解决当前遥感领域多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂场景下缺乏可靠规划能力评估的问题。现有基准主要关注感知与推理能力,未能有效衡量模型在受限条件下的路径规划性能,原因在于规划任务难以规模化构建及验证,或评估协议不准确。其解决方案的关键在于提出NeSy-Route——一个大规模神经符号基准,通过融合高保真语义掩码与启发式搜索的自动化数据生成框架,生成具有可证明最优解的多样化路径规划任务;同时设计三级分层神经符号评估协议,实现对感知、推理与规划能力的协同精确评估,从而系统性揭示MLLMs在遥感场景中的规划短板。

链接: https://arxiv.org/abs/2603.16307
作者: Ming Yang,Zhi Zhou,Shi-Yu Tian,Kun-Yang Yu,Lan-Zhe Guo,Yu-Feng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Remote sensing underpins crucial applications such as disaster relief and ecological field surveys, where systems must understand complex scenes and constraints and make reliable decisions. Current remote-sensing benchmarks mainly focus on evaluating perception and reasoning capabilities of multimodal large language models (MLLMs). They fail to assess planning capability, stemming either from the difficulty of curating and validating planning tasks at scale or from evaluation protocols that are inaccurate and inadequate. To address these limitations, we introduce NeSy-Route, a large-scale neuro-symbolic benchmark for constrained route planning in remote sensing. Within this benchmark, we introduce an automated data-generation framework that integrates high-fidelity semantic masks with heuristic search to produce diverse route-planning tasks with provably optimal solutions. This allows NeSy-Route to comprehensively evaluate planning across 10,821 route-planning samples, nearly 10 times larger than the largest prior benchmark. Furthermore, a three-level hierarchical neuro-symbolic evaluation protocol is developed to enable accurate assessment and support fine-grained analysis on perception, reasoning, and planning simultaneously. Our comprehensive evaluation of various state-of-the-art MLLMs demonstrates that existing MLLMs show significant deficiencies in perception and planning capabilities. We hope NeSy-Route can support further research and development of more powerful MLLMs for remote sensing.

[AI-50] Surrogate-Assisted Genetic Programming with Rank-Based Phenotypic Characterisation for Dynamic Multi-Mode Project Scheduling

【速读】:该论文旨在解决动态多模式资源约束项目调度问题(Dynamic Multi-mode Resource-Constrained Project Scheduling Problem, DMRCPSP)中遗传编程(Genetic Programming, GP)因大量基于仿真的适应度评估而导致计算成本过高的问题。其核心解决方案是提出一种基于排序的表型表征(phenotypic characterisation, PC)方案,该方案通过启发式驱动的可执行活动-模式对及活动组在决策情境下的排序来构建PC向量,从而使得代理模型(surrogate model)能够高效估计未评估GP个体的适应度。这一PC机制使代理辅助的GP算法能够在显著降低计算开销的前提下,更早地发现高质量的启发式规则,并通过代理模型为后代选择提供有效指导,提升进化效率。

链接: https://arxiv.org/abs/2603.16286
作者: Yuan Tian,Yi Mei,Mengjie Zhang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 7 pages, 7 figures, accepted by IEEE Congress on Evolutionary Computation 2026. This is the version submitted for peer review. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The dynamic multi-mode resource-constrained project scheduling problem (DMRCPSP) is of practical importance, as it requires making real-time decisions under changing project states and resource availability. Genetic Programming (GP) has been shown to effectively evolve heuristic rules for such decision-making tasks; however, the evolutionary process typically relies on a large number of simulation-based fitness evaluations, resulting in high computational cost. Surrogate models offer a promising solution to reduce evaluation cost, but their application to GP requires problem-specific phenotypic characterisation (PC) schemes of heuristic rules. There is currently a lack of suitable PC schemes for GP applied to DMRCPSP. This paper proposes a rank-based PC scheme derived from heuristic-driven ordering of eligible activity-mode pairs and activity groups in decision situations. The resulting PC vectors enable a surrogate model to estimate the fitness of unevaluated GP individuals. Based on this scheme, a surrogate-assisted GP algorithm is developed. Experimental results demonstrate that the proposed surrogate-assisted GP can identify high-quality heuristic rules consistently earlier than the state-of-the-art GP approach for DMRCPSP, while introducing only marginal computational overhead. Further analyses demonstrate that the surrogate model provides useful guidance for offspring selection, leading to improved evolutionary efficiency. Comments: 7 pages, 7 figures, accepted by IEEE Congress on Evolutionary Computation 2026. This is the version submitted for peer review. This work has been submitted to the IEEE for possible publication Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.16286 [cs.NE] (or arXiv:2603.16286v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2603.16286 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-51] Adaptive Theory of Mind for LLM -based Multi-Agent Coordination AAAI2026

【速读】:该论文旨在解决多智能体协作任务中因理论心智(Theory of Mind, ToM)层级不匹配导致的协调失效问题,即当智能体间对彼此心智状态推理深度不一致时,可能引发推理不足或过度,从而损害协作效率。解决方案的关键在于设计一种自适应理论心智(Adaptive ToM, A-ToM)智能体,该智能体能够基于历史交互估计合作方的ToM层级,并据此预测其行为,实现双方在ToM推理深度上的动态对齐,从而提升协作性能。

链接: https://arxiv.org/abs/2603.16264
作者: Chunjiang Mu,Ya Zeng,Qiaosheng Zhang,Kun Shao,Chen Chu,Hao Guo,Danyang Jia,Zhen Wang,Shuyue Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Theory of Mind (ToM) refers to the ability to reason about others’ mental states, and higher-order ToM involves considering that others also possess their own ToM. Equipping large language model (LLM)-driven agents with ToM has long been considered to improve their coordination in multiagent collaborative tasks. However, we find that misaligned ToM orders-mismatches in the depth of ToM reasoning between agents-can lead to insufficient or excessive reasoning about others, thereby impairing their coordination. To address this issue, we design an adaptive ToM (A-ToM) agent, which can align in ToM orders with its partner. Based on prior interactions, the agent estimates the partner’s likely ToM order and leverages this estimation to predict the partner’s action, thereby facilitating behavioral coordination. We conduct empirical evaluations on four multi-agent coordination tasks: a repeated matrix game, two grid navigation tasks and an Overcooked task. The results validate our findings on ToM alignment and demonstrate the effectiveness of our A-ToM agent. Furthermore, we discuss the generalizability of our A-ToM to non-LLM-based agents, as well as what would diminish the importance of ToM alignment.

[AI-52] Generative AI for Quantum Circuits and Quantum Code: A Technical Review and Taxonomy

【速读】:该论文旨在解决当前生成式量子程序(如Qiskit代码、OpenQASM程序)与量子电路图等量子计算相关代码自动生成系统在实际硬件部署层面的评估缺失问题。尽管现有十三种生成系统在语法正确性(syntactic validity)和语义正确性(semantic correctness)方面已取得一定进展,但均未进行端到端的量子硬件执行验证(hardware executability, Layer 3b),导致理论生成结果与真实量子设备应用之间存在显著差距。其解决方案的关键在于构建一个三层系统性评估框架,并基于此对现有生成系统进行全面分析,从而揭示当前研究在实践落地环节的不足,为未来工作指明方向——即必须将生成模型与量子硬件测试紧密结合,以实现从算法设计到物理执行的闭环验证。

链接: https://arxiv.org/abs/2603.16216
作者: Juhani Merilehto
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 20 pages, 4 tables

点击查看摘要

Abstract:We review thirteen generative systems and five supporting datasets for quantum circuit and quantum code generation, identified through a structured scoping review of Hugging Face, arXiv, and provenance tracing (January-February 2026). We organize the field along two axes: artifact type (Qiskit code, OpenQASM programs, circuit graphs); crossed with training regime (supervised fine-tuning, verifier-in-the-loop RL, diffusion/graph generation, agentic optimization); and systematically apply a three-layer evaluation framework covering syntactic validity, semantic correctness, and hardware executability. The central finding is that while all reviewed systems address syntax and most address semantics to some degree, none reports end-to-end evaluation on quantum hardware (Layer 3b), leaving a significant gap between generated circuits and practical deployment. Scope note: quantum code refers throughout to quantum program artifacts (QASM, Qiskit); we do not cover generation of quantum error-correcting codes (QEC).

[AI-53] MOSAIC: Composable Safety Alignment with Modular Control Tokens

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中面临的上下文依赖性安全规则难以实现的问题,即现有方法要么将安全行为与通用能力耦合在参数中(导致灵活性差),要么依赖自然语言提示(约束力弱)。其解决方案的关键在于提出MOSAIC框架,通过可学习的控制标记(control tokens) 实现模块化、组合式安全对齐:每个标记代表一个独立的安全约束,在推理时可灵活激活和组合,从而在不修改冻结主干模型的前提下,动态适应不同用户、区域或应用场景的安全需求。此外,作者引入基于顺序的任务采样和分布级对齐目标,有效降低过度拒绝(over-refusal)现象并保持模型实用性。

链接: https://arxiv.org/abs/2603.16210
作者: Jingyu Peng,Hongyu Chen,Jiancheng Dong,Maolin Wang,Wenxi Li,Yuchen Li,Kai Zhang,Xiangyu Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety alignment in large language models (LLMs) is commonly implemented as a single static policy embedded in model parameters. However, real-world deployments often require context-dependent safety rules that vary across users, regions, and applications. Existing approaches struggle to provide such conditional control: parameter-level alignment entangles safety behaviors with general capabilities, while prompt-based methods rely on natural language instructions that provide weak enforcement. We propose MOSAIC, a modular framework that enables compositional safety alignment through learnable control tokens optimized over a frozen backbone model. Each token represents a safety constraint and can be flexibly activated and composed at inference time. To train compositional tokens efficiently, we introduce order-based task sampling and a distribution-level alignment objective that mitigates over-refusal. Experiments show that MOSAIC achieves strong defense performance with substantially lower over-refusal while preserving model utility.

[AI-54] Proactive Rejection and Grounded Execution: A Dual-Stage Intent Analysis Paradigm for Safe and Efficient AIoT Smart Homes

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在物联网(Internet of Things, IoT)环境中作为具身代理(embodied agents)时面临的可靠性与交互效率问题。具体而言,LLM直接执行生成的指令常导致实体幻觉(entity hallucinations),例如尝试控制不存在的设备;而现有迭代式框架(如SAGE)则陷入“交互频率困境”(Interaction Frequency Dilemma),在鲁莽执行与过度询问用户之间反复摇摆。解决方案的关键在于提出一种双阶段意图感知(Dual-Stage Intent-Aware, DS-IA)框架:第一阶段作为语义防火墙,通过检查当前环境状态过滤无效指令并澄清模糊命令;第二阶段采用确定性级联验证器(deterministic cascade verifier),按房间、设备和功能逐层验证动作的物理可行性,确保仅在可执行的前提下才执行操作。该设计显著提升了任务成功率与指令拒绝率,同时平衡了自主推理与必要的人工干预,从而最小化用户干扰。

链接: https://arxiv.org/abs/2603.16207
作者: Xinxin Jin,Zhengwei Ni,Zhengguo Sheng,Victor C. M. Leung
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) transition from information providers to embodied agents in the Internet of Things (IoT), they face significant challenges regarding reliability and interaction efficiency. Direct execution of LLM-generated commands often leads to entity hallucinations (e.g., trying to control non-existent devices). Meanwhile, existing iterative frameworks (e.g., SAGE) suffer from the Interaction Frequency Dilemma, oscillating between reckless execution and excessive user questioning. To address these issues, we propose a Dual-Stage Intent-Aware (DS-IA) Framework. This framework separates high-level user intent understanding from low-level physical execution. Specifically, Stage 1 serves as a semantic firewall to filter out invalid instructions and resolve vague commands by checking the current state of the home. Stage 2 then employs a deterministic cascade verifier-a strict, step-by-step rule checker that verifies the room, device, and capability in sequence-to ensure the action is actually physically possible before execution. Extensive experiments on the HomeBench and SAGE benchmarks demonstrate that DS-IA achieves an Exact Match (EM) rate of 58.56% (outperforming baselines by over 28%) and improves the rejection rate of invalid instructions to 87.04%. Evaluations on the SAGE benchmark further reveal that DS-IA resolves the Interaction Frequency Dilemma by balancing proactive querying with state-based inference. Specifically, it boosts the Autonomous Success Rate (resolving tasks without unnecessary user intervention) from 42.86% to 71.43%, while maintaining high precision in identifying irreducible ambiguities that truly necessitate human clarification. These results underscore the framework’s ability to minimize user disturbance through accurate environmental grounding.

[AI-55] Sample-Efficient Adaptation of Drug-Response Models to Patient Tumors under Strong Biological Domain Shift

【速读】:该论文旨在解决预临床数据(如体外细胞系)与患者肿瘤之间存在的显著生物学差异导致的药物反应预测难题,尤其是在临床数据稀缺情况下如何实现高效迁移的问题。其解决方案的关键在于采用分阶段的迁移学习框架:首先利用大规模无标签药理基因组数据,通过基于自编码器的表示学习方法独立提取细胞和药物的结构化表征;随后在细胞系数据上对这些表征进行标签对齐,并借助少量标注样本将模型适配至患者肿瘤数据。该策略在跨域适应场景下显著提升了小样本条件下的患者层面药物反应预测性能,同时保持了与单阶段基线模型相当的细胞系基准表现,从而有效降低了临床监督数据的需求量,为预临床到临床的高效转化提供了可行路径。

链接: https://arxiv.org/abs/2603.16185
作者: Camille Jimenez Cortes,Philippe Lalanda,German Vega
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Predicting drug response in patients from preclinical data remains a major challenge in precision oncology due to the substantial biological gap between in vitro cell lines and patient tumors. Rather than aiming to improve absolute in vitro prediction accuracy, this work examines whether explicitly separating representation learning from task supervision enables more sample-efficient adaptation of drug-response models to patient data under strong biological domain shift. We propose a staged transfer-learning framework in which cellular and drug representations are first learned independently from large collections of unlabeled pharmacogenomic data using autoencoder-based representation learning. These representations are then aligned with drug-response labels on cell-line data and subsequently adapted to patient tumors using few-shot supervision. Through a systematic evaluation spanning in-domain, cross-dataset, and patient-level settings, we show that unsupervised pretraining provides limited benefit when source and target domains overlap substantially, but yields clear gains when adapting to patient tumors with very limited labeled data. In particular, the proposed framework achieves faster performance improvements during few-shot patient-level adaptation while maintaining comparable accuracy to single-phase baselines on standard cell-line benchmarks. Overall, these results demonstrate that learning structured and transferable representations from unlabeled molecular profiles can substantially reduce the amount of clinical supervision required for effective drug-response prediction, offering a practical pathway toward data-efficient preclinical-to-clinical translation.

[AI-56] SQL-ASTRA: Alleviating Sparse Feedback in Agent ic SQL via Column-Set Matching and Trajectory Aggregation

【速读】:该论文旨在解决生成式 AI (Generative AI) 在 Text-to-SQL 任务中因采用单轮(single-turn)范式而导致的信用分配(credit assignment)问题,即传统方法仅依赖最终回合反馈进行奖励评估,忽视了中间推理过程,造成奖励信号稀疏且难以精准定位错误来源。解决方案的关键在于提出 Agentic SQL 框架,其核心是双层奖励机制:一是聚合轨迹奖励(Aggregated Trajectory Reward, ATR),通过非对称转移矩阵将过程导向得分聚合,结合李雅普诺夫稳定性理论证明其作为能量耗散算子可保证策略无环且单调收敛;二是列集匹配奖励(Column-Set Matching Reward, CSMR),在每一步执行查询并基于部分正确性将二值反馈转化为密集 [0,1] 奖励信号,从而缓解稀疏奖励问题。该方案显著提升了多轮 Text-to-SQL 的性能,在 BIRD 和 Spider 2.0 数据集上超越当前最优模型。

链接: https://arxiv.org/abs/2603.16161
作者: Long Li,Zhijian Zhou,Jiangxuan Long,Peiyang Liu,Weidi Xu,Zhe Wang,Shirui Pan,Chao Qu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages

点击查看摘要

Abstract:Agentic Reinforcement Learning (RL) shows promise for complex tasks, but Text-to-SQL remains mostly restricted to single-turn paradigms. A primary bottleneck is the credit assignment problem. In traditional paradigms, rewards are determined solely by the final-turn feedback, which ignores the intermediate process and leads to ambiguous credit evaluation. To address this, we propose Agentic SQL, a framework featuring a universal two-tiered reward mechanism designed to provide effective trajectory-level evaluation and dense step-level signals. First, we introduce Aggregated Trajectory Reward (ATR) to resolve multi-turn credit assignment. Using an asymmetric transition matrix, ATR aggregates process-oriented scores to incentivize continuous improvement. Leveraging Lyapunov stability theory, we prove ATR acts as an energy dissipation operator, guaranteeing a cycle-free policy and monotonic convergence. Second, Column-Set Matching Reward (CSMR) provides immediate step-level rewards to mitigate sparsity. By executing queries at each turn, CSMR converts binary (0/1) feedback into dense [0, 1] signals based on partial correctness. Evaluations on BIRD show a 5% gain over binary-reward GRPO. Notably, our approach outperforms SOTA Arctic-Text2SQL-R1-7B on BIRD and Spider 2.0 using identical models, propelling Text-to-SQL toward a robust multi-turn agent paradigm.

[AI-57] DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

【速读】:该论文旨在解决基于策略的强化学习(Reinforcement Learning, RL)在大语言模型推理中因样本效率低下而导致的训练瓶颈问题,尤其是现有经验回放方法虽能重用历史数据提升准确性,但常引发计算开销过大和模式崩溃(mode collapse)等副作用。其核心解决方案是提出一种动态 Jensen-Shannon 回放机制(Dynamic Jensen-Shannon Replay, DyJR),关键创新在于:(1) 设计了一个时间敏感的动态缓冲区(Time-Sensitive Dynamic Buffer),通过先进先出(FIFO)和自适应容量调整机制保留近期轨迹中的样本,以匹配模型演化的时序特性;(2) 引入 Jensen-Shannon 散度正则化(Jensen-Shannon Divergence Regularization),将直接梯度更新替换为分布约束,从而在保持策略性能的同时有效维持动作空间的多样性,避免过拟合导致的模式坍缩。实验表明,DyJR 在数学推理与 Text-to-SQL 任务上显著优于 GRPO 及 RLEP、Ex-GRPO 等基线方法,且训练效率接近原始 GRPO。

链接: https://arxiv.org/abs/2603.16157
作者: Long Li,Zhijian Zhou,Tianyi Wang,Weidi Xu,Zuming Huang,Wei Chu,Zhe Wang,Shirui Pan,Chao Qu,Yuan Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures

点击查看摘要

Abstract:While Reinforcement Learning (RL) enhances Large Language Model reasoning, on-policy algorithms like GRPO are sample-inefficient as they discard past rollouts. Existing experience replay methods address this by reusing accurate samples for direct policy updates, but this often incurs high computational costs and causes mode collapse via overfitting. We argue that historical data should prioritize sustaining diversity rather than simply reinforcing accuracy. To this end, we propose Dynamic Jensen-Shannon Replay (DyJR), a simple yet effective regularization framework using a dynamic reference distribution from recent trajectories. DyJR introduces two innovations: (1) A Time-Sensitive Dynamic Buffer that uses FIFO and adaptive sizing to retain only temporally proximal samples, synchronizing with model evolution; and (2) Jensen-Shannon Divergence Regularization, which replaces direct gradient updates with a distributional constraint to prevent diversity collapse. Experiments on mathematical reasoning and Text-to-SQL benchmarks demonstrate that DyJR significantly outperforms GRPO as well as baselines such as RLEP and Ex-GRPO, while maintaining training efficiency comparable to the original GRPO. Furthermore, from the perspective of Rank- k token probability evolution, we show that DyJR enhances diversity and mitigates over-reliance on Rank-1 tokens, elucidating how specific sub-modules of DyJR influence the training dynamics.

[AI-58] NeuronSpark: A Spiking Neural Network Language Model with Selective State Space Dynamics

【速读】:该论文旨在解决纯脉冲神经网络(Spiking Neural Network, SNN)是否能够从随机初始化直接学习大规模语言建模任务,而无需依赖Transformer架构的蒸馏过程。其核心挑战在于如何在不借助传统深度学习模型结构的情况下,实现高效且稳定的端到端语言建模。解决方案的关键在于提出NeuronSpark-0.9B模型,该模型融合了选择性状态空间脉冲动力学、泄漏电流层间通信机制、PonderNet自适应时间步长策略、优化的Triton PLIF核融合计算以及多种稳定化技术(残差中心化、侧抑制归一化与自然梯度补偿),从而在有限训练预算下(约14亿预训练token和6500步监督微调)实现了3.6的预训练损失,并展现出初步的多轮对话能力,验证了纯SNN架构在该规模语言建模中的可行性。

链接: https://arxiv.org/abs/2603.16148
作者: Zhengzheng Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, 6 tables. Preprint

点击查看摘要

Abstract:We ask whether a pure spiking backbone can learn large-scale language modeling from random initialization, without Transformer distillation. We introduce NeuronSpark, a 0.9B-parameter SNN language model trained with next-token prediction and surrogate gradients. The model combines selective state-space spiking dynamics, leakage-current inter-layer communication, PonderNet adaptive timesteps, fused Triton PLIF kernels, and stabilization techniques (residual centering, lateral-inhibition normalization, and natural-gradient compensation). Under a constrained budget (about 1.4B pretraining tokens and 6.5K SFT steps), NeuronSpark-0.9B reaches 3.6 pretraining loss and shows early multi-turn dialogue behavior after SFT. These results support the feasibility of end-to-end language modeling with a pure SNN architecture at this scale.

[AI-59] Functorial Neural Architectures from Higher Inductive Types

【速读】:该论文试图解决神经网络在组合泛化(compositional generalization)方面的系统性失败问题,即模型无法正确处理已知组件的新组合。解决方案的关键在于将组合泛化建模为解码器的函子性(functoriality),并通过高阶归纳类型(Higher Inductive Type, HIT)规范编译为神经架构:利用从目标空间路径群胚到参数映射范畴的张量函子,将路径构造转化为生成网络,组合操作变为结构拼接,而群关系的2-细胞则成为学习到的自然变换。由此构建的解码器因结构拼接特性而天然为严格张量函子(compositional by construction),实验验证了其在环面(Z2\mathbb{Z}^2)、自由群空间(S1S1S^1 \vee S^1)和克莱因瓶(ZZ\mathbb{Z} \rtimes \mathbb{Z})上的显著性能优势,且softmax自注意力机制在非平凡组合任务中不具备函子性。

链接: https://arxiv.org/abs/2603.16123
作者: Karen Sargsyan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Algebraic Topology (math.AT); Category Theory (math.CT)
备注: 20 pages, 10 tables. Code and Cubical Agda formalization: this https URL

点击查看摘要

Abstract:Neural networks systematically fail at compositional generalization – producing correct outputs for novel combinations of known parts. We show that this failure is architectural: compositional generalization is equivalent to functoriality of the decoder, and this perspective yields both guarantees and impossibility results. We compile Higher Inductive Type (HIT) specifications into neural architectures via a monoidal functor from the path groupoid of a target space to a category of parametric maps: path constructors become generator networks, composition becomes structural concatenation, and 2-cells witnessing group relations become learned natural transformations. We prove that decoders assembled by structural concatenation of independently generated segments are strict monoidal functors (compositional by construction), while softmax self-attention is not functorial for any non-trivial compositional task. Both results are formalized in Cubical Agda. Experiments on three spaces validate the full hierarchy: on the torus ( \mathbbZ^2 ), functorial decoders outperform non-functorial ones by 2-2.7x; on S^1 \vee S^1 ( F_2 ), the type-A/B gap widens to 5.5-10x; on the Klein bottle ( \mathbbZ \rtimes \mathbbZ ), a learned 2-cell closes a 46% error gap on words exercising the group relation.

[AI-60] VIGIL: Towards Edge-Extended Agent ic AI for Enterprise IT Support

【速读】:该论文旨在解决企业IT支持中因设备异构性、策略动态变化以及难以集中处理的长尾故障模式所带来的挑战。解决方案的关键在于提出VIGIL系统——一个扩展至边缘端的代理型人工智能系统,通过在用户终端部署本地化代理(desktop-resident agents),实现情境感知诊断、企业知识库检索与受策略约束的自适应修复操作,且全程获得用户显式授权和端到端可观测性。该设计使系统能够在资源受限设备上高效运行,显著降低交互轮次、提升诊断速度,并支持82%匹配案例的自助修复,同时验证了其在可用性、可信度和认知负荷方面的优越表现。

链接: https://arxiv.org/abs/2603.16110
作者: Sarthak Ahuja,Neda Kordjazi,Evren Yortucboylu,Vishaal Kapoor,Mariam Dundua,Yiming Li,Derek Ho,Vaibhavi Padala,Jennifer Whitted,Rebecca Steinert
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Enterprise IT support is constrained by heterogeneous devices, evolving policies, and long-tail failure modes that are difficult to resolve centrally. We present VIGIL, an edge-extended agentic AI system that deploys desktop-resident agents to perform situated diagnosis, retrieval over enterprise knowledge, and policy-governed remediation directly on user devices with explicit consent and end-to-end observability. In a 10-week pilot of VIGIL’s operational loop on 100 resource-constrained endpoints, VIGIL reduces interaction rounds by 39%, achieves at least 4 times faster diagnosis, and supports self-service resolution in 82% of matched cases. Users report excellent usability, high trust, and low cognitive workload across four validated instruments, with qualitative feedback highlighting transparency as critical for trust. Notably, users rated the system higher when no historical matches were available, suggesting on-device diagnosis provides value independent of knowledge base coverage. This pilot establishes safety and observability foundations for fleet-wide continuous improvement.

[AI-61] RepoReviewer: A Local-First Multi-Agent Architecture for Repository-Level Code Review

【速读】:该论文旨在解决自动化代码审查(code review)在处理 GitHub 仓库级任务时存在的局限性,即现有工具常将项目结构、仓库上下文和文件级实现细节等多维信息压缩为单一处理流程,导致审查相关性降低、重复分析增多以及优先级判断失效。解决方案的关键在于提出 RepoReviewer——一个以本地优先(local-first)为核心的多智能体系统,通过分解审查流程为仓库获取、上下文合成、文件级分析、问题优先排序和摘要生成五个模块化步骤,实现了对复杂仓库语境的精细化处理;其架构基于 Python CLI、FastAPI API 和 LangGraph 协调层,支持可复用的评估与报告基础设施,为未来实证研究提供了技术基础。

链接: https://arxiv.org/abs/2603.16107
作者: Peng Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Repository-level code review requires reasoning over project structure, repository context, and file-level implementation details. Existing automated review workflows often collapse these tasks into a single pass, which can reduce relevance, increase duplication, and weaken prioritization. We present RepoReviewer, a local-first multi-agent system for automated GitHub repository review with a Python CLI, FastAPI API, LangGraph orchestration layer, and this http URL user interface. RepoReviewer decomposes review into repository acquisition, context synthesis, file-level analysis, finding prioritization, and summary generation. We describe the system design, implementation tradeoffs, developer-facing interfaces, and practical failure modes. Rather than claiming benchmark superiority, we frame RepoReviewer as a technical systems contribution: a pragmatic architecture for repository-level automated review, accompanied by reusable evaluation and reporting infrastructure for future empirical study.

[AI-62] Large Reward Models: Generalizable Online Robot Reward Generation with Vision-Language Models

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在机器人操作策略优化中因难以设计通用奖励函数而导致的性能瓶颈问题。解决方案的关键在于将基础视觉语言模型(Vision-Language Model, VLM)在线适配为奖励生成器,构建一个基于大规模多源数据训练的鲁棒且可扩展的奖励模型;该模型不依赖于事后轨迹评估,而是基于当前视觉观测实时生成包含过程、完成度和时间对比三重维度的奖励信号,从而在闭环环境中引导初始模仿学习(Imitation Learning, IL)策略修正次优行为。实验表明,该方法仅需30次RL迭代即可显著提升原始IL策略的成功率,展现出极高的样本效率,并实现了无需人工干预的零样本在线策略精炼。

链接: https://arxiv.org/abs/2603.16065
作者: Yanru Wu,Weiduo Yuan,Ang Qi,Vitor Guizilini,Jiageng Mao,Yue Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has shown great potential in refining robotic manipulation policies, yet its efficacy remains strongly bottlenecked by the difficulty of designing generalizable reward functions. In this paper, we propose a framework for online policy refinement by adapting foundation VLMs into online reward generators. We develop a robust, scalable reward model based on a state-of-the-art VLM, trained on a large-scale, multi-source dataset encompassing real-world robot trajectories, human-object interactions, and diverse simulated environments. Unlike prior approaches that evaluate entire trajectories post-hoc, our method leverages the VLM to formulate a multifaceted reward signal comprising process, completion, and temporal contrastive rewards based on current visual observations. Initializing with a base policy trained via Imitation Learning (IL), we employ these VLM rewards to guide the model to correct sub-optimal behaviors in a closed-loop manner. We evaluate our framework on challenging long-horizon manipulation benchmarks requiring sequential execution and precise control. Crucially, our reward model operates in a purely zero-shot manner within these test environments. Experimental results demonstrate that our method significantly improves the success rate of the initial IL policy within just 30 RL iterations, demonstrating remarkable sample efficiency. This empirical evidence highlights that VLM-generated signals can provide reliable feedback to resolve execution errors, effectively eliminating the need for manual reward engineering and facilitating efficient online refinement for robot learning.

[AI-63] ARISE: Agent Reasoning with Intrinsic Skill Evolution in Hierarchical Reinforcement Learning

【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的语言模型在提升数学推理能力时存在的局限性,即现有方法通常将每个问题实例独立处理,未能有效利用训练过程中涌现并累积的可复用推理策略。其解决方案的关键在于提出一种分层强化学习框架——ARISE(Agent Reasoning via Intrinsic Skill Evolution),该框架通过引入一个技能管理器(Skills Manager)与工作者(Worker)的双层结构:技能管理器负责在训练中通过结构化总结成功解题轨迹来构建层级化技能库,并基于策略驱动机制检索相关技能以指导后续推理;而工作者则执行具体推理任务。同时,分层奖励设计引导推理能力与技能库质量协同进化,从而实现更高效、泛化能力更强的数学推理。

链接: https://arxiv.org/abs/2603.16060
作者: Yu Li,Rui Miao,Zhengling Qi,Tian Lan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The dominant paradigm for improving mathematical reasoning in language models relies on Reinforcement Learning with verifiable rewards. Yet existing methods treat each problem instance in isolation without leveraging the reusable strategies that emerge and accumulate during training. To this end, we introduce ARISE (Agent Reasoning via Intrinsic Skill Evolution), a hierarchical reinforcement learning framework, in which a shared policy operates both to manage skills at high-level and to generate responses at low-level (denoted as a Skills Manager and a Worker, respectively). The Manager maintains a tiered skill library through a dedicated skill generation rollout that performs structured summarization of successful solution traces (after execution), while employing a policy-driven selection mechanism to retrieve relevant skills to condition future rollouts (before execution). A hierarchical reward design guides the co-evolution of reasoning ability and library quality. Experiments on two base models and seven benchmarks spanning both competition mathematics and Omni-MATH show that ARISE consistently outperforms GRPO-family algorithms and memory-augmented baselines, with particularly notable gains on out-of-distribution tasks. Ablation studies confirm that each component contributes to the observed improvements and that library quality and reasoning performance improve in tandem throughout training. Code is available at \hrefthis https URLthis https URL.

[AI-64] A Context Alignment Pre-processor for Enhancing the Coherence of Human-LLM Dialog

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长期和动态对话中面临的语境错位(contextual misalignment)问题,即当用户省略前提、简化指代或突然切换话题时,模型难以准确捕捉其意图,从而生成机械或离题的回应。解决方案的关键在于提出一种名为“语境对齐预处理器”(Context Alignment Pre-processor, C.A.P.)的计算框架,该框架作为用户输入与生成模块之间的预处理组件,通过三个核心过程实现语境对齐:(1) 语义扩展,将用户指令扩展至包含前提、字面意义及隐含含义的更广语义范围;(2) 时间加权上下文检索,利用时间衰减函数优先提取近期对话历史以模拟人类注意力焦点;(3) 对齐验证与决策分支,通过衡量当前提示与加权历史语境间的语义相似度判断是否偏离主题,并在显著偏差时触发结构化澄清协议,引导人机重新校准对话方向。

链接: https://arxiv.org/abs/2603.16052
作者: Ding Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have made remarkable progress in generating fluent text, but they still face a critical challenge of contextual misalignment in long-term and dynamic dialogue. When human users omit premises, simplify references, or shift context abruptly during interactions with LLMs, the models may fail to capture their actual intentions, producing mechanical or off-topic responses that weaken the collaborative potential of dialogue. To address this problem, this paper proposes a computational framework called the Context Alignment Pre-processor (C.A.P.). Rather than operating during generation, C.A.P. functions as a pre-processing module between user input and response generation. The framework includes three core processes: (1) semantic expansion, which extends a user instruction to a broader semantic span including its premises, literal meaning, and implications; (2) time-weighted context retrieval, which prioritizes recent dialogue history through a temporal decay function approximating human conversational focus; and (3) alignment verification and decision branching, which evaluates whether the dialogue remains on track by measuring the semantic similarity between the current prompt and the weighted historical context. When a significant deviation is detected, C.A.P. initiates a structured clarification protocol to help users and the system recalibrate the conversation. This study presents the architecture and theoretical basis of C.A.P., drawing on cognitive science and Common Ground theory in human-computer interaction. We argue that C.A.P. is not only a technical refinement but also a step toward shifting human-computer dialogue from one-way command-execution patterns to two-way, self-correcting, partnership-based collaboration. Finally, we discuss implementation paths, evaluation methods, and implications for the future design of interactive intelligent systems.

[AI-65] POaaS: Minimal-Edit Prompt Optimization as a Service to Lift Accuracy and Cut Hallucinations on On-Device sLLM s

【速读】:该论文旨在解决小语言模型(small language models, sLLMs)在设备端部署时,因用户输入提示(prompt)不完善(如拼写错误、意图模糊或缺少上下文)而导致的事实性错误和幻觉问题。现有自动提示优化(automatic prompt optimization, APO)方法主要面向云端大语言模型(large language models, LLMs),依赖搜索机制生成长且结构化的指令,在设备端受限场景下,由于同一小模型需同时承担优化与推理任务,此类方法易浪费上下文资源并降低准确性。其解决方案的关键在于提出POaaS(Prompt Optimization as a Service),一个最小编辑提示优化层,通过轻量级专家模块(Cleaner、Paraphraser、Fact-Adder)对每个查询进行路由处理,并在严格约束漂移和长度的前提下融合输出,同时采用保守跳过策略避免对已良好构造的提示进行冗余优化。实验表明,在固定模型设置下(Llama-3.2-3B-Instruct 和 Llama-3.1-8B-Instruct),POaaS显著提升任务准确性和事实一致性,而代表性APO基线反而性能下降,且在token删除和混洗扰动下恢复能力达+7.4%。

链接: https://arxiv.org/abs/2603.16045
作者: Jungwoo Shim,Dae Won Kim,Sun Wook Kim,Soo Young Kim,Myungcheol Lee,Jae-geun Cha,Hyunhwa Choi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at FEVER 2026. 9 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Small language models (sLLMs) are increasingly deployed on-device, where imperfect user prompts–typos, unclear intent, or missing context–can trigger factual errors and hallucinations. Existing automatic prompt optimization (APO) methods were designed for large cloud LLMs and rely on search that often produces long, structured instructions; when executed under an on-device constraint where the same small model must act as optimizer and solver, these pipelines can waste context and even hurt accuracy. We propose POaaS, a minimal-edit prompt optimization layer that routes each query to lightweight specialists (Cleaner, Paraphraser, Fact-Adder) and merges their outputs under strict drift and length constraints, with a conservative skip policy for well-formed prompts. Under a strict fixed-model setting with Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct, POaaS improves both task accuracy and factuality while representative APO baselines degrade them, and POaaS recovers up to +7.4% under token deletion and mixup. Overall, per-query conservative optimization is a practical alternative to search-heavy APO for on-device sLLMs.

[AI-66] Enhancing Linguistic Generalization of VLA: Fine-Tuning OpenVLA via Synthetic Instruction Augmentation

【速读】:该论文旨在解决具身人工智能(Embodied AI)中机器人在面对全新环境时泛化能力不足的问题,尤其针对当前最先进的视觉-语言-动作模型(Vision-Language-Action, VLA)如OpenVLA,在零样本(zero-shot)场景下性能受限的挑战。解决方案的关键在于提出一种参数高效微调策略:利用大语言模型(Large Language Model, LLM)为Bridge Dataset V2中的已有轨迹生成语义等价但结构多样化的自然语言指令,从而扩充语言空间;随后采用低秩适应(Low-Rank Adaptation, LoRA)方法对OpenVLA进行微调,使其更有效地将复杂自然语言意图映射到机器人动作,显著提升了模型在新环境下的鲁棒性与语言泛化能力。

链接: https://arxiv.org/abs/2603.16044
作者: Dongik Shin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generalization remains a core challenge in embodied AI, as robots must adapt to diverse environments. While OpenVLA represents the State-of-the-Art (SOTA) in Vision-Language-Action models by leveraging large-scale pre-training, its zero-shot performance can be limited when encountering completely new environments. This paper proposes a parameter-efficient fine-tuning strategy to enhance the linguistic generalization of OpenVLA by synthesizing a general instruction set for the Bridge Dataset V2. The paper leverages a Large Language Model (LLM) to generate a rich variety of semantically equivalent but structurally diverse commands for existing trajectories. In this experiment, Low-Rank Adaptation (LoRA) is implemented to fine-tune OpenVLA on augmented pairs, allowing the model to bridge the gap between complex natural language intent and robotic actions. Results demonstrate that the LoRA-enhanced model’s robustness, suggesting that enriching the linguistic space of specialized datasets is crucial for embodied agents.

[AI-67] IRAM-Omega-Q: A Computational Architecture for Uncertainty Regulation in Artificial Agents

【速读】:该论文旨在解决人工代理在面对随机扰动时,如何实现内部调节(internal regulation)、不确定性管理(uncertainty management)与稳定性(stability)的可计算建模问题。现有方法虽能实现强任务性能,但缺乏对这些关键认知机制的透明性与可控性。解决方案的核心在于提出IRAM-Omega-Q计算架构,其将内部调节建模为对量子态类(quantum-like)状态表示的闭环控制,利用密度矩阵(density matrices)作为抽象状态描述符,直接计算熵、纯度和相干性等指标,而无需依赖物理量子过程;同时引入一个持续自适应更新的增益参数,以维持目标不确定性水平下的稳定运行。该框架通过系统参数扫描与相图分析识别出调节-噪声空间中的可复现临界边界,并揭示感知优先与行动优先控制顺序会诱发不同稳定性区域,从而确立不确定性调节作为人工代理架构设计的实质性原则。

链接: https://arxiv.org/abs/2603.16020
作者: Veronique Ziegler
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figues

点击查看摘要

Abstract:Artificial agents can achieve strong task performance while remaining opaque with respect to internal regulation, uncertainty management, and stability under stochastic perturbation. We present IRAM-Omega-Q, a computational architecture that models internal regulation as closed-loop control over a quantum-like state representation. The framework uses density matrices instrumentally as abstract state descriptors, enabling direct computation of entropy, purity, and coherence-related metrics without invoking physical quantum processes. A central adaptive gain is updated continuously to maintain a target uncertainty regime under noise. Using systematic parameter sweeps, fixed-seed publication-mode simulations, and susceptibility-based phase-diagram analysis, we identify reproducible critical boundaries in regulation-noise space. We further show that alternative control update orderings, interpreted as perception-first and action-first architectures, induce distinct stability regimes under identical external conditions. These results support uncertainty regulation as a concrete architectural principle for artificial agents and provide a formal setting for studying stability, control, and order effects in cognitively inspired AI systems. The framework is presented as a technical model of adaptive regulation dynamics in artificial agents. It makes no claims regarding phenomenological consciousness, and the quantum-like formalism is used strictly as a mathematical representation for structured uncertainty and state evolution.

[AI-68] Selective Memory for Artificial Intelligence: Write-Time Gating with Hierarchical Archiving

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在知识存储与检索过程中存在的两大局限:一是检索增强生成(Retrieval-Augmented Generation, RAG)方法因无差别存储内容而导致噪声累积,进而降低准确性;二是参数化方法虽能压缩知识至权重中,但无法实现选择性更新。其解决方案的关键在于引入“写入时门控”(write-time gating),通过复合显著性评分(包括来源声誉、新颖性和可靠性)对输入的知识对象进行过滤,并维护版本链以保留历史状态,从而模拟生物记忆的门控编码机制。实验表明,在无Oracle标签的情况下,该方法在多项基准测试中达到100%准确率,且在干扰比例高达8:1时仍优于读取时过滤(Self-RAG)方法(后者降至0%准确率),体现出写入时门控在结构上的优势。

链接: https://arxiv.org/abs/2603.15994
作者: Oliver Zahn,Simran Chana
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 8 figures

点击查看摘要

Abstract:Retrieval-augmented generation stores all content indiscriminately, degrading accuracy as noise accumulates. Parametric approaches compress knowledge into weights, precluding selective updates. Neither mirrors biological memory, which gates encoding based on salience and archives rather than deletes superseded information. We introduce write-time gating that filters incoming knowledge objects using composite salience scores (source reputation, novelty, reliability) while maintaining version chains that preserve prior states. Using real LLM evaluation without oracle access to quality labels, write gating achieves 100 percent accuracy versus 13 percent for ungated stores. The critical finding emerges under distractor scaling: at 8:1 distractor ratios, read-time filtering (Self-RAG) collapses to 0 percent while write gating maintains 100 percent, revealing a structural advantage of write-time over read-time curation. Validation on Wikipedia (20 entities), procedurally generated pharmacology data, and 2026 arXiv papers confirms these findings. The gating advantage scales inversely with parametric memory support: +25pp for Wikipedia, +48pp for post-cutoff arXiv, +65pp for procedural data with zero training knowledge. Signal ablation confirms the method does not depend on oracle-correlated metadata. Write gating matches Self-RAG accuracy at one-ninth the query-time cost.

[AI-69] From Workflow Automation to Capability Closure: A Formal Framework for Safe and Revenue-Aware Customer Service AI

【速读】:该论文旨在解决客户服务中心自动化系统中因多智能体协作而产生的安全漏洞问题。随着从单一脚本化聊天机器人向由多个专业化AI智能体组成的动态能力网络转变,现有平台尚未解决一个关键安全隐患:即使每个智能体单独验证为安全,其组合后可能通过涌现的协同依赖关系(emergent conjunctive dependency)达成单个智能体无法实现的危险目标。解决方案的关键在于识别并防范这种跨智能体的非线性交互风险,从而在系统设计层面建立对多智能体协同行为的安全保障机制。

链接: https://arxiv.org/abs/2603.15978
作者: Cosimo Spera,Garima Agrawal,Riccardo De Maria
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Customer service automation is undergoing a structural transformation. The dominant paradigm is shifting from scripted chatbots and single-agent responders toward networks of specialised AI agents that compose capabilities dynamically across billing, service provision, payments, and fulfilment. This shift introduces a safety gap that no current platform has closed: two agents individually verified as safe can, when combined, reach a forbidden goal through an emergent conjunctive dependency that neither possesses alone.

[AI-70] An Agent ic Evaluation Framework for AI-Generated Scientific Code in PETSc

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在高性能计算(High Performance Computing, HPC)领域生成科学代码时,传统评测方法仅依赖测试用例匹配所带来的局限性问题。传统基准测试无法全面评估代码在求解器选择、API规范、内存管理及性能等方面的合规性与质量,而这些因素对HPC库(如PETSc)的实用性至关重要。解决方案的关键在于提出petscagent-bench框架,其核心是基于“代理评估代理”(agents-evaluating-agents)范式,部署一个工具增强型评估代理(evaluator agent),通过标准化协议(A2A和MCP)与待测模型代理(model-under-test agent)交互,自动编译、执行并量化代码在正确性、性能、代码质量、算法适配性和库特定规范等五个维度的表现,从而实现对任意编码代理的黑盒化、多维、自动化评估。

链接: https://arxiv.org/abs/2603.15976
作者: Hong Zhang,Barry Smith,Satish Balay,Le Chen,Murat Keceli,Lois Curfman McInnes,Junchao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language models have significantly accelerated scientific code generation, comprehensively evaluating the generated code remains a major challenge. Traditional benchmarks reduce evaluation to test-case matching, an approach insufficient for library code in HPC where solver selection, API conventions, memory management, and performance are just as critical as functional correctness. To address this gap, we introduce petscagent-bench, an agentic framework built on an agents-evaluating-agents paradigm. Instead of relying on static scripts, petscagent-bench deploys a tool-augmented evaluator agent that compiles, executes, and measures code produced by a separate model-under-test agent, orchestrating a 14-evaluator pipeline across five scoring categories: correctness, performance, code quality, algorithmic appropriateness, and library-specific conventions. Because the agents communicate through standardized protocols (A2A and MCP), the framework enables black-box evaluation of any coding agent without requiring access to its source code. We demonstrate the framework on a benchmark suite of realistic problems using the PETSc library for HPC. Our empirical analysis of frontier models reveals that while current models generate readable, well-structured code, they consistently struggle with library-specific conventions that traditional pass/fail metrics completely miss.

[AI-71] Safety is Non-Compositional: A Formal Framework for Capability-Based AI Systems

【速读】:该论文旨在解决安全属性在存在合取能力依赖(conjunctive capability dependencies)时的非组合性问题,即两个独立个体无法达到禁止目标,但其协作后可能通过涌现的合取依赖共同达成禁止目标。解决方案的关键在于首次形式化证明了安全性在该场景下不具备组合性质,从而揭示了传统基于模块化分析的安全验证方法在此类依赖关系中可能失效的本质原因。

链接: https://arxiv.org/abs/2603.15973
作者: Cosimo Spera
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper contains the first formal proof that safety is non-compositional in the presence of conjunctive capability dependencies: two agents each individually inca- pable of reaching any forbidden capability can, when combined, collectively reach a forbidden goal through an emergent conjunctive dependency.

[AI-72] 100x Cost Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

【速读】:该论文旨在解决生成式 AI (Generative AI) 查询在数据库中因频繁调用大语言模型(LLM)而导致的高昂计算成本与延迟问题,尤其针对大规模数据集上语义过滤(semantic filter)和语义排序(semantic ranking)等操作。其核心解决方案是引入低成本且高精度的代理模型(proxy models),基于嵌入向量(embedding vectors)进行近似计算,从而显著降低对 LLM 的依赖。该方法在保持甚至提升查询准确率的同时,实现了高达 100 倍的成本与延迟优化,并提出了适用于 OLAP 场景的 BigQuery 架构与 HTAP 场景的 AlloyDB 架构,进一步通过离线训练代理模型来压缩在线响应时间。

链接: https://arxiv.org/abs/2603.15970
作者: Yeounoh Chung,Rushabh Desai,Jian He,Yu Xiao,Thibaud Hottelier,Yves-Laurent Kom Samo,Pushkar Kadilkar,Xianshun Chen,Sam Idicula,Fatma Özcan,Alon Halevy,Yannis Papakonstantinou
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Several data warehouse and database providers have recently introduced extensions to SQL called AI Queries, enabling users to specify functions and conditions in SQL that are evaluated by LLMs, thereby broadening significantly the kinds of queries one can express over the combination of structured and unstructured data. LLMs offer remarkable semantic reasoning capabilities, making them an essential tool for complex and nuanced queries that blend structured and unstructured data. While extremely powerful, these AI queries can become prohibitively costly when invoked thousands of times. This paper provides an extensive evaluation of a recent AI query approximation approach that enables low cost analytics and database applications to benefit from AI queries. The approach delivers 100x cost and latency reduction for the semantic filter (this http URL) operator and also important gains for semantic ranking (this http URL). The cost and performance gains come from utilizing cheap and accurate proxy models over embedding vectors. We show that despite the massive gains in latency and cost, these proxy models preserve accuracy and occasionally improve accuracy across various benchmark datasets, including the extended Amazon reviews benchmark that has 10M rows. We present an OLAP-friendly architecture within Google \textitBigQuery for this approach for purely online (ad hoc) queries, and a low-latency HTAP database-friendly architecture in \textitAlloyDB that could further improve the latency by moving the proxy model training offline. We present techniques that accelerate the proxy model training. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.15970 [cs.DB] (or arXiv:2603.15970v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2603.15970 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3802002 Focus to learn more DOI(s) linking to related resources

[AI-73] Optimizing Hospital Capacity During Pandemics: A Dual-Component Framework for Strategic Patient Relocation

【速读】:该论文旨在解决新冠疫情下全球医院系统因患者激增而导致的容量压力问题,特别是如何通过优化患者重新安置策略来提升医疗资源分配效率与系统韧性。其解决方案的关键在于构建一个两阶段框架:第一阶段采用时间序列预测模型(time series prediction model)基于历史新冠病例和住院数据,精准预测未来患者流量;第二阶段则利用仿真模型(simulation model)评估多种患者转移策略的影响,综合考虑床位可用性、医护人员能力、运输物流及患者病情严重程度等因素,从而在联网医院间优化患者分布。通过融合预测分析与仿真建模,该框架为医院管理者提供了一个全面的决策支持工具,以实现需求预判、策略模拟与最优政策实施。

链接: https://arxiv.org/abs/2603.15960
作者: Sadaf Tabatabaee,Hicham El Baz,Mohammed Khalil Ghali,Nagendra N. Nagarur
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages. Published in Proceedings of the IISE Annual Conference Expo 2025. DOI: https://doi.org/10.21872/2025IISE_6202

点击查看摘要

Abstract:The COVID-19 pandemic has placed immense strain on hospital systems worldwide, leading to critical capacity challenges. This research proposes a two-part framework to optimize hospital capacity through patient relocation strategies. The first component involves developing a time series prediction model to forecast patient arrival rates. Using historical data on COVID-19 cases and hospitalizations, the model will generate accurate forecasts of future patient volumes. This will enable hospitals to proactively plan resource allocation and patient flow. The second com- ponent is a simulation model that evaluates the impact of different patient relocation strategies. The simulation will account for factors such as bed availability, staff capabilities, transportation logistics, and patient acuity to optimize the placement of patients across networked hospitals. Multiple scenarios will be tested, including inter-hospital trans- fers, use of temporary care facilities, and adaptations to discharge protocols. By combining predictive analytics and simulation modeling, this research aims to provide hospital administrators with a comprehensive decision-support tool. The proposed framework will empower them to anticipate demand, simulate relocation strategies, and imple- ment optimal policies to distribute patients and resources. Ultimately, this work seeks to enhance the resilience of healthcare systems in the face of COVID-19 and future pandemics.

[AI-74] ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

【速读】:该论文旨在解决机器人领域中行为克隆(Behavior Cloning)策略泛化能力和鲁棒性不足的问题,尤其是在缺乏大规模高质量真实世界数据的情况下,如何高效构建可迁移的专家策略。其核心挑战在于:传统依赖人类示范(如遥操作)的数据采集成本高昂,且直接在仿真中训练的策略难以保证真实场景下的性能。解决方案的关键在于提出ExpertGen框架,通过两阶段优化实现从低质量演示到高成功率专家策略的自动化学习:首先利用扩散策略(Diffusion Policy)对不完美示范(可能由大语言模型生成或人工提供)进行初始化,形成行为先验;随后采用强化学习(Reinforcement Learning)在保持预训练扩散策略冻结的前提下,仅优化初始噪声分布以引导探索,从而在稀疏奖励环境下有效提升任务成功率,同时约束探索范围于安全的人类行为流形内。该方法无需奖励工程,在工业装配和长时程操作任务中分别达到90.5%和85%的成功率,并成功实现从仿真到真实机器人硬件的迁移部署。

链接: https://arxiv.org/abs/2603.15956
作者: Zifan Xu,Ran Gong,Maria Vittoria Minniti,Ahmet Salih Gundogdu,Eric Rosen,Kausik Sivakumar,Riedana Yan,Zixing Wang,Di Deng,Peter Stone,Xiaohan Zhang,Karl Schmeckpeper
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrations (e.g., through teleoperation) serve as the standard source for expert behaviors, acquiring such data at scale in the real world is prohibitively expensive. This paper introduces ExpertGen, a framework that automates expert policy learning in simulation to enable scalable sim-to-real transfer. ExpertGen first initializes a behavior prior using a diffusion policy trained on imperfect demonstrations, which may be synthesized by large language models or provided by humans. Reinforcement learning is then used to steer this prior toward high task success by optimizing the diffusion model’s initial noise while keep original policy frozen. By keeping the pretrained diffusion policy frozen, ExpertGen regularizes exploration to remain within safe, human-like behavior manifolds, while also enabling effective learning with only sparse rewards. Empirical evaluations on challenging manipulation benchmarks demonstrate that ExpertGen reliably produces high-quality expert policies with no reward engineering. On industrial assembly tasks, ExpertGen achieves a 90.5% overall success rate, while on long-horizon manipulation tasks it attains 85% overall success, outperforming all baseline methods. The resulting policies exhibit dexterous control and remain robust across diverse initial configurations and failure states. To validate sim-to-real transfer, the learned state-based expert policies are further distilled into visuomotor policies via DAgger and successfully deployed on real robotic hardware.

[AI-75] MobileLLM -Flash: Latency-Guided On-Device LLM Design for Industry Scale

【速读】:该论文旨在解决在资源受限的移动设备上高效部署大语言模型(Large Language Models, LLMs)的问题,以支持实时人工智能体验。其核心挑战在于如何在保证模型性能的同时,满足移动端的延迟约束和硬件兼容性要求。解决方案的关键在于提出一种“软硬件协同架构搜索”方法(hardware-in-the-loop architecture search),通过将硬件延迟评估嵌入搜索过程,联合优化模型结构(层数、维度)与注意力模式,并采用注意力跳过(attention skipping)机制替代专用注意力机制以加速长上下文处理;同时,所有候选模型均基于预训练骨干网络进行剪枝并继承权重,极大减少了微调成本,最终实现了可在标准移动端运行时(如Executorch)直接部署的MobileLLM-Flash系列模型,在保持高准确率的前提下显著提升前向填充(prefill)和解码(decode)速度(最高达1.8倍和1.6倍)。

链接: https://arxiv.org/abs/2603.15954
作者: Hanxian Huang,Igor Fedorov,Andrey Gromov,Bernard Beckerman,Naveen Suda,David Eriksson,Maximilian Balandat,Rylan Conway,Patrick Huber,Chinnadhurai Sankar,Ayushi Dalmia,Zechun Liu,Lemeng Wu,Tarek Elgamal,Adithya Sagar,Vikas Chandra,Raghuraman Krishnamoorthi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-time AI experiences call for on-device large language models (OD-LLMs) optimized for efficient deployment on resource-constrained hardware. The most useful OD-LLMs produce near-real-time responses and exhibit broad hardware compatibility, maximizing user reach. We present a methodology for designing such models using hardware-in-the-loop architecture search under mobile latency constraints. This system is amenable to industry-scale deployment: it generates models deployable without custom kernels and compatible with standard mobile runtimes like Executorch. Our methodology avoids specialized attention mechanisms and instead uses attention skipping for long-context acceleration. Our approach jointly optimizes model architecture (layers, dimensions) and attention pattern. To efficiently evaluate candidates, we treat each as a pruned version of a pretrained backbone with inherited weights, thereby achieving high accuracy with minimal continued pretraining. We leverage the low cost of latency evaluation in a staged process: learning an accurate latency model first, then searching for the Pareto-frontier across latency and quality. This yields MobileLLM-Flash, a family of foundation models (350M, 650M, 1.4B) for efficient on-device use with strong capabilities, supporting up to 8k context length. MobileLLM-Flash delivers up to 1.8x and 1.6x faster prefill and decode on mobile CPUs with comparable or superior quality. Our analysis of Pareto-frontier design choices offers actionable principles for OD-LLM design. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.15954 [cs.LG] (or arXiv:2603.15954v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.15954 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-76] Protein Design with Agent Rosetta: A Case Study for Specialized Scientific Agents

【速读】:该论文旨在解决当前机器学习(ML)方法在蛋白质设计中受限于仅能处理标准氨基酸(canonical amino acids)且目标范围狭窄的问题,从而难以实现广泛、灵活的蛋白质设计流程。其解决方案的关键在于构建一个名为Agent Rosetta的大型语言模型(LLM)代理系统,该系统与Rosetta这一领先的基于物理的异聚物(heteropolymer)设计软件相结合,并通过结构化环境实现对复杂科学任务的自主执行。该环境不仅支持非标准残基(non-canonical residues)的设计,还使LLM能够迭代优化设计以满足用户定义的目标,从而在保持通用性的同时达到与专用模型和人类专家相当的性能水平。研究进一步表明,单纯依赖提示工程(prompt engineering)无法有效生成Rosetta操作指令,凸显了环境设计在整合LLM代理与专业科学软件中的核心作用。

链接: https://arxiv.org/abs/2603.15952
作者: Jacopo Teneggi,S.M. Bargeen A. Turzo,Tanya Marwah,Alberto Bietti,P. Douglas Renfrew,Vikram Khipple Mulligan,Siavash Golkar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are capable of emulating reasoning and using tools, creating opportunities for autonomous agents that execute complex scientific tasks. Protein design provides a natural testbed: although machine learning (ML) methods achieve strong results, these are largely restricted to canonical amino acids and narrow objectives, leaving unfilled need for a generalist tool for broad design pipelines. We introduce Agent Rosetta, an LLM agent paired with a structured environment for operating Rosetta, the leading physics-based heteropolymer design software, capable of modeling non-canonical building blocks and geometries. Agent Rosetta iteratively refines designs to achieve user-defined objectives, combining LLM reasoning with Rosetta’s generality. We evaluate Agent Rosetta on design with canonical amino acids, matching specialized models and expert baselines, and with non-canonical residues – where ML approaches fail – achieving comparable performance. Critically, prompt engineering alone often fails to generate Rosetta actions, demonstrating that environment design is essential for integrating LLM agents with specialized software. Our results show that properly designed environments enable LLM agents to make scientific software accessible while matching specialized tools and human experts.

[AI-77] Argumentative Human-AI Decision-Making: Toward AI Agents That Reason With Us Not For Us

【速读】:该论文旨在解决传统计算论证(Computational Argumentation)因依赖领域特定信息和大量特征工程而难以扩展,以及大语言模型(LLM)虽能处理非结构化文本但推理过程不透明、难以验证和信任的问题。其解决方案的关键在于融合计算论证与大语言模型,通过论证框架挖掘(argumentation framework mining)、论证框架合成(argumentation framework synthesis)和论证推理(argumentative reasoning)的协同作用,构建一种新型人机决策模式——即“论辩式人机决策”(Argumentative Human-AI Decision-Making),使AI不仅能够解释决策,还能与人类进行可争辩、可修正的辩证互动,从而实现高风险场景下以人为本、可信的智能决策系统。

链接: https://arxiv.org/abs/2603.15946
作者: Stylianos Loukas Vasileiou,Antonio Rago,Francesca Toni,William Yeoh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computational argumentation offers formal frameworks for transparent, verifiable reasoning but has traditionally been limited by its reliance on domain-specific information and extensive feature engineering. In contrast, LLMs excel at processing unstructured text, yet their opaque nature makes their reasoning difficult to evaluate and trust. We argue that the convergence of these fields will lay the foundation for a new paradigm: Argumentative Human-AI Decision-Making. We analyze how the synergy of argumentation framework mining, argumentation framework synthesis, and argumentative reasoning enables agents that do not just justify decisions, but engage in dialectical processes where decisions are contestable and revisable – reasoning with humans rather than for them. This convergence of computational argumentation and LLMs is essential for human-aware, trustworthy AI in high-stakes domains.

[AI-78] Data-Local Autonomous LLM -Guided Neural Architecture Search for Multiclass Multimodal Time-Series Classification

【速读】:该论文旨在解决在隐私敏感领域(如医疗健康)中,基于时间序列数据的机器学习模型开发所面临的迭代瓶颈问题,尤其在多模态融合场景下,由于各传感器模态需独立预处理并组合,导致人工干预成本高且数据难以迁移至云端。其解决方案的关键在于提出一种数据本地化、由大语言模型(LLM)引导的神经架构搜索(NAS)框架,该框架通过远程控制候选模型管道的评估过程,同时确保所有训练与验证均在本地完成,并遵循固定协议;控制器仅接收试验级别的摘要信息(如管道描述、指标、学习曲线统计和失败日志),而不接触原始样本或中间特征表示,从而在保障数据隐私的前提下实现自动化架构探索,显著降低人工参与度。

链接: https://arxiv.org/abs/2603.15939
作者: Emil Hardarson,Luka Biedebach,Ómar Bessi Ómarsson,Teitur Hrólfsson,Anna Sigridur Islind,María Óskarsdóttir
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Applying machine learning to sensitive time-series data is often bottlenecked by the iteration loop: Performance depends strongly on preprocessing and architecture, yet training often has to run on-premise under strict data-local constraints. This is a common problem in healthcare and other privacy-constrained domains (e.g., a hospital developing deep learning models on patient EEG). This bottleneck is particularly challenging in multimodal fusion, where sensor modalities must be individually preprocessed and then combined. LLM-guided neural architecture search (NAS) can automate this exploration, but most existing workflows assume cloud execution or access to data-derived artifacts that cannot be exposed. We present a novel data-local, LLM-guided search framework that handles candidate pipelines remotely while executing all training and evaluation locally under a fixed protocol. The controller observes only trial-level summaries, such as pipeline descriptors, metrics, learning-curve statistics, and failure logs, without ever accessing raw samples or intermediate feature representations. Our framework targets multiclass, multimodal learning via one-vs-rest binary experts per class and modality, a lightweight fusion MLP, and joint search over expert architectures and modality-specific preprocessing. We evaluate our method on two regimes: UEA30 (public multivariate time-series classification dataset) and SleepEDFx sleep staging (heterogeneous clinical modalities such as EEG, EOG, and EMG). The results show that the modular baseline model is strong, and the LLM-guided NAS further improves it. Notably, our method finds models that perform within published ranges across most benchmark datasets. Across both settings, our method reduces manual intervention by enabling unattended architecture search while keeping sensitive data on-premise. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.15939 [cs.LG] (or arXiv:2603.15939v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.15939 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-79] Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau Equilibrium

【速读】:该论文旨在解决等离子体物理中Vlasov-Maxwell-Landau(VML)系统平衡态特征的严格形式化问题,即通过数学逻辑语言对带电粒子系统的稳态行为进行精确建模与验证。其解决方案的关键在于构建了一个完整的AI辅助数学研究闭环:首先由生成式AI推理模型(Gemini DeepThink)从猜想出发生成证明,再由代理编码工具(Claude Code)将自然语言提示转化为Lean 4代码,随后由专用定理证明器(Aristotle)自动闭合111个引理,最终由Lean内核完成形式化验证。整个流程仅需一位数学家在10天内监督完成,且无需编写任何代码,同时揭示了AI在数学研究中的关键失败模式与有效策略,如假设蔓延、定义对齐错误及代理回避行为,并强调抽象-具体证明分离、对抗性自审和人类对核心定义与定理的审查是成功的核心要素。

链接: https://arxiv.org/abs/2603.15929
作者: Vasily Ilin
机构: 未知
类目: Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP); Logic (math.LO)
备注: 11 figures

点击查看摘要

Abstract:We present a complete Lean 4 formalization of the equilibrium characterization in the Vlasov-Maxwell-Landau (VML) system, which describes the motion of charged plasma. The project demonstrates the full AI-assisted mathematical research loop: an AI reasoning model (Gemini DeepThink) generated the proof from a conjecture, an agentic coding tool (Claude Code) translated it into Lean from natural-language prompts, a specialized prover (Aristotle) closed 111 lemmas, and the Lean kernel verified the result. A single mathematician supervised the process over 10 days at a cost of \ 200, writing zero lines of code. The entire development process is public: all 229 human prompts, and 213 git commits are archived in the repository. We report detailed lessons on AI failure modes – hypothesis creep, definition-alignment bugs, agent avoidance behaviors – and on what worked: the abstract/concrete proof split, adversarial self-review, and the critical role of human review of key definitions and theorem statements. Notably, the formalization was completed before the final draft of the corresponding math paper was finished. Comments: 11 figures Subjects: Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP); Logic (math.LO) MSC classes: 68V15 (primary), 68T01, 35Q83, 82D10 (secondary) ACMclasses: I.2.3; F.4.1; J.2 Cite as: arXiv:2603.15929 [cs.AI] (or arXiv:2603.15929v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.15929 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-80] Evaluating Causal Discovery Algorithms for Path-Specific Fairness and Utility in Healthcare

【速读】:该论文旨在解决健康数据中因果发现(causal discovery)的评估难题,尤其是在缺乏真实因果图(ground truth)的情况下如何有效衡量算法性能。其关键解决方案是通过与领域专家合作构建代理真实因果图(proxy ground-truth graphs),并基于此在合成阿尔茨海默病和心力衰竭临床数据上建立基准测试集,从而实现对Peter-Clark、Greedy Equivalence Search和Fast Causal Inference等算法在结构恢复和路径特异性公平性分解方面的系统评估。该方法突破了传统复合公平性分数的局限,引入细粒度的路径特异性效应分析,揭示了不同算法在公平性-效用比上的差异,强调了在临床应用中进行图感知公平性评估的重要性。

链接: https://arxiv.org/abs/2603.15926
作者: Nitish Nagesh,Elahe Khatibi,Thomas Hughes,Mahdi Bagheri,Pratik Gajane,Amir M. Rahmani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Causal discovery in health data faces evaluation challenges when ground truth is unknown. We address this by collaborating with experts to construct proxy ground-truth graphs, establishing benchmarks for synthetic Alzheimer’s disease and heart failure clinical records data. We evaluate the Peter-Clark, Greedy Equivalence Search, and Fast Causal Inference algorithms on structural recovery and path-specific fairness decomposition, going beyond composite fairness scores. On synthetic data, Peter-Clark achieved the best structural recovery. On heart failure data, Fast Causal Inference achieved the highest utility. For path-specific effects, ejection fraction contributed 3.37 percentage points to the indirect effect in the ground truth. These differences drove variations in the fairness-utility ratio across algorithms. Our results highlight the need for graph-aware fairness evaluation and fine-grained path-specific analysis when deploying causal discovery in clinical applications.

[AI-81] VIBEPASS: Can Vibe Coders Really Pass the Vibe Check?

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在自主软件工程中缺乏对隐性缺陷进行有效自诊断与修复能力的问题,尤其关注生成式 AI 在“vibe coding”范式下如何实现故障定位与针对性修复。其解决方案的关键在于提出名为 \name 的系统性评估框架,首次将两个耦合任务——故障触发测试生成(Fault-Triggering Test Generation, FT-Test)和故障目标程序修复(Fault-targeted Program Repair, FPR)——联合分解并实证分析。研究发现,尽管LLMs在语法正确测试输入生成上表现接近天花板水平,但其在生成具有判别力的故障见证测试用例方面严重不足,且故障假设生成是主要瓶颈;此外,当模型自动生成的测试能成功触发故障时,修复效果优于外部测试引导的修复,而失败的测试则显著劣于无引导基线,揭示出故障目标推理能力才是当前所有前沿模型中制约自主调试的核心短板。

链接: https://arxiv.org/abs/2603.15921
作者: Srijan Bansal,Jiao Fangkai,Yilun Zhou,Austin Xu,Shafiq Joty,Semih Yavuz
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models shift the programming toward human-guided ‘‘vibe coding’’, agentic coding tools increasingly rely on models to self-diagnose and repair their own subtle faults – a capability central to autonomous software engineering yet never systematically evaluated. We present \name, the first empirical decomposition that jointly evaluates two coupled tasks: \emphFault-Triggering Test Generation (FT-Test) constructing a discriminative witness that exposes a latent bug, and \emphFault-targeted Program Repair (FPR), repairing it under varying diagnostic conditions. \name pairs competitive programming problems with LLM-generated solutions that pass partial test suites but fail on semantic edge cases, enabling controlled identification of where the diagnostic chain breaks down. Evaluating 12 frontier LLMs, we find that fault-targeted reasoning does not scale with general coding ability. Models produce syntactically valid test inputs at near-ceiling rates yet collapse on discriminative generation, with fault hypothesis generation – not output validation – as the dominant bottleneck. Test-guided repair reveals a complementary insight: when self-generated tests successfully witness a fault, the resulting repair matches or outperforms repair guided by externally provided tests, but tests that fail to witness the fault actively degrade repair below unguided baselines. Together, these results reframe the challenge of autonomous debugging: the binding bottleneck is not code synthesis or test validity but fault-target reasoning, a capability that remains deficient across all frontier models. As Large Language Models shift the programming toward human-guided ‘‘vibe coding’’, agentic coding tools increasingly rely on models to self-diagnose and repair their own subtle faults – a capability central to autonomous software engineering yet never systematically evaluated. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.15921 [cs.SE] (or arXiv:2603.15921v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.15921 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-82] Auto Researching not hyperparameter tuning: Convergence Analysis of 10000 Experiments

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在自主设计机器学习(Machine Learning, ML)实验时,是否真正执行了架构搜索(architecture search),还是仅在固定架构范围内进行超参数调优(hyperparameter tuning)的问题。其关键解决方案是通过大规模实证分析——对两个LLM代理(Claude Opus 和 Gemini 2.5 Pro)在27天内执行的10,469次实验进行ANOVA分解,发现架构选择解释了94%的性能方差(F = 1324, η² = 0.94),而超参数变化仅解释6%,表明LLM代理确实实现了有效的架构探索;此外,研究进一步验证了其在不同任务上的泛化能力,并揭示了LLM引导搜索能聚焦于高潜力架构区域,显著优于随机或贝叶斯基线方法,在N=50时即达到0.985 AP,较从头随机搜索提升明显,且收敛遵循幂律规律(c = 0.11, R² = 0.93),体现了其高效性与探索深度。

链接: https://arxiv.org/abs/2603.15916
作者: Xiaoyi Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When LLM agents autonomously design ML experiments, do they perform genuine architecture search – or do they default to hyperparameter tuning within a narrow region of the design space? We answer this question by analyzing 10,469 experiments executed by two LLM agents (Claude Opus and Gemini 2.5 Pro) across a combinatorial configuration space of 108,000 discrete cells for dashcam collision detection over 27 days. Through ANOVA decomposition, we find that \textbfarchitectural choices explain 94% of performance variance ( F = 1324 , \eta^2 = 0.94 ), while hyperparameter variation within a fixed architecture explains only 6%. Cross-task validation on a second collision dataset confirms this finding (75% architecture-explained variance) with a \emphdifferent winning backbone, confirming genuine architecture discovery. The agents’ key contribution is discovering that V-JEPA,2 video features with Zipformer temporal encoders achieve 0.9245 AP – a configuration no human proposed – and concentrating search on productive architectural regions: at N = 50 , LLM-guided search reaches AP = 0.985 versus 0.965 for from-scratch random search. Post-bugfix convergence follows a power law ( c = 0.11 , R^2 = 0.93 ); the low exponent reflects the cost of broad exploration, not inefficiency, since the LLM discovers qualitatively better regions than random or Bayesian baselines. We characterize multi-agent search dynamics via entropy cycles and Jensen–Shannon specialization, providing the first large-scale empirical framework for LLM-guided combinatorial ML experiment design.

[AI-83] he Agent ic Researcher: A Practical Guide to AI-Assisted Research in Mathematics and Machine Learning

【速读】:该论文旨在解决当前研究人员在数学与机器学习领域中如何有效整合生成式 AI (Generative AI) 工具以提升科研效率的问题,尤其关注这些工具在日常研究实践中具体的应用场景、价值边界及责任约束。其解决方案的关键在于提出一个开源框架,该框架通过一组形式化为代理提示(agent prompts)的方法论规则,将命令行界面(CLI)编码代理(如 Claude Code、Codex CLI、OpenCode)转化为自主的研究助手;该框架运行于沙箱容器内,兼容任意前沿大语言模型(LLM),支持从个人笔记本电脑原型开发到多节点、多GPU计算集群的扩展,且无需人工干预即可执行长时间(超过20小时)的独立实验调度。此设计强调增强而非替代研究人员的角色,从而实现安全、可控且高效的 AI 辅助科研流程。

链接: https://arxiv.org/abs/2603.15914
作者: Max Zimmer,Nico Pelleriti,Christophe Roux,Sebastian Pokutta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI tools and agents are reshaping how researchers work, from proving theorems to training neural networks. Yet for many, it remains unclear how these tools fit into everyday research practice. This paper is a practical guide to AI-assisted research in mathematics and machine learning: We discuss how researchers can use modern AI systems productively, where these systems help most, and what kinds of guardrails are needed to use them responsibly. It is organized into three parts: (I) a five-level taxonomy of AI integration, (II) an open-source framework that, through a set of methodological rules formulated as agent prompts, turns CLI coding agents (e.g., Claude Code, Codex CLI, OpenCode) into autonomous research assistants, and (III) case studies from deep learning and mathematics. The framework runs inside a sandboxed container, works with any frontier LLM through existing CLI agents, is simple enough to install and use within minutes, and scales from personal-laptop prototyping to multi-node, multi-GPU experimentation across compute clusters. In practice, our longest autonomous session ran for over 20 hours, dispatching independent experiments across multiple nodes without human intervention. We stress that our framework is not intended to replace the researcher in the loop, but to augment them. Our code is publicly available at this https URL.

[AI-84] he Internet of Physical AI Agents : Interoperability Longevity and the Cost of Getting It Wrong

【速读】:该论文旨在解决当前物联网(Internet of Things, IoT)在长期可持续性、互操作性、自主性与安全性方面存在的根本局限,特别是在将快速演进的生成式 AI 能力嵌入到寿命较长的物理基础设施时所引入的新架构风险。其解决方案的关键在于将“演化能力”“信任机制”和“互操作性”作为首要设计要求,提出一套包含代理身份标识、安全的代理间通信、语义互操作性、策略驱动的运行时环境以及可观测性驱动治理的系统架构蓝图,从而构建具备韧性、可进化且可信的物理 AI 代理(Physical AI Agents)系统。

链接: https://arxiv.org/abs/2603.15900
作者: Roberto Morabito,Mallik Tatipamula
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: A related version of this work is currently under review for publication in an IEEE magazine

点击查看摘要

Abstract:The Internet has evolved by progressively expanding what humanity connects: first computers, then people, and later billions of devices through the Internet of Things (IoT). While IoT succeeded in digitizing perception at scale, it also exposed fundamental limitations, including fragmentation, weak security, limited autonomy, and poor long-term sustainability. Today, advances in edge hardware, sensing, connectivity, and artificial intelligence enable a new phase: the Internet of Physical AI Agents. Unlike IoT devices that primarily sense and report, Physical AI Agents perceive, reason, and act in real time, operating autonomously and cooperatively across safety-critical domains such as disaster response, healthcare, industrial automation, and mobility. However, embedding fast-evolving AI capabilities into long-lived physical infrastructure introduces new architectural risks, particularly around interoperability, lifecycle management, and premature ossification. This article revisits lessons from IoT and Internet evolution, and articulates design principles for building resilient, evolvable, and trustworthy agentic systems. We present an architectural blueprint encompassing agentic identity, secure agent-to-agent communication, semantic interoperability, policy-governed runtimes, and observability-driven governance. We argue that treating evolution, trust, and interoperability as first-class requirements is essential to avoid hard-coding today’s assumptions into tomorrow’s intelligent infrastructure, and to prevent the high technical and economic cost of getting it wrong.

[AI-85] PhasorFlow: A Python Library for Unit Circle Based Computing

【速读】:该论文旨在解决传统神经网络在建模连续几何梯度和保持全局能量守恒方面存在的局限性,以及量子计算在经典硬件上难以实现的问题。其解决方案的关键在于提出了一种基于单位圆(S¹)的新型计算范式——PhasorFlow,通过将输入编码为复平面上的相位矢量(phasors),利用酉波干涉门进行计算,从而在经典硬件上实现类量子力学的单位性演化(unitary evolution),同时保留连续几何梯度用于预测学习。核心创新包括:1)定义了相位电路模型(Phasor Circuit)及包含22个门操作的完整矩阵仿真库;2)引入变分相位电路(Variational Phasor Circuit, VPC)以优化连续相位参数;3)设计无需参数的DFT驱动的相位变换器(Phasor Transformer),替代传统注意力机制,显著提升效率。该方法在多个任务中验证了其确定性、轻量化与数学严谨性,成为经典神经网络与量子电路之间的有效替代方案。

链接: https://arxiv.org/abs/2603.15886
作者: Dibakar Sigdel,Namuna Panday
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present PhasorFlow, an open-source Python library introducing a computational paradigm operating on the S^1 unit circle. Inputs are encoded as complex phasors z = e^i\theta on the N -Torus ( \mathbbT^N ). As computation proceeds via unitary wave interference gates, global norm is preserved while individual components drift into \mathbbC^N , allowing algorithms to natively leverage continuous geometric gradients for predictive learning. PhasorFlow provides three core contributions. First, we formalize the Phasor Circuit model ( N unit circle threads, M gates) and introduce a 22-gate library covering Standard Unitary, Non-Linear, Neuromorphic, and Encoding operations with full matrix algebra simulation. Second, we present the Variational Phasor Circuit (VPC), analogous to Variational Quantum Circuits (VQC), enabling optimization of continuous phase parameters for classical machine learning tasks. Third, we introduce the Phasor Transformer, replacing expensive QK^TV attention with a parameter-free, DFT-based token mixing layer inspired by FNet. We validate PhasorFlow on non-linear spatial classification, time-series prediction, financial volatility detection, and neuromorphic tasks including neural binding and oscillatory associative memory. Our results establish unit circle computing as a deterministic, lightweight, and mathematically principled alternative to classical neural networks and quantum circuits. It operates on classical hardware while sharing quantum mechanics’ unitary foundations. PhasorFlow is available at this https URL.

[AI-86] Resilience Meets Autonomy: Governing Embodied AI in Critical Infrastructure

【速读】:该论文旨在解决嵌入式人工智能(Embodied AI)在关键基础设施中面临的韧性不足问题,特别是在面对超出训练假设的级联故障和危机动态时,传统AI系统难以有效应对。其解决方案的关键在于构建一种混合治理架构(hybrid governance architecture),通过限定人工智能的自主性边界,并在机器能力与人类判断之间进行结构化分配,从而提升系统在复杂任务、高风险和严重后果场景下的适应性和可靠性。

链接: https://arxiv.org/abs/2603.15885
作者: Puneet Sharma,Christer Henrik Pursiainen
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 6 pages

点击查看摘要

Abstract:Critical infrastructure increasingly incorporates embodied AI for monitoring, predictive maintenance, and decision support. However, AI systems designed to handle statistically representable uncertainty struggle with cascading failures and crisis dynamics that exceed their training assumptions. This paper argues that Embodied AIs resilience depends on bounded autonomy within a hybrid governance architecture. We outline four oversight modes and map them to critical infrastructure sectors based on task complexity, risk level, and consequence severity. Drawing on the EU AI Act, ISO safety standards, and crisis management research, we argue that effective governance requires a structured allocation of machine capability and human judgement.

[AI-87] Electrodermal Activity as a Unimodal Signal for Aerobic Exercise Detection in Wearable Sensors

【速读】:该论文旨在解决如何仅使用皮肤电活动(Electrodermal Activity, EDA)信号在无个体依赖性评估下可靠区分静息状态与持续有氧运动状态的问题。其关键解决方案在于利用来自30名健康受试者的公开数据集,通过基准机器学习模型结合留一被试者排除(leave-one-subject-out, LOSO)验证策略,系统评估仅基于EDA特征的分类性能;结果表明,EDA中的时相性时间动态特征和事件发生时机对两类状态的分离具有显著贡献,从而为EDA作为可穿戴设备中单一模态输入在活动状态推断中的作用提供了保守且可信的性能基准。

链接: https://arxiv.org/abs/2603.15880
作者: Rena Mira Krishna,Ramya Sankar,Shadi Ghiasi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electrodermal Activity (EDA) is a non-invasive physiological signal widely available in wearable devices and reflects sympathetic nervous system (SNS) activation. Prior multi-modal studies have demonstrated robust performance in distinguishing stress and exercise states when EDA is combined with complementary signals such as heart rate and accelerometry. However, the ability of EDA to independently distinguish sustained aerobic exercise from low-arousal states under subject-independent evaluation remains insufficiently characterized. This study investigates whether features derived exclusively from EDA can reliably differentiate rest from sustained aerobic exercise. Using a publicly available dataset collected from thirty healthy individuals, EDA features were evaluated using benchmark machine learning models with leave-one-subject-out (LOSO) validation. Across models, EDA-only classifiers achieved moderate subject-independent performance, with phasic temporal dynamics and event timing contributing to class separation. Rather than proposing EDA as a replacement for multimodal sensing, this work provides a conservative benchmark of the discriminative power of EDA alone and clarifies its role as a unimodal input for wearable activity-state inference. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.15880 [cs.LG] (or arXiv:2603.15880v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.15880 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.5281/zenodo.19056046 Focus to learn more DOI(s) linking to related resources

[AI-88] Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learning NEURIPS2025

【速读】:该论文旨在解决高维马尔可夫决策过程(Markov Decision Process, MDP)中强化学习策略面临的计算复杂性与政策性能之间的矛盾问题,即在状态空间呈指数级增长时,传统强化学习方法难以实现高效、可扩展的学习。其解决方案的关键在于引入一种基于对抗性动作(counteractive actions)所获取经验的全新理论框架,该方法在不增加额外计算复杂度的前提下,显著加速了训练过程,并提升了样本效率,从而实现了高效、有效且可扩展的学习机制。

链接: https://arxiv.org/abs/2603.15871
作者: Ezgi Korkmaz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Spotlight

点击查看摘要

Abstract:Following the pivotal success of learning strategies to win at tasks, solely by interacting with an environment without any supervision, agents have gained the ability to make sequential decisions in complex MDPs. Yet, reinforcement learning policies face exponentially growing state spaces in high dimensional MDPs resulting in a dichotomy between computational complexity and policy success. In our paper we focus on the agent’s interaction with the environment in a high-dimensional MDP during the learning phase and we introduce a theoretically-founded novel paradigm based on experiences obtained through counteractive actions. Our analysis and method provide a theoretical basis for efficient, effective, scalable and accelerated learning, and further comes with zero additional computational complexity while leading to significant acceleration in training. We conduct extensive experiments in the Arcade Learning Environment with high-dimensional state representation MDPs. The experimental results further verify our theoretical analysis, and our method achieves significant performance increase with substantial sample-efficiency in high-dimensional environments.

[AI-89] Regularized Latent Dynamics Prediction is a Strong Baseline For Behavioral Foundation Models ICLR2026

【速读】:该论文旨在解决行为基础模型(Behavioral Foundation Models, BFMs)在零样本强化学习(zero-shot RL)中对状态特征表示的依赖问题,即现有方法仅能生成在预设状态特征空间中可线性表示的奖励函数的近优策略,导致状态特征的选择成为关键瓶颈。为提升BFM的表达能力并减少对复杂表征学习目标的依赖,论文提出一种名为正则化潜空间动态预测(Regularized Latent Dynamics Prediction, RLDP)的新方法,其核心在于在潜空间下一状态预测任务基础上引入简单的正交性正则化项,以维持状态特征的多样性并防止因特征相似性增加而导致的表示空间维度压缩(span reduction),从而在不依赖复杂训练目标的情况下实现与当前最先进方法相当或更优的零样本强化学习性能,并在数据覆盖不足场景下仍保持鲁棒性。

链接: https://arxiv.org/abs/2603.15857
作者: Pranaya Jajoo,Harshit Sikchi,Siddhant Agarwal,Amy Zhang,Scott Niekum,Martha White
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: ICLR 2026

点击查看摘要

Abstract:Behavioral Foundation Models (BFMs) produce agents with the capability to adapt to any unknown reward or task. These methods, however, are only able to produce near-optimal policies for the reward functions that are in the span of some pre-existing state features, making the choice of state features crucial to the expressivity of the BFM. As a result, BFMs are trained using a variety of complex objectives and require sufficient dataset coverage, to train task-useful spanning features. In this work, we examine the question: are these complex representation learning objectives necessary for zero-shot RL? Specifically, we revisit the objective of self-supervised next-state prediction in latent space for state feature learning, but observe that such an objective alone is prone to increasing state-feature similarity, and subsequently reducing span. We propose an approach, Regularized Latent Dynamics Prediction (RLDP), that adds a simple orthogonality regularization to maintain feature diversity and can match or surpass state-of-the-art complex representation learning methods for zero-shot RL. Furthermore, we empirically show that prior approaches perform poorly in low-coverage scenarios where RLDP still succeeds.

[AI-90] Algorithmic Trading Strategy Development and Optimisation

【速读】:该论文旨在解决传统算法交易策略在市场预测和收益优化方面存在的局限性,尤其在如何有效融合技术指标与市场情绪信息以提升交易性能的问题。其解决方案的关键在于构建一个集成多种技术指标(如移动平均线、动量、波动率)与基于FinBERT的 earnings call 情绪分析(earnings call sentiment analysis)的增强型算法交易策略,并通过计算优化实现参数调优与策略性能提升。实证结果表明,该策略在总回报、夏普比率和回撤等关键指标上显著优于基线模型,验证了多源信息融合与计算优化在算法交易系统中的有效性。

链接: https://arxiv.org/abs/2603.15848
作者: Owen Nyo Wei Yuan,Victor Tan Jia Xuan,Ong Jun Yao Fabian,Ryan Tan Jun Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 7 figures

点击查看摘要

Abstract:The report presents with the development and optimisation of an enhanced algorithmic trading strategy through the use of historical SP 500 market data and earnings call sentiment analysis. The proposed strategy integrates various technical indicators such as moving averages, momentum, volatility, and FinBERT-based sentiment analysis to improve overall trades being taken. The results show that the enhanced strategy significantly outperforms the baseline model in terms of total return, Sharpe ratio, and drawdown amongst other factors. The findings helped demonstrate the relevance and effectiveness of combining technical indicators, sentiment analysis, and computational optimisation in algorithmic trading systems.

[AI-91] Informationally Compressive Anonymization: Non-Degrading Sensitive Input Protection for Privacy-Preserving Supervised Machine Learning

【速读】:该论文旨在解决现代机器学习系统在依赖敏感数据时面临的隐私、安全与合规风险问题,现有隐私保护机器学习(ppML)技术如差分隐私(Differential Privacy, DP)和同态加密(Homomorphic Encryption, HE)虽能提供一定保障,但常以性能下降、复杂度上升或计算开销过大为代价。其解决方案的关键在于提出信息压缩匿名化(Informationally Compressive Anonymization, ICA)机制与VEIL架构:ICA通过在可信源环境内嵌入监督式多目标编码器,将原始输入映射为低维、任务对齐的潜在表示,并确保仅导出不可逆的匿名向量至不可信训练与推理环境;该方法从拓扑与信息论角度严格证明编码结构在逻辑上不可逆,即使在理想攻击者假设下也无法重构原始数据,且在实际部署中攻击者对原始数据的条件熵发散,使重建概率趋近于零;同时,ICA通过与下游监督目标对齐的表征学习保留预测效用,避免梯度裁剪、噪声预算或推理时加密,从而实现高吞吐、低延迟的高性能机器学习,且天然契合隐私设计(Privacy-by-Design)监管框架,具备抗后量子威胁的能力。

链接: https://arxiv.org/abs/2603.15842
作者: Jeremy J Samuelson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 25 pages, 17 figures

点击查看摘要

Abstract:Modern machine learning systems increasingly rely on sensitive data, creating significant privacy, security, and regulatory risks that existing privacy-preserving machine learning (ppML) techniques, such as Differential Privacy (DP) and Homomorphic Encryption (HE), address only at the cost of degraded performance, increased complexity, or prohibitive computational overhead. This paper introduces Informationally Compressive Anonymization (ICA) and the VEIL architecture, a privacy-preserving ML framework that achieves strong privacy guarantees through architectural and mathematical design rather than noise injection or cryptography. ICA embeds a supervised, multi-objective encoder within a trusted Source Environment to transform raw inputs into low-dimensional, task-aligned latent representations, ensuring that only irreversibly anonymized vectors are exported to untrusted Training and Inference Environments. The paper rigorously proves that these encodings are structurally non-invertible using topological and information-theoretic arguments, showing that inversion is logically impossible, even under idealized attacker assumptions, and that, in realistic deployments, the attackers conditional entropy over the original data diverges, driving reconstruction probability to zero. Unlike prior autoencoder-based ppML approaches, ICA preserves predictive utility by aligning representation learning with downstream supervised objectives, enabling low-latency, high-performance ML without gradient clipping, noise budgets, or encryption at inference time. The VEIL architecture enforces strict trust boundaries, supports scalable multi-region deployment, and naturally aligns with privacy-by-design regulatory frameworks, establishing a new foundation for enterprise ML that is secure, performant, and safe by construction, even in the face of post-quantum threats.

[AI-92] Hypothesis Class Determines Explanation: Why Accurate Models Disagree on Feature Attribution

【速读】:该论文试图解决的问题是:在可解释人工智能(Explainable AI)实践中广泛假设“预测等价模型产生等价解释”的合理性问题。研究表明,这一假设并不成立——即使多个模型在预测性能上完全一致,其生成的特征归因(feature attributions)仍可能显著不同。解决方案的关键在于识别出“假设类(hypothesis class)”作为导致这种差异的核心结构驱动因素,并提出一个名为“Explanation Lottery”的现象来描述该机制。作者进一步理论证明,在数据生成过程中存在交互结构时,这种归因差异(即Agreement Gap)依然存在;并据此设计了一个后验诊断指标——解释可靠性分数(Explanation Reliability Score, R(x)),用于在不额外训练的前提下预测解释在不同架构间的稳定性,从而为模型选择提供基于解释稳定性的决策依据。

链接: https://arxiv.org/abs/2603.15821
作者: Thackshanaramana B
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 1 figure. Submitted to TMLR

点击查看摘要

Abstract:The assumption that prediction-equivalent models produce equivalent explanations underlies many practices in explainable AI, including model selection, auditing, and regulatory evaluation. In this work, we show that this assumption does not hold. Through a large-scale empirical study across 24 datasets and multiple model classes, we find that models with identical predictive behavior can produce substantially different feature attributions. This disagreement is highly structured: models within the same hypothesis class exhibit strong agreement, while cross-class pairs (e.g., tree-based vs. linear) trained on identical data splits show substantially reduced agreement, consistently near or below the lottery threshold. We identify hypothesis class as the structural driver of this phenomenon, which we term the Explanation Lottery. We theoretically show that the resulting Agreement Gap persists under interaction structure in the data-generating process. This structural finding motivates a post-hoc diagnostic, the Explanation Reliability Score R(x), which predicts when explanations are stable across architectures without additional training. Our results demonstrate that model selection is not explanation-neutral: the hypothesis class chosen for deployment can determine which features are attributed responsibility for a decision. Comments: 17 pages, 1 figure. Submitted to TMLR Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.15821 [cs.LG] (or arXiv:2603.15821v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.15821 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Thackshanaramana Balashanmugam [view email] [v1] Mon, 16 Mar 2026 18:55:50 UTC (252 KB)

[AI-93] Prose2Policy (P2P): A Practical LLM Pipeline for Translating Natural-Language Access Policies into Executable Rego

【速读】:该论文旨在解决将人类可读的访问控制策略(Natural-Language Access Control Policies, NLACPs)高效、准确地转换为机器可执行的策略即代码(Policy-as-Code, PaC)的问题,尤其是在零信任(Zero Trust)和合规驱动场景中对策略可靠性和可审计性的高要求。解决方案的关键在于提出 Prose2Policy (P2P),一个基于大语言模型(Large Language Model, LLM)的模块化端到端工具链,其核心能力包括策略检测、组件提取、模式验证、代码格式化、编译、自动测试生成与执行,从而实现从自然语言到 Open Policy Agent (OPA) 的 Rego 代码的高保真转换,最终在 ACRE 数据集上实现了 95.3% 的编译成功率和高达 82.2% 的正向测试通过率,验证了其语法鲁棒性与行为一致性。

链接: https://arxiv.org/abs/2603.15799
作者: Vatsal Gupta,Darshan Sreenivasamurthy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prose2Policy (P2P) is a LLM-based practical tool that translates natural-language access control policies (NLACPs) into executable Rego code (the policy language of Open Policy Agent, OPA). It provides a modular, end-to-end pipeline that performs policy detection, component extraction, schema validation, linting, compilation, automatic test generation and execution. Prose2Policy is designed to bridge the gap between human-readable access requirements and machine-enforceable policy-as-code (PaC) while emphasizing deployment reliability and auditability. We evaluated Prose2Policy on the ACRE dataset and demonstrated a 95.3% compile rate for accepted policies, with automated testing achieving a 82.2% positive-test pass rate and a 98.9% negative-test pass rate. These results indicate that Prose2Policy produces syntactically robust and behaviorally consistent Rego policies suitable for Zero Trust and compliance-driven environments.

[AI-94] CUBE: A Standard for Unifying Agent Benchmarks

【速读】:该论文旨在解决当前代理基准测试(agent benchmark)领域因碎片化而导致的研究效率低下问题,即每个新基准都需要大量定制化集成,形成所谓的“集成税”(integration tax),限制了全面评估和跨平台复用。其解决方案的关键在于提出一种通用协议标准 CUBE(Common Unified Benchmark Environments),基于 MCP 和 Gym 构建,通过将任务(task)、基准(benchmark)、包(package)和注册表(registry)等关注点分离为独立的 API 层,实现一次封装即可在任意兼容平台上使用,从而支持评估、强化学习(RL)训练及数据生成而无需额外定制开发。

链接: https://arxiv.org/abs/2603.15798
作者: Alexandre Lacoste,Nicolas Gontier,Oleh Shliazhko,Aman Jaiswal,Kusha Sareen,Shailesh Nanisetty,Joan Cabezas,Manuel Del Verme,Omar G. Younis,Simone Baratta,Matteo Avalle,Imene Kerboua,Xing Han Lù,Elron Bandel,Michal Shmueli-Scheuer,Asaf Yehudai,Leshem Choshen,Jonathan Lebensold,Sean Hughes,Massimo Caccia,Alexandre Drouin,Siva Reddy,Tao Yu,Yu Su,Graham Neubig,Dawn Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Position paper. 10 pages. Reference implementation: this https URL

点击查看摘要

Abstract:The proliferation of agent benchmarks has created critical fragmentation that threatens research productivity. Each new benchmark requires substantial custom integration, creating an “integration tax” that limits comprehensive evaluation. We propose CUBE (Common Unified Benchmark Environments), a universal protocol standard built on MCP and Gym that allows benchmarks to be wrapped once and used everywhere. By separating task, benchmark, package, and registry concerns into distinct API layers, CUBE enables any compliant platform to access any compliant benchmark for evaluation, RL training, or data generation without custom integration. We call on the community to contribute to the development of this standard before platform-specific implementations deepen fragmentation as benchmark production accelerates through 2026.

[AI-95] OMNIFLOW: A Physics-Grounded Multimodal Agent for Generalized Scientific Reasoning

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理由偏微分方程(Partial Differential Equations, PDEs)描述的连续时空动力学时,常出现非物理解释(non-physical hallucinations)的问题,且现有方法依赖昂贵的领域特定微调,限制了跨域泛化能力和可解释性。其解决方案的关键在于提出一种神经符号架构 OMNIFLOW,通过引入语义-符号对齐机制(Semantic-Symbolic Alignment),将高维流场张量映射为拓扑语言描述,使模型感知物理结构而非原始像素;同时构建物理引导的思维链流程(Physics-Guided Chain-of-Thought, PG-CoT),通过动态约束注入(如质量守恒)和迭代反思验证实现可解释的科学推理,从而在零样本泛化与少样本适应任务中显著优于传统深度学习基线,并提供透明、物理一致的推理报告。

链接: https://arxiv.org/abs/2603.15797
作者: Hao Wu,Yongheng Zhang,Yuan Gao,Fan Xu,Fan Zhang,Ruobing Xie,Ruijian Gou,Yuxuan Liang,Xiaomeng Huang,Xian Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional logical reasoning capabilities but frequently struggle with the continuous spatiotemporal dynamics governed by Partial Differential Equations (PDEs), often resulting in non-physical hallucinations. Existing approaches typically resort to costly, domain-specific fine-tuning, which severely limits cross-domain generalization and interpretability. To bridge this gap, we propose OMNIFLOW, a neuro-symbolic architecture designed to ground frozen multimodal LLMs in fundamental physical laws without requiring domain-specific parameter updates. OMNIFLOW introduces a novel \textitSemantic-Symbolic Alignment mechanism that projects high-dimensional flow tensors into topological linguistic descriptors, enabling the model to perceive physical structures rather than raw pixel values. Furthermore, we construct a Physics-Guided Chain-of-Thought (PG-CoT) workflow that orchestrates reasoning through dynamic constraint injection (e.g., mass conservation) and iterative reflexive verification. We evaluate OMNIFLOW on a comprehensive benchmark spanning microscopic turbulence, theoretical Navier-Stokes equations, and macroscopic global weather forecasting. Empirical results demonstrate that OMNIFLOW significantly outperforms traditional deep learning baselines in zero-shot generalization and few-shot adaptation tasks. Crucially, it offers transparent, physically consistent reasoning reports, marking a paradigm shift from black-box fitting to interpretable scientific reasoning.

[AI-96] CorrectionPlanner: Self-Correction Planner with Reinforcement Learning in Autonomous Driving

【速读】:该论文旨在解决学习型自动驾驶规划器缺乏显式自我修正能力的问题,即一旦生成不安全动作便无法进行纠正,从而导致潜在碰撞风险。解决方案的关键在于提出一种自回归式规划器 CorrectionPlanner,其通过“提议-评估-修正”循环实现动态自我修正:在每个规划步骤中,策略生成一个运动标记(motion token),由预训练的碰撞评判器预测该动作是否会在短时视野内引发碰撞;若判定为不安全,则保留历史不安全运动标记序列作为自修正轨迹(self-correction trace),并基于此条件生成下一个运动标记,直至生成安全动作或满足安全准则。该机制将规划过程建模为运动标记的生成过程,其中自修正轨迹类似于语言模型中的推理轨迹,显著提升了规划安全性与鲁棒性。

链接: https://arxiv.org/abs/2603.15771
作者: Yihong Guo,Dongqiangzi Ye,Sijia Chen,Anqi Liu,Xianming Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous driving requires safe planning, but most learning-based planners lack explicit self-correction ability: once an unsafe action is proposed, there is no mechanism to correct it. Thus, we propose CorrectionPlanner, an autoregressive planner with self-correction that models planning as motion-token generation within a propose, evaluate, and correct loop. At each planning step, the policy proposes an action, namely a motion token, and a learned collision critic predicts whether it will induce a collision within a short horizon. If the critic predicts a collision, we retain the sequence of historical unsafe motion tokens as a self-correction trace, generate the next motion token conditioned on it, and repeat this process until a safe motion token is proposed or the safety criterion is met. This self-correction trace, consisting of all unsafe motion tokens, represents the planner’s correction process in motion-token space, analogous to a reasoning trace in language models. We train the planner with imitation learning followed by model-based reinforcement learning using rollouts from a pretrained world model that realistically models agents’ reactive behaviors. Closed-loop evaluations show that CorrectionPlanner reduces collision rate by over 20% on Waymax and achieves state-of-the-art planning scores on nuPlan.

[AI-97] Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation

【速读】:该论文旨在解决机器人领域中仿真到现实(sim-to-real)迁移的核心挑战,即仿真环境与真实世界动力学不匹配导致的性能下降问题。现有强化学习方法在真实场景下的微调过程中面临探索困难和长时程信用分配难题,尤其是在数据稀缺的情况下。解决方案的关键在于提出Simulation Distillation (SimDist) 框架,其核心机制是将模拟器中的结构先验知识蒸馏到一个潜在世界模型中,并通过在线规划与监督式动力学微调实现快速真实世界适应。SimDist 直接从仿真迁移奖励和价值模型,从而在部署阶段无需额外的价值学习即可提供密集的规划信号,使真实世界适应转化为短时程系统辨识问题,避免了复杂的长期信用分配,显著提升了数据效率、稳定性和最终性能。

链接: https://arxiv.org/abs/2603.15759
作者: Jacob Levy,Tyler Westenbroek,Kevin Huang,Fernando Palafox,Patrick Yin,Shayegan Omidshafiei,Dong-Ki Kim,Abhishek Gupta,David Fridovich-Keil
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project website: this https URL

点击查看摘要

Abstract:Simulation-to-real transfer remains a central challenge in robotics, as mismatches between simulated and real-world dynamics often lead to failures. While reinforcement learning offers a principled mechanism for adaptation, existing sim-to-real finetuning methods struggle with exploration and long-horizon credit assignment in the low-data regimes typical of real-world robotics. We introduce Simulation Distillation (SimDist), a sim-to-real framework that distills structural priors from a simulator into a latent world model and enables rapid real-world adaptation via online planning and supervised dynamics finetuning. By transferring reward and value models directly from simulation, SimDist provides dense planning signals from raw perception without requiring value learning during deployment. As a result, real-world adaptation reduces to short-horizon system identification, avoiding long-horizon credit assignment and enabling fast, stable improvement. Across precise manipulation and quadruped locomotion tasks, SimDist substantially outperforms prior methods in data efficiency, stability, and final performance. Project website and code: this https URL

[AI-98] Youve Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector

【速读】:该论文旨在解决预训练生成式机器人策略(pretrained generative robot policy)在实际应用中性能受限的问题,特别是当其依赖随机初始噪声采样时,难以稳定提升下游任务奖励表现。解决方案的关键在于:使用一个固定且经过优化的常量初始噪声输入(称为“黄金票券”(golden ticket))替代传统的高斯分布采样,从而显著改善策略性能。该方法通过蒙特卡洛策略评估进行搜索,在不微调预训练模型的前提下找到最优初始噪声,适用于所有扩散模型(diffusion)和流匹配(flow matching)策略(包括多种视觉语言动作模型 VLA),无需额外训练或基础设施,仅需注入初始噪声并计算稀疏任务奖励即可实现部署。实验证明,该方法在43个任务中有38个取得性能提升,模拟任务成功率最高提升58%,真实机器人任务在50次搜索内提升达60%。

链接: https://arxiv.org/abs/2603.15757
作者: Omkar Patil,Ondrej Biza,Thomas Weng,Karl Schmeckpeper,Wil Thomason,Xiaohan Zhang,Robin Walters,Nakul Gopalan,Sebastian Castro,Eric Rosen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:What happens when a pretrained generative robot policy is provided a constant initial noise as input, rather than repeatedly sampling it from a Gaussian? We demonstrate that the performance of a pretrained, frozen diffusion or flow matching policy can be improved with respect to a downstream reward by swapping the sampling of initial noise from the prior distribution (typically isotropic Gaussian) with a well-chosen, constant initial noise input – a golden ticket. We propose a search method to find golden tickets using Monte-Carlo policy evaluation that keeps the pretrained policy frozen, does not train any new networks, and is applicable to all diffusion/flow matching policies (and therefore many VLAs). Our approach to policy improvement makes no assumptions beyond being able to inject initial noise into the policy and calculate (sparse) task rewards of episode rollouts, making it deployable with no additional infrastructure or models. Our method improves the performance of policies in 38 out of 43 tasks across simulated and real-world robot manipulation benchmarks, with relative improvements in success rate by up to 58% for some simulated tasks, and 60% within 50 search episodes for real-world tasks. We also show unique benefits of golden tickets for multi-task settings: the diversity of behaviors from different tickets naturally defines a Pareto frontier for balancing different objectives (e.g., speed, success rates); in VLAs, we find that a golden ticket optimized for one task can also boost performance in other related tasks. We release a codebase with pretrained policies and golden tickets for simulation benchmarks using VLAs, diffusion policies, and flow matching policies.

[AI-99] Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models

【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)在文本到图像(Text-to-Image, T2I)生成中,现有测试时缩放(Test-Time Scaling, TTS)方法仅能实现实例级改进、难以从先前推理中学习并累积跨相似提示的知识这一问题。解决方案的关键在于提出一种元认知测试时强化学习框架(Meta-TTRL),其通过利用UMMs内在的元知识(meta-knowledge)生成监控信号,指导测试时参数优化,从而实现模型自我提升和能力层级提升。实验表明,Meta-TTRL在多个代表性UMMs上具有良好的泛化性能,并揭示了有效测试时强化学习(Test-Time Reinforcement Learning, TTRL)的核心机制——元认知协同效应(metacognitive synergy),即监控信号与模型优化策略的一致性,可驱动模型自适应改进。

链接: https://arxiv.org/abs/2603.15724
作者: Lit Sin Tan,Junzhe Chen,Xiaolong Fu,Lichen Ma,Junshi Huang,Jianzhong Shi,Yan Li,Lijie Wen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:Existing test-time scaling (TTS) methods for unified multimodal models (UMMs) in text-to-image (T2I) generation primarily rely on search or sampling strategies that produce only instance-level improvements, limiting the ability to learn from prior inferences and accumulate knowledge across similar prompts. To overcome these limitations, we propose Meta-TTRL, a metacognitive test-time reinforcement learning framework. Meta-TTRL performs test-time parameter optimization guided by model-intrinsic monitoring signals derived from the meta-knowledge of UMMs, achieving self-improvement and capability-level improvement at test time. Extensive experiments demonstrate that Meta-TTRL generalizes well across three representative UMMs, including Janus-Pro-7B, BAGEL, and Qwen-Image, achieving significant gains on compositional reasoning tasks and multiple T2I benchmarks with limited data. We provide the first comprehensive analysis to investigate the potential of test-time reinforcement learning (TTRL) for T2I generation in UMMs. Our analysis further reveals a key insight underlying effective TTRL: metacognitive synergy, where monitoring signals align with the model’s optimization regime to enable self-improvement.

[AI-100] Context-Length Robustness in Question Answering Models: A Comparative Empirical Study

【速读】:该论文旨在解决大语言模型在长且嘈杂上下文中的鲁棒性问题,特别是在问答任务中随着上下文长度增加导致性能下降的现象。其核心问题是:当前对模型在复杂信息环境下的可靠性评估不足,尤其是多跳推理(multi-hop reasoning)任务是否更易受上下文稀释影响尚不明确。解决方案的关键在于设计了一个受控的实证研究,通过系统性地向SQuAD和HotpotQA两个基准数据集添加无关上下文,同时保持答案信号不变,从而隔离出上下文长度变化对模型准确率的影响。实验结果表明,模型性能随上下文长度增长而持续下降,且多跳推理任务(如HotpotQA)的准确率降幅显著高于单跨度抽取任务(如SQuAD),揭示了任务类型对上下文长度鲁棒性的关键调节作用。

链接: https://arxiv.org/abs/2603.15723
作者: Trishita Dhara,Siddhesh Sheth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for Oral Presentation at Math AI 2026 conference

点击查看摘要

Abstract:Large language models are increasingly deployed in settings where relevant information is embedded within long and noisy contexts. Despite this, robustness to growing context length remains poorly understood across different question answering tasks. In this work, we present a controlled empirical study of context-length robustness in large language models using two widely used benchmarks: SQuAD and HotpotQA. We evaluate model accuracy as a function of total context length by systematically increasing the amount of irrelevant context while preserving the answer-bearing signal. This allows us to isolate the effect of context length from changes in task difficulty. Our results show a consistent degradation in performance as context length increases, with substantially larger drops observed on multi-hop reasoning tasks compared to single-span extraction tasks. In particular, HotpotQA exhibits nearly twice the accuracy degradation of SQuAD under equivalent context expansions. These findings highlight task-dependent differences in robustness and suggest that multi-hop reasoning is especially vulnerable to context dilution. We argue that context-length robustness should be evaluated explicitly when assessing model reliability, especially for applications involving long documents or retrieval-augmented generation. Comments: Accepted for Oral Presentation at Math AI 2026 conference Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.15723 [cs.AI] (or arXiv:2603.15723v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.15723 Focus to learn more arXiv-issued DOI via DataCite

[AI-101] A Framework and Prototype for a Navigable Map of Datasets in Engineering Design and Systems Engineering

【速读】:该论文旨在解决工程设计与系统工程(Engineering Design and Systems Engineering, EDSE)领域中数据资源碎片化、难获取的问题,这些问题阻碍了方法验证、限制了研究可复现性并拖慢了科研进展。其解决方案的关键在于提出一个系统的“EDSE数据集地图”框架,该框架基于多维分类体系(涵盖领域、生命周期阶段、数据类型和格式)实现数据的分面发现,并构建了一个以知识图谱为数据模型的交互式发现工具原型,用以捕获数据集、工具与文献之间的丰富语义关系,从而促进数据资源的高效组织与利用。

链接: https://arxiv.org/abs/2603.15722
作者: H. Sinan Bank,Daniel R. Herber
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Databases (cs.DB); Digital Libraries (cs.DL)
备注: 10 pages, 3 figures, Submitted to ASME IDETC 2026-DAC22

点击查看摘要

Abstract:The proliferation of data across the system lifecycle presents both a significant opportunity and a challenge for Engineering Design and Systems Engineering (EDSE). While this digital thread'' has the potential to drive innovation, the fragmented and inaccessible nature of existing datasets hinders method validation, limits reproducibility, and slows research progress. Unlike fields such as computer vision and natural language processing, which benefit from established benchmark ecosystems, engineering design research often relies on small, proprietary, or ad-hoc datasets. This paper addresses this challenge by proposing a systematic framework for a Map of Datasets in EDSE.‘’ The framework is built upon a multi-dimensional taxonomy designed to classify engineering datasets by domain, lifecycle stage, data type, and format, enabling faceted discovery. An architecture for an interactive discovery tool is detailed and demonstrated through a working prototype, employing a knowledge graph data model to capture rich semantic relationships between datasets, tools, and publications. An analysis of the current data landscape reveals underrepresented areas (data deserts'') in early-stage design and system architecture, as well as relatively well-represented areas (data oases’') in predictive maintenance and autonomous systems. The paper identifies key challenges in curation and sustainability and proposes mitigation strategies, laying the groundwork for a dynamic, community-driven resource to accelerate data-centric engineering research.

[AI-102] How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition

【速读】:该论文旨在解决生成式 AI (Generative AI) 代理在高风险场景中因处理外部数据源(如邮件、文档和代码仓库)而面临的间接提示注入攻击(indirect prompt injection attacks)问题,特别是攻击者通过隐蔽手段在不被用户察觉的情况下操纵代理行为的威胁。解决方案的关键在于通过大规模公开红队测试竞赛(red teaming competition),系统性评估不同代理设置(工具调用、编码、计算机使用)下的攻击成功率,并识别出跨模型家族和行为模式的通用攻击策略,从而揭示当前指令遵循架构的根本性脆弱性。研究发现所有前沿模型均易受攻击,且攻击成功率与模型能力之间无强相关性,表明现有防御机制存在严重不足,亟需持续迭代的红队测试以推动鲁棒性提升。

链接: https://arxiv.org/abs/2603.15714
作者: Mateusz Dziemian,Maxwell Lin,Xiaohan Fu,Micha Nowak,Nick Winter,Eliot Jones,Andy Zou,Lama Ahmad,Kamalika Chaudhuri,Sahana Chennabasappa,Xander Davies,Lauren Deason,Benjamin L. Edelman,Tanner Emek,Ivan Evtimov,Jim Gust,Maia Hamin,Kat He,Klaudia Krawiecka,Riccardo Patana,Neil Perry,Troy Peterson,Xiangyu Qi,Javier Rando,Zifan Wang,Zihan Wang,Spencer Whitman,Eric Winsor,Arman Zharmagambetov,Matt Fredrikson,Zico Kolter
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 38 pages, 16 figures. Newer version to cover Q1 competition results on latest models in progress. Code at this https URL Partial Dataset at this https URL

点击查看摘要

Abstract:LLM based agents are increasingly deployed in high stakes settings where they process external data sources such as emails, documents, and code repositories. This creates exposure to indirect prompt injection attacks, where adversarial instructions embedded in external content manipulate agent behavior without user awareness. A critical but underexplored dimension of this threat is concealment: since users tend to observe only an agent’s final response, an attack can conceal its existence by presenting no clue of compromise in the final user facing response while successfully executing harmful actions. This leaves users unaware of the manipulation and likely to accept harmful outcomes as legitimate. We present findings from a large scale public red teaming competition evaluating this dual objective across three agent settings: tool calling, coding, and computer use. The competition attracted 464 participants who submitted 272000 attack attempts against 13 frontier models, yielding 8648 successful attacks across 41 scenarios. All models proved vulnerable, with attack success rates ranging from 0.5% (Claude Opus 4.5) to 8.5% (Gemini 2.5 Pro). We identify universal attack strategies that transfer across 21 of 41 behaviors and multiple model families, suggesting fundamental weaknesses in instruction following architectures. Capability and robustness showed weak correlation, with Gemini 2.5 Pro exhibiting both high capability and high vulnerability. To address benchmark saturation and obsoleteness, we will endeavor to deliver quarterly updates through continued red teaming competitions. We open source the competition environment for use in evaluations, along with 95 successful attacks against Qwen that did not transfer to any closed source model. We share model-specific attack data with respective frontier labs and the full dataset with the UK AISI and US CAISI to support robustness research.

[AI-103] Survey of Various Fuzzy and Uncertain Decision-Making Methods

【速读】:该论文旨在解决现实应用中决策过程常受模糊性(vagueness)、信息不完整、异构数据及专家意见冲突等因素影响的问题,提出一种面向不确定性的多准则决策(uncertainty-aware multi-criteria decision-making, MCDM)的系统性综述。其解决方案的关键在于构建一个任务导向的分类体系,涵盖问题设置(如离散型、群体共识、动态多阶段等)、权重获取机制(主观与客观方法在模糊/语言输入下的应用),以及准则间结构与因果建模;同时对比了补偿性评分法、参考点距离与妥协方案、非补偿性排序框架等求解策略,并引入基于规则/证据和序列决策模型以提升可解释性。该综述明确了典型输入、核心计算步骤与主要输出,为依据鲁棒性、可解释性和数据可用性选择合适方法提供指导,并指出未来方向包括可解释的不确定性融合、稳定性保障及大规模动态环境下的可扩展性。

链接: https://arxiv.org/abs/2603.15709
作者: Takaaki Fujita,Florentin Smarandache
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Book. Publisher: Neutrosophic Science International Association (NSIA) Publishing House. ISBN: 978-1-59973-883-3. 446 pages

点击查看摘要

Abstract:Decision-making in real applications is often affected by vagueness, incomplete information, heterogeneous data, and conflicting expert opinions. This survey reviews uncertainty-aware multi-criteria decision-making (MCDM) and organizes the field into a concise, task-oriented taxonomy. We summarize problem-level settings (discrete, group/consensus, dynamic, multi-stage, multi-level, multiagent, and multi-scenario), weight elicitation (subjective and objective schemes under fuzzy/linguistic inputs), and inter-criteria structure and causality modelling. For solution procedures, we contrast compensatory scoring methods, distance-to-reference and compromise approaches, and non-compensatory outranking frameworks for ranking or sorting. We also outline rule/evidence-based and sequential decision models that produce interpretable rules or policies. The survey highlights typical inputs, core computational steps, and primary outputs, and provides guidance on choosing methods according to robustness, interpretability, and data availability. It concludes with open directions on explainable uncertainty integration, stability, and scalability in large-scale and dynamic decision environments.

[AI-104] Mastering the Minority: An Uncertainty-guided Multi-Expert Framework for Challenging-tailed Sequence Learning

【速读】:该论文旨在解决序列学习中因数据分布不均衡导致的少数类识别困难问题,即模型倾向于识别频繁类别而难以有效检测少数类。其解决方案的关键在于提出一种基于不确定性的多专家融合网络(Uncertainty-based Multi-Expert fusion network, UME)框架,包含三项核心创新:一是采用Ensemble LoRA实现参数高效建模,显著降低可训练参数量;二是引入基于Dempster-Shafer理论(DST)的顺序专业化机制,提升对长尾类别的专家特异性;三是设计不确定性引导的融合机制,利用DST的置信度度量动态加权专家意见,通过优先选择最自信专家来解决预测冲突,从而提升最终预测的可靠性。

链接: https://arxiv.org/abs/2603.15708
作者: Ye Wang,Zixuan Wu,Lifeng Shen,Jiang Xie,Xiaoling Wang,Hong Yu,Guoyin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Imbalanced data distribution remains a critical challenge in sequential learning, leading models to easily recognize frequent categories while failing to detect minority classes adequately. The Mixture-of-Experts model offers a scalable solution, yet its application is often hindered by parameter inefficiency, poor expert specialization, and difficulty in resolving prediction conflicts. To Master the Minority classes effectively, we propose the Uncertainty-based Multi-Expert fusion network (UME) framework. UME is designed with three core innovations: First, we employ Ensemble LoRA for parameter-efficient modeling, significantly reducing the trainable parameter count. Second, we introduce Sequential Specialization guided by Dempster-Shafer Theory (DST), which ensures effective specialization on the challenging-tailed classes. Finally, an Uncertainty-Guided Fusion mechanism uses DST’s certainty measures to dynamically weigh expert opinions, resolving conflicts by prioritizing the most confident expert for reliable final predictions. Extensive experiments across four public hierarchical text classification datasets demonstrate that UME achieves state-of-the-art performance. We achieve a performance gain of up to 17.97% over the best baseline on individual categories, while reducing trainable parameters by up to 10.32%. The findings highlight that uncertainty-guided expert coordination is a principled strategy for addressing challenging-tailed sequence learning. Our code is available at this https URL.

[AI-105] SEMAG: Self-Evolutionary Multi-Agent Code Generation

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在处理复杂编程任务时,因依赖人工模型选择和固定工作流而难以适应任务难度变化的问题。其解决方案的关键在于提出一种自进化多智能体代码生成框架(Self-Evolutionary Multi-Agent code Generation, SEMAG),该框架模仿人类编码实践,将编程任务分解为规划、编码、调试与讨论等阶段,并根据任务复杂度动态调整工作流;同时,其自进化代理能够实时访问最新模型并自动升级主干模型(backbone model),从而实现对LLM能力演进的自适应利用。实验表明,SEMAG在多个基准测试上达到新的最先进性能,尤其在CodeContests数据集上相较于先前方法提升3.3%的Pass@1准确率,且结合自进化模型选择后进一步达到52.6%的准确率,验证了其框架有效性与适应性。

链接: https://arxiv.org/abs/2603.15707
作者: Yulin Peng,Haowen Hou,Xinxin Zhu,Ying Tiffany He,F. Richard Yu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant progress in handling complex programming tasks. However, current methods rely on manual model selection and fixed workflows, which limit their ability to adapt to changing task complexities. To address this, we propose SEMAG, a Self-Evolutionary Multi-Agent code Generation framework that mimics human coding practices. It decomposes programming tasks into stages, including planning, coding, debugging, and discussion, while adapting workflows to task difficulty. Its self-evolutionary agents can access the latest models in real time and automatically upgrade the backbone model. SEMAG sets new state-of-the-art Pass@1 accuracy across benchmarks. Using identical backbone models, SEMAG outperforms prior methods by 3.3% on CodeContests. When augmented with self-evolutionary model selection that automatically identifies optimal backbones, SEMAG reaches 52.6%, showcasing both framework effectiveness and adaptability to evolving LLM capabilities.

[AI-106] his Is Taking Too Long - Investigating Time as a Proxy for Energy Consumption of LLM s

【速读】:该论文旨在解决API调用的大型语言模型(Large Language Models, LLMs)在推理阶段能耗不透明的问题,尤其针对终端用户难以获取实际能源消耗信息的困境。其解决方案的关键在于利用推理时间测量作为代理指标,来近似估算API接口下LLM的能源成本;通过与本地部署模型的实际能耗数据对比验证,证明推理时间可有效推断出GPU型号并支撑能源成本估计,从而为用户提供一种可行的能效评估手段。

链接: https://arxiv.org/abs/2603.15699
作者: Lars Krupp,Daniel Geißler,Francisco M. Calatrava-Nicolas,Vishal Banwari,Paul Lukowicz,Jakob Karolus
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: This work was accepted at PerCom 2026

点击查看摘要

Abstract:The energy consumption of Large Language Models (LLMs) is raising growing concerns due to their adverse effects on environmental stability and resource use. Yet, these energy costs remain largely opaque to users, especially when models are accessed through an API - a black box in which all information depends on what providers choose to disclose. In this work, we investigate inference time measurements as a proxy to approximate the associated energy costs of API-based LLMs. We ground our approach by comparing our estimations with actual energy measurements from locally hosted equivalents. Our results show that time measurements allow us to infer GPU models for API-based LLMs, grounding our energy cost estimations. Our work aims to create means for understanding the associated energy costs of API-based LLMs, especially for end users.

[AI-107] ackling Over-smoothing on Hypergraphs: A Ricci Flow-guided Neural Diffusion Approach

【速读】:该论文旨在解决现有超图神经网络(Hypergraph Neural Networks, HGNNs)在深度增加时普遍存在的过平滑(over-smoothing)问题,以及缺乏对节点间信息传递过程的有效控制。其解决方案的关键在于引入离散里奇流(discrete Ricci flow)理论,构建一个基于偏微分方程(PDE)系统的信息扩散机制,通过在几何层面自适应调节信息传播速率,从而有效抑制节点特征同质化,提升节点表征质量。此方法被称为里奇流引导的超图神经扩散(Ricci Flow-guided Hypergraph Neural Diffusion, RFHND),在多个基准数据集上表现出显著优于现有方法的性能和鲁棒性。

链接: https://arxiv.org/abs/2603.15696
作者: Mengyao Zhou,Zhiheng Zhou,Xiao Han,Xingqin Qi,Guanghui Wang,Guiying Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hypergraph neural networks (HGNNs) have demonstrated strong capabilities in modeling complex higher-order relationships. However, existing HGNNs often suffer from over-smoothing as the number of layers increases and lack effective control over message passing among nodes. Inspired by the theory of Ricci flow in differential geometry, we theoretically establish that introducing discrete Ricci flow into hypergraph structures can effectively regulate node feature evolution and thereby alleviate over-smoothing. Building on this insight, we propose Ricci Flow-guided Hypergraph Neural Diffusion(RFHND), a novel message passing paradigm for hypergraphs guided by discrete Ricci flow. Specifically, RFHND is based on a PDE system that describes the continuous evolution of node features on hypergraphs and adaptively regulates the rate of information diffusion at the geometric level, preventing feature homogenization and producing high-quality node representations. Experimental results show that RFHND significantly outperforms existing methods across multiple benchmark datasets and demonstrates strong robustness, while also effectively mitigating over-smoothing.

[AI-108] BadLLM -TG: A Backdoor Defender powered by LLM Trigger Generator

【速读】:该论文旨在解决文本领域中后门攻击(backdoor attack)的防御问题,尤其是针对自然语言处理(Natural Language Processing, NLP)场景下触发器(trigger)难以定位与消除的挑战。现有基于噪声的触发器生成方法因文本的离散特性无法直接适用。其解决方案的关键在于利用大语言模型(Large Language Model, LLM)中蕴含的丰富语义知识,设计一种由LLM驱动的触发器生成器(LLM Trigger Generator, BadLLM-TG),并通过提示引导的强化学习进行优化,以目标模型的反馈损失作为奖励信号,从而生成有效的对抗性触发器用于后续的对抗训练,实现对后门攻击的有效缓解。

链接: https://arxiv.org/abs/2603.15692
作者: Ruyi Zhang,Heng Gao,Songlei Jian,Yusong Tan,Haifang Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 5pages, 2 figures

点击查看摘要

Abstract:Backdoor attacks compromise model reliability by using triggers to manipulate outputs. Trigger inversion can accurately locate these triggers via a generator and is therefore critical for backdoor defense. However, the discrete nature of text prevents existing noise-based trigger generator from being applied to nature language processing (NLP). To overcome the limitations, we employ the rich knowledge embedded in large language models (LLMs) and propose a Backdoor defender powered by LLM Trigger Generator, termed BadLLM-TG. It is optimized through prompt-driven reinforcement learning, using the victim model’s feedback loss as the reward signal. The generated triggers are then employed to mitigate the backdoor via adversarial training. Experiments show that our method reduces the attack success rate by 76.2% on average, outperforming the second-best defender by 13.7.

[AI-109] Loosely-Structured Software: Engineering Context Structure and Evolution Entropy in Runtime-Rewired Multi-Agent Systems

【速读】:该论文旨在解决大规模基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS)在自治增强背景下所面临的复杂性失控问题,包括上下文压力加剧、协作错误频发以及系统漂移(system drift)。传统方法如提示调优或提升模型智能无法根本缓解此类问题,因此论文提出以软件工程架构视角重构MAS设计范式。其解决方案的关键在于引入一种新型软件形态——松散结构软件(Loosely-Structured Software, LSS),通过三层次工程框架实现对运行时熵(runtime entropy)的可控管理:视图/上下文工程(View/Context Engineering)用于维护任务相关的动态视图;结构工程(Structure Engineering)实现对智能体与资源间动态绑定的组织;演化工程(Evolution Engineering)则管控自重写构件的生命周期。LSS通过语义驱动的自组织机制和设计模式作为语义控制块,在保持智能体适应性的同时稳定推理介导的交互流,从而显著提升系统的可设计性(designability)、可扩展性(scalability)与可演化性(evolvability)。

链接: https://arxiv.org/abs/2603.15690
作者: Weihao Zhang,Yitong Zhou,Huanyu Qu,Hongyi Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As LLM-based multi-agent systems (MAS) become more autonomous, their free-form interactions increasingly dominate system behavior. However, scaling the number of agents often amplifies context pressure, coordination errors, and system drift. It is well known that building robust MAS requires more than prompt tuning or increased model intelligence. It necessitates engineering discipline focused on architecture to manage complexity under uncertainty. We characterize agentic software by a core property: \emphruntime generation and evolution under uncertainty. Drawing upon and extending software engineering experience, especially object-oriented programming, this paper introduces \emphLoosely-Structured Software (LSS), a new class of software systems that shifts the engineering focus from constructing deterministic logic to managing the runtime entropy generated by View-constructed programming, semantic-driven self-organization, and endogenous evolution. To make this entropy governable, we introduce design principles under a three-layer engineering framework: \emphView/Context Engineering to manage the execution environment and maintain task-relevant Views, \emphStructure Engineering to organize dynamic binding over artifacts and agents, and \emphEvolution Engineering to govern the lifecycle of self-rewriting artifacts. Building on this framework, we develop LSS design patterns as semantic control blocks that stabilize fluid, inference-mediated interactions while preserving agent adaptability. Together, these abstractions improve the \emphdesignability, \emphscalability, and \emphevolvability of agentic infrastructure. We provide basic experimental validation of key mechanisms, demonstrating the effectiveness of LSS. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.15690 [cs.SE] (or arXiv:2603.15690v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.15690 Focus to learn more arXiv-issued DOI via DataCite

[AI-110] Evidential Domain Adaptation for Remaining Useful Life Prediction with Incomplete Degradation

【速读】:该论文旨在解决在缺乏目标域标注数据的情况下,剩余使用寿命(Remaining Useful Life, RUL)预测中因目标域退化轨迹不完整(尤其是缺少晚期退化阶段)所导致的外推挑战。现有领域自适应(Domain Adaptation, DA)方法在处理此类问题时存在两个主要局限:一是仅关注全局对齐,易将源域的晚期退化阶段与目标域的早期退化阶段错误匹配;二是未考虑不同工况下同一退化阶段内特征差异,导致简单特征匹配无法实现有效对齐。为克服上述问题,论文提出了一种新颖的证据自适应方法(EviAdapt),其关键在于:首先基于退化速率对源域和目标域数据进行分段,实现阶段级对齐以确保对应退化阶段样本准确匹配;其次引入证据不确定性对齐技术,利用证据学习估计不确定性并跨匹配阶段对齐不确定性分布,从而提升模型在不完整退化轨迹下的泛化能力。

链接: https://arxiv.org/abs/2603.15687
作者: Yubo Hou,Mohamed Ragab,Yucheng Wang,Min Wu,Abdulla Alseiari,Chee-Keong Kwoh,Xiaoli Li,Zhenghua Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate Remaining Useful Life (RUL) prediction without labeled target domain data is a critical challenge, and domain adaptation (DA) has been widely adopted to address it by transferring knowledge from a labeled source domain to an unlabeled target domain. Despite its success, existing DA methods struggle significantly when faced with incomplete degradation trajectories in the target domain, particularly due to the absence of late degradation stages. This missing data introduces a key extrapolation challenge. When applied to such incomplete RUL prediction tasks, current DA methods encounter two primary limitations. First, most DA approaches primarily focus on global alignment, which can misaligns late degradation stage in the source domain with early degradation stage in the target domain. Second, due to varying operating conditions in RUL prediction, degradation patterns may differ even within the same degradation stage, resulting in different learned features. As a result, even if degradation stages are partially aligned, simple feature matching cannot fully align two domains. To overcome these limitations, we propose a novel evidential adaptation approach called EviAdapt, which leverages evidential learning to enhance domain adaptation. The method first segments the source and target domain data into distinct degradation stages based on degradation rate, enabling stage-wise alignment that ensures samples from corresponding stages are accurately matched. To address the second limitation, we introduce an evidential uncertainty alignment technique that estimates uncertainty using evidential learning and aligns the uncertainty across matched stages.

[AI-111] State-Dependent Safety Failures in Multi-Turn Language Model Interaction

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)安全对齐评估中忽视多轮对话情境的问题,即现有方法主要基于孤立查询进行测评,而真实应用场景具有明显的多轮交互特性。研究表明,许多多轮越狱攻击并非源于单个提示的漏洞,而是由对话状态的结构化演化所驱动。为此,作者提出STAR(State-oriented Diagnostic Framework),其核心在于将对话历史建模为状态转移算子,从而在自回归条件下的交互轨迹上系统性地分析模型的安全行为边界穿越机制。不同于传统以最大化攻击强度为目标的方法,STAR提供了一种原则性的诊断工具,揭示了即使在静态测试下表现稳健的模型,在结构化多轮交互中也可能发生快速且可复现的安全崩溃,关键发现包括拒绝相关表征的单调漂移和角色条件上下文引发的突变相变现象,这推动了将语言模型安全视为依赖于对话轨迹的动态过程的理解。

链接: https://arxiv.org/abs/2603.15684
作者: Pengcheng Li,Jie Zhang,Tianwei Zhang,Han Qiu,Zhang kejun,Weiming Zhang,Nenghai Yu,Wenbo Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety alignment in large language models is typically evaluated under isolated queries, yet real-world use is inherently multi-turn. Although multi-turn jailbreaks are empirically effective, the structure of conversational safety failure remains insufficiently understood. In this work, we study safety failures from a state-space perspective and show that many multi-turn failures arise from structured contextual state evolution rather than isolated prompt vulnerabilities. We introduce STAR, a state-oriented diagnostic framework that treats dialogue history as a state transition operator and enables controlled analysis of safety behavior along interaction trajectories. Rather than optimizing attack strength, STAR provides a principled probe of how aligned models traverse the safety boundary under autoregressive conditioning. Across multiple frontier language models, we find that systems that appear robust under static evaluation can undergo rapid and reproducible safety collapse under structured multi-turn interaction. Mechanistic analysis reveals monotonic drift away from refusal-related representations and abrupt phase transitions induced by role-conditioned context. Together, these findings motivate viewing language model safety as a dynamic, state-dependent process defined over conversational trajectories.

[AI-112] Spectral Edge Dynamics of Training Trajectories: Signal–Noise Geometry Across Scales

【速读】:该论文旨在解决大规模Transformer模型在训练过程中参数更新轨迹的高维复杂性与低维结构之间的矛盾问题,即尽管模型参数量巨大(如51M至124M),其优化路径却主要沿少数几个相干方向演化。解决方案的关键在于提出谱边缘动力学(Spectral Edge Dynamics, SED):通过滑动窗口奇异值分解(SVD)识别出参数更新中的“谱边缘”——即相干优化方向与随机噪声之间的清晰边界,该边界由最大连续奇异值比 σk/σk+1\sigma_k/\sigma_{k+1} 确定。SED揭示了通用的三阶段演化模式(上升、平台、崩溃)、任务复杂度对有效秩 kk^* 的调节作用,并发现谱几何与验证损失之间存在随窗口大小变化的耦合反转(称为“滞后翻转”),这反映了轨迹积分的时间尺度特性。此外,通过Johnson–Lindenstrauss投影将高维参数空间压缩至d=10Wd=10W维度后仍能保留94.3%以上的谱间隙,使得该框架可扩展至任意规模模型。

链接: https://arxiv.org/abs/2603.15678
作者: Yongzhong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:Despite hundreds of millions of parameters, transformer training trajectories evolve within only a few coherent directions. We introduce \emphSpectral Edge Dynamics (SED) to measure this structure: rolling-window SVD of parameter updates reveals a sharp boundary – the \emphspectral edge – between coherent optimization directions and stochastic noise, identified by the maximum consecutive singular value ratio \sigma_k/\sigma_k+1 . Across a 51M-parameter TinyStories model (4~seeds) and GPT-2 124M under a distribution shift, the spectral edge exhibits a universal three-phase pattern (rise, plateau, collapse), signal rank adjusts with task complexity ( k^* = 2 at 51M, k^* = 3 at 124M), and the directional coupling between spectral geometry and validation loss reverses with window size – a \emphlag flip reflecting the timescale of trajectory integration. Johnson–Lindenstrauss projection to d = 10W dimensions (e.g., d = 100 for W = 10 ) preserves the spectral gap within 5.7%, making the framework applicable to models of arbitrary size. In companion work, the same spectral geometry provides early-warning signals of grokking – predicting generalization 600–1,700 steps before it occurs across modular arithmetic, Dyck languages, and the SCAN benchmark.

[AI-113] Automated Self-Testing as a Quality Gate: Evidence-Driven Release Management for LLM Applications

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)应用在发布治理中因输出非确定性和模型行为动态演化而导致传统测试方法失效的问题。其解决方案的关键在于提出一个自动化自测框架,通过五个实证驱动的质量门控维度(任务成功率、研究上下文保留度、P95延迟、安全通过率和证据覆盖率)实现基于证据的发布决策(PROMOTE/HOLD/ROLLBACK)。该框架在长期案例研究中验证了其有效性,尤其发现“证据覆盖率”是识别严重回归的核心指标,且运行时开销可预测地随测试套件规模增长,同时多模态评估(LLM作为裁判)揭示了结构异常与内容质量缺陷的互补性,从而支撑了对复杂AI系统质量保障的全面覆盖。

链接: https://arxiv.org/abs/2603.15676
作者: Alexandre Cristovão Maiorano
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 19 pages, 4 figures, 11 tables. Under submission to Empirical Software Engineering (EMSE)

点击查看摘要

Abstract:LLM applications are AI systems whose non-deterministic outputs and evolving model behavior make traditional testing insufficient for release governance. We present an automated self-testing framework that introduces quality gates with evidence-based release decisions (PROMOTE/HOLD/ROLLBACK) across five empirically grounded dimensions: task success rate, research context preservation, P95 latency, safety pass rate, and evidence coverage. We evaluate the framework through a longitudinal case study of an internally deployed multi-agent conversational AI system with specific marketing capabilities in active development, covering 38 evaluation runs across 20+ internal releases. The gate identified two ROLLBACK-grade builds in early runs and supported stable quality evolution over a four-week staging lifecycle while exercising persona-grounded, multi-turn, adversarial, and evidence-required scenarios. Statistical analysis (Mann-Kendall trends, Spearman correlations, bootstrap confidence intervals), gate ablation, and overhead scaling indicate that evidence coverage is the primary severe-regression discriminator and that runtime scales predictably with suite size. A human calibration study (n=60 stratified cases, two independent evaluators, LLM-as-judge cross-validation) reveals complementary multi-modal coverage: LLM-judge disagreements with the system gate (kappa=0.13) are attributable to structural failure modes such as latency violations and routing errors that are invisible in response text alone, while the judge independently surfaces content quality failures missed by structural checks, validating the multi-dimensional gate design. The framework, supplementary pseudocode, and calibration artifacts are provided to support AI-system quality assurance and independent replication.

[AI-114] heoretical Foundations of Latent Posterior Factors: Formal Guarantees for Multi-Evidence Reasoning

【速读】:该论文旨在解决高风险领域中多源异构证据(multi-evidence)融合的可信人工智能(Trustworthy AI)问题,尤其是在医疗诊断、金融风控、法律分析等场景下,现有方法或缺乏形式化保证,或无法在架构上有效处理多证据推理。其解决方案的核心是提出Latent Posterior Factors (LPF) 框架,通过变分自编码器将每项证据映射为高斯潜变量后验(Gaussian latent posterior),再利用蒙特卡洛边缘化转化为软因子(soft factors),并通过精确的Sum-Product Network(LPF-SPN)或可学习神经聚合器(LPF-Learned)进行因子聚合。该框架提供七项严格的理论保障,包括校准保持性(ECE = ε + C/√K_eff)、蒙特卡洛误差衰减率 O(1/√M)、非平凡的PAC-Bayes界、信息论下界接近度、抗干扰鲁棒性(88%性能保留于半数证据被对抗替换时)以及精确的内生不确定性(epistemic)与偶然不确定性(aleatoric)分解,从而为安全关键应用中的多证据概率推理提供了理论完备且可验证的建模基础。

链接: https://arxiv.org/abs/2603.15674
作者: Aliyu Agboola Alege
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 30 pages, 8 figures, 10 tables. Theoretical characterization of the Latent Posterior Factors (LPF) framework for multi-evidence probabilistic reasoning, with formal guarantees and empirical validation

点击查看摘要

Abstract:We present a complete theoretical characterization of Latent Posterior Factors (LPF), a principled framework for aggregating multiple heterogeneous evidence items in probabilistic prediction tasks. Multi-evidence reasoning arises pervasively in high-stakes domains including healthcare diagnosis, financial risk assessment, legal case analysis, and regulatory compliance, yet existing approaches either lack formal guarantees or fail to handle multi-evidence scenarios architecturally. LPF encodes each evidence item into a Gaussian latent posterior via a variational autoencoder, converting posteriors to soft factors through Monte Carlo marginalization, and aggregating factors via exact Sum-Product Network inference (LPF-SPN) or a learned neural aggregator (LPF-Learned). We prove seven formal guarantees spanning the key desiderata for trustworthy AI: Calibration Preservation (ECE = epsilon + C/sqrt(K_eff)); Monte Carlo Error decaying as O(1/sqrt(M)); a non-vacuous PAC-Bayes bound with train-test gap of 0.0085 at N=4200; operation within 1.12x of the information-theoretic lower bound; graceful degradation as O(epsilondeltasqrt(K)) under corruption, maintaining 88% performance with half of evidence adversarially replaced; O(1/sqrt(K)) calibration decay with R^2=0.849; and exact epistemic-aleatoric uncertainty decomposition with error below 0.002%. All theorems are empirically validated on controlled datasets spanning up to 4,200 training examples. Our theoretical framework establishes LPF as a foundation for trustworthy multi-evidence AI in safety-critical applications. Comments: 30 pages, 8 figures, 10 tables. Theoretical characterization of the Latent Posterior Factors (LPF) framework for multi-evidence probabilistic reasoning, with formal guarantees and empirical validation Subjects: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 68T37 (Primary), 68T05, 62F15, 62G15 ACMclasses: I.2.6; I.2.4; G.3; I.2.3 Cite as: arXiv:2603.15674 [cs.AI] (or arXiv:2603.15674v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.15674 Focus to learn more arXiv-issued DOI via DataCite

[AI-115] DRCY: Agent ic Hardware Design Reviews

【速读】:该论文旨在解决硬件设计中因连接语义错误(semantic errors)导致的物理重制(physical respins)问题,这类错误在传统电子设计自动化(EDA)工具中难以被发现,因为现有工具仅能验证结构连通性,而无法根据器件数据手册(datasheets)检查引脚配置、电压调节器反馈电阻等关键参数是否符合制造商规格。解决方案的关键在于提出 DRCY——首个可投入生产的多智能体大语言模型(LLM)系统,其通过五智能体流水线架构实现自动化的首版原理图连接审查:包括自主获取器件数据手册、逐引脚比对提取的规格信息,并将结果以内联注释形式提交至设计评审流程;同时采用自评估的数据手册检索机制与多轮运行共识机制提升安全性分析的可靠性,已在 AllSpice Hub 平台作为 CI/CD 动作部署,服务于从自动驾驶到太空探索等高价值硬件项目。

链接: https://arxiv.org/abs/2603.15672
作者: Kyle Dumont,Nicholas Herbert,Hayder Tirmazi,Shrikanth Upadhayaya
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Hardware design errors discovered after fabrication require costly physical respins that can delay products by months. Existing electronic design automation (EDA) tools enforce structural connectivity rules. However, they cannot verify that connections are \emphsemantically correct with respect to component datasheets. For example, that a symbol’s pinout matches the manufacturer’s specification, or that a voltage regulator’s feedback resistors produce the intended output. We present DRCY, the first production-ready multi-agent LLM system that automates first-pass schematic connection review by autonomously fetching component datasheets, performing pin-by-pin analysis against extracted specifications, and posting findings as inline comments on design reviews. DRCY is deployed in production on AllSpice Hub, a collaborative hardware design platform, where it runs as a CI/CD action triggered on design review submissions. DRCY is used regularly by major hardware companies for use-cases ranging from multi-agent vehicle design to space exploration. We describe DRCY’s five-agent pipeline architecture, its agentic datasheet retrieval system with self-evaluation, and its multi-run consensus mechanism for improving reliability on safety-critical analyses

[AI-116] I Know What I Dont Know: Latent Posterior Factor Models for Multi-Evidence Probabilistic Reasoning

【速读】:该论文旨在解决现实世界决策任务中多源噪声证据聚合的问题,尤其在面对未结构化数据时,现有方法要么缺乏显式的不确定性量化(如神经聚合方法),要么依赖人工设计的离散谓词(如概率逻辑框架),限制了其可扩展性。解决方案的关键在于提出Latent Posterior Factors (LPF) 框架,该框架将变分自编码器(Variational Autoencoder, VAE)的潜在后验分布转化为适用于求和-乘积网络(Sum-Product Network, SPN)推理的软似然因子,从而实现对未结构化证据的可 tractable(可计算)概率推理,并保持校准的不确定性估计。通过构建 LPF-SPN(基于结构化因子的推理)与 LPF-Learned(端到端学习聚合)两种架构,作者实现了在统一不确定性表示下对显式概率推理与学习聚合范式的系统比较,显著优于证据深度学习(Evidential Deep Learning)、大语言模型(LLM)及图神经网络基线方法。

链接: https://arxiv.org/abs/2603.15670
作者: Aliyu Agboola Alege
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 202 pages, 52 figures, 105 tables. Comprehensive presentation of the Latent Posterior Factors (LPF) framework for multi-evidence probabilistic reasoning, including theoretical analysis, algorithmic design, and extensive empirical evaluation across synthetic and real-world benchmarks

点击查看摘要

Abstract:Real-world decision-making, from tax compliance assessment to medical diagnosis, requires aggregating multiple noisy and potentially contradictory evidence sources. Existing approaches either lack explicit uncertainty quantification (neural aggregation methods) or rely on manually engineered discrete predicates (probabilistic logic frameworks), limiting scalability to unstructured data. We introduce Latent Posterior Factors (LPF), a framework that transforms Variational Autoencoder (VAE) latent posteriors into soft likelihood factors for Sum-Product Network (SPN) inference, enabling tractable probabilistic reasoning over unstructured evidence while preserving calibrated uncertainty estimates. We instantiate LPF as LPF-SPN (structured factor-based inference) and LPF-Learned (end-to-end learned aggregation), enabling a principled comparison between explicit probabilistic reasoning and learned aggregation under a shared uncertainty representation. Across eight domains (seven synthetic and the FEVER benchmark), LPF-SPN achieves high accuracy (up to 97.8%), low calibration error (ECE 1.4%), and strong probabilistic fit, substantially outperforming evidential deep learning, LLMs and graph-based baselines over 15 random seeds. Contributions: (1) A framework bridging latent uncertainty representations with structured probabilistic reasoning. (2) Dual architectures enabling controlled comparison of reasoning paradigms. (3) Reproducible training methodology with seed selection. (4) Evaluation against EDL, BERT, R-GCN, and large language model baselines. (5) Cross-domain validation. (6) Formal guarantees in a companion paper. Comments: 202 pages, 52 figures, 105 tables. Comprehensive presentation of the Latent Posterior Factors (LPF) framework for multi-evidence probabilistic reasoning, including theoretical analysis, algorithmic design, and extensive empirical evaluation across synthetic and real-world benchmarks Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: 68T37 (Primary), 68T05, 62F15 ACMclasses: I.2.6; I.2.4; G.3 Cite as: arXiv:2603.15670 [cs.AI] (or arXiv:2603.15670v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.15670 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Aliyu Alege [view email] [v1] Fri, 13 Mar 2026 10:05:14 UTC (282 KB)

[AI-117] Quantum-Secure-By-Construction (QSC): A Paradigm Shift For Post-Quantum Agent ic Intelligence

【速读】:该论文旨在解决在分布式、长期运行的智能体人工智能(Agentic AI)系统中,如何确保通信安全且符合政策要求的问题,尤其是在量子计算时代,传统加密假设可能失效的背景下。其核心挑战在于,现有AI部署中的密码学基础难以适应未来数十年的运行周期,从而带来潜在的安全风险。解决方案的关键在于提出“量子安全原生”(Quantum Secure by Construction, QSC)的设计范式,将量子安全通信作为系统架构的核心属性而非事后补丁;通过运行时自适应安全模型,融合后量子密码学(Post-Quantum Cryptography)、量子随机数生成(Quantum Random Number Generation)与量子密钥分发(Quantum Key Distribution),实现跨云、边缘及组织边界的自主智能体间动态安全交互,并基于策略驱动的治理层对全生命周期(包括会话初始化、协作、工具调用和内存访问)进行细粒度的加密保护,从而降低引入量子安全机制的运营复杂性和成本。

链接: https://arxiv.org/abs/2603.15668
作者: Arit Kumar Bishwas,Mousumi Sen,Albert Nieto-Morales,Joel Jacob Varghese
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:As agentic artificial intelligence systems scale across globally distributed and long lived infrastructures, secure and policy compliant communication becomes a fundamental systems challenge. This challenge grows more serious in the quantum era, where the cryptographic assumptions built into today’s AI deployments may not remain valid over their operational lifetime. Here, we introduce quantum secure by construction, or QSC, as a design paradigm that treats quantum secure communication as a core architectural property of agentic AI systems rather than an upgrade added later. We realize QSC through a runtime adaptive security model that combines post quantum cryptography, quantum random number generation, and quantum key distribution to secure interactions among autonomous agents operating across heterogeneous cloud, edge, and inter organizational environments. The approach is cryptographically pluggable and guided by policy, allowing the system to adjust its security posture according to infrastructure availability, regulatory constraints, and performance needs. QSC contributes a governance aware orchestration layer that selects and combines link specific cryptographic protections across the full agent lifecycle, including session bootstrap, inter agent coordination, tool invocation, and memory access. Through system level analysis and empirical evaluation, we examine the trade offs between classical and quantum secure mechanisms and show that QSC can reduce the operational complexity and cost of introducing quantum security into deployed agentic AI systems. These results position QSC as a foundational paradigm for post quantum agentic intelligence and establish a principled pathway for designing globally interoperable, resilient, and future ready intelligent systems.

[AI-118] A Dynamic Survey of Fuzzy Intuitionistic Fuzzy Neutrosophic Plithogenic and Extensional Sets

【速读】:该论文旨在解决多类不确定性建模框架(如模糊集、直觉模糊集、中立模糊集及普利托吉尼集)在理论发展与应用实践中存在的碎片化问题,其核心挑战在于这些模型虽各有特色,但彼此间存在大量重复的概念、结构与方法。解决方案的关键在于通过系统性综述与统一表述,梳理出四类主要不确定性模型的共性结构模式与演化脉络,从而为跨领域研究提供整合性的理论基础,并激发新的概念扩展与应用场景。

链接: https://arxiv.org/abs/2603.15667
作者: Takaaki Fujita,Florentin Smarandache
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: Book. Peer-reviewed. Publisher: Neutrosophic Science International Association (NSIA). ISBN: 978-1-59973-842-0

点击查看摘要

Abstract:Real-world phenomena often exhibit vagueness, partial truth, and incomplete information. To model such uncertainty in a mathematically rigorous way, many generalized set-theoretic frameworks have been introduced, including Fuzzy Sets [1], Intuitionistic Fuzzy Sets [2], Neutrosophic Sets [3,4], Vague Sets [5], Hesitant Fuzzy Sets [6], Picture Fuzzy Sets [7], Quadripartitioned Neutrosophic Sets [8], Penta-Partitioned Neutrosophic Sets [9], Plithogenic Sets [10], HyperFuzzy Sets [11], and HyperNeutrosophic Sets [12]. Within these frameworks, a wide range of notions has been proposed and studied, particularly in the settings of fuzzy, intuitionistic fuzzy, neutrosophic, and plithogenic set theories. This extensive literature underscores both the significance of these theories and the breadth of their application areas. As a result, many ideas, constructions, and structural patterns recur across these four major families of uncertainty-oriented models. In this book, we provide a comprehensive, large-scale survey of Fuzzy, Intuitionistic Fuzzy, Neutrosophic, and Plithogenic Sets. Our goal is to give readers a systematic overview of existing developments and, through a unified exposition, to stimulate new insights, further conceptual extensions, and additional applications across a wide range of disciplines.

[AI-119] Compiled Memory: Not More Information but More Precise Instructions for Language Agents

【速读】:该论文旨在解决语言智能体(language agent)中记忆效用(memory utility)问题,即如何判断哪些经验值得保留,并有效改变代理行为,而非仅关注记忆管理(如检索与分页信息)。其核心解决方案是提出Atlas记忆内核,通过无监督的“记忆蒸馏”机制将任务经验转化为代理指令结构——不依赖微调、检索增强生成(RAG)或人工干预。关键创新在于:记忆的本质是知识提炼(distillation),而非存储;知识传递方式为系统提示词重写(instruction rewriting),而非上下文注入。具体实现上,利用代理成功与失败中的事实信息,经三步晋升门控验证后,以学习到的子项目符号形式嵌入系统提示,从而动态优化代理行为。实验表明,该方法在CUAD合同分析和HotpotQA多跳问答任务中显著提升性能,且编译的知识具有任务特异性而非模型特异性,证明了训练信号约束机制的有效性。

链接: https://arxiv.org/abs/2603.15666
作者: James Rhodes,George Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing memory systems for language agents address memory management: how to retrieve and page more information within a context budget. We address a complementary problem – memory utility: what experience is worth keeping, and how it should change agent behavior. We present Atlas, a memory kernel that compiles accumulated task experience into an agent’s instruction structure – without fine-tuning, RAG, or human intervention. Memory is distillation, not storage; delivery is instruction rewriting, not context injection. Facts extracted from agent failures and successes are verified through a three-step promotion gate and delivered by rewriting the agent’s system prompt with learned sub-bullets. On CUAD contract analysis, the evolved prompt improves GPT-4o token-level F1 by +8.7 pp and precision by +12.5 pp. On HotpotQA multi-hop QA, joint F1 improves +3.16 pp. An ablation isolates the mechanism’s defining property – the training signal constraint: the evolved prompt learns exactly what it is taught, and nothing more. Applied to Claude Sonnet~4.5 using the same evolved prompt – compiled from GPT-4o errors, unchanged – joint F1 improves +2.31 pp, with gains concentrating where Claude’s stronger baseline leaves the most room – confirming that the compiled knowledge is task-shaped, not model-shaped.

[AI-120] QV May Be Enough: Toward the Essence of Attention in LLM s

【速读】:该论文旨在解决当前Transformer架构中Query-Key-Value (QKV) 机制的内在原理不清晰问题,尤其从语言学的角度出发,通过词性(Part-of-Speech, POS)和句法分析揭示其本质功能。解决方案的关键在于构建一个统一的理论解释框架,用于阐明多查询注意力(MQA)、分组查询注意力(GQA)和多头注意力(MLA)等现代架构的有效性及其权衡关系,并在此基础上提出QV范式(QV paradigm)与QV-Ka优化方案,从而为大语言模型架构的未来演进提供可解释且可验证的理论支撑。

链接: https://arxiv.org/abs/2603.15665
作者: Zhang Edward
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Starting from first principles and a linguistic perspective centered on part-of-speech (POS) and syntactic analysis, this paper explores and derives the underlying essence of the Query-Key-Value (QKV) mechanism within the Transformer architecture. Based on this theoretical foundation, we provide a unified explanatory framework for the efficacy of contemporary architectures, including MQA, GQA, and MLA, while identifying their inherent trade-offs and potential optimization trajectories. We introduce the QV paradigm and provide empirical evidence for its validity. Building upon this, we propose the QV-Ka optimization scheme, which is further substantiated through experimental validation. The interpretable theoretical analysis of the QKV mechanism presented in this work establishes a robust foundation for the future evolution of large language model architectures.

[AI-121] DynaTrust: Defending Multi-Agent Systems Against Sleeper Agents via Dynamic Trust Graphs

【速读】:该论文旨在解决大型语言模型驱动的多智能体系统(Large Language Model-based Multi-Agent Systems, LLM-MAS)中“睡 sleeper agent”(即潜伏代理)带来的安全威胁问题。此类代理在常规运行中表现正常,逐步积累信任,仅在特定触发条件下才暴露恶意行为,传统防御方法如静态图优化或层级数据管理难以适应此类动态攻击,且常因僵化拦截策略导致高误报率(False Positive Rate, FPR)。解决方案的关键在于提出一种名为DynaTrust的新防御机制,其核心是将MAS建模为动态信任图(Dynamic Trust Graph, DTG),将信任视为一个持续演化的过程而非静态属性;通过融合历史行为与专家代理的置信度动态更新各代理的信任值,并采用自主重构图结构的方式隔离受损代理、恢复任务连通性,从而在保障安全性的同时显著降低FPR,维持系统可用性。

链接: https://arxiv.org/abs/2603.15661
作者: Yu Li,Qiang Hu,Yao Zhang,Lili Quan,Jiongchi Yu,Junjie Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Large Language Model-based Multi-Agent Systems (MAS) have demonstrated remarkable collaborative reasoning capabilities but introduce new attack surfaces, such as the sleeper agent, which behave benignly during routine operation and gradually accumulate trust, only revealing malicious behaviors when specific conditions or triggers are met. Existing defense works primarily focus on static graph optimization or hierarchical data management, often failing to adapt to evolving adversarial strategies or suffering from high false-positive rates (FPR) due to rigid blocking policies. To address this, we propose DynaTrust, a novel defense method against sleeper agents. DynaTrust models MAS as a dynamic trust graph~(DTG), and treats trust as a continuous, evolving process rather than a static attribute. It dynamically updates the trust of each agent based on its historical behaviors and the confidence of selected expert agents. Instead of simply blocking, DynaTrust autonomously restructures the graph to isolate compromised agents and restore task connectivity to ensure the usability of MAS. To assess the effectiveness of DynaTrust, we evaluate it on mixed benchmarks derived from AdvBench and HumanEval. The results demonstrate that DynaTrust outperforms the state-of-the-art method AgentShield by increasing the defense success rate by 41.7%, achieving rates exceeding 86% under adversarial conditions. Furthermore, it effectively balances security with utility by significantly reducing FPR, ensuring uninterrupted system operations through graph adaptation.

[AI-122] A federated learning framework with knowledge graph and temporal transformer for early sepsis prediction in multi-center ICUs

【速读】:该论文旨在解决重症监护病房(ICU)中脓毒症早期预测的难题,其核心挑战在于医疗数据在不同医疗机构间的碎片化分布、临床时序数据的复杂性以及严格的隐私保护要求。解决方案的关键在于提出一种融合联邦学习(Federated Learning, FL)、医学知识图谱(Medical Knowledge Graph)与时间注意力机制(Temporal Transformer)的新型框架,并引入模型无关元学习(Model-Agnostic Meta-Learning, MAML)策略,实现跨机构协作建模而无需共享原始患者数据,从而在保障隐私的前提下显著提升预测性能。

链接: https://arxiv.org/abs/2603.15651
作者: Yue Chang,Guangsen Lin,Jyun Jie Chuang,Shunqi Liu,Xinkui Li,Yaozheng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The early prediction of sepsis in intensive care unit (ICU) patients is crucial for improving survival rates. However, the development of accurate predictive models is hampered by data fragmentation across healthcare institutions and the complex, temporal nature of medical data, all under stringent privacy constraints. To address these challenges, we propose a novel framework that uniquely integrates federated learning (FL) with a medical knowledge graph and a temporal transformer model, enhanced by meta-learning capabilities. Our approach enables collaborative model training across multiple hospitals without sharing raw patient data, thereby preserving privacy. The model leverages a knowledge graph to incorporate structured medical relationships and employs a temporal transformer to capture long-range dependencies in clinical time-series data. A model-agnostic meta-learning (MAML) strategy is further incorporated to facilitate rapid adaptation of the global model to local data distributions. Evaluated on the MIMIC-IV and eICU datasets, our method achieves an area under the curve (AUC) of 0.956, which represents a 22.4% improvement over conventional centralized models and a 12.7% improvement over standard federated learning, demonstrating strong predictive capability for sepsis. This work presents a reliable and privacy-preserving solution for multi-center collaborative early warning of sepsis.

[AI-123] Steering Frozen LLM s: Adaptive Social Alignment via Online Prompt Routing

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在部署和推理阶段因静态对齐策略(如RLHF或DPO)导致的安全适应性不足问题,即固定权重无法应对不断演化的越狱行为(jailbreak behaviors)以及多元、动态变化的安全规范。其解决方案的关键在于提出Consensus Clustering LinUCB Bandit(CCLUB)框架,通过系统提示路由实现推理时的自适应社会对齐(adaptive social alignment),核心机制为保守共识聚类:仅在效用相似图与安全相似图的交集区域内聚合数据,从而有效避免在语义相近但风险差异显著的上下文中发生不安全泛化。理论分析表明CCLUB具有次线性遗憾保证,实验验证其在累积奖励上提升10.98%,平均次优差距降低14.42%。

链接: https://arxiv.org/abs/2603.15647
作者: Zeyu Zhang,Xiangxiang Dai,Ziyi Han,Xutong Liu,John C.S. Lui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are typically governed by post-training alignment (e.g., RLHF or DPO), which yields a largely static policy during deployment and inference. However, real-world safety is a full-lifecycle problem: static defenses degrade against evolving jailbreak behaviors, and fixed weights cannot adapt to pluralistic, time-varying safety norms. This motivates inference-time governance that steers behavior without costly retraining. To address this, we introduce the Consensus Clustering LinUCB Bandit (CCLUB), a unified framework for adaptive social alignment via system-prompt routing. CCLUB employs a conservative consensus clustering mechanism: it pools data only within the intersection of utility and safety similarity graphs, effectively preventing unsafe generalization across semantically proximal but risk-divergent contexts. Our theoretical analysis yields a sublinear regret guarantee, demonstrating near-optimal performance of CCLUB. Extensive experiments validate that CCLUB outperforms strong baselines, achieving a 10.98% improvement in cumulative reward and a 14.42% reduction in the average suboptimality gap.

[AI-124] XLinear: Frequency-Enhanced MLP with CrossFilter for Robust Long-Range Forecasting

【速读】:该论文旨在解决传统多层感知机(MLP)模型在长序列预测中难以捕捉长期依赖关系的问题,同时保持其对噪声的高鲁棒性。解决方案的关键在于提出XLinear架构:首先将时间序列分解为趋势和季节成分;针对含长期特征的趋势分量,设计增强频域注意力(Enhanced Frequency Attention, EFA),通过频域操作有效建模长程依赖;对于季节分量,则引入CrossFilter Block以维持模型对噪声的鲁棒性,避免注意力机制带来的脆弱性。此方法在保持MLP轻量化优势的同时显著提升了长程预测性能。

链接: https://arxiv.org/abs/2603.15645
作者: Xiang Ao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures. Accepted and published in 2025 5th International Conference on Artificial Intelligence, Automation and High Performance Computing (AIAHPC)

点击查看摘要

Abstract:Time series forecasters are widely used across various domains. Among them, MLP (multi-layer perceptron)-based forecasters have been proven to be more robust to noise compared to Transformer-based forecasters. However, MLP struggles to capture complex features, resulting in limitations on capturing long-range dependencies. To address this challenge, we propose XLinear, an MLP-based forecaster for long-range forecasting. Firstly, we decompose the time series into trend and seasonal components. For the trend component which contains long-range characteristics, we design Enhanced Frequency Attention (EFA) to capture long-term dependencies by leveraging frequency-domain operations. Additionally, a CrossFilter Block is proposed for the seasonal component to maintain the model’s robustness to noise, avoiding the problems of low robustness often caused by attention mechanisms. Experimental results demonstrate that XLinear achieves state-of-the-art performance on test datasets. While keeping the lightweight architecture and high robustness of MLP-based models, our forecaster outperforms other MLP-based forecasters in capturing long-range dependencies.

[AI-125] GSI Agent : Domain Knowledge Enhancement for Large Language Models in Green Stormwater Infrastructure

【速读】:该论文旨在解决绿色雨水基础设施(Green Stormwater Infrastructure, GSI)维护中专业领域知识分散、非专家用户难以获取可靠指导的问题。现有大型语言模型(Large Language Models, LLMs)虽具备强大的通用推理与生成能力,但在工程场景下常因缺乏领域知识而产生错误或幻觉输出,限制其在专业基础设施任务中的直接应用。解决方案的关键在于提出GSI Agent框架,通过三种互补策略实现领域增强:(1) 基于人工标注的GSI指令数据集进行监督微调(Supervised Fine-Tuning, SFT),(2) 利用从市政文档构建的内部知识库进行检索增强生成(Retrieval-Augmented Generation, RAG),以及(3) 设计基于代理的推理流程以协调检索、上下文整合与结构化响应生成。实验表明,该框架显著提升了GSI相关任务的性能(BLEU-4从0.090提升至0.307),同时保持了对通用知识任务的稳定性。

链接: https://arxiv.org/abs/2603.15643
作者: Shaohuang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Green Stormwater Infrastructure (GSI) systems, such as permeable pavement, rain gardens, and bioretention facilities, require continuous inspection and maintenance to ensure long-term perfor- mance. However, domain knowledge about GSI is often scattered across municipal manuals, regula- tory documents, and inspection forms. As a result, non-expert users and maintenance staff may strug- gle to obtain reliable and actionable guidance from field observations. Although Large Language Models (LLMs) have demonstrated strong general reasoning and language generation capabilities, they often lack domain-specific knowledge and may produce inaccurate or hallucinated answers in engineering scenarios. This limitation restricts their direct application to professional infrastructure tasks. In this paper, we propose GSI Agent, a domain-enhanced LLM framework designed to im- prove performance in GSI-related tasks. Our approach integrates three complementary strategies: (1) supervised fine-tuning (SFT) on a curated GSI instruction dataset, (2) retrieval-augmented gen- eration (RAG) over an internal GSI knowledge base constructed from municipal documents, and (3) an agent-based reasoning pipeline that coordinates retrieval, context integration, and structured response generation. We also construct a new GSI Dataset aligned with real-world GSI inspection and maintenance scenarios. Experimental results show that our framework significantly improves domain-specific performance while maintaining general knowledge capability. On the GSI dataset, BLEU-4 improves from 0.090 to 0.307, while performance on the common knowledge dataset re- mains stable (0.304 vs. 0.305). These results demonstrate that systematic domain knowledge en- hancement can effectively adapt general-purpose LLMs to professional infrastructure applications.

[AI-126] CraniMem: Cranial Inspired Gated and Bounded Memory for Agent ic Systems

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在长时间运行工作流中难以稳定保持用户与任务状态的问题。现有代理记忆系统通常采用类外部数据库的机制,依赖非结构化的读写规则,导致记忆保留不稳定、整合能力有限且易受干扰内容影响。其解决方案的关键在于提出CraniMem——一种受神经认知启发的、门控且有界多阶段记忆架构:通过目标条件门控(goal-conditioned gating)和效用标记(utility tagging)结合一个有限容量的情景缓冲区(episodic buffer),实现短期连续性;同时构建结构化的长期知识图谱(knowledge graph)以支持持久语义回忆;并通过调度式整合循环将高价值记忆轨迹回放至知识图谱并剔除低价值项,有效控制记忆增长并降低干扰。

链接: https://arxiv.org/abs/2603.15642
作者: Pearl Mody,Mihir Panchal,Rishit Kar,Kiran Bhowmick,Ruhina Karani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: International Conference on Learning Representations 2026 Workshop on Memory for LLM-Based Agentic Systems (MemAgents)

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly deployed in long running workflows, where they must preserve user and task state across many turns. Many existing agent memory systems behave like external databases with ad hoc read/write rules, which can yield unstable retention, limited consolidation, and vulnerability to distractor content. We present CraniMem, a neurocognitively motivated, gated and bounded multi-stage memory design for agentic systems. CraniMem couples goal conditioned gating and utility tagging with a bounded episodic buffer for near term continuity and a structured long-term knowledge graph for durable semantic recall. A scheduled consolidation loop replays high utility traces into the graph while pruning low utility items, keeping memory growth in check and reducing interference. On long horizon benchmarks evaluated under both clean inputs and injected noise, CraniMem is more robust than a Vanilla RAG and Mem0 baseline and exhibits smaller performance drops under distraction. Our code is available at this https URL and the accompanying PyPI package at this https URL.

[AI-127] Form Follows Function: Recursive Stem Model

【速读】:该论文旨在解决递归推理模型(如TRM)在训练过程中依赖深度监督和长序列展开所导致的计算成本高、中间行为趋于贪婪策略的问题。其核心解决方案是提出Recursive Stem Model (RSM),通过改变训练机制实现更高效且稳定的训练:RSM在训练时完全剥离隐藏状态的历史信息,将早期迭代视为“预热”步骤,仅在最终步骤施加损失函数;同时独立扩展外层递归深度 HH 和内层计算深度 LL,并引入随机外层转移策略(stochastic depth over HH)以缓解深度增加带来的不稳定性。这一设计使得RSM在训练速度上比TRM快约20倍,同时准确率提升约5倍误差率减少,并支持测试时任意扩展推理步数(高达20,000步),从而实现无需重训的“额外思考”能力。

链接: https://arxiv.org/abs/2603.15641
作者: Navid Hakimi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:Recursive reasoning models such as Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM) show that small, weight-shared networks can solve compute-heavy and NP puzzles by iteratively refining latent states, but their training typically relies on deep supervision and/or long unrolls that increase wall-clock cost and can bias the model toward greedy intermediate behavior. We introduce Recursive Stem Model (RSM), a recursive reasoning approach that keeps the TRM-style backbone while changing the training contract so the network learns a stable, depth-agnostic transition operator. RSM fully detaches the hidden-state history during training, treats early iterations as detached “warm-up” steps, and applies loss only at the final step. We further grow the outer recursion depth H and inner compute depth L independently and use a stochastic outer-transition scheme (stochastic depth over H ) to mitigate instability when increasing depth. This yields two key capabilities: (i) 20\times faster training than TRM while improving accuracy ( \approx 5\times reduction in error rate), and (ii) test-time scaling where inference can run for arbitrarily many refinement steps ( \sim 20,000 H_\texttest \gg 20 H_\texttrain ), enabling additional “thinking” without retraining. On Sudoku-Extreme, RSM reaches 97.5% exact accuracy with test-time compute (within ~1 hour of training on a single A100), and on Maze-Hard ( 30 \times 30 ) it reaches ~80% exact accuracy in ~40 minutes using attention-based instantiation. Finally, because RSM implements an iterative settling process, convergence behavior provides a simple, architecture-native reliability signal: non-settling trajectories warn that the model has not reached a viable solution and can be a guard against hallucination, while stable fixed points can be paired with domain verifiers for practical correctness checks.

[AI-128] he Comprehension-Gated Agent Economy: A Robustness-First Architecture for AI Economic Agency

【速读】:该论文旨在解决当前AI代理(AI agent)在经济活动中被赋予自主权(如执行交易、管理预算、协商合同等)时,其权限分配缺乏与实际运行稳健性相关联的评估机制的问题。现有框架依赖能力基准测试来限制经济权限,但这些指标与实际操作中的鲁棒性无显著关联,导致潜在风险不可控。解决方案的关键在于提出一种名为“理解门控代理经济”(Comprehension-Gated Agent Economy, CGAE)的形式化架构,通过基于对抗鲁棒性审计的验证理解函数对代理的经济权限进行上限控制。该机制在三个正交鲁棒性维度上运作:约束合规性(CDCT)、认知完整性(DDFT)和行为一致性(AGT),并以内在幻觉率作为跨维度诊断指标;同时采用最弱环节门控函数将鲁棒性向量映射为离散经济层级,并证明了系统具备有限经济暴露、激励兼容的鲁棒性投资以及单调安全增长等核心性质,从而首次建立了实证AI鲁棒性评估与经济治理之间的形式桥梁。

链接: https://arxiv.org/abs/2603.15639
作者: Rahul Baxi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 3 theorems, 1 proposition

点击查看摘要

Abstract:AI agents are increasingly granted economic agency (executing trades, managing budgets, negotiating contracts, and spawning sub-agents), yet current frameworks gate this agency on capability benchmarks that are empirically uncorrelated with operational robustness. We introduce the Comprehension-Gated Agent Economy (CGAE), a formal architecture in which an agent’s economic permissions are upper-bounded by a verified comprehension function derived from adversarial robustness audits. The gating mechanism operates over three orthogonal robustness dimensions: constraint compliance (measured by CDCT), epistemic integrity (measured by DDFT), and behavioral alignment (measured by AGT), with intrinsic hallucination rates serving as a cross-cutting diagnostic. We define a weakest-link gate function that maps robustness vectors to discrete economic tiers, and prove three properties of the resulting system: (1) bounded economic exposure, ensuring maximum financial liability is a function of verified robustness; (2) incentive-compatible robustness investment, showing rational agents maximize profit by improving robustness rather than scaling capability alone; and (3) monotonic safety scaling, demonstrating that aggregate system safety does not decrease as the economy grows. The architecture includes temporal decay and stochastic re-auditing mechanisms that prevent post-certification drift. CGAE provides the first formal bridge between empirical AI robustness evaluation and economic governance, transforming safety from a regulatory burden into a competitive advantage.

[AI-129] AIDABench: AI Data Analytics Benchmark

【速读】:该论文旨在解决当前AI驱动的文档理解与处理工具在真实应用场景中缺乏系统性、端到端评估标准的问题。现有基准测试多聚焦于单一能力或简化场景,无法反映实际任务中的全流程有效性。为此,作者提出AIDABench——一个涵盖600余项复杂数据分析任务的综合性基准,覆盖问答(question answering)、数据可视化(data visualization)和文件生成(file generation)三大核心能力维度,任务设计基于异构数据类型(如电子表格、数据库、财务报告等)及跨行业、跨岗位的真实分析需求。其关键创新在于构建了具有高难度且贴近现实的工作流任务集,使得即使是人类专家借助AI工具也需1–2小时完成每题,从而有效揭示当前模型在复杂场景下的局限性。通过在11个先进模型上的实证评估,发现最佳模型仅达到59.43%的pass-at-1准确率,凸显了未来研究亟需突破的方向。

链接: https://arxiv.org/abs/2603.15636
作者: Yibo Yang,Fei Lei,Yixuan Sun,Yantao Zeng,Chengguang Lv,Jiancao Hong,Jiaojiao Tian,Tianyu Qiu,Xin Wang,Yanbing Chen,Yanjie Li,Zheng Pan,Xiaochen Zhou,Guanzhou Chen,Haoran Lv,Yuning Xu,Yue Ou,Haodong Liu,Shiqi He,Anya Jia,Yulei Xin,Huan Wu,Liang Liu,Jiaye Ge,Jianxin Dong,Dahua Lin,Wenxiu Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages (including appendix), 9 figures, 4 tables. Code: this https URL . Dataset: this https URL

点击查看摘要

Abstract:As AI-driven document understanding and processing tools become increasingly prevalent in real-world applications, the need for rigorous evaluation standards has grown increasingly urgent. Existing benchmarks and evaluations often focus on isolated capabilities or simplified scenarios, failing to capture the end-to-end task effectiveness required in practical settings. To address this gap, we introduce AIDABench, a comprehensive benchmark for evaluating AI systems on complex data analytics tasks in an end-to-end manner. AIDABench encompasses 600+ diverse document analysis tasks across three core capability dimensions: question answering, data visualization, and file generation. These tasks are grounded in realistic scenarios involving heterogeneous data types, including spreadsheets, databases, financial reports, and operational records, and reflect analytical demands across diverse industries and job functions. Notably, the tasks in AIDABench are sufficiently challenging that even human experts require 1-2 hours per question when assisted by AI tools, underscoring the benchmark’s difficulty and real-world complexity. We evaluate 11 state-of-the-art models on AIDABench, spanning both proprietary (e.g., Claude Sonnet 4.5, Gemini 3 Pro Preview) and open-source (e.g., Qwen3-Max-2026-01-23-Thinking) families. Our results reveal that complex, real-world data analytics tasks remain a significant challenge for current AI systems, with the best-performing model achieving only 59.43% pass-at-1. We provide a detailed analysis of failure modes across each capability dimension and identify key challenges for future research. AIDABench offers a principled reference for enterprise procurement, tool selection, and model optimization, and is publicly available at this https URL.

[AI-130] Neural-Symbolic Logic Query Answering in Non-Euclidean Space

【速读】:该论文旨在解决知识图谱上复杂一阶逻辑(First-Order Logic, FOL)查询推理中符号方法可解释性高但难以处理不完整图谱、神经方法泛化能力强但缺乏透明性的矛盾问题。其解决方案的关键在于提出HYQNET模型,该模型通过将FOL查询分解为关系投影与模糊集合上的逻辑运算以增强可解释性,并采用基于双曲空间的图神经网络(Hyperbolic GNN)进行知识图谱补全,从而在保持查询树递归结构和依赖关系的同时,利用双曲空间对逻辑推理中的层次结构进行更有效的建模,相较于欧几里得空间方法显著提升了推理性能。

链接: https://arxiv.org/abs/2603.15633
作者: Lihui Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Answering complex first-order logic (FOL) queries on knowledge graphs is essential for reasoning. Symbolic methods offer interpretability but struggle with incomplete graphs, while neural approaches generalize better but lack trans- parency. Neural-symbolic models aim to integrate both strengths but often fail to capture the hierarchical structure of logical queries, limiting their effectiveness. We propose HYQNET, a neural-symbolic model for logic query reasoning that fully leverages hyperbolic space. HYQNET decomposes FOL queries into relation projections and logical operations over fuzzy sets, enhancing interpretability. To address missing links, it employs a hyperbolic GNN-based approach for knowledge graph completion in hyperbolic space, effectively embedding the re- cursive query tree while preserving structural dependencies. By utilizing hyperbolic representations, HYQNET captures the hierarchical nature of logical projection reasoning more effectively than Euclidean-based approaches. Experiments on three benchmark datasets demonstrate that HYQNET achieves strong performance, highlighting the advantages of reasoning in hyperbolic space.

[AI-131] One Operator to Rule Them All? On Boundary-Indexed Operator Families in Neural PDE Solvers ICLR2026

【速读】:该论文旨在解决当前神经偏微分方程(PDE)求解器在边界条件变化时泛化能力不足的问题。研究表明,标准神经算子训练实际上隐式学习了一个边界索引的算子族,而非单一的边界无关算子,且其映射关系本质上依赖于训练过程中所见的边界条件分布。解决方案的关键在于将算子学习重新形式化为基于边界条件的条件风险最小化问题,并由此揭示了在训练边界分布支持范围之外存在不可识别性(non-identifiability)现象。这说明仅在强迫项或分辨率上实现泛化,并不能保证跨边界条件的泛化性能,从而强调了在构建PDE基础模型时必须显式建模边界信息的重要性。

链接: https://arxiv.org/abs/2603.01406
作者: Lennon J. Shikhman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: Accepted to the ICLR 2026 Workshop on AI PDEs. 10 pages, 5 figures

点击查看摘要

Abstract:Neural PDE solvers are often described as learning solution operators that map problem data to PDE solutions. In this work, we argue that this interpretation is generally incorrect when boundary conditions vary. We show that standard neural operator training implicitly learns a boundary-indexed family of operators, rather than a single boundary-agnostic operator, with the learned mapping fundamentally conditioned on the boundary-condition distribution seen during training. We formalize this perspective by framing operator learning as conditional risk minimization over boundary conditions, which leads to a non-identifiability result outside the support of the training boundary distribution. As a consequence, generalization in forcing terms or resolution does not imply generalization across boundary conditions. We support our theoretical analysis with controlled experiments on the Poisson equation, demonstrating sharp degradation under boundary-condition shifts, cross-distribution failures between distinct boundary ensembles, and convergence to conditional expectations when boundary information is removed. Our results clarify a core limitation of current neural PDE solvers and highlight the need for explicit boundary-aware modeling in the pursuit of foundation models for PDEs.

[AI-132] Deep Learning-Driven Black-Box Doherty Power Amplifier with Pixelated Output Combiner and Extended Efficiency Range

【速读】:该论文旨在解决传统Doherty功率放大器(PA)中输出组合网络设计复杂、效率优化困难的问题,尤其针对多端口像素化组合网络的高效逆向设计挑战。解决方案的关键在于构建一个基于深度卷积神经网络(CNN)的电磁(EM)代理模型,用以快速准确预测像素化无源网络的S参数,并将其嵌入黑箱Doherty框架与遗传算法(GA)优化器中,实现对复杂组合网络的自动化合成。该方法显著提升了背离效率范围,同时通过实测验证了其在高功率效率和线性度上的优越性能,为高性能射频功率放大器的设计提供了新范式。

链接: https://arxiv.org/abs/2603.16565
作者: Han Zhou,Haojie Chang,David Widen
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:This article presents a deep learning-driven inverse design methodology for Doherty power amplifiers (PA) with multi-port pixelated output combiner networks. A deep convolutional neural network (CNN) is developed and trained as an electromagnetic (EM) surrogate model to accurately and rapidly predict the S-parameters of pixelated passive networks. By leveraging the CNN-based surrogate model within a blackbox Doherty framework and a genetic algorithm (GA)-based optimizer, we effectively synthesize complex Doherty combiners that enable an extended back-off efficiency range using fully symmetrical devices. As a proof of concept, we designed and fabricated two Doherty PA prototypes incorporating three-port pixelated combiners, implemented with GaN HEMT transistors. In measurements, both prototypes demonstrate a maximum drain efficiency exceeding 74% and deliver an output power surpassing 44.1 dBm at 2.75 GHz. Furthermore, a measured drain efficiency above 52% is maintained at the 9-dB back-off power level for both prototypes at the same frequency. To evaluate linearity and efficiency under realistic signal conditions, both prototypes are tested using a 20-MHz 5G new radio (NR)-like waveform exhibiting a peak-to-average power ratio (PAPR) of 9.0 dB. After applying digital predistortion (DPD), each design achieves an average power added efficiency (PAE) above 51%, while maintaining an adjacent channel leakage ratio (ACLR) better than -60.8 dBc.

[AI-133] Explainable machine learning workflows for radio astronomical data processing

【速读】:该论文旨在解决现代射电望远镜数据流激增背景下,传统人工配置数据处理流程不可行的问题,同时针对现有基于机器学习(Machine Learning, ML)的数据处理流水线多为“黑箱”模型、缺乏可解释性这一痛点。其解决方案的关键在于提出将模糊规则推理(fuzzy rule-based inference)与深度学习相结合的方法,具体采用Takagi-Sugeno-Kang(TSK)模糊系统实现ML辅助决策,从而在不牺牲处理质量或准确性的前提下显著提升流程的可解释性。

链接: https://arxiv.org/abs/2603.16350
作者: S. Yatawatta,A. Ahmadi,B. Asabere,M. Iacobelli,N. Peters,M. Veldhuis
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Radio astronomy relies heavily on efficient and accurate processing pipelines to deliver science ready data. With the increasing data flow of modern radio telescopes, manual configuration of such data processing pipelines is infeasible. Machine learning (ML) is already emerging as a viable solution for automating data processing pipelines. However, almost all existing ML enabled pipelines are of black-box type, where the decisions made by the automating agents are not easily deciphered by astronomers. In order to improve the explainability of the ML aided data processing pipelines in radio astronomy, we propose the joint use of fuzzy rule based inference and deep learning. We consider one application in radio astronomy, i.e., calibration, to showcase the proposed approach of ML aided decision making using a Takagi-Sugeno-Kang (TSK) fuzzy system. We provide results based on simulations to illustrate the increased explainability of the proposed approach, not compromising on the quality or accuracy.

[AI-134] Robust Generative Audio Quality Assessment: Disentangling Quality from Spurious Correlations ICME2026

【速读】:该论文旨在解决AI生成内容(AIGC)感知质量评估中自动平均意见得分(MOS)预测模型因数据稀缺而导致的虚假相关性问题,例如模型可能学习到特定数据集的声学特征而非泛化的质量判别特征。其解决方案的关键在于采用领域对抗训练(DAT)方法,通过系统性地探索从显式元数据驱动标签到隐式数据驱动聚类的不同领域定义策略,发现最优策略高度依赖于具体的MOS评估维度;由此提出基于MOS方面的特定领域策略,有效缓解了声学偏差,显著提升了与人类评分的相关性,并在未见生成场景中实现了更好的泛化性能。

链接: https://arxiv.org/abs/2603.16201
作者: Kuan-Tang Huang,Chien-Chun Wang,Cheng-Yeh Yang,Hung-Shin Lee,Hsin-Min Wang,Berlin Chen
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
备注: Accepted to IEEE ICME 2026

点击查看摘要

Abstract:The rapid proliferation of AI-Generated Content (AIGC) has necessitated robust metrics for perceptual quality assessment. However, automatic Mean Opinion Score (MOS) prediction models are often compromised by data scarcity, predisposing them to learn spurious correlations-- such as dataset-specific acoustic signatures-- rather than generalized quality features. To address this, we leverage domain adversarial training (DAT) to disentangle true quality perception from these nuisance factors. Unlike prior works that rely on static domain priors, we systematically investigate domain definition strategies ranging from explicit metadata-driven labels to implicit data-driven clusters. Our findings reveal that there is no “one-size-fits-all” domain definition; instead, the optimal strategy is highly dependent on the specific MOS aspect being evaluated. Experimental results demonstrate that our aspect-specific domain strategy effectively mitigates acoustic biases, significantly improving correlation with human ratings and achieving superior generalization on unseen generative scenarios.

[AI-135] Structure-Aware Multimodal LLM Framework for Trustworthy Near-Field Beam Prediction

【速读】:该论文旨在解决近场超大规模多输入多输出(XL-MIMO)系统中因球面波传播特性导致的传统波束码本效率低下问题,尤其是在复杂三维低空环境中,波束变化与用户位置及物理环境深度耦合,使得精确波束对齐需要强大的环境理解能力。解决方案的关键在于提出一种由大语言模型(LLM)驱动的多模态框架,融合历史GPS数据、RGB图像、LiDAR点云以及定制的任务特定文本提示,利用LLM强大的涌现推理和泛化能力,学习复杂的时空动态以实现更优的环境感知与波束管理。

链接: https://arxiv.org/abs/2603.16143
作者: Mengyuan Li,Qianfan Lu,Jiachen Tian,Hongjun Hu,Yu Han,Xiao Li,Chao-kai Wen,Shi Jin
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In near-field extremely large-scale multiple-input multiple-output (XL-MIMO) systems, spherical wavefront propagation expands the traditional beam codebook into the joint angular-distance domain, rendering conventional beam training prohibitively inefficient, especially in complex 3-dimensional (3D) low-altitude environments. Furthermore, since near-field beam variations are deeply coupled not only with user positions but also with the physical surroundings, precise beam alignment demands profound environmental understanding capabilities. To address this, we propose a large language model (LLM)-driven multimodal framework that fuses historical GPS data, RGB image, LiDAR data, and strategically designed task-specific textual prompts. By utilizing the powerful emergent reasoning and generalization capabilities of the LLM, our approach learns complex spatial dynamics to achieve superior environmental comprehension…

[AI-136] Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech INTERSPEECH2026

【速读】:该论文旨在解决发音障碍语音质量评估(Dysarthric Speech Quality Assessment, DSQA)中因标注数据稀缺和主观评价成本高而导致的客观建模困难问题。解决方案的关键在于提出一个三阶段框架:首先利用教师模型为未标注的发音障碍语音生成伪标签;随后通过一种标签感知的对比学习策略进行弱监督预训练,使模型在多样说话人和声学条件下具备泛化能力;最后对预训练模型进行微调以适配下游DSQA任务。该方法显著提升了模型在多个病因和语言场景下的鲁棒性,其基于Whisper的基线模型性能优于现有最先进方法(如SpICE),整体框架在未见测试集上平均Spearman相关系数(SRCC)达0.761。

链接: https://arxiv.org/abs/2603.15988
作者: Jaesung Bae,Xiuwen Zheng,Minje Kim,Chang D. Yoo,Mark Hasegawa-Johnson
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Dysarthric speech quality assessment (DSQA) is critical for clinical diagnostics and inclusive speech technologies. However, subjective evaluation is costly and difficult to scale, and the scarcity of labeled data limits robust objective modeling. To address this, we propose a three-stage framework that leverages unlabeled dysarthric speech and large-scale typical speech datasets to scale training. A teacher model first generates pseudo-labels for unlabeled samples, followed by weakly supervised pretraining using a label-aware contrastive learning strategy that exposes the model to diverse speakers and acoustic conditions. The pretrained model is then fine-tuned for the downstream DSQA task. Experiments on five unseen datasets spanning multiple etiologies and languages demonstrate the robustness of our approach. Our Whisper-based baseline significantly outperforms SOTA DSQA predictors such as SpICE, and the full framework achieves an average SRCC of 0.761 across unseen test datasets.

[AI-137] Standardizing Medical Images at Scale for AI

【速读】:该论文旨在解决深度学习在医学图像分析中因临床数据异质性导致的性能下降问题,特别是由成像设备、染色协议和采集条件差异引起的域偏移(domain shift)对模型泛化能力的影响。其解决方案的关键在于提出一种基于物理原理的数据预处理框架——PhyCV(Physics-Inspired Computer Vision),该框架通过模拟光学物理过程(如虚拟衍射传播与相干相位检测)对图像进行确定性变换,有效抑制非语义变异(如色彩和光照差异),同时保留诊断相关的纹理与结构特征。该方法具有物理可解释性、参数可调性和可微分性,可在固定预处理阶段部署或嵌入端到端学习流程,显著提升跨机构乳腺癌分类的准确率(从70.8%提升至90.9%),且计算开销极低,从而增强临床AI系统的鲁棒性、可解释性与可重复性。

链接: https://arxiv.org/abs/2603.15980
作者: Callen MacPhee,Yiming Zhou,Koichiro Kishima,Bahram Jalali
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Deep learning has achieved remarkable success in medical image analysis, yet its performance remains highly sensitive to the heterogeneity of clinical data. Differences in imaging hardware, staining protocols, and acquisition conditions produce substantial domain shifts that degrade model generalization across institutions. Here we present a physics-based data preprocessing framework based on the PhyCV (Physics-Inspired Computer Vision) family of algorithms, which standardizes medical images through deterministic transformations derived from optical physics. The framework models images as spatially varying optical fields that undergo a virtual diffractive propagation followed by coherent phase detection. This process suppresses non-semantic variability such as color and illumination differences while preserving diagnostically relevant texture and structural features. When applied to histopathological images from the Camelyon17-WILDS benchmark, PhyCV preprocessing improves out-of-distribution breast-cancer classification accuracy from 70.8% (Empirical Risk Minimization baseline) to 90.9%, matching or exceeding data-augmentation and domain-generalization approaches at negligible computational cost. Because the transform is physically interpretable, parameterizable, and differentiable, it can be deployed as a fixed preprocessing stage or integrated into end-to-end learning. These results establish PhyCV as a generalizable data refinery for medical imaging-one that harmonizes heterogeneous datasets through first-principles physics, improving robustness, interpretability, and reproducibility in clinical AI systems.

[AI-138] LLM -Driven Discovery of High-Entropy Catalysts via Retrieval-Augmented Generation

【速读】:该论文旨在解决二氧化碳还原(CO₂ reduction)催化剂材料发现周期长(10–20年)、依赖专家经验的瓶颈问题。其解决方案的关键在于提出一种基于检索增强生成(retrieval-augmented generation, RAG)的框架,将GPT-4与包含50,000+已知材料的数据库相结合,使大语言模型能够在物理约束下高效探索化学空间。该方法不仅实现了对多目标优化(如热力学稳定性、成本<100/kg、金属导电性、机械稳定性)的兼顾,还通过火山图分析验证了82%的生成候选材料位于理论活性最优区域,并在计算效率上达到传统高通量筛选的200倍提升,从而显著加速催化剂设计流程。

链接: https://arxiv.org/abs/2603.15712
作者: AI Scientists,Xinyi Lin,Danqing Yin,Ying Guo
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:CO2 reduction requires efficient catalysts, yet materials discovery remains bottlenecked by 10-20 year development cycles requiring deep domain expertise. This paper demonstrates how large language models can assist the catalyst discovery process by helping researchers explore chemical spaces and interpret results when augmented with retrieval-based grounding. We introduce a retrieval-augmented generation framework that enables GPT-4 to navigate chemical space by accessing a database of 50,000+ known materials, adapting general-purpose language understanding for high-throughput materials design. Our approach generated over 250 catalyst candidates with an 82% thermodynamic stability rate while addressing multi-objective constraints: 68% achieved 100/kg cost with metallic conductivity (band gap0.1eV) and mechanical stability (B/G1.75). The best-performing Fe0.2Co0.2Ni0.2Ir0.1Ru0.3 achieves 0.285V limiting potential (25% improvement over IrO2), while Cr0.2Fe0.2Co0.3Ni0.2Mo0.1 optimally balances performance-cost trade-offs at 18/kg. Volcano plot analysis confirms that 78% of LLM-generated catalysts cluster near the theoretical activity optimum, while our system achieves 200x computational efficiency compared to traditional high-throughput screening. By demonstrating that retrieval-augmented generation can ground AI creativity in physical constraints without sacrificing exploration, this work demonstrates an approach where natural language interfaces can streamline materials discovery workflows, enabling researchers to explore chemical spaces more efficiently while the LLM assists in result interpretation and hypothesis generation.

[AI-139] Quantum Amplitude Estimation for Catastrophe Insurance Tail-Risk Pricing: Empirical Convergence and NISQ Noise Analysis

【速读】:该论文旨在解决传统蒙特卡洛方法在定价巨灾保险尾部风险时因样本稀疏性导致的收敛速度慢问题,即其收敛阶仅为 $ O(1/\sqrt{N}) $,难以在有限模拟预算下准确估计损失分布的上尾分位数,从而影响高风险情境下的风险评估精度。解决方案的关键在于引入量子振幅估计算法(Quantum Amplitude Estimation, QAE),该方法基于Grover放大机制,在Oracle查询次数上实现接近 $ O(1/N) $ 的收敛速率,理论上可提供二次加速优势。研究通过Qiskit Aer模拟器验证了这一优势,构建了完整的量子编码-估计流程,将拟合的对数正态巨灾分布编码为量子Oracles,并利用小读出概率特性实现安全的Grover迭代(最多16次),实验证明了量子方案在尾部估计中的潜力,同时指出当前瓶颈在于离散化误差而非估计精度。

链接: https://arxiv.org/abs/2603.15664
作者: Alexis Kirke
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Classical Monte Carlo methods for pricing catastrophe insurance tail risk converge at order reciprocal root N, requiring large simulation budgets to resolve upper-tail percentiles of the loss distribution. This sample-sparsity problem can lead to AI models trained on impoverished tail data, producing poorly calibrated risk estimates where insolvency risk is greatest. Quantum Amplitude Estimation (QAE), following Montanaro, achieves convergence approaching order reciprocal N in oracle queries - a quadratic speedup that, at scale, would enable high-resolution tail estimation within practical budgets. We validate this advantage empirically using a Qiskit Aer simulator with genuine Grover amplification. A complete pipeline encodes fitted lognormal catastrophe distributions into quantum oracles via amplitude encoding, producing small readout probabilities that enable safe Grover amplification with up to k=16 iterations. Seven experiments on synthetic and real (NOAA Storm Events, 58,028 records) data yield three main findings: an oracle-model advantage, that strong classical baselines win when analytical access is available, and that discretisation, not estimation, is the current bottleneck.

机器学习

[LG-0] Long-Horizon Traffic Forecasting via Incident-Aware Conformal Spatio-Temporal Transformers

链接: https://arxiv.org/abs/2603.16857
作者: Mayur Patil,Qadeer Ahmed,Shawn Midlam-Mohler,Stephanie Marik,Allen Sheldon,Rajeev Chhajer,Nithin Santhanam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliable multi-horizon traffic forecasting is challenging because network conditions are stochastic, incident disruptions are intermittent, and effective spatial dependencies vary across time-of-day patterns. This study is conducted on the Ohio Department of Transportation (ODOT) traffic count data and corresponding ODOT crash records. This work utilizes a Spatio-Temporal Transformer (STT) model with Adaptive Conformal Prediction (ACP) to produce multi-horizon forecasts with calibrated uncertainty. We propose a piecewise Coefficient of Variation (CV) strategy that models hour-to-hour traveltime variability using a log-normal distribution, enabling the construction of a per-hour dynamic adjacency matrix. We further perturb edge weights using incident-related severity signals derived from the ODOT crash dataset that comprises incident clearance time, weather conditions, speed violations, work zones, and roadway functional class, to capture localized disruptions and peak/off-peak transitions. This dynamic graph construction replaces a fixed-CV assumption and better represents changing traffic conditions within the forecast window. For validation, we generate extended trips via multi-hour loop runs on the Columbus, Ohio, network in SUMO simulations and apply a Monte Carlo simulation to obtain travel-time distributions for a Vehicle Under Test (VUT). Experiments demonstrate improved long-horizon accuracy and well-calibrated prediction intervals compared to other baseline methods.

[LG-1] GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators

链接: https://arxiv.org/abs/2603.16849
作者: Mattia Rigotti,Nicholas Thumiger,Thomas Frick
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adapting transformer positional encoding to meshes and graph-structured data presents significant computational challenges: exact spectral methods require cubic-complexity eigendecomposition and can inadvertently break gauge invariance through numerical solver artifacts, while efficient approximate methods sacrifice gauge symmetry by design. Both failure modes cause catastrophic generalization in inductive learning, where models trained with one set of numerical choices fail when encountering different spectral decompositions of similar graphs or discretizations of the same mesh. We propose GIST (Gauge-Invariant Spectral Transformers), a new graph transformer architecture that resolves this challenge by achieving end-to-end \mathcalO(N) complexity through random projections while algorithmically preserving gauge invariance via inner-product-based attention on the projected embeddings. We prove GIST achieves discretization-invariant learning with bounded mismatch error, enabling parameter transfer across arbitrary mesh resolutions for neural operator applications. Empirically, GIST matches state-of-the-art on standard graph benchmarks (e.g., achieving 99.50% micro-F1 on PPI) while uniquely scaling to mesh-based Neural Operator benchmarks with up to 750K nodes, achieving state-of-the-art aerodynamic prediction on the challenging DrivAerNet and DrivAerNet++ datasets.

[LG-2] Dynamic Meta-Layer Aggregation for Byzantine-Robust Federated Learning

链接: https://arxiv.org/abs/2603.16846
作者: Reek Das,Biplab Kanti Sen
类目: Machine Learning (cs.LG)
*备注: 15 pages, 3 figures

点击查看摘要

Abstract:Federated Learning (FL) is increasingly applied in sectors like healthcare, finance, and IoT, enabling collaborative model training while safeguarding user privacy. However, FL systems are susceptible to Byzantine adversaries that inject malicious updates, which can severely compromise global model performance. Existing defenses tend to focus on specific attack types and fail against untargeted strategies, such as multi-label flipping or combinations of noise and backdoor patterns. To overcome these limitations, we propose FedAOT-a novel defense mechanism that counters multi-label flipping and untargeted poisoning attacks using a metalearning-inspired adaptive aggregation framework. FedAOT dynamically weights client updates based on their reliability, suppressing adversarial influence without relying on predefined thresholds or restrictive attack assumptions. Notably, FedAOT generalizes effectively across diverse datasets and a wide range of attack types, maintaining robust performance even in previously unseen scenarios. Experimental results demonstrate that FedAOT substantially improves model accuracy and resilience while maintaining computational efficiency, offering a scalable and practical solution for secure federated learning.

[LG-3] Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning

链接: https://arxiv.org/abs/2603.16842
作者: Jello Zhou,Vudtiwat Ngampruetikorn,David J. Schwab
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Systems and Control (eess.SY); Biological Physics (physics.bio-ph)
*备注: 18 pages, 17 figures

点击查看摘要

Abstract:Stochastic resetting, where a dynamical process is intermittently returned to a fixed reference state, has emerged as a powerful mechanism for optimizing first-passage properties. Existing theory largely treats static, non-learning processes. Here we ask how stochastic resetting interacts with reinforcement learning, where the underlying dynamics adapt through experience. In tabular grid environments, we find that resetting accelerates policy convergence even when it does not reduce the search time of a purely diffusive agent, indicating a novel mechanism beyond classical first-passage optimization. In a continuous control task with neural-network-based value approximation, we show that random resetting improves deep reinforcement learning when exploration is difficult and rewards are sparse. Unlike temporal discounting, resetting preserves the optimal policy while accelerating convergence by truncating long, uninformative trajectories to enhance value propagation. Our results establish stochastic resetting as a simple, tunable mechanism for accelerating learning, translating a canonical phenomenon of statistical mechanics into an optimization principle for reinforcement learning.

[LG-4] RaDAR: Relation-aware Diffusion-Asymmetric Graph Contrastive Learning for Recommendation WWW2026

链接: https://arxiv.org/abs/2603.16800
作者: Yixuan Huang,Jiawei Chen,Shengfan Zhang,Zongsheng Cao
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures. Accepted at WWW 2026

点击查看摘要

Abstract:Collaborative filtering (CF) recommendation has been significantly advanced by integrating Graph Neural Networks (GNNs) and Graph Contrastive Learning (GCL). However, (i) random edge perturbations often distort critical structural signals and degrade semantic consistency across augmented views, and (ii) data sparsity hampers the propagation of collaborative signals, limiting generalization. To tackle these challenges, we propose RaDAR (Relation-aware Diffusion-Asymmetric Graph Contrastive Learning Framework for Recommendation Systems), a novel framework that combines two complementary view generation mechanisms: a graph generative model to capture global structure and a relation-aware denoising model to refine noisy edges. RaDAR introduces three key innovations: (1) asymmetric contrastive learning with global negative sampling to maintain semantic alignment while suppressing noise; (2) diffusion-guided augmentation, which employs progressive noise injection and denoising for enhanced robustness; and (3) relation-aware edge refinement, dynamically adjusting edge weights based on latent node semantics. Extensive experiments on three public benchmarks demonstrate that RaDAR consistently outperforms state-of-the-art methods, particularly under noisy and sparse conditions. Comments: 12 pages, 5 figures. Accepted at WWW 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.16800 [cs.LG] (or arXiv:2603.16800v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.16800 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the ACM Web Conference (WWW), 2026 Related DOI: https://doi.org/10.1145/3774904.3792463 Focus to learn more DOI(s) linking to related resources Submission history From: Yixuan Huang [view email] [v1] Tue, 17 Mar 2026 17:05:23 UTC (522 KB)

[LG-5] High-Dimensional Gaussian Mean Estimation under Realizable Contamination

链接: https://arxiv.org/abs/2603.16798
作者: Ilias Diakonikolas,Daniel M. Kane,Thanasis Pittas
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study mean estimation for a Gaussian distribution with identity covariance in \mathbbR^d under a missing data scheme termed realizable \epsilon -contamination model. In this model an adversary can choose a function r(x) between 0 and \epsilon and each sample x goes missing with probability r(x) . Recent work Ma et al., 2024 proposed this model as an intermediate-strength setting between Missing Completely At Random (MCAR) – where missingness is independent of the data – and Missing Not At Random (MNAR) – where missingness may depend arbitrarily on the sample values and can lead to non-identifiability issues. That work established information-theoretic upper and lower bounds for mean estimation in the realizable contamination model. Their proposed estimators incur runtime exponential in the dimension, leaving open the possibility of computationally efficient algorithms in high dimensions. In this work, we establish an information-computation gap in the Statistical Query model (and, as a corollary, for Low-Degree Polynomials and PTF tests), showing that algorithms must either use substantially more samples than information-theoretically necessary or incur exponential runtime. We complement our SQ lower bound with an algorithm whose sample-time tradeoff nearly matches our lower bound. Together, these results qualitatively characterize the complexity of Gaussian mean estimation under \epsilon -realizable contamination.

[LG-6] Conservative Continuous-Time Treatment Optimization

链接: https://arxiv.org/abs/2603.16789
作者: Nora Schneider,Georg Manten,Niki Kilbertus
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:We develop a conservative continuous-time stochastic control framework for treatment optimization from irregularly sampled patient trajectories. The unknown patient dynamics are modeled as a controlled stochastic differential equation with treatment as a continuous-time control. Naive model-based optimization can exploit model errors and propose out-of-support controls, so optimizing the estimated dynamics may not optimize the true dynamics. To limit extrapolation, we add a consistent signature-based MMD regularizer on path space that penalizes treatment plans whose induced trajectory distribution deviates from observed trajectories. The resulting objective minimizes a computable upper bound on the true cost. Experiments on benchmark datasets show improved robustness and performance compared to non-conservative baselines.

[LG-7] pADAM: A Plug-and-Play All-in-One Diffusion Architecture for Multi-Physics Learning

链接: https://arxiv.org/abs/2603.16757
作者: Amirhossein Mollaali,Bongseok Kim,Christian Moya,Guang Lin
类目: Machine Learning (cs.LG)
*备注: 36 pages, 10 figures

点击查看摘要

Abstract:Generalizing across disparate physical laws remains a fundamental challenge for artificial intelligence in science. Existing deep-learning solvers are largely confined to single-equation settings, limiting transfer across physical regimes and inference tasks. Here we introduce pADAM, a unified generative framework that learns a shared probabilistic prior across heterogeneous partial differential equation families. Through a learned joint distribution of system states and, where applicable, physical parameters, pADAM supports forward prediction and inverse inference within a single architecture without retraining. Across benchmarks ranging from scalar diffusion to nonlinear Navier–Stokes equations, pADAM achieves accurate inference even under sparse observations. Combined with conformal prediction, it also provides reliable uncertainty quantification with coverage guarantees. In addition, pADAM performs probabilistic model selection from only two sparse snapshots, identifying governing laws through its learned generative representation. These results highlight the potential of generative multi-physics modeling for unified and uncertainty-aware scientific inference.

[LG-8] A Practical Algorithm for Feature-Rich Non-Stationary Bandit Problems

链接: https://arxiv.org/abs/2603.16755
作者: Wei Min Loh,Sajib Kumer Sinha,Ankur Agarwal,Pascal Poupart
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contextual bandits are incredibly useful in many practical problems. We go one step further by devising a more realistic problem that combines: (1) contextual bandits with dense arm features, (2) non-linear reward functions, and (3) a generalization of correlated bandits where reward distributions change over time but the degree of correlation maintains. This formulation lends itself to a wider set of applications such as recommendation tasks. To solve this problem, we introduce conditionally coupled contextual C3 Thompson sampling for Bernoulli bandits. It combines an improved Nadaraya-Watson estimator on an embedding space with Thompson sampling that allows online learning without retraining. Empirical results show that C3 outperforms the next best algorithm by 5.7% lower average cumulative regret on four OpenML tabular datasets as well as demonstrating a 12.4% click lift on Microsoft News Dataset (MIND) compared to other algorithms.

[LG-9] Bayesian Inference of Psychometric Variables From Brain and Behavior in Implicit Association Tests

链接: https://arxiv.org/abs/2603.16741
作者: Christian A. Kothe,Sean Mullen,Michael V. Bronstein,Grant Hanada,Marcelo Cicconet,Aaron N. McInnes,Tim Mullen,Marc Aafjes,Scott R. Sponheim,Alik S. Widge
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: 43 pages, 7 figures, 6 tables, submitted to: Journal of Neural Engineering

点击查看摘要

Abstract:Objective. We establish a principled method for inferring mental health related psychometric variables from neural and behavioral data using the Implicit Association Test (IAT) as the data generation engine, aiming to overcome the limited predictive performance (typically under 0.7 AUC) of the gold-standard D-score method, which relies solely on reaction times. Approach. We propose a sparse hierarchical Bayesian model that leverages multi-modal data to predict experiences related to mental illness symptoms in new participants. The model is a multivariate generalization of the D-score with trainable parameters, engineered for parameter efficiency in the small-cohort regime typical of IAT studies. Data from two IAT variants were analyzed: a suicidality-related E-IAT ( n=39 ) and a psychosis-related PSY-IAT ( n=34 ). Main Results. Our approach overcomes a high inter-individual variability and low within-session effect size in the dataset, reaching AUCs of 0.73 (E-IAT) and 0.76 (PSY-IAT) in the best modality configurations, though corrected 95% confidence intervals are wide ( \pm 0.18 ) and results are marginally significant after FDR correction ( q=0.10 ). Restricting the E-IAT to MDD participants improves AUC to 0.79 [0.62, 0.97] (significant at q=0.05 ). Performance is on par with the best reference methods (shrinkage LDA and EEGNet) for each task, even when the latter were adapted to the task, while the proposed method was not. Accuracy was substantially above near-chance D-scores (0.50-0.53 AUC) in both tasks, with more consistent cross-task performance than any single reference method. Significance. Our framework shows promise for enhancing IAT-based assessment of experiences related to entrapment and psychosis, and potentially other mental health conditions, though further validation on larger and independent cohorts will be needed to establish clinical utility. Comments: 43 pages, 7 figures, 6 tables, submitted to: Journal of Neural Engineering Subjects: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML) MSC classes: 62F15 (Primary), 62P10, 92C55 (Secondary) ACMclasses: I.2.6; J.3; G.3 Cite as: arXiv:2603.16741 [cs.LG] (or arXiv:2603.16741v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.16741 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Christian Kothe [view email] [v1] Tue, 17 Mar 2026 16:20:55 UTC (1,467 KB)

[LG-10] Understanding Quantization of Optimizer States in LLM Pre-training: Dynamics of State Staleness and Effectiveness of State Resets

链接: https://arxiv.org/abs/2603.16731
作者: Kristi Topollai,Anna Choromanska
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantizing optimizer states is becoming an important ingredient of memory-efficient large-scale pre-training, but the resulting optimizer dynamics remain only partially understood. We study low-precision exponential moving average (EMA) optimizer states and show how quantization can cause many nominal updates to round back to the same stored value, making the state effectively stale and slowing adaptation beyond what the nominal decay would suggest. We then develop a simple predictive model of stalling that estimates one-step stalling probabilities and characterizes how stalling builds up over time after the initialization. This perspective provides a mechanistic explanation for why optimizer-state resets help in low precision: once a quantized EMA becomes effectively stale, resetting it can temporarily restore responsiveness. Motivated by this picture, we derive a simple theory-guided method for choosing useful reset periods, showing that in low precision the key question is not only whether resets help, but when they should be applied. Experiments in controlled simulations and LLM pre-training show that suitable reset schedules recover the performance lost to low-precision state storage while substantially reducing optimizer-state memory.

[LG-11] GeMA: Learning Latent Manifold Frontiers for Benchmarking Complex Systems

链接: https://arxiv.org/abs/2603.16729
作者: Jia Ming Li,Anupriya,Daniel J. Graham
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Econometrics (econ.EM); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Latent manifold frontiers for benchmarking complex production systems, and applications to national rail operators, wind farms, and macroeconomic productivity are presented

点击查看摘要

Abstract:Benchmarking the performance of complex systems such as rail networks, renewable generation assets and national economies is central to transport planning, regulation and macroeconomic analysis. Classical frontier methods, notably Data Envelopment Analysis (DEA) and Stochastic Frontier Analysis (SFA), estimate an efficient frontier in the observed input-output space and define efficiency as distance to this frontier, but rely on restrictive assumptions on the production set and only indirectly address heterogeneity and scale effects. We propose Geometric Manifold Analysis (GeMA), a latent manifold frontier framework implemented via a productivity-manifold variational autoencoder (ProMan-VAE). Instead of specifying a frontier function in the observed space, GeMA represents the production set as the boundary of a low-dimensional manifold embedded in the joint input-output space. A split-head encoder learns latent variables that capture technological structure and operational inefficiency. Efficiency is evaluated with respect to the learned manifold, endogenous peer groups arise as clusters in latent technology space, a quotient construction supports scale-invariant benchmarking, and a local certification radius, derived from the decoder Jacobian and a Lipschitz bound, quantifies the geometric robustness of efficiency scores. We validate GeMA on synthetic data with non-convex frontiers, heterogeneous technologies and scale bias, and on four real-world case studies: global urban rail systems (COMET), British rail operators (ORR), national economies (Penn World Table) and a high-frequency wind-farm dataset. Across these domains GeMA behaves comparably to established methods when classical assumptions hold, and provides additional insight in settings with pronounced heterogeneity, non-convexity or size-related bias. Comments: Latent manifold frontiers for benchmarking complex production systems, and applications to national rail operators, wind farms, and macroeconomic productivity are presented Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Econometrics (econ.EM); Optimization and Control (math.OC); Machine Learning (stat.ML) MSC classes: 62G08, 62P20, 60D05, 62H30 ACMclasses: I.2.6; I.5.1; G.3; J.2; J.4 Cite as: arXiv:2603.16729 [cs.LG] (or arXiv:2603.16729v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.16729 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-12] he Cost of Reasoning : Chain-of-Thought Induces Overconfidence in Vision-Language Models

链接: https://arxiv.org/abs/2603.16728
作者: Robert Welch,Emir Konuk,Kevin Smith
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly deployed in high-stakes settings where reliable uncertainty quantification (UQ) is as important as predictive accuracy. Extended reasoning via chain-of-thought (CoT) prompting or reasoning-trained models has become ubiquitous in modern VLM pipelines, yet its effect on UQ reliability remains poorly understood. We show that reasoning consistently degrades the quality of most uncertainty estimates, even when it improves task accuracy. We identify implicit answer conditioning as the primary mechanism: as reasoning traces converge on a conclusion before the final answer is generated, token probabilities increasingly reflect consistency with the model’s own reasoning trace rather than uncertainty about correctness. In effect, the model becomes overconfident in its answer. In contrast, agreement-based consistency remains robust and often improves under reasoning, making it a practical choice for uncertainty estimation in reasoning-enabled VLMs.

[LG-13] Novelty-Driven Target-Space Discovery in Automated Electron and Scanning Probe Microscopy

链接: https://arxiv.org/abs/2603.16715
作者: Utkarsh Pratiush,Kamyar Barakati,Boris N. Slautin,Catherine C. Bodinger,Christopher D. Lowe,Brandi M. Cossairt,Sergei V. Kalinin
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Modern automated microscopy faces a fundamental discovery challenge: in many systems, the most important scientific information does not reside in the immediately visible image features, but in the target space of sequentially acquired spectra or functional responses, making it essential to develop strategies that can actively search for new behaviors rather than simply optimize known objectives. Here, we developed a deep-kernel-learning BEACON framework that is explicitly designed to guide discovery in the target space by learning structure-property relationships during the experiment and using that evolving model to seek diverse response regimes. We first established the method through demonstration workflows built on pre-acquired ground-truth datasets, which enabled direct benchmarking against classical acquisition strategies and allowed us to define a set of monitoring functions for comparing exploration quality, target-space coverage, and surrogate-model behavior in a transparent and reproducible manner. This benchmarking framework provides a practical basis for evaluating discovery-driven algorithms, not just optimization performance. We then operationalized and deployed the workflow on STEM, showing that the approach can transition from offline validation to real experimental implementation. To support adoption and extension by the broader community, the associated notebooks are available, allowing users to reproduce the workflows, test the benchmarks, and adapt the method to their own instruments and datasets.

[LG-14] Learning Lineage-guided Geodesics with Finsler Geometry

链接: https://arxiv.org/abs/2603.16708
作者: Aaron Zweig,Mingxuan Zhang,David A. Knowles,Elham Azizi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Trajectory inference investigates how to interpolate paths between observed timepoints of dynamical systems, such as temporally resolved population distributions, with the goal of inferring trajectories at unseen times and better understanding system dynamics. Previous work has focused on continuous geometric priors, utilizing data-dependent spatial features to define a Riemannian metric. In many applications, there exists discrete, directed prior knowledge over admissible transitions (e.g. lineage trees in developmental biology). We introduce a Finsler metric that combines geometry with classification and incorporate both types of priors in trajectory inference, yielding improved performance on interpolation tasks in synthetic and real-world data.

[LG-15] Grid-World Representations in Transformers Reflect Predictive Geometry

链接: https://arxiv.org/abs/2603.16689
作者: Sasha Brenner,Thomas R. Knösche,Nico Scherf
类目: Machine Learning (cs.LG)
*备注: 20 pages, 3 figures

点击查看摘要

Abstract:Next-token predictors often appear to develop internal representations of the latent world and its rules. The probabilistic nature of these models suggests a deep connection between the structure of the world and the geometry of probability distributions. In order to understand this link more precisely, we use a minimal stochastic process as a controlled setting: constrained random walks on a two-dimensional lattice that must reach a fixed endpoint after a predetermined number of steps. Optimal prediction of this process solely depends on a sufficient vector determined by the walker’s position relative to the target and the remaining time horizon; in other words, the probability distributions are parametrized by the world’s geometry. We train decoder-only transformers on prefixes sampled from the exact distribution of these walks and compare their hidden activations to the analytically derived sufficient vectors. Across models and layers, the learned representations align strongly with the ground-truth predictive vectors and are often low-dimensional. This provides a concrete example in which world-model-like representations can be directly traced back to the predictive geometry of the data itself. Although demonstrated in a simplified toy system, the analysis suggests that geometric representations supporting optimal prediction may provide a useful lens for studying how neural networks internalize grammatical and other structural constraints.

[LG-16] Self-Aware Markov Models for Discrete Reasoning

链接: https://arxiv.org/abs/2603.16661
作者: Gregor Kornhardt,Jannis Chemseddine,Christian Wald,Gabriele Steidl
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Standard masked discrete diffusion models face limitations in reasoning tasks due to their inability to correct their own mistakes on the masking path. Since they rely on a fixed number of denoising steps, they are unable to adjust their computation to the complexity of a given problem. To address these limitations, we introduce a method based on learning a Markov transition kernel that is trained on its own outputs. This design enables tokens to be remasked, allowing the model to correct its previous mistakes. Furthermore, we do not need a fixed time schedule but use a trained stopping criterion. This allows for adaptation of the number of function evaluations to the difficulty of the reasoning problem. Our adaptation adds two lightweight prediction heads, enabling reuse and fine-tuning of existing pretrained models. On the Sudoku-Extreme dataset we clearly outperform other flow based methods with a validity of 95%. For the Countdown-4 we only need in average of 10 steps to solve almost 96% of them correctly, while many problems can be solved already in 2 steps.

[LG-17] Simplex-to-Euclidean Bijection for Conjugate and Calibrated Multiclass Gaussian Process

链接: https://arxiv.org/abs/2603.16621
作者: Bernardo Williams,Harsha Vardhan Tetali,Arto Klami,Marcelo Hartmann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a conjugate and calibrated Gaussian process (GP) model for multi-class classification by exploiting the geometry of the probability simplex. Our approach uses Aitchison geometry to map simplex-valued class probabilities to an unconstrained Euclidean representation, turning classification into a GP regression problem with fewer latent dimensions than standard multi-class GP classifiers. This yields conjugate inference and reliable predictive probabilities without relying on distributional approximations in the model construction. The method is compatible with standard sparse GP regression techniques, enabling scalable inference on larger datasets. Empirical results show well-calibrated and competitive performance across synthetic and real-world datasets.

[LG-18] rajectory-Optimized Time Reparameterization for Learning-Compatible Reduced-Order Modeling of Stiff Dynamical Systems

链接: https://arxiv.org/abs/2603.16583
作者: Joe Standridge,Daniel Livescu,Paul Cizmas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stiff dynamical systems present a challenge for machine-learning reduced-order models (ML-ROMs), as explicit time integration becomes unstable in stiff regimes while implicit integration within learning loops is computationally expensive and often degrades training efficiency. Time reparameterization (TR) offers an alternative by transforming the independent variable so that rapid physical-time transients are spread over a stretched-time coordinate, enabling stable explicit integration on uniformly sampled grids. Although several TR strategies have been proposed, their effect on learnability in ML-ROMs remains incompletely understood. This work investigates time reparameterization as a stiffness-mitigation mechanism for neural ODE reduced-order modeling and introduces a trajectory-optimized TR (TOTR) formulation. The proposed approach casts time reparameterization as an optimization problem in arc-length coordinates, in which a traversal-speed profile is selected to penalize acceleration in stretched time. By targeting the smoothness of the training dynamics, this formulation produces reparameterized trajectories that are better conditioned and easier to learn than existing TR methods. TOTR is evaluated on three stiff problems: a parameterized stiff linear system, the van der Pol oscillator, and the HIRES chemical kinetics model. Across all cases, the proposed approach yields smoother reparameterizations and improved physical-time predictions under identical training regimens than other TR approaches. Quantitative results demonstrate loss reductions of one to two orders of magnitude compared to benchmark algorithms. These results highlight that effective stiffness mitigation in ML-ROMs depends critically on the regularity and learnability of the time map itself, and that optimization-based TR provides a robust framework for explicit reduced-order modeling of multiscale dynamical systems.

[LG-19] Deep Tabular Representation Corrector

链接: https://arxiv.org/abs/2603.16569
作者: Hangting Ye,Peng Wang,Wei Fan,Xiaozhuang Song,He Zhao,Dandan Gun,Yi Chang
类目: Machine Learning (cs.LG)
*备注: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

点击查看摘要

Abstract:Tabular data have been playing a mostly important role in diverse real-world fields, such as healthcare, engineering, finance, etc. The recent success of deep learning has fostered many deep networks (e.g., Transformer, ResNet) based tabular learning methods. Generally, existing deep tabular machine learning methods are along with the two paradigms, i.e., in-learning and pre-learning. In-learning methods need to train networks from scratch or impose extra constraints to regulate the representations which nonetheless train multiple tasks simultaneously and make learning more difficult, while pre-learning methods design several pretext tasks for pre-training and then conduct task-specific fine-tuning, which however need much extra training effort with prior knowledge. In this paper, we introduce a novel deep Tabular Representation Corrector, TRC, to enhance any trained deep tabular model’s representations without altering its parameters in a model-agnostic manner. Specifically, targeting the representation shift and representation redundancy that hinder prediction, we propose two tasks, i.e., (i) Tabular Representation Re-estimation, that involves training a shift estimator to calculate the inherent shift of tabular representations to subsequently mitigate it, thereby re-estimating the representations and (ii) Tabular Space Mapping, that transforms the above re-estimated representations into a light-embedding vector space via a coordinate estimator while preserves crucial predictive information to minimize redundancy. The two tasks jointly enhance the representations of deep tabular models without touching on the original models thus enjoying high efficiency. Finally, we conduct extensive experiments on state-of-the-art deep tabular machine learning models coupled with TRC on various tabular benchmarks which have shown consistent superiority.

[LG-20] SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds

链接: https://arxiv.org/abs/2603.16535
作者: Viktor Stein,Wuchen Li,Gabriele Steidl
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 24 pages, 2 figures, 3 tables, comments welcome!

点击查看摘要

Abstract:Transformers owe much of their empirical success in natural language processing to the self-attention blocks. Recent perspectives interpret attention blocks as interacting particle systems, whose mean-field limits correspond to gradient flows of interaction energy functionals on probability density spaces equipped with Wasserstein- 2 -type metrics. We extend this viewpoint by introducing accelerated attention blocks derived from inertial Nesterov-type dynamics on density spaces. In our proposed architecture, tokens carry both spatial (feature) and velocity variables. The time discretization and the approximation of accelerated density dynamics yield Hamiltonian momentum attention blocks, which constitute the proposed accelerated attention architectures. In particular, for linear self-attention, we show that the attention blocks approximate a Stein variational gradient flow, using a bilinear kernel, of a potential energy. In this setting, we prove that elliptically contoured probability distributions are preserved by the accelerated attention blocks. We present implementable particle-based algorithms and demonstrate that the proposed accelerated attention blocks converge faster than the classical attention blocks while preserving the number of oracle calls.

[LG-21] Optimal uncertainty bounds for multivariate kernel regression under bounded noise: A Gaussian process-based dual function

链接: https://arxiv.org/abs/2603.16481
作者: Amon Lahr,Anna Scampicchio,Johannes Köhler,Melanie N. Zeilinger
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Non-conservative uncertainty bounds are essential for making reliable predictions about latent functions from noisy data–and thus, a key enabler for safe learning-based control. In this domain, kernel methods such as Gaussian process regression are established techniques, thanks to their inherent uncertainty quantification mechanism. Still, existing bounds either pose strong assumptions on the underlying noise distribution, are conservative, do not scale well in the multi-output case, or are difficult to integrate into downstream tasks. This paper addresses these limitations by presenting a tight, distribution-free bound for multi-output kernel-based estimates. It is obtained through an unconstrained, duality-based formulation, which shares the same structure of classic Gaussian process confidence bounds and can thus be straightforwardly integrated into downstream optimization pipelines. We show that the proposed bound generalizes many existing results and illustrate its application using an example inspired by quadrotor dynamics learning.

[LG-22] DISCOVER: A Solver for Distributional Counterfactual Explanations

链接: https://arxiv.org/abs/2603.16436
作者: Yikai Gu,Lele Cao,Bo Zhao,Lei Lei,Lei You
类目: Machine Learning (cs.LG)
*备注: 20 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Counterfactual explanations (CE) explain model decisions by identifying input modifications that lead to different predictions. Most existing methods operate at the instance level. Distributional Counterfactual Explanations (DCE) extend this setting by optimizing an optimal transport objective that balances proximity to a factual input distribution and alignment to a target output distribution, with statistical certification via chance constrained bounds. However, DCE relies on gradient based optimization, while many real-world tabular pipelines are dominated by non-differentiable models. We propose DISCOVER, a model-agnostic solver for distributional counterfactual explanations. DISCOVER preserves the original DCE objective and certification while replacing gradient descent with a sparse propose-and-select search paradigm. It exploits a sample-wise decomposition of the transport objective to compute per-row impact scores and enforce a top- k intervention budget, focusing edits on the most influential samples. To guide candidate generation without predictor gradients, DISCOVER introduces an OT-guided cone sampling primitive driven by input-side transport geometry. Experiments on multiple tabular datasets demonstrate strong joint alignment of input and output distributions, extending distributional counterfactual reasoning to modern black box learning pipelines. A code repository is available at this https URL.

[LG-23] Controlling Fish Schools via Reinforcement Learning of Virtual Fish Movement

链接: https://arxiv.org/abs/2603.16384
作者: Yusuke Nishii,Hiroaki Kawashima
类目: Robotics (cs.RO); Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注: English translation of the author’s 2018 bachelor’s thesis. Keywords: fish schooling, reinforcement learning, collective behavior, artificial agents, swarm-machine interaction

点击查看摘要

Abstract:This study investigates a method to guide and control fish schools using virtual fish trained with reinforcement learning. We utilize 2D virtual fish displayed on a screen to overcome technical challenges such as durability and movement constraints inherent in physical robotic agents. To address the lack of detailed behavioral models for real fish, we adopt a model-free reinforcement learning approach. First, simulation results show that reinforcement learning can acquire effective movement policies even when simulated real fish frequently ignore the virtual stimulus. Second, real-world experiments with live fish confirm that the learned policy successfully guides fish schools toward specified target directions. Statistical analysis reveals that the proposed method significantly outperforms baseline conditions, including the absence of stimulus and a heuristic “stay-at-edge” strategy. This study provides an early demonstration of how reinforcement learning can be used to influence collective animal behavior through artificial agents.

[LG-24] Prior-Informed Neural Network Initialization: A Spectral Approach for Function Parameterizing Architectures

链接: https://arxiv.org/abs/2603.16376
作者: David Orlando Salazar Torres,Diyar Altinses,Andreas Schwung
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural network architectures designed for function parameterization, such as the Bag-of-Functions (BoF) framework, bridge the gap between the expressivity of deep learning and the interpretability of classical signal processing. However, these models are inherently sensitive to parameter initialization, as traditional data-agnostic schemes fail to capture the structural properties of the target signals, often leading to suboptimal convergence. In this work, we propose a prior-informed design strategy that leverages the intrinsic spectral and temporal structure of the data to guide both network initialization and architectural configuration. A principled methodology is introduced that uses the Fast Fourier Transform to extract dominant seasonal priors, informing model depth and initial states, and a residual-based regression approach to parameterize trend components. Crucially, this structural alignment enables a substantial reduction in encoder dimensionality without compromising reconstruction fidelity. A supporting theoretical analysis provides guidance on trend estimation under finite-sample regimes. Extensive experiments on synthetic and real-world benchmarks demonstrate that embedding data-driven priors significantly accelerates convergence, reduces performance variability across trials, and improves computational efficiency. Overall, the proposed framework enables more compact and interpretable architectures while outperforming standard initialization baselines, without altering the core training procedure.

[LG-25] Encoding Predictability and Legibility for Style-Conditioned Diffusion Policy

链接: https://arxiv.org/abs/2603.16368
作者: Adrien Jacquet Crétides,Mouad Abrini,Hamed Rahimi,Mohamed Chetouani
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted to the 18th International Conference on Social Robotics (ICSR 2026)

点击查看摘要

Abstract:Striking a balance between efficiency and transparent motion is a core challenge in human-robot collaboration, as highly expressive movements often incur unnecessary time and energy costs. In collaborative environments, legibility allows a human observer a better understanding of the robot’s actions, increasing safety and trust. However, these behaviors result in sub-optimal and exaggerated trajectories that are redundant in low-ambiguity scenarios where the robot’s goal is already obvious. To address this trade-off, we propose Style-Conditioned Diffusion Policy (SCDP), a modular framework that constrains the trajectory generation of a pre-trained diffusion model toward either legibility or efficiency based on the environment’s configuration. Our method utilizes a post-training pipeline that freezes the base policy and trains a lightweight scene encoder and conditioning predictor to modulate the diffusion process. At inference time, an ambiguity detection module activates the appropriate conditioning, prioritizing expressive motion only for ambiguous goals and reverting to efficient paths otherwise. We evaluate SCDP on manipulation and navigation tasks, and results show that it enhances legibility in ambiguous settings while preserving optimal efficiency when legibility is unnecessary, all without retraining the base policy.

[LG-26] Decoding the Critique Mechanism in Large Reasoning Models

链接: https://arxiv.org/abs/2603.16331
作者: Hoang Phan,Quang H. Nguyen,Hung T. Q. Le,Xiusi Chen,Heng Ji,Khoa D. Doan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behaviors are beneficial only when the model has sufficiently strong “critique” ability to detect its own mistakes. This work systematically investigates how current LRMs recover from errors by inserting arithmetic mistakes in their intermediate reasoning steps. Notably, we discover a peculiar yet important phenomenon: despite the error propagating through the chain-of-thought (CoT), resulting in an incorrect intermediate conclusion, the model still reaches the correct final answer. This recovery implies that the model must possess an internal mechanism to detect errors and trigger self-correction, which we refer to as the hidden critique ability. Building on feature space analysis, we identify a highly interpretable critique vector representing this behavior. Extensive experiments across multiple model scales and families demonstrate that steering latent representations with this vector improves the model’s error detection capability and enhances the performance of test-time scaling at no extra training cost. Our findings provide a valuable understanding of LRMs’ critique behavior, suggesting a promising direction to control and improve their self-verification mechanism. Our code is available at this https URL.

[LG-27] Laya: A LeJEPA Approach to EEG via Latent Prediction over Reconstruction

链接: https://arxiv.org/abs/2603.16281
作者: Saarang Panchavati,Uddhav Panchavati,Corey Arnold,William Speier
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG) is a widely used tool for studying brain function, with applications in clinical neuroscience, diagnosis, and brain-computer interfaces (BCIs). Recent EEG foundation models trained on large unlabeled corpora aim to learn transferable representations, but their effectiveness remains unclear; reported improvements over smaller task-specific models are often modest, sensitive to downstream adaptation and fine-tuning strategies, and limited under linear probing. We hypothesize that one contributing factor is the reliance on signal reconstruction as the primary self-supervised learning (SSL) objective, which biases representations toward high-variance artifacts rather than task-relevant neural structure. To address this limitation, we explore an SSL paradigm based on Joint Embedding Predictive Architectures (JEPA), which learn by predicting latent representations instead of reconstructing raw signals. While earlier JEPA-style methods often rely on additional heuristics to ensure training stability, recent advances such as LeJEPA provide a more principled and stable formulation. We introduce Laya, the first EEG foundation model based on LeJEPA. Across a range of EEG benchmarks, Laya demonstrates improved performance under linear probing compared to reconstruction-based baselines, suggesting that latent predictive objectives offer a promising direction for learning transferable, high-level EEG representations.

[LG-28] Physics-integrated neural differentiable modeling for immersed boundary systems

链接: https://arxiv.org/abs/2603.16277
作者: Chenglin Li,Hang Xu,Jianting Chen,Yanfei Zhang
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 22 pages, 15 figures

点击查看摘要

Abstract:Accurately, efficiently, and stably computing complex fluid flows and their evolution near solid boundaries over long horizons remains challenging. Conventional numerical solvers require fine grids and small time steps to resolve near-wall dynamics, resulting in high computational costs, while purely data-driven surrogate models accumulate rollout errors and lack robustness under extrapolative conditions. To address these issues, this study extends existing neural PDE solvers by developing a physics-integrated differentiable framework for long-horizon prediction of immersed-boundary flows. A key design aspect of the framework includes an important improvement, namely the structural integration of physical principles into an end-to-end differentiable architecture incorporating a PDE-based intermediate velocity module and a multi-direct forcing immersed boundary module, both adhering to the pressure-projection procedure for incompressible flow computation. The computationally expensive pressure projection step is substituted with a learned implicit correction using ConvResNet blocks to reduce cost, and a sub-iteration strategy is introduced to separate the embedded physics module’s stability requirement from the surrogate model’s time step, enabling stable coarse-grid autoregressive rollouts with large effective time increments. The framework uses only single-step supervision for training, eliminating long-horizon backpropagation and reducing training time to under one hour on a single GPU. Evaluations on benchmark cases of flow past a stationary cylinder and a rotationally oscillating cylinder at Re=100 show the proposed model consistently outperforms purely data-driven, physics-loss-constrained, and coarse-grid numerical baselines in flow-field fidelity and long-horizon stability, while achieving an approximately 200-fold inference speedup over the high-resolution solver.

[LG-29] Neural Pushforward Samplers for the Fokker-Planck Equation on Embedded Riemannian Manifolds

链接: https://arxiv.org/abs/2603.16239
作者: Andrew Qing He,Wei Cai
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 13 pages, 2 figures, 1 table, 1 algorithm

点击查看摘要

Abstract:We extend the Weak Adversarial Neural Pushforward (WANPF) Method to the Fokker–Planck equation posed on a compact, smoothly embedded Riemannian manifold M in R^n . The key observation is that the weak formulation of the Fokker–Planck equation, together with the ambient-space representation of the Laplace–Beltrami operator via the tangential projection P(x) and the mean-curvature vector H(x) , permits all integrals to be evaluated as expectations over samples lying on M, using test functions defined globally on R^n . A neural pushforward map is constrained to map the support of a base distribution into M at all times through a manifold retraction, so that probability conservation and manifold membership are enforced by construction. Adversarial ambient plane-wave test functions are chosen, and their Laplace–Beltrami operators are derived in closed form, enabling autodiff-free, mesh-free training. We present both a steady-state and a time-dependent formulation, derive explicit Laplace–Beltrami formulae for the sphere S^n-1 and the flat torus T^n , and demonstrate the method numerically on a double-well steady-state Fokker–Planck equation on S^2 .

[LG-30] Dual Consensus: Escaping from Spurious Majority in Unsupervised RLVR via Two-Stage Vote Mechanism

链接: https://arxiv.org/abs/2603.16223
作者: Kaixuan Du,Meng Cao,Hang Zhang,Yukun Wang,Xiangzhou Huang,Ni Li
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Current label-free RLVR approaches for large language models (LLMs), such as TTRL and Self-reward, have demonstrated effectiveness in improving the performance of LLMs on complex reasoning tasks. However, these methods rely heavily on accurate pseudo-label estimation and converge on spurious yet popular answers, thereby trapping in a dominant mode and limiting further improvements. Building on this, we propose Dual Consensus Reinforcement Learning (DCRL), a novel self-supervised training method which is capable of generating more reliable learning signals through a two-stage consensus mechanism. The model initially acts as an anchor, producing dominant responses; then it serves as an explorer, generating diverse auxiliary signals via a temporary unlearning process. The final training target is derived from the harmonic mean of these two signal sets. Notably, the process operates entirely without external models or supervision. Across eight benchmarks and diverse domains, DCRL consistently improves Pass@1 over majority vote while yielding more stable training dynamics. These results demonstrate that DCRL establishes a scalable path toward stronger reasoning without labels.

[LG-31] Online Semi-infinite Linear Programming: Efficient Algorithms via Function Approximation

链接: https://arxiv.org/abs/2603.16200
作者: Yiming Zong,Jiashuo Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the dynamic resource allocation problem where the decision space is finite-dimensional, yet the solution must satisfy a large or even infinite number of constraints revealed via streaming data or oracle feedback. We model this challenge as an Online Semi-infinite Linear Programming (OSILP) problem and develop a novel LP formulation to solve it approximately. Specifically, we employ function approximation to reduce the number of constraints to a constant q . This addresses a key limitation of traditional online LP algorithms, whose regret bounds typically depend on the number of constraints, leading to poor performance in this setting. We propose a dual-based algorithm to solve our new formulation, which offers broad applicability through the selection of appropriate potential functions. We analyze this algorithm under two classical input models-stochastic input and random permutation-establishing regret bounds of O(q\sqrtT) and O\left(\left(q+q\logT)\sqrtT\right)\right) respectively. Note that both regret bounds are independent of the number of constraints, which demonstrates the potential of our approach to handle a large or infinite number of constraints. Furthermore, we investigate the potential to improve upon the O(q\sqrtT) regret and propose a two-stage algorithm, achieving O(q\logT + q/\epsilon) regret under more stringent assumptions. We also extend our algorithms to the general function setting. A series of experiments validates that our algorithms outperform existing methods when confronted with a large number of constraints.

[LG-32] he Finetuners Fallacy: When to Pretrain with Your Finetuning Data

链接: https://arxiv.org/abs/2603.16177
作者: Christina Baek,Ricardo Pio Monti,David Schwab,Amro Abbas,Rishabh Adiga,Cody Blakeney,Maximilian Böther,Paul Burstein,Aldo Gael Carranza,Alvin Deng,Parth Doshi,Vineeth Dorna,Alex Fang,Tony Jiang,Siddharth Joshi,Brett W. Larsen,Jason Chan Lee,Katherine L. Mentzer,Luke Merrick,Haakon Mongstad,Fan Pan,Anshuman Suri,Darren Teh,Jason Telanoff,Jack Urbanek,Zhengping Wang,Josh Wills,Haoli Yin,Aditi Raghunathan,J. Zico Kolter,Bogdan Gaza,Ari Morcos,Matthew Leavitt,Pratyush Maini
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world model deployments demand strong performance on narrow domains where data is often scarce. Typically, practitioners finetune models to specialize them, but this risks overfitting to the domain and forgetting general knowledge. We study a simple strategy, specialized pretraining (SPT), where a small domain dataset, typically reserved for finetuning, is repeated starting from pretraining as a fraction of the total tokens. Across three specialized domains (ChemPile, MusicPile, and ProofPile), SPT improves domain performance and preserves general capabilities after finetuning compared to standard pretraining. In our experiments, SPT reduces the pretraining tokens needed to reach a given domain performance by up to 1.75x. These gains grow when the target domain is underrepresented in the pretraining corpus: on domains far from web text, a 1B SPT model outperforms a 3B standard pretrained model. Beyond these empirical gains, we derive overfitting scaling laws to guide practitioners in selecting the optimal domain-data repetition for a given pretraining compute budget. Our observations reveal the finetuner’s fallacy: while finetuning may appear to be the cheapest path to domain adaptation, introducing specialized domain data during pretraining stretches its utility. SPT yields better specialized domain performance (via reduced overfitting across repeated exposures) and better general domain performance (via reduced forgetting during finetuning), ultimately achieving stronger results with fewer parameters and less total compute when amortized over inference. To get the most out of domain data, incorporate it as early in training as possible.

[LG-33] Execution-Grounded Credit Assignment for GRPO in Code Generation ICLR2026

链接: https://arxiv.org/abs/2603.16158
作者: Abhijit Kumar,Natalya Kumar,Shikhar Gupta
类目: Machine Learning (cs.LG)
*备注: Accepted at SPOT ICLR 2026 ( this https URL )

点击查看摘要

Abstract:Critic-free reinforcement learning with verifiable rewards (RLVR) improves code generation by optimizing unit-test pass rates, but GRPO-style updates suffer from coarse credit assignment: a single outcome signal is spread uniformly across long programs even when failure stems from a localized semantic error. We propose Execution-Grounded Credit Assignment (EGCA), which localizes GRPO updates using execution traces. For programs that satisfy algorithmic constraints but fail tests, EGCA executes the candidate and a canonical reference solution (curated once offline; used for analysis, not supervision) under identical instrumentation, identifies the earliest semantic divergence, and assigns advantage only to the corresponding token span while masking downstream tokens. EGCA is a drop-in modification requiring no critic, auxiliary loss, or learned verifier, yielding 82.1% pass@1 on HumanEval (+3.1 over GRPO) and 68.9% on MBPP (+1.5) with 18% wall-clock overhead.

[LG-34] Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

链接: https://arxiv.org/abs/2603.16140
作者: Yuxuan Zhu,Daniel Kang
类目: Machine Learning (cs.LG)
*备注: 16 pages, 17 figures

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has driven recent capability advances of large language models across various domains. Recent studies suggest that improved RLVR algorithms allow models to learn effectively from incorrect annotations, achieving performance comparable to learning from clean data. In this work, we show that these findings are invalid because the claimed 100% noisy training data is “contaminated” with clean data. After rectifying the dataset with a rigorous re-verification pipeline, we demonstrate that noise is destructive to RLVR. We show that existing RLVR algorithm improvements fail to mitigate the impact of noise, achieving similar performance to that of the basic GRPO. Furthermore, we find that the model trained on truly incorrect annotations performs 8-10% worse than the model trained on clean data across mathematical reasoning benchmarks. Finally, we show that these findings hold for real-world noise in Text2SQL tasks, where training on real-world, human annotation errors cause 5-12% lower accuracy than clean data. Our results show that current RLVR methods cannot yet compensate for poor data quality. High-quality data remains essential.

[LG-35] A Depth-Aware Comparative Study of Euclidean and Hyperbolic Graph Neural Networks on Bitcoin Transaction Systems

链接: https://arxiv.org/abs/2603.16080
作者: Ankit Ghimire,Saydul Akbar Murad,Nick Rahimi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bitcoin transaction networks are large scale socio- technical systems in which activities are represented through multi-hop interaction patterns. Graph Neural Networks(GNNs) have become a widely adopted tool for analyzing such systems, supporting tasks such as entity detection and transaction classification. Large-scale datasets like Elliptic have allowed for a rise in the analysis of these systems and in tasks such as fraud detection. In these settings, the amount of transactional context available to each node is determined by the neighborhood aggregation and sampling strategies, yet the interaction between these receptive fields and embedding geometry has received limited attention. In this work, we conduct a controlled comparison of Euclidean and tangent-space hyperbolic GNNs for node classification on a large Bitcoin transaction graph. By explicitly varying the neighborhood while keeping the model architecture and dimensionality fixed, we analyze the differences in two embedding spaces. We further examine optimization behavior and observe that joint selection of learning rate and curvature plays a critical role in stabilizing high-dimensional hyperbolic embeddings. Overall, our findings provide practical insights into the role of embedding geometry and neighborhood depth when modeling large-scale transaction networks, informing the deployment of hyperbolic GNNs for computational social systems.

[LG-36] MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models MDM

链接: https://arxiv.org/abs/2603.16077
作者: Chen-Hao Chao,Wei-Fang Sun,Junwei Qua,Chun-Yi Lee,Rahul G. Krishnan
类目: Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. Second, we find that the function form of the subtokenizer significantly degrades likelihood estimation when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. To address these limitations, we study the tightness of the variational bound in MDM-Prime and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our scaling analysis reveals that MDM-Prime-v2 is 21.8 \times more compute-efficient than autoregressive models (ARM). In compute-optimal comparisons, MDM-Prime-v2 achieves 7.77 perplexity on OpenWebText, outperforming ARM (12.99), MDM (18.94), and MDM-Prime (13.41). When extending the model size to 1.1B parameters, our model further demonstrates superior zero-shot accuracy on various commonsense reasoning tasks.

[LG-37] Adaptive regularization parameter selection for high-dimensional inverse problems: A Bayesian approach with Tucker low-rank constraints

链接: https://arxiv.org/abs/2603.16066
作者: Qing-Mei Yang,Da-Qing Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel variational Bayesian method that integrates Tucker decomposition for efficient high-dimensional inverse problem solving. The method reduces computational complexity by transforming variational inference from a high-dimensional space to a lower-dimensional core tensor space via Tucker decomposition. A key innovation is the introduction of per-mode precision parameters, enabling adaptive regularization for anisotropic structures. For instance, in directional image deblurring, learned parameters align with physical anisotropy, applying stronger regularization to critical directions (e.g., row vs. column axes). The method further estimates noise levels from data, eliminating reliance on prior knowledge of noise parameters (unlike conventional benchmarks such as the discrepancy principle (DP)). Experimental evaluations across 2D deblurring, 3D heat conduction, and Fredholm integral equations demonstrate consistent improvements in quantitative metrics (PSNR, SSIM) and qualitative visualizations (error maps, precision parameter trends) compared to L-curve criterion, generalized cross-validation (GCV), unbiased predictive risk estimator (UPRE), and DP. The approach scales to problems with 110,000 variables and outperforms existing methods by 0.73-2.09 dB in deblurring tasks and 6.75 dB in 3D heat conduction. Limitations include sensitivity to rank selection in Tucker decomposition and the need for theoretical analysis. Future work will explore automated rank selection and theoretical guarantees. This method bridges Bayesian theory and scalable computation, offering practical solutions for large-scale inverse problems in imaging, remote sensing, and scientific computing.

[LG-38] he Importance of Being Smoothly Calibrated

链接: https://arxiv.org/abs/2603.16015
作者: Parikshit Gopalan,Konstantinos Stavropoulos,Kunal Talwar,Pranay Tankala
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Recent work has highlighted the centrality of smooth calibration [Kakade and Foster, 2008] as a robust measure of calibration error. We generalize, unify, and extend previous results on smooth calibration, both as a robust calibration measure, and as a step towards omniprediction, which enables predictions with low regret for downstream decision makers seeking to optimize some proper loss unknown to the predictor. We present a new omniprediction guarantee for smoothly calibrated predictors, for the class of all bounded proper losses. We smooth the predictor by adding some noise to it, and compete against smoothed versions of any benchmark predictor on the space, where we add some noise to the predictor and then post-process it arbitrarily. The omniprediction error is bounded by the smooth calibration error of the predictor and the earth mover’s distance from the benchmark. We exhibit instances showing that this dependence cannot, in general, be improved. We show how this unifies and extends prior results [Foster and Vohra, 1998; Hartline, Wu, and Yang, 2025] on omniprediction from smooth calibration. We present a crisp new characterization of smooth calibration in terms of the earth mover’s distance to the closest perfectly calibrated joint distribution of predictions and labels. This also yields a simpler proof of the relation to the lower distance to calibration from [Blasiok, Gopalan, Hu, and Nakkiran, 2023]. We use this to show that the upper distance to calibration cannot be estimated within a quadratic factor with sample complexity independent of the support size of the predictions. This is in contrast to the distance to calibration, where the corresponding problem was known to be information-theoretically impossible: no finite number of samples suffice [Blasiok, Gopalan, Hu, and Nakkiran, 2023]. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2603.16015 [cs.LG] (or arXiv:2603.16015v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.16015 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Pranay Tankala [view email] [v1] Mon, 16 Mar 2026 23:50:02 UTC (432 KB)

[LG-39] W2T: LoRA Weights Already Know What They Can Do

链接: https://arxiv.org/abs/2603.15990
作者: Xiaolong Han,Ferrante Neri,Zijian Jiang,Fang Wu,Yanfang Ye,Lu Yin,Zehong Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Each LoRA checkpoint compactly stores task-specific updates in low-rank weight matrices, offering an efficient way to adapt large language models to new tasks and domains. In principle, these weights already encode what the adapter does and how well it performs. In this paper, we ask whether this information can be read directly from the weights, without running the base model or accessing training data. A key obstacle is that a single LoRA update can be factorized in infinitely many ways. Without resolving this ambiguity, models trained on the factors may fit the particular factorization rather than the underlying update. To this end, we propose \methodfull, which maps each LoRA update to a provably canonical form via QR decomposition followed by SVD, so that all equivalent factorizations share the same representation. The resulting components are then tokenized and processed by a Transformer to produce a weight-space embedding. Across language and vision LoRA collections, W2T achieves strong results on attribute classification, performance prediction, and adapter retrieval, demonstrating that LoRA weights reliably indicate model behavior once factorization ambiguity is removed. Code is available at this https URL.

[LG-40] Determinism in the Undetermined: Deterministic Output in Charge-Conserving Continuous-Time Neuromorphic Systems with Temporal Stochasticity

链接: https://arxiv.org/abs/2603.15987
作者: Jing Yan,Kang You,Zhezhi He,Yaoyu Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Achieving deterministic computation results in asynchronous neuromorphic systems remains a fundamental challenge due to the inherent temporal stochasticity of continuous-time hardware. To address this, we develop a unified continuous-time framework for spiking neural networks (SNNs) that couples the Law of Charge Conservation with minimal neuron-level constraints. This integration ensures that the terminal state depends solely on the aggregate input charge, providing a unique cumulated output invariant to temporal stochasticity. We prove that this mapping is strictly invariant to spike timing in acyclic networks, whereas recurrent connectivity can introduce temporal sensitivity. Furthermore, we establish an exact representational correspondence between these charge-conserving SNNs and quantized artificial neural networks, bridging the gap between static deep learning and event-driven dynamics without approximation errors. These results establish a rigorous theoretical basis for designing continuous-time neuromorphic systems that harness the efficiency of asynchronous processing while maintaining algorithmic determinism.

[LG-41] Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

链接: https://arxiv.org/abs/2603.15958
作者: Egor Shulgin,Dimitri von Rütte,Tianyue H. Zhang,Niccolò Ajroldi,Bernhard Schölkopf,Antonio Orvieto
类目: Machine Learning (cs.LG)
*备注: v1: Preprint based on a short version published as a conference paper at SciForDL Workshop, 2nd edition

点击查看摘要

Abstract:Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, with transfer across batch sizes and training horizons often relying on empirical scaling rules informed by insights from timescale preservation, quadratic proxies, and continuous-time approximations. We study hyperparameter scaling laws for modern first-order optimizers through the lens of recent convergence bounds for methods based on the Linear Minimization Oracle (LMO), a framework that includes normalized SGD, signSGD (approximating Adam), and Muon. Treating bounds in recent literature as a proxy and minimizing them across different tuning regimes yields closed-form power-law schedules for learning rate, momentum, and batch size as functions of the iteration or token budget. Our analysis, holding model size fixed, recovers most insights and observations from the literature under a unified and principled perspective, with clear directions open for future research. Our results draw particular attention to the interaction between momentum and batch-size scaling, suggesting that optimal performance may be achieved with several scaling strategies.

[LG-42] GASP: Guided Asymmetric Self-Play For Coding LLM s ICLR2026

链接: https://arxiv.org/abs/2603.15957
作者: Swadesh Jana,Cansu Sancaktar,Tomáš Daniš,Georg Martius,Antonio Orvieto,Pavel Kolev
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR 2026 Workshop on AI with Recursive Self-Improvement (RSI 2026) as Spotlight, and ICLR 2026 Workshop on Lifelong Agents (LLA 2026)

点击查看摘要

Abstract:Asymmetric self-play has emerged as a promising paradigm for post-training large language models, where a teacher continually generates questions for a student to solve at the edge of the student’s learnability. Although these methods promise open-ended data generation bootstrapped from no human data, they suffer from one major problem: not all problems that are hard to solve are interesting or informative to improve the overall capabilities of the model. Current asymmetric self-play methods are goal-agnostic with no real grounding. We propose Guided Asymmetric Self-Play (GASP), where grounding is provided by real-data goalpost questions that are identified to pose a hard exploration challenge to the model. During self-play, the teacher first generates an easier variant of a hard question, and then a harder variant of that easier question, with the goal of gradually closing the gap to the goalpost throughout training. Doing so, we improve pass@20 on LiveCodeBench (LCB) by 2.5% over unguided asymmetric self-play, and through the curriculum constructed by the teacher, we manage to solve hard goalpost questions that remain out of reach for all baselines.

[LG-43] Discovery of interaction and diffusion kernels in particle-to-mean-field multi-agent systems

链接: https://arxiv.org/abs/2603.15927
作者: Giacomo Albi,Alessandro Alla,Elisa Calzola
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We propose a data-driven framework to learn interaction kernels in stochastic multi-agent systems. Our approach aims at identifying the functional form of nonlocal interaction and diffusion terms directly from trajectory data, without any a priori knowledge of the underlying interaction structure. Starting from a discrete stochastic binary-interaction model, we formulate the inverse problem as a sequence of sparse regression tasks in structured finite-dimensional spaces spanned by compactly supported basis functions, such as piecewise linear polynomials. In particular, we assume that pairwise interactions between agents are not directly observed and that only limited trajectory data are available. To address these challenges, we propose two complementary identification strategies. The first based on random-batch sampling, which compensates for latent interactions while preserving the statistical structure of the full dynamics in expectation. The second based on a mean-field approximation, where the empirical particle density reconstructed from the data defines a continuous nonlocal regression problem. Numerical experiments demonstrate the effectiveness and robustness of the proposed framework, showing accurate reconstruction of both interaction and diffusion kernels even from partially observed. The method is validated on benchmark models, including bounded-confidence and attraction-repulsion dynamics, where the two proposed strategies achieve comparable levels of accuracy.

[LG-44] Generative Inverse Design with Abstention via Diagonal Flow Matching

链接: https://arxiv.org/abs/2603.15925
作者: Miguel de Campos,Werner Krebs,Hanno Gottschalk
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inverse design aims to find design parameters x achieving target performance y^* . Generative approaches learn bidirectional mappings between designs and labels, enabling diverse solution sampling. However, standard conditional flow matching (CFM), when adapted to inverse problems by pairing labels with design parameters, exhibits strong sensitivity to their arbitrary ordering and scaling, leading to unstable training. We introduce Diagonal Flow Matching (Diag-CFM), which resolves this through a zero-anchoring strategy that pairs design coordinates with noise and labels with zero, making the learning problem provably invariant to coordinate permutations. This yields order-of-magnitude improvements in round-trip accuracy over CFM and invertible neural network baselines across design dimensions up to P=100 . We develop two architecture-intrinsic uncertainty metrics, Zero-Deviation and Self-Consistency, that enable three practical capabilities: selecting the best candidate among multiple generations, abstaining from unreliable predictions, and detecting out-of-distribution targets; consistently outperforming ensemble and general-purpose alternatives across all tasks. We validate on airfoil, gas turbine combustor, and an analytical benchmark with scalable design dimension.

[LG-45] Game-Theory-Assisted Reinforcement Learning for Border Defense: Early Termination based on Analytical Solutions

链接: https://arxiv.org/abs/2603.15907
作者: Goutam Das,Michael Dorothy,Kyle Volle,Daigo Shishika
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 7 pages, ACC 2026

点击查看摘要

Abstract:Game theory provides the gold standard for analyzing adversarial engagements, offering strong optimality guarantees. However, these guarantees often become brittle when assumptions such as perfect information are violated. Reinforcement learning (RL), by contrast, is adaptive but can be sample-inefficient in large, complex domains. This paper introduces a hybrid approach that leverages game-theoretic insights to improve RL training efficiency. We study a border defense game with limited perceptual range, where defender performance depends on both search and pursuit strategies, making classical differential game solutions inapplicable. Our method employs the Apollonius Circle (AC) to compute equilibrium in the post-detection phase, enabling early termination of RL episodes without learning pursuit dynamics. This allows RL to concentrate on learning search strategies while guaranteeing optimal continuation after detection. Across single- and multi-defender settings, this early termination method yields 10-20% higher rewards, faster convergence, and more efficient search trajectories. Extensive experiments validate these findings and demonstrate the overall effectiveness of our approach.

[LG-46] Evaluating Black-Box Vulnerabilities with Wasserstein-Constrained Data Perturbations

链接: https://arxiv.org/abs/2603.15867
作者: Adriana Laurindo Monteiro,Jean-Michel Loubes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The massive use of Machine Learning (ML) tools in industry comes with critical challenges, such as the lack of explainable models and the use of black-box algorithms. We address this issue by applying Optimal Transport theory in the analysis of responses of ML models to variations in the distribution of input variables. We find the closest distribution, in the Wasserstein sense, that satisfies a given constraintt and examine its impact on model behavior. Furthermore, we establish convergence results for this projected distribution and demonstrate our approach using examples and real-world datasets in both regression and classification settings.

[LG-47] Longitudinal Risk Prediction in Mammography with Privileged History Distillation

链接: https://arxiv.org/abs/2603.15814
作者: Banafsheh Karimian,Alexis Guichemerre,Soufiane Belharbi,Natacha Gillet,Luke McCaffrey,Mohammadhadi Shateri,Eric Granger
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Breast cancer remains a leading cause of cancer-related mortality worldwide. Longitudinal mammography risk prediction models improve multi-year breast cancer risk prediction based on prior screening exams. However, in real-world clinical practice, longitudinal histories are often incomplete, irregular, or unavailable due to missed screenings, first-time examinations, heterogeneous acquisition schedules, or archival constraints. The absence of prior exams degrades the performance of longitudinal risk models and limits their practical applicability. While substantial longitudinal history is available during training, prior exams are commonly absent at test time. In this paper, we address missing history at inference time and propose a longitudinal risk prediction method that uses mammography history as privileged information during training and distills its prognostic value into a student model that only requires the current exam at inference time. The key idea is a privileged multi-teacher distillation scheme with horizon-specific teachers: each teacher is trained on the full longitudinal history to specialize in one prediction horizon, while the student receives only a reconstructed history derived from the current exam. This allows the student to inherit horizon-dependent longitudinal risk cues without requiring prior screening exams at deployment. Our new Privileged History Distillation (PHD) method is validated on a large longitudinal mammography dataset with multi-year cancer outcomes, CSAW-CC, comparing full-history and no-history baselines to their distilled counterparts. Using time-dependent AUC across horizons, our privileged history distillation method markedly improves the performance of long-horizon prediction over no-history models and is comparable to that of full-history models, while using only the current exam at inference time.

[LG-48] Mask Is What DLLM Needs: A Masked Data Training Paradigm for Diffusion LLM s

链接: https://arxiv.org/abs/2603.15803
作者: Linrui Ma,Yufei Cui,Kai Han,Yunhe Wang
类目: Machine Learning (cs.LG)
*备注: Ongoing work

点击查看摘要

Abstract:Discrete diffusion models offer global context awareness and flexible parallel generation. However, uniform random noise schedulers in standard DLLM training overlook the highly non-uniform information density inherent in real-world sequences. This wastes optimization resources on low-density structural glues while leaving high-density logical pivot points severely under-optimized. To address this, we propose an Information Density Driven Smart Noise Scheduler. By extracting information-dense hubs and applying Complementary Priority Masking, our method decouples a single training instance into mutually reinforcing reasoning and syntax samples, forcing the model to master both logical deduction and foundational sequence structure. Experiments demonstrate that our approach improves average accuracy by ~4% across four Code and Math reasoning benchmarks, significantly outperforming uniform baselines. Mechanistic analyses further reveal that probabilistic priority masking effectively mitigates contextual collapse during block diffusion training. Overall, this density-aware strategy efficiently unlocks the reasoning potential of diffusion language models at minimal annotation cost, emerging as a promising new masked data training paradigm for Diffusion LLMs. Our processed dataset can be found at this https URL.

[LG-49] me-Aware Prior Fitted Networks for Zero-Shot Forecasting with Exogenous Variables

链接: https://arxiv.org/abs/2603.15802
作者: Andres Potapczynski,Ravi Kiran Selvam,Tatiana Konstantinova,Shankar Ramasubramanian,Malcolm Wolff,Kin G. Olivares,Ruijun Ma,Mengfei Cao,Michael W. Mahoney,Andrew Gordon Wilson,Boris N. Oreshkin,Dmitry Efimov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many time series forecasting settings, the target time series is accompanied by exogenous covariates, such as promotions and prices in retail demand; temperature in energy load; calendar and holiday indicators for traffic or sales; and grid load or fuel costs in electricity pricing. Ignoring these exogenous signals can substantially degrade forecasting accuracy, particularly when they drive spikes, discontinuities, or regime and phase changes in the target series. Most current time series foundation models (e.g., Chronos, Sundial, TimesFM, TimeMoE, TimeLLM, and LagLlama) ignore exogenous covariates and make forecasts solely from the numerical time series history, thereby limiting their performance. In this paper, we develop ApolloPFN, a prior-data fitted network (PFN) that is time-aware (unlike prior PFNs) and that natively incorporates exogenous covariates (unlike prior univariate forecasters). Our design introduces two major advances: (i) a synthetic data generation procedure tailored to resolve the failure modes that arise when tabular (non-temporal) PFNs are applied to time series; and (ii) time-aware architectural modifications that embed inductive biases needed to exploit the time series context. We demonstrate that ApolloPFN achieves state-of-the-art results across benchmarks, such as M5 and electric price forecasting, that contain exogenous information.

[LG-50] PulmoVec: A Two-Stage Stacking Meta-Learning Architecture Built on the HeAR Foundation Model for Multi-Task Classification of Pediatric Respiratory Sounds

链接: https://arxiv.org/abs/2603.15688
作者: Izzet Turkalp Akbasli,Oguzhan Serin
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 14 pages, 2 figures, 4 tables; supplementary material included (4 tables, 3 multi-panel figures)

点击查看摘要

Abstract:Background: Respiratory diseases are a leading cause of childhood morbidity and mortality, yet lung auscultation remains subjective and limited by inter-listener variability, particularly in pediatric populations. Existing AI approaches are further constrained by small datasets and single-task designs. We developed PulmoVec, a multi-task framework built on the Health Acoustic Representations (HeAR) foundation model for classification of pediatric respiratory sounds. Methods: In this retrospective analysis of the SPRSound database, 24,808 event-level annotated segments from 1,652 pediatric patients were analyzed. Three task-specific classifiers were trained for screening, sound-pattern recognition, and disease-group prediction. Their out-of-fold probability outputs were combined with demographic metadata in a LightGBM stacking meta-model, and event-level predictions were aggregated to the patient level using ensemble voting. Results: At the event level, the screening model achieved an ROC-AUC of 0.96 (95% CI, 0.95-0.97), the sound-pattern recognition model a macro ROC-AUC of 0.96 (95% CI, 0.96-0.97), and the disease-group prediction model a macro ROC-AUC of 0.94 (95% CI, 0.93-0.94). At the patient level, disease-group classification yielded an accuracy of 0.74 (95% CI, 0.71-0.77), a weighted F1-score of 0.73, and a macro ROC-AUC of 0.91 (95% CI, 0.90-0.93). Stacking improved performance across all tasks compared with base models alone. Conclusions: PulmoVec links event-level acoustic phenotyping with patient-level clinical classification, supporting the potential of foundation-model-based digital auscultation in pediatric respiratory medicine. Multi-center external validation across devices and real-world conditions remains essential.

[LG-51] Flood Risk Follows Valleys Not Grids: Graph Neural Networks for Flash Flood Susceptibility Mapping in Himachal Pradesh with Conformal Uncertainty Quantification

链接: https://arxiv.org/abs/2603.15681
作者: Paras Sharma,Swastika Sharma
类目: Machine Learning (cs.LG)
*备注: 28 pages, 10 figures, 2 tables. Code and data at this https URL

点击查看摘要

Abstract:Flash floods are the most destructive natural hazard in Himachal Pradesh (HP), India, causing over 400 fatalities and 1.2 billion in losses in the 2023 monsoon season alone. Existing risk maps treat every pixel independently, ignoring the basic fact that flooding upstream raises risk downstream. We address this with a Graph Neural Network (GraphSAGE) trained on a watershed connectivity graph (460 sub-watersheds, 1,700 directed edges), built from a six-year Sentinel-1 SAR flood inventory (2018-2023, 3,000 events) and 12 environmental variables at 30 m resolution. Four pixel-based ML models (RF, XGBoost, LightGBM, stacking ensemble) serve as baselines. All models are evaluated with leave-one-basin-out spatial cross-validation to avoid the 5-15% AUC inflation of random splits. Conformal prediction produces the first HP susceptibility maps with statistically guaranteed 90% coverage intervals. The GNN achieved AUC = 0.978 +/- 0.017, outperforming the best baseline (AUC = 0.881) and the published HP benchmark (AUC = 0.88). The +0.097 gain confirms that river connectivity carries predictive signal that pixel-based models miss. High-susceptibility zones overlap 1,457 km of highways (including 217 km of the Manali-Leh corridor), 2,759 bridges, and 4 major hydroelectric installations. Conformal intervals achieved 82.9% empirical coverage on the held-out 2023 test set; lower coverage in high-risk zones (45-59%) points to SAR label noise as a target for future work. Comments: 28 pages, 10 figures, 2 tables. Code and data at this https URL Subjects: Machine Learning (cs.LG) MSC classes: 68T07, 86A32, 62H30 ACMclasses: I.2.6; I.5.1; J.2 Cite as: arXiv:2603.15681 [cs.LG] (or arXiv:2603.15681v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.15681 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-52] Quantum Key Distribution Secured Federated Learning for Channel Estimation and Radar Spectrum Sensing in 6G Networks

链接: https://arxiv.org/abs/2603.15649
作者: Ferhat Ozgur Catak,Murat Kuzlu,Jungwon Seo,Umit Cali
类目: Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:This paper presents a federated learning framework secured by quantum key distribution (QKD) for wireless channel estimation and radar spectrum sensing in the next generation networks (NextG or Beyond 6G). A BB84-style protocol abstraction and pairwise additive masking are utilized to train clients’ local models (CNN for channel estimation, U-Net for radar segmentation) and upload only masked model updates. The server aggregates without observing plain parameters; an eavesdropper without QKD keys cannot recover individual updates. Experiments show that secure FL achieves NMSE of 0.216 for channel estimation and 92.1% accuracy with 0.72 mIoU for radar sensing. When an eavesdropper is present, QBER rises to \sim 25% and all rounds abort as intended; reconstruction error remains below 10^-5 , confirming correct aggregation.

[LG-53] Conditional Distributional Treatment Effects: Doubly Robust Estimation and Testing

链接: https://arxiv.org/abs/2603.16829
作者: Saksham Jain,Alex Luedtke
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Beyond conditional average treatment effects, treatments may impact the entire outcome distribution in covariate-dependent ways, for example, by altering the variance or tail risks for specific subpopulations. We propose a novel estimand to capture such conditional distributional treatment effects, and develop a doubly robust estimator that is minimax optimal in the local asymptotic sense. Using this, we develop a test for the global homogeneity of conditional potential outcome distributions that accommodates discrepancies beyond the maximum mean discrepancy (MMD), has provably valid type 1 error, and is consistent against fixed alternatives – the first test, to our knowledge, with such guarantees in this setting. Furthermore, we derive exact closed-form expressions for two natural discrepancies (including the MMD), and provide a computationally efficient, permutation-free algorithm for our test.

[LG-54] Data-driven forced response analysis with min-max representations of nonlinear restoring forces

链接: https://arxiv.org/abs/2603.16746
作者: Akira Saito,Hiromu Fujita
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:This paper discusses a novel data-driven nonlinearity identification method for mechanical systems with nonlinear restoring forces such as polynomial, piecewise-linear, and general displacement-dependent nonlinearities. The proposed method is built upon the universal approximation theorem that states that a nonlinear function can be approximated by a linear combination of activation functions in artificial neural network framework. The proposed approach utilizes piecewise linear springs with initial gaps to act as the activation functions of the neurons of artificial neural networks. A library of piecewise linear springs with initial gaps are constructed, and the contributions of the springs on the nonlinear restoring force are determined by solving the linear regression problems. The piecewise linear springs are realized by combinations of min and max functions with biases. The proposed method is applied to a Duffing oscillator with cubic stiffness, and a piecewise linear oscillator with a gap and their nonlinearities are successfully determined from their free responses. The obtained models are then used for conducting forced response analysis and the results match well with those of the original system. The method is then applied to experimentally-obtained free response data of a cantilevered plate that is subjected to magnetic restoring force, and successfully finds the piecewise linear representation of the magnetic force. It is also shown that the obtained model is capable of accurately capturing the steady-state response of the system subject to harmonic base excitation.

[LG-55] High-dimensional estimation with missing data: Statistical and computational limits

链接: https://arxiv.org/abs/2603.16712
作者: Kabir Aladin Verchand,Ankit Pensia,Saminul Haque,Rohith Kuditipudi
类目: atistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an \epsilon fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in \ell_2 norm, we show that in order to obtain error at most \rho , for any constant contamination \epsilon \in (0, 1) , (roughly) n \gtrsim d e^1/\rho^2 samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly) n \gtrsim d^1/\rho^2 and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.

[LG-56] Deep Adaptive Model-Based Design of Experiments

链接: https://arxiv.org/abs/2603.16146
作者: Arno Strouwen,Sebastian Micluţa-Câmpeanu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Model-based design of experiments (MBDOE) is essential for efficient parameter estimation in nonlinear dynamical systems. However, conventional adaptive MBDOE requires costly posterior inference and design optimization between each experimental step, precluding real-time applications. We address this by combining Deep Adaptive Design (DAD), which amortizes sequential design into a neural network policy trained offline, with differentiable mechanistic models. For dynamical systems with known governing equations but uncertain parameters, we extend sequential contrastive training objectives to handle nuisance parameters and propose a transformer-based policy architecture that respects the temporal structure of dynamical systems. We demonstrate the approach on four systems of increasing complexity: a fed-batch bioreactor with Monod kinetics, a Haldane bioreactor with uncertain substrate inhibition, a two-compartment pharmacokinetic model with nuisance clearance parameters, and a DC motor for real-time deployment.

[LG-57] Safe Distributionally Robust Feature Selection under Covariate Shift

链接: https://arxiv.org/abs/2603.16062
作者: Hiroyuki Hanada,Satoshi Akahane,Noriaki Hashimoto,Shion Takeno,Ichiro Takeuchi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In practical machine learning, the environments encountered during the model development and deployment phases often differ, especially when a model is used by many users in diverse settings. Learning models that maintain reliable performance across plausible deployment environments is known as distributionally robust (DR) learning. In this work, we study the problem of distributionally robust feature selection (DRFS), with a particular focus on sparse sensing applications motivated by industrial needs. In practical multi-sensor systems, a shared subset of sensors is typically selected prior to deployment based on performance evaluations using many available sensors. At deployment, individual users may further adapt or fine-tune models to their specific environments. When deployment environments differ from those anticipated during development, this strategy can result in systems lacking sensors required for optimal performance. To address this issue, we propose safe-DRFS, a novel approach that extends safe screening from conventional sparse modeling settings to a DR setting under covariate shift. Our method identifies a feature subset that encompasses all subsets that may become optimal across a specified range of input distribution shifts, with finite-sample theoretical guarantees of no false feature elimination.

[LG-58] Shuffling the Stochastic Mirror Descent via Dual Lipschitz Continuity and Kernel Conditioning

链接: https://arxiv.org/abs/2603.16042
作者: Junwen Qiu,Leilei Mei,Junyu Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages, 3 figures

点击查看摘要

Abstract:The global Lipschitz smoothness condition underlies most convergence and complexity analyses via two key consequences: the descent lemma and the gradient Lipschitz continuity. How to study the performance of optimization algorithms in the absence of Lipschitz smoothness remains an active area. The relative smoothness framework from Bauschke-Bolte-Teboulle (2017) and Lu-Freund-Nesterov (2018) provides an extended descent lemma, ensuring convergence of Bregman-based proximal gradient methods and their vanilla stochastic counterparts. However, many widely used techniques (e.g., momentum schemes, random reshuffling, and variance reduction) additionally require the Lipschitz-type bound for gradient deviations, leaving their analysis under relative smoothness an open area. To resolve this issue, we introduce the dual kernel conditioning (DKC) regularity condition to regulate the local relative curvature of the kernel functions. Combined with the relative smoothness, DKC provides a dual Lipschitz continuity for gradients: even though the gradient mapping is not Lipschitz in the primal space, it preserves Lipschitz continuity in the dual space induced by a mirror map. We verify that DKC is widely satisfied by popular kernels and is closed under affine composition and conic combination. With these novel tools, we establish the first complexity bounds as well as the iterate convergence of random reshuffling mirror descent for constrained nonconvex relative smooth problems.

[LG-59] Power Analysis for Prediction-Powered Inference

链接: https://arxiv.org/abs/2603.16041
作者: Yiqun T. Chen,Moran Guo,Shengy Li
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern studies increasingly leverage outcomes predicted by machine learning and artificial intelligence (AI/ML) models, and recent work, such as prediction-powered inference (PPI), has developed valid downstream statistical inference procedures. However, classical power and sample size formulas do not readily account for these predictions. In this work, we tackle a simple yet practical question: given a new AI/ML model with high predictive power, how many labeled samples are needed to achieve a desired level of statistical power? We derive closed-form power formulas by characterizing the asymptotic variance of the PPI estimator and applying Wald test inversion to obtain the required labeled sample size. Our results cover widely used settings including two-sample comparisons and risk measures in 2x2 tables. We find that a useful rule of thumb is that the reduction in required labeled samples relative to classical designs scales roughly with the R2 between the predictions and the ground truth. Our analytical formulas are validated using Monte Carlo simulations, and we illustrate the framework in three contemporary biomedical applications spanning single-cell transcriptomics, clinical blood pressure measurement, and dermoscopy imaging. We provide our software as an R package and online calculators at this https URL.

[LG-60] Learning to Recall with Transformers Beyond Orthogonal Embeddings ICLR2026

链接: https://arxiv.org/abs/2603.15923
作者: Nuri Mert Vural,Alberto Bietti,Mahdi Soltanolkotabi,Denny Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: ICLR 2026

点击查看摘要

Abstract:Modern large language models (LLMs) excel at tasks that require storing and retrieving knowledge, such as factual recall and question answering. Transformers are central to this capability because they can encode information during training and retrieve it at inference. Existing theoretical analyses typically study transformers under idealized assumptions such as infinite data or orthogonal embeddings. In realistic settings, however, models are trained on finite datasets with non-orthogonal (random) embeddings. We address this gap by analyzing a single-layer transformer with random embeddings trained with (empirical) gradient descent on a simple token-retrieval task, where the model must identify an informative token within a length- L sequence and learn a one-to-one mapping from tokens to labels. Our analysis tracks the ``early phase’’ of gradient descent and yields explicit formulas for the model’s storage capacity – revealing a multiplicative dependence between sample size N , embedding dimension d , and sequence length L . We validate these scalings numerically and further complement them with a lower bound for the underlying statistical problem, demonstrating that this multiplicative scaling is intrinsic under non-orthogonal embeddings.

[LG-61] Learnability with Partial Labels and Adaptive Nearest Neighbors

链接: https://arxiv.org/abs/2603.15781
作者: Nicolas A. Errandonea,Santiago Mazuelas,Jose A. Lozano,Sanjoy Dasgupta
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prior work on partial labels learning (PLL) has shown that learning is possible even when each instance is associated with a bag of labels, rather than a single accurate but costly label. However, the necessary conditions for learning with partial labels remain unclear, and existing PLL methods are effective only in specific scenarios. In this work, we mathematically characterize the settings in which PLL is feasible. In addition, we present PL A- k NN, an adaptive nearest-neighbors algorithm for PLL that is effective in general scenarios and enjoys strong performance guarantees. Experimental results corroborate that PL A- k NN can outperform state-of-the-art methods in general PLL scenarios.

[LG-62] Life cycle assessment for all organic chemicals

链接: https://arxiv.org/abs/2603.15686
作者: Shaohan Chen,Tim Langhorst,Julian Nöhl,Christopher Oberschelp,Martin Pillich,Johannes Schilling,André Bardow
类目: Chemical Physics (physics.chem-ph); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 24 pages, 9 figures

点击查看摘要

Abstract:Chemicals are embedded in nearly every aspect of modern society, yet their production poses substantial sustainability concerns. Achieving a sustainable chemical industry requires detailed Life Cycle Assessment (LCA); however, current assessments face many unknowns due to limited, partly inconsistent, and untransparent data coverage since existing Life Cycle Inventory (LCI) databases account for only a tiny fraction of traded chemicals. Here, we introduce the Chemical RetrosYnthesiS for Transparent Assessment of Life-cycles (CRYSTAL) framework, which automatically generates consistent and transparent LCI data for organic chemicals based on their molecular structure using retrosynthesis and machine-learned gate-to-gate inventories. Using the predictive power of CRYSTAL, we create a consistent database for more than 70000 organic chemicals, comprising over 110000 transparent LCI datasets that quantify both feedstock and energy demands, together with associated auxiliary materials, biosphere flows, and waste flows. From this comprehensive database, we identify 50 key environmental hotspots driving high impacts of organic chemical production across multiple environmental categories and pivotal hub chemicals that are most critical for downstream chemical production. In providing this comprehensive data foundation, the CRYSTAL framework offers systematic guidance for targeted engineering and policy interventions. Its transparent, modular nature is designed to shift chemical LCA from a reliance on “unknown unknowns” to a collaboratively improvable mapping of “known unknowns”.

[LG-63] Beyond Distance: Quantifying Point Cloud Dynamics with Persistent Homology and Dynamic Optimal Transport

链接: https://arxiv.org/abs/2603.15683
作者: Yixin Wang,Ting Gao,Jinqiao Duan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 42 pages, 15 figures

点击查看摘要

Abstract:We introduce a framework for analyzing topological tipping in time-evolutionary point clouds by extending the recently proposed Topological Optimal Transport (TpOT) distance. While TpOT unifies geometric, homological, and higher-order relations into one metric, its global scalar distance can obscure transient, localized structural reorganizations during dynamic phase transitions. To overcome this limitation, we present a hierarchical dynamic evaluation framework driven by a novel topological and hypergraph reconstruction strategy. Instead of directly interpolating abstract network parameters, our method interpolates the underlying spatial geometry and rigorously recomputes the valid topological structures, ensuring physical fidelity. Along this geodesic, we introduce a set of multi-scale indicators: macroscopic metrics (Topological Distortion and Persistence Entropy) to capture global shifts, and a novel mesoscopic dual-perspective Hypergraph Entropy (node-perspective and edge-perspective) to detect highly sensitive, asynchronous local rewirings. We further propagate the cycle-level entropy change onto individual vertices to form a point-level topological field. Extensive evaluations on physical dynamical systems (Rayleigh-Van der Pol limit cycles, Double-Well cluster fusion), high-dimensional biological aggregation (D’Orsogna model), and longitudinal stroke fMRI data demonstrate the utility of combining transport-based alignment with multi-scale entropy diagnostics for dynamic topological analysis.

[LG-64] Machine Learning Based Identification of Solvents from Post-Desiccation Patterns

链接: https://arxiv.org/abs/2603.15660
作者: Jesús Israel Morán-Cortés,Felipe Pacheco-Vázquez
类目: oft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 11 pages, 8 figures, article

点击查看摘要

Abstract:We introduce an optimized protocol of fracture pattern classification using an artificial neural network to identify the solvent involved in the desiccation cracking process of starch-liquid slurries, even after it has been completely evaporated. For this purpose, image analysis techniques were used to characterize patterns obtained from drying suspensions using single solvents (water, ethanol, acetone) and two-component solvents (water-ethanol mixtures at different concentrations). Frequency histograms were generated based on nine morphological features, taking into account their size, shape, geometry and orientational ordering. Subsequently, we used these histograms as input data into artificial neural network variants to determine the set of features that lead to the higher accuracy in solvent identification. We obtained an average accuracy of 96(\pm 1)% considering all solvents in the analysis. The highest accuracy was obtained with sets of features that include the crack area distribution. The proposed protocol can help to determine the combination of features that optimize pattern recognition in other fields of science and engineering.

[LG-65] An Efficient Global Optimization Algorithm with Adaptive Estimates of the Local Lipschitz Constants

链接: https://arxiv.org/abs/2211.04129
作者: Danny D’Agostino
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted in Journal of Global Optimization, Springer

点击查看摘要

Abstract:In this work, we present a new deterministic partition-based global optimization algorithm, HALO (Hybrid Adaptive Lipschitzian Optimization), which uses estimates of the local Lipschitz constants associated with different sub-regions of the objective function’s domain to compute lower bounds and guide the search toward global minimizers. These estimates are obtained by adaptively balancing the global and local information collected from the algorithm, based on absolute slopes. HALO is hyperparameter-free, eliminating the need for manual tuning, and it highlights the most important variables to help interpret the optimization problem. We also introduce a coupling strategy with local optimization algorithms, both gradient-based and derivative-free, to accelerate convergence. We compare HALO with popular global optimization algorithms on hundreds of test functions. The numerical results are very promising and demonstrate that HALO can expand our arsenal of efficient procedures of efficient procedures for challenging real-world black-box optimization problems. The Python code of HALO is publicly available on GitHub. this https URL

附件下载

点击下载今日全部论文列表