本篇博文主要内容为 2026-04-03 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-04-03)
今日共更新641篇论文,其中:
- 自然语言处理共89篇(Computation and Language (cs.CL))
- 人工智能共188篇(Artificial Intelligence (cs.AI))
- 计算机视觉共150篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共143篇(Machine Learning (cs.LG))
- 多智能体系统共9篇(Multiagent Systems (cs.MA))
- 信息检索共10篇(Information Retrieval (cs.IR))
- 人机交互共18篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] he Self Driving Portfolio: Agent ic Architecture for Institutional Asset Management
【速读】:该论文旨在解决传统投资组合构建中人类投资者角色受限于分析执行层面的问题,即如何通过自动化与智能化手段提升资产配置的效率与适应性。其解决方案的关键在于设计了一个由约50个专业化智能体(agent)组成的代理式战略资产配置流程:这些智能体分别负责生成资本市场假设、运用20余种不同方法构建投资组合,并相互批判与投票;同时引入研究者智能体(researcher agent)提出尚未被涵盖的新构建方法,以及元智能体(meta-agent)基于历史预测与实际收益对比,持续优化智能体代码和提示词(prompt),从而实现自我进化能力。整个系统受投资政策声明(Investment Policy Statement, IPS)约束,确保自主运行符合既定合规框架,使AI具备类人决策逻辑的同时保持可控性和透明度。
链接: https://arxiv.org/abs/2604.02279
作者: Andrew Ang,Nazym Azimbayev,Andrey Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); General Finance (q-fin.GN); Portfolio Management (q-fin.PM)
备注: 31 pages, 11 exhibits
Abstract:Agentic AI shifts the investor’s role from analytical execution to oversight. We present an agentic strategic asset allocation pipeline in which approximately 50 specialized agents produce capital market assumptions, construct portfolios using over 20 competing methods, and critique and vote on each other’s output. A researcher agent proposes new portfolio construction methods not yet represented, and a meta-agent compares past forecasts against realized returns and rewrites agent code and prompts to improve future performance. The entire pipeline is governed by the Investment Policy Statement–the same document that guides human portfolio managers can now constrain and direct autonomous agents.
[MA-1] Multi-Agent Video Recommenders: Evolution Patterns and Open Challenges WSDM
【速读】:该论文旨在解决传统单模型推荐系统在应对现代视频推荐平台动态需求时的局限性问题,尤其是在用户行为多样性、内容复杂性和个性化要求日益增长的背景下,静态优化指标难以满足实时适应与精准解释的需求。其解决方案的关键在于引入多智能体架构(multi-agent architectures),通过协同工作的一组专业化智能体——包括视频理解、推理、记忆和反馈模块——实现更灵活、可解释且自适应的推荐机制。该方法融合了多智能体推荐系统、基础模型(foundation models)和对话式AI的思想,尤其强调基于大语言模型(LLM)驱动的多智能体视频推荐系统(MAVRS)的发展趋势,从而推动推荐系统从静态策略向动态协作范式的演进。
链接: https://arxiv.org/abs/2604.02211
作者: Srivaths Ranganathan,Abhishek Dharmaratnakar,Anushree Sinha,Debanshu Das
机构: Google LLC(谷歌)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted for publication in The Nineteenth ACM International Conference on Web Search and Data Mining (WSDM Companion 2026)
Abstract:Video recommender systems are among the most popular and impactful applications of AI, shaping content consumption and influencing culture for billions of users. Traditional single-model recommenders, which optimize static engagement metrics, are increasingly limited in addressing the dynamic requirements of modern platforms. In response, multi-agent architectures are redefining how video recommender systems serve, learn, and adapt to both users and datasets. These agent-based systems coordinate specialized agents responsible for video understanding, reasoning, memory, and feedback, to provide precise, explainable recommendations. In this survey, we trace the evolution of multi-agent video recommendation systems (MAVRS). We combine ideas from multi-agent recommender systems, foundation models, and conversational AI, culminating in the emerging field of large language model (LLM)-powered MAVRS. We present a taxonomy of collaborative patterns and analyze coordination mechanisms across diverse video domains, ranging from short-form clips to educational platforms. We discuss representative frameworks, including early multi-agent reinforcement learning (MARL) systems such as MMRF and recent LLM-driven architectures like MACRec and Agent4Rec, to illustrate these patterns. We also outline open challenges in scalability, multimodal understanding, incentive alignment, and identify research directions such as hybrid reinforcement learning-LLM systems, lifelong personalization and self-improving recommender systems.
[MA-2] PRO-SPECT: Probabilistically Safe Scalable Planning for Energy-Aware Coordinated UAV-UGV Teams in Stochastic Environments
【速读】:该论文旨在解决在随机环境下的无人飞行器(UAV)与无人地面车辆(UGV)协同任务中,如何实现能量感知的路径规划问题,确保UAV在最短时间内完成指定航点访问的同时,满足能量约束条件,并以UGV作为移动充电站提供支持。传统方法通常假设移动时间为确定性或采用固定鲁棒性裕度,而本文则将移动时间建模为随机变量,并通过概率约束控制整个任务过程中能量耗尽的失败概率不超过用户设定的风险水平。解决方案的关键在于提出一种名为PRO-SPECT的多项式时间算法,该算法将问题建模为混合整数规划(Mixed-Integer Program),能够在离线规划和在线重规划场景下生成满足风险边界的可行路径,从而实现对扰动的适应性响应并保持严格的风险控制。
链接: https://arxiv.org/abs/2604.02142
作者: Roger Fowler,Cahit Ikbal Er,Benjamin Johnsenberg,Yasin Yazicioglu
机构: Northeastern University (东北大学); The Charles Stark Draper Laboratory, Inc. (查尔斯·斯塔克·德雷珀实验室)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:
Abstract:We consider energy-aware planning for an unmanned aerial vehicle (UAV) and unmanned ground vehicle (UGV) team operating in a stochastic environment. The UAV must visit a set of air points in minimum time while respecting energy constraints, relying on the UGV as a mobile charging station. Unlike prior work that assumed deterministic travel times or used fixed robustness margins, we model travel times as random variables and bound the probability of failure (energy depletion) across the entire mission to a user-specified risk level. We formulate the problem as a Mixed-Integer Program and propose PRO-SPECT, a polynomial-time algorithm that generates risk-bounded plans. The algorithm supports both offline planning and online re-planning, enabling the team to adapt to disturbances while preserving the risk bound. We provide theoretical results on solution feasibility and time complexity. We also demonstrate the performance of our method via numerical comparisons and simulations.
[MA-3] Systematic Analyses of Reinforcement Learning Controllers in Signalized Urban Corridors
【速读】:该论文旨在解决多交叉口交通网络中的协同控制问题,特别是在城市走廊网络中如何提升整体通行效率并优化平均出行时间(Average Travel Time, ATT)。其核心挑战在于设计有效的强化学习(Reinforcement Learning, RL)控制器架构,以在不同决策层级(集中式、完全去中心化、参数共享去中心化)下实现容量区域的最大化与系统性能的最优平衡。解决方案的关键在于通过训练和比较三类RL控制器,并发现参数共享的去中心化策略不仅能在原训练网络上表现优异,还能推广至更大规模网络,且无需显式协调即能促使交通流自发形成“绿波”(green waves),从而显著改善整体路网运行效率。
链接: https://arxiv.org/abs/2604.02025
作者: Xiaofei Song,Kerstin Eder,Jonathan Lawry,R. Eddie Wilson
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:
Abstract:In this work, we extend our systematic capacity region perspective to multi-junction traffic networks, focussing on the special case of an urban corridor network. In particular, we train and evaluate centralized, fully decentralized, and parameter-sharing decentralized RL controllers, and compare their capacity regions and ATTs together with a classical baseline MaxPressure controller. Further, we show how the parametersharing controller may be generalised to be deployed on a larger network than it was originally trained on. In this setting, we show some initial findings that suggest that even though the junctions are not formally coordinated, traffic may self organise into `green waves’.
[MA-4] Optimizing Interventions for Agent -Based Infectious Disease Simulations
【速读】:该论文旨在解决在缺乏药物干预手段时,如何通过代理模型(agent-based models)自动优化非药物干预措施(Non-Pharmaceutical Interventions, NPIs)以最小化社会扰动并有效控制传染病传播的问题。由于NPI可针对个体多维属性、影响多层次群体结构(如学校、工作场所和家庭),且组合方式多样,导致其搜索空间庞大甚至无限,传统方法难以高效求解。解决方案的关键在于提出一种基于语法引导遗传编程(Grammar-Guided Genetic Programming, GGGP)的代理型传染病干预优化系统(Agent-based Infectious Disease Intervention Optimization System, ADIOS),其核心是一个为NPI设计的领域特定语言(Domain-Specific Language, DSL),该语言通过上下文无关文法(context-free grammar)结构化干预搜索空间,并引入语义约束减少无效模式生成,从而提升优化效率;同时,ADIOS通过接口与代理模型(如德国流行病微观仿真系统GEMS)耦合,实现基于仿真的优化。
链接: https://arxiv.org/abs/2604.02016
作者: Anja Wolpers,Johannes Ponge,Adelinde M. Uhrmacher
机构: University of Rostock (罗斯托克大学); University of Münster (明斯特大学); Stanford University (斯坦福大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Non-pharmaceutical interventions (NPIs) are commonly used tools for controlling infectious disease transmission when pharmaceutical options are unavailable. Yet, identifying effective interventions that minimize societal disruption remains challenging. Agent-based simulation is a popular tool for analyzing the impact of possible interventions in epidemiology. However, automatically optimizing NPIs using agent-based simulations poses a complex problem because, in agent-based epidemiological models, interventions can target individuals based on multiple attributes, affect hierarchical group structures (e.g., schools, workplaces, and families), and be combined arbitrarily, resulting in a very large or even infinite search space. We aim to support decision-makers with our Agent-based Infectious Disease Intervention Optimization System (ADIOS) that optimizes NPIs for infectious disease simulations using Grammar-Guided Genetic Programming (GGGP). The core of ADIOS is a domain-specific language for expressing NPIs in agent-based simulations that structures the intervention search space through a context-free grammar. To make optimization more efficient, the search space can be further reduced by defining constraints that prevent the generation of semantically invalid intervention patterns. Using this constrained language and an interface that enables coupling with agent-based simulations, ADIOS adopts the GGGP approach for simulation-based optimization. Using the German Epidemic Micro-Simulation System (GEMS) as a case study, we demonstrate the potential of our approach to generate optimal interventions for realistic epidemiological models
[MA-5] Free Information Disrupts Even Bayesian Crowds
【速读】:该论文试图解决的问题是:在信息网络(如社交媒体平台)中,尽管用户被赋予自由且无限制的信息交换权利,这种设计是否真的有助于群体信念的准确性。研究表明,即使是在理想化的群体中——即由追求真理且具备完美信息处理能力的协作型代理组成——无约束的信息交换仍可能导致群体信念偏离正确方向。其解决方案之关键在于提出应重新审视信息流动的约束机制,在设计具有重大社会影响的通信网络时,需谨慎考虑对信息流施加适当限制,以提升群体决策与认知的准确性。
链接: https://arxiv.org/abs/2604.01838
作者: Jonas Stein,Shannon Cruz,Davide Grossi,Martina Testori
机构: University of Groningen (格罗宁根大学); The Pennsylvania State University (宾夕法尼亚州立大学); University of Amsterdam (阿姆斯特丹大学); University of Greenwich (格林威治大学)
类目: Multiagent Systems (cs.MA); Theoretical Economics (econ.TH); Physics and Society (physics.soc-ph)
备注:
Abstract:A core tenet underpinning the conception of contemporary information networks, such as social media platforms, is that users should not be constrained in the amount of information they can freely and willingly exchange with one another about a given topic. By means of a computational agent-based model, we show how even in groups of truth-seeking and cooperative agents with perfect information-processing abilities, unconstrained information exchange may lead to detrimental effects on the correctness of the group’s beliefs. If unconstrained information exchange can be detrimental even among such idealized agents, it is prudent to assume it can also be so in practice. We therefore argue that constraints on information flow should be carefully considered in the design of communication networks with substantial societal impact, such as social media platforms.
[MA-6] A Role-Based LLM Framework for Structured Information Extraction from Healthy Food Policies
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)在健康食品政策领域信息抽取(Information Extraction, IE)任务中因政策文档结构多样性和不一致性导致的误报、错分和漏抽等问题,尤其是由模型幻觉(hallucination)引发的可靠性不足。其解决方案的关键在于提出一种基于角色的LLM框架,通过为不同专业职能分配专用角色——政策分析师(负责元数据与机制分类)、法律策略专家(识别复杂法律手段)和食品系统专家(划分食品系统阶段),并将领域内明确的法律机制定义与分类标准嵌入到角色提示中,从而模拟专家分析流程,提升模型在复杂推理任务中的准确性与透明度。
链接: https://arxiv.org/abs/2604.01529
作者: Congjing Zhang,Ruoxuan Bao,Jingyu Li,Yoav Ackerman,Shuai Huang,Yanfang Su
机构: University of Washington (华盛顿大学); Shanghai University (上海大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Current Large Language Model (LLM) approaches for information extraction (IE) in the healthy food policy domain are often hindered by various factors, including misinformation, specifically hallucinations, misclassifications, and omissions that result from the structural diversity and inconsistency of policy documents. To address these limitations, this study proposes a role-based LLM framework that automates the IE from unstructured policy data by assigning specialized roles: an LLM policy analyst for metadata and mechanism classification, an LLM legal strategy specialist for identifying complex legal approaches, and an LLM food system expert for categorizing food system stages. This framework mimics expert analysis workflows by incorporating structured domain knowledge, including explicit definitions of legal mechanisms and classification criteria, into role-specific prompts. We evaluate the framework using 608 healthy food policies from the Healthy Food Policy Project (HFPP) database, comparing its performance against zero-shot, few-shot, and chain-of-thought (CoT) baselines using Llama-3.3-70B. Our proposed framework demonstrates superior performance in complex reasoning tasks, offering a reliable and transparent methodology for automating IE from health policies.
[MA-7] Computational Foundations for Strategic Coopetition: Formalizing Sequential Interaction and Reciprocity
【速读】:该论文旨在解决多利益相关者系统中战略合谋(strategic coopetition)如何在缺乏绑定契约的情况下持续存在的问题,核心挑战在于理解合作行为在时间维度上的动态演化机制。解决方案的关键在于构建一个融合概念建模(i框架)与博弈论互惠分析的计算基础框架,其创新点包括:(1)有界互惠响应函数,将合作偏离映射为有限条件反应;(2)记忆窗口历史追踪机制,模拟认知限制下对最近k期交互的记忆;(3)结构互惠敏感性,基于相互依赖矩阵量化结构性因素对行为反应的放大效应;(4)信任门控互惠,引入信任水平调节互惠强度。该框架通过15,625种参数配置验证了六项关键行为目标均显著达标,且在Apple iOS应用生态(2008–2024)中实现84.3%的实证匹配度,统计显著性达p < 0.001(Cohen’s d = 1.57),为无合约环境下的长期合作提供了可计算、可验证的理论支撑。
链接: https://arxiv.org/abs/2604.01240
作者: Vik Pant,Eric Yu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Software Engineering (cs.SE)
备注: 81 pages, 19 figures. Fourth technical report in research program; should be read with companion arXiv:2510.18802 , arXiv:2510.24909 , and arXiv:2601.16237 . Adapts and extends complex actor material from Pant (2021) doctoral dissertation, University of Toronto
Abstract:Strategic coopetition in multi-stakeholder systems requires understanding how cooperation persists through time without binding contracts. This technical report extends computational foundations for strategic coopetition to sequential interaction dynamics, bridging conceptual modeling (i* framework) with game-theoretic reciprocity analysis. We develop: (1) bounded reciprocity response functions mapping partner deviations to finite conditional responses, (2) memory-windowed history tracking capturing cognitive limitations over k recent periods, (3) structural reciprocity sensitivity derived from interdependence matrices where behavioral responses are amplified by structural dependencies, and (4) trust-gated reciprocity where trust modulates reciprocity responses. The framework applies to both human stakeholder interactions and multi-agent computational systems. Comprehensive validation across 15,625 parameter configurations demonstrates robust reciprocity effects, with all six behavioral targets exceeding thresholds: cooperation emergence (97.5%), defection punishment (100%), forgiveness dynamics (87.9%), asymmetric differentiation (100%), trust-reciprocity interaction (100%), and bounded responses (100%). Empirical validation using the Apple iOS App Store ecosystem (2008-2024) achieves 43/51 applicable points (84.3%), reproducing documented cooperation patterns across five ecosystem phases. Statistical significance confirmed at p 0.001 with Cohen’s d = 1.57. This report concludes the Foundations Series (TR-1 through TR-4) adopting uniaxial treatment where agents choose cooperation levels along a single continuum. Companion work on interdependence (arXiv:2510.18802), trust (arXiv:2510.24909), and collective action (arXiv:2601.16237) has been prepublished. Extensions Series (TR-5 through TR-8) introduces biaxial treatment where cooperation and competition are independent dimensions.
[MA-8] DarwinNet: An Evolutionary Network Architecture for Agent -Driven Protocol Synthesis
【速读】:该论文旨在解决传统网络架构中存在的协议僵化(protocol ossification)和结构脆弱性问题,这些问题源于其依赖静态、人工定义的规则,难以适应现代自主代理所面临的新兴边缘场景与概率推理需求。解决方案的关键在于提出 DarwinNet,一种生物启发的自演化网络架构,通过将通信协议从设计时静态范式转变为运行时生长范式,实现动态适应能力;其核心创新是采用三层框架——不可变物理锚点(L0)、基于 WebAssembly 的流体皮层(L1)以及由大语言模型(LLM)驱动的达尔文皮层(L2)——并通过双环意图到字节码(Intent-to-Bytecode, I2B)机制,将高层业务意图转化为可执行字节码,同时引入协议固化指数(Protocol Solidification Index, PSI)量化系统演化成熟度,从而在保证零信任沙箱安全的前提下,使网络具备抗脆弱性(anti-fragility),并趋近物理性能极限。
链接: https://arxiv.org/abs/2604.01236
作者: Jinliang Xu,Bingqi Li
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Traditional network architectures suffer from severe protocol ossification and structural fragility due to their reliance on static, human-defined rules that fail to adapt to the emergent edge cases and probabilistic reasoning of modern autonomous agents. To address these limitations, this paper proposes DarwinNet, a bio-inspired, self-evolving network architecture that transitions communication protocols from a \textitdesign-time static paradigm to a \textitruntime growth paradigm. DarwinNet utilizes a tri-layered framework-comprising an immutable physical anchor (L0), a WebAssembly-based fluid cortex (L1), and an LLM-driven Darwin cortex (L2)-to synthesize high-level business intents into executable bytecode through a dual-loop \textitIntent-to-Bytecode (I2B) mechanism. We introduce the Protocol Solidification Index (PSI) to quantify the evolutionary maturity of the system as it collapses from high-latency intelligent reasoning (Slow Thinking) toward near-native execution (Fast Thinking). Validated through a reliability growth framework based on the Crow-AMSAA model, experimental results demonstrate that DarwinNet achieves anti-fragility by treating environmental anomalies as catalysts for autonomous evolution. Our findings confirm that DarwinNet can effectively converge toward physical performance limits while ensuring endogenous security through zero-trust sandboxing, providing a viable path for the next generation of intelligent, self-optimizing networks.
自然语言处理
[NLP-0] Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
【速读】: 该论文旨在解决语言模型(Language Models, LMs)在扩展新词汇时因初始嵌入方式不当而导致的性能瓶颈问题,具体表现为:标准的均值初始化策略会使新增词汇嵌入坍缩至退化子空间,导致词间区分度丧失,进而阻碍后续微调阶段对新词汇语义的有效学习。解决方案的关键在于提出“基于语义锚定的词元初始化假设”(Grounded Token Initialization Hypothesis),并据此设计轻量级的GTI(Grounded Token Initialization)方法——在微调前利用配对的语言学监督信息,将新词元映射到预训练嵌入空间中语义明确且互不重叠的位置,从而提升模型在新词汇域上的表征能力与泛化性能。实验证明,GTI显著优于传统均值初始化和现有辅助任务适应方法,且其生成的词嵌入结构在微调后仍保持丰富性,验证了高质量初始化是词汇扩展中的关键瓶颈所在。
链接: https://arxiv.org/abs/2604.02324
作者: Daiwei Chen,Zhoutong Fu,Chengming Jiang,Haichao Zhang,Ran Zhou,Tan Wang,Chunnan Yao,Guoyao Li,Rui Cai,Yihan Cao,Ruijie Jiang,Fedor Borisyuk,Jianqiang Shen,Jingwei Wu,Ramya Korlakai Vinayak
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); LinkedIn Corporation (领英公司); Northeastern University (东北大学); University of California, Davis (加州大学戴维斯分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emphtoken initialization is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emphGrounded Token Initialization Hypothesis: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.
[NLP-1] Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在采用思维链(Chain-of-Thought, CoT)推理时存在的过度 token 消耗问题,这一问题显著增加了推理成本。现有方法如显式长度惩罚、难度估计或分阶段课程训练等,往往导致推理质量下降或需要复杂的训练流程。解决方案的关键在于提出一种名为“批处理情境强化”(Batched Contextual Reinforcement, BCR)的极简单阶段训练范式:通过让模型在共享上下文窗口中同时解决 N 个问题,并仅以每个实例的准确率为奖励信号,从而隐式地建立 token 预算约束。该机制不仅实现了 token 使用效率的显著提升(在标准单问题推理下减少 15.8%–62.6%),还保持甚至提升了多个数学基准上的准确性,且避免了显式长度惩罚所引发的对抗梯度和优化崩溃问题,展现出稳定的约束驱动型长度控制能力。
链接: https://arxiv.org/abs/2604.02322
作者: Bangji Yang,Hongbo Ma,Jiajun Fan,Ge Liu
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 43 pages, 5 figures, 24 tables
Abstract:Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning quality or require complex training pipelines. We introduce Batched Contextual Reinforcement, a minimalist, single-stage training paradigm that unlocks efficient reasoning through a simple structural modification: training the model to solve N problems simultaneously within a shared context window, rewarded purely by per-instance accuracy. This formulation creates an implicit token budget that yields several key findings: (1) We identify a novel task-scaling law: as the number of concurrent problems N increases during inference, per-problem token usage decreases monotonically while accuracy degrades far more gracefully than baselines, establishing N as a controllable throughput dimension. (2) BCR challenges the traditional accuracy-efficiency trade-off by demonstrating a “free lunch” phenomenon at standard single-problem inference. Across both 1.5B and 4B model families, BCR reduces token usage by 15.8% to 62.6% while consistently maintaining or improving accuracy across five major mathematical benchmarks. (3) Qualitative analyses reveal emergent self-regulated efficiency, where models autonomously eliminate redundant metacognitive loops without explicit length supervision. (4) Crucially, we empirically demonstrate that implicit budget constraints successfully circumvent the adversarial gradients and catastrophic optimization collapse inherent to explicit length penalties, offering a highly stable, constraint-based alternative for length control. These results prove BCR practical, showing simple structural incentives unlock latent high-density reasoning in LLMs.
[NLP-2] No Single Best Model for Diversity: Learning a Router for Sample Diversity
【速读】: 该论文旨在解决开放性提示(open-ended prompts)下生成多样化有效回答的问题,即如何系统性地从多个大语言模型(Large Language Models, LLMs)中获取覆盖更全面的响应集合,以满足不同用户需求。其核心解决方案在于提出了一种新的评估指标——多样性覆盖率(diversity coverage),用于衡量预测答案集中每个唯一回答相对于最优同规模答案集的质量表现;基于此指标发现单一模型无法在所有提示上均表现出色,但每条提示下总存在一个显著优于其他模型的最优模型。因此,作者设计了一个路由器(router)机制,能够根据输入提示动态选择最适合的模型进行响应生成,从而在NB-Wildchat数据集上实现26.3%的多样性覆盖率,优于单个最佳模型基线(23.8%),并展现出跨域泛化能力与对不同提示策略的适应性。
链接: https://arxiv.org/abs/2604.02319
作者: Yuhan Liu,Fangyuan Xu,Vishakh Padmakumar,Daphne Ippolito,Eunsol Choi
机构: New York University (纽约大学); Stanford University (斯坦福大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: under review at COLM 2026
Abstract:When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce \textbfdiversity coverage, a metric that measures the total quality scores assigned to each \textbfunique answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-Wildchat, our trained router outperforms the single best model baseline (26.3% vs 23.8%). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.
[NLP-3] go-mHC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices
【速读】: 该论文旨在解决双重随机矩阵(doubly stochastic matrices)的精确且高效参数化问题,这一问题在残差流(residual streams)间学习混合时至关重要。现有方法中,精确方法随流数量 $ d $ 呈阶乘级增长,而克罗内克分解(Kronecker-factorized)方法虽高效但表达能力受限。解决方案的关键在于提出一种基于广义正交随机矩阵(generalized orthostochastic matrices)理论的新颖精确参数化方法,其计算复杂度为 $ \mathcal{O}(d^3) $,并引入一个可连续调节表达能力的超参数 $ s $,实现从计算高效边界到完整Birkhoff多面体的平滑过渡。该方法在Manifold-Constrained Hyper-Connections (mHC) 框架基础上构建为 go-mHC,既保持与克罗内克分解方法的兼容性,又显著提升表达能力,在相同FLOP成本下更充分填充Birkhoff多面体,并在合成任务和GPT风格语言模型中验证了其优越性能。
链接: https://arxiv.org/abs/2604.02309
作者: Torque Dandachi,Sophia Diggs-Galligan
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 29 pages, 30 figures, 9 tables. Includes supplementary material
Abstract:Doubly stochastic matrices enable learned mixing across residual streams, but parameterizing the set of doubly stochastic matrices (the Birkhoff polytope) exactly and efficiently remains an open challenge. Existing exact methods scale factorially with the number of streams ( d ), while Kronecker-factorized approaches are efficient but expressivity-limited. We introduce a novel exact parameterization grounded in the theory of generalized orthostochastic matrices, which scales as \mathcalO(d^3) and exposes a single hyperparameter s which continuously interpolates between a computationally efficient boundary and the fully expressive Birkhoff polytope. Building on Manifold-Constrained Hyper-Connections ( m HC), a framework for learned dynamic layer connectivity, we instantiate this parameterization in go- m HC. Our method composes naturally with Kronecker-factorized methods, substantially recovering expressivity at similar FLOP costs. Spectral analysis indicates that go- m HC fills the Birkhoff polytope far more completely than Kronecker-factorized baselines. On synthetic stream-mixing tasks, go- m HC achieves the minimum theoretical loss while converging up to 10\times faster. We validate our approach in a 30M parameter GPT-style language model. The expressivity, efficiency, and exactness of go- m HC offer a practical avenue for scaling d as a new dimension of model capacity.
[NLP-4] De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules
【速读】: 该论文旨在解决从密集且层级结构复杂的监管文档中自动提取结构化规则的问题,这一过程传统上依赖人工标注和领域特定提示,成本高且难以扩展。其解决方案的关键在于提出了一种完全自动化、领域无关的流水线 De Jure,通过四个阶段实现:源文档的结构化 Markdown 正则化、基于大语言模型(LLM)的语义分解生成规则单元、多维度(19个维度)的 LLM-as-a-judge 评估以及在有限再生预算内迭代修复低分提取结果。该方法无需人工标注或金标准数据,利用显式的可解释评估指标替代人工判断,在金融、医疗和人工智能治理等多个监管领域均表现出高一致性与泛化能力,并在下游合规问答任务中显著优于现有方法,验证了提取准确性对实际应用价值的直接提升。
链接: https://arxiv.org/abs/2604.02276
作者: Keerat Guliani,Deepkamal Gill,David Landsman,Nima Eshraghi,Krishna Kumar,Lovedeep Gondara
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Regulatory documents encode legally binding obligations that LLM-based systems must respect. Yet converting dense, hierarchically structured legal text into machine-readable rules remains a costly, expert-intensive process. We present De Jure, a fully automated, domain-agnostic pipeline for extracting structured regulatory rules from raw documents, requiring no human annotation, domain-specific prompting, or annotated gold data. De Jure operates through four sequential stages: normalization of source documents into structured Markdown; LLM-driven semantic decomposition into structured rule units; multi-criteria LLM-as-a-judge evaluation across 19 dimensions spanning metadata, definitions, and rule semantics; and iterative repair of low-scoring extractions within a bounded regeneration budget, where upstream components are repaired before rule units are evaluated. We evaluate De Jure across four models on three regulatory corpora spanning finance, healthcare, and AI governance. On the finance domain, De Jure yields consistent and monotonic improvement in extraction quality, reaching peak performance within three judge-guided iterations. De Jure generalizes effectively to healthcare and AI governance, maintaining high performance across both open- and closed-source models. In a downstream compliance question-answering evaluation via RAG, responses grounded in De Jure extracted rules are preferred over prior work in 73.8% of cases at single-rule retrieval depth, rising to 84.0% under broader retrieval, confirming that extraction fidelity translates directly into downstream utility. These results demonstrate that explicit, interpretable evaluation criteria can substitute for human annotation in complex regulatory domains, offering a scalable and auditable path toward regulation-grounded LLM alignment.
[NLP-5] VISTA: Visualization of Token Attribution via Efficient Analysis
【速读】: 该论文旨在解决生成式 AI(Generative AI)模型中注意力机制可视化方法的局限性问题,尤其是现有技术对特定模型架构(如 Transformer)依赖性强、需反向传播导致计算资源消耗大(接近双倍 GPU 内存)且缺乏通用性的缺陷。其解决方案的关键在于提出一种轻量级、模型无关的 token 重要性可视化方法,通过扰动策略结合三矩阵分析框架——即角度偏移矩阵(Angular Deviation Matrix)、幅度偏移矩阵(Magnitude Deviation Matrix)与维度重要性矩阵(Dimensional Importance Matrix),从语义方向、强度及向量维度三个互补维度量化每个 token 对模型预测的贡献,从而生成具有数学基础和细粒度解释能力的相关性图谱,且无需额外计算开销。
链接: https://arxiv.org/abs/2604.02217
作者: Syed Ahmed,Bharathi Vokkaliga Ganesh,Jagadish Babu P,Karthick Selvaraj,Praneeth Talluri,Sanket Hingne,Anubhav Kumar,Anushka Yadav,Pratham Kumar Verma,Kiranmayee Janardhan,Mandanna A N
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 3 figures
Abstract:Understanding how Large Language Models (LLMs) process information from prompts remains a significant challenge. To shed light on this “black box,” attention visualization techniques have been developed to capture neuron-level perceptions and interpret how models focus on different parts of input data. However, many existing techniques are tailored to specific model architectures, particularly within the Transformer family, and often require backpropagation, resulting in nearly double the GPU memory usage and increased computational cost. A lightweight, model-agnostic approach for attention visualization remains lacking. In this paper, we introduce a model-agnostic token importance visualization technique to better understand how generative AI systems perceive and prioritize information from input text, without incurring additional computational cost. Our method leverages perturbation-based strategies combined with a three-matrix analytical framework to generate relevance maps that illustrate token-level contributions to model predictions. The framework comprises: (1) the Angular Deviation Matrix, which captures shifts in semantic direction; (2) the Magnitude Deviation Matrix, which measures changes in semantic intensity; and (3) the Dimensional Importance Matrix, which evaluates contributions across individual vector dimensions. By systematically removing each token and measuring the resulting impact across these three complementary dimensions, we derive a composite importance score that provides a nuanced and mathematically grounded measure of token significance. To support reproducibility and foster wider adoption, we provide open-source implementations of all proposed and utilized explainability techniques, with code and resources publicly available at this https URL
[NLP-6] CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech
【速读】: 该论文旨在解决阿拉伯语语音命名实体识别(Speech Named Entity Recognition, Speech NER)在低资源场景下的性能瓶颈问题,尤其针对阿拉伯语的形态复杂性、缺少短元音标注以及标注数据稀缺等挑战。其关键解决方案是构建首个公开可用的阿拉伯语语音命名实体识别数据集CV-18 NER,该数据集基于阿拉伯语Common Voice 18语料库并采用细粒度Wojood标注体系(21类实体类型)进行人工标注;同时,通过对比流水线系统(ASR + 文本NER)与端到端(End-to-end, E2E)模型(基于Whisper和AraBEST-RQ架构)的性能,发现E2E方法显著优于最优流水线配置(测试集CoER达37.0%,CVER达38.0%),并揭示了阿拉伯语特定自监督预训练对ASR性能提升的重要性,以及多语言弱监督在联合语音到实体学习中的迁移有效性。
链接: https://arxiv.org/abs/2604.02209
作者: Youssef Saidi,Haroun Elleuch,Fethi Bougares
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at OSACT 2026
Abstract:End-to-end speech Named Entity Recognition (NER) aims to directly extract entities from speech. Prior work has shown that end-to-end (E2E) approaches can outperform cascaded pipelines for English, French, and Chinese, but Arabic remains under-explored due to its morphological complexity, the absence of short vowels, and limited annotated resources. We introduce CV-18 NER, the first publicly available dataset for NER from Arabic speech, created by augmenting the Arabic Common Voice 18 corpus with manual NER annotations following the fine-grained Wojood schema (21 entity types). We benchmark both pipeline systems (ASR + text NER) and E2E models based on Whisper and AraBEST-RQ. E2E systems substantially outperform the best pipeline configuration on the test set, reaching 37.0% CoER (AraBEST-RQ 300M) and 38.0% CVER (Whisper-medium). Further analysis shows that Arabic-specific self-supervised pretraining yields strong ASR performance, while multilingual weak supervision transfers more effectively to joint speech-to-entity learning, and that larger models may be harder to adapt in this low-resource setting. Our dataset and models are publicly released, providing the first open benchmark for end-to-end named entity recognition from Arabic speech this https URL.
[NLP-7] Blinded Radiologist and LLM -Based Evaluation of LLM -Generated Japanese Translations of Chest CT Reports: Comparative Study
【速读】: 该论文旨在解决生成式 AI(Generative AI)在医学影像报告翻译质量评估中的有效性问题,特别是针对胸部CT报告的英文到日文翻译是否适合用于放射学教育场景。研究的关键在于通过对比放射科医生与大型语言模型(LLM)作为评判者(LLM-as-a-judge)对同一组翻译结果的独立评分,验证自动化评估方法的可靠性。结果显示,LLM judges对自动生成翻译表现出显著偏好(70%-99%倾向),但其评价结果与两位放射科医生之间几乎无一致性(QWK=-0.04至0.15),且医生间一致性亦低(QWK=0.01至0.06),表明仅依赖LLM进行翻译质量评估不足以保障教育用途的准确性,专家人工审核仍是必要环节。
链接: https://arxiv.org/abs/2604.02207
作者: Yosuke Yamagishi,Atsushi Takamatsu,Yasunori Hamaguchi,Tomohiro Kikuchi,Shouhei Hanaoka,Takeharu Yoshikawa,Osamu Abe
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 25 pages, 4 figures
Abstract:Background: Accurate translation of radiology reports is important for multilingual research, clinical communication, and radiology education, but the validity of LLM-based evaluation remains unclear. Objective: To evaluate the educational suitability of LLM-generated Japanese translations of chest CT reports and compare radiologist assessments with LLM-as-a-judge evaluations. Methods: We analyzed 150 chest CT reports from the CT-RATE-JPN validation set. For each English report, a human-edited Japanese translation was compared with an LLM-generated translation by DeepSeek-V3.2. A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity. In parallel, 3 LLM judges (DeepSeek-V3.2, Mistral Large 3, and GPT-5) evaluated the same pairs. Agreement was assessed using QWK and percentage agreement. Results: Agreement between radiologists and LLM judges was near zero (QWK=-0.04 to 0.15). Agreement between the 2 radiologists was also poor (QWK=0.01 to 0.06). Radiologist 1 rated terminology as equivalent in 59% of cases and favored the LLM translation for readability (51%) and overall quality (51%). Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%). All 3 LLM judges strongly favored the LLM translation across all criteria (70%-99%) and rated it as more radiologist-like in 93% of cases. Conclusions: LLM-generated translations were often judged natural and fluent, but the 2 radiologists differed substantially. LLM-as-a-judge showed strong preference for LLM output and negligible agreement with radiologists. For educational use of translated radiology reports, automated LLM-based evaluation alone is insufficient; expert radiologist review remains important.
[NLP-8] owards Position-Robust Talent Recommendation via Large Language Models
【速读】: 该论文旨在解决现有基于大语言模型(Large Language Models, LLMs)的招聘推荐系统中存在的两个核心问题:一是采用点对点(pointwise)范式导致无法捕捉候选者列表中的相互关系,造成token消耗高且推荐效果不佳;二是LLM在处理多文档时存在位置偏差(position bias)和“中间迷失”(lost-in-the-middle)问题,影响推荐质量。解决方案的关键在于提出L3TR框架,其创新性地引入隐式策略利用LLM潜在输出,并设计块注意力机制(block attention mechanism)与局部位置编码方法(local positional encoding),以增强跨文档交互并缓解位置偏差与并发token偏差;同时提出ID采样方法解决训练与推理阶段候选集规模不一致的问题,并辅以无训练去偏方法检测与修正偏差,从而显著提升推荐性能。
链接: https://arxiv.org/abs/2604.02200
作者: Silin Du,Hongyan Liu
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Talent recruitment is a critical, yet costly process for many industries, with high recruitment costs and long hiring cycles. Existing talent recommendation systems increasingly adopt large language models (LLMs) due to their remarkable language understanding capabilities. However, most prior approaches follow a pointwise paradigm, which requires LLMs to repeatedly process some text and fails to capture the relationships among candidates in the list, resulting in higher token consumption and suboptimal recommendations. Besides, LLMs exhibit position bias and the lost-in-the-middle issue when answering multiple-choice questions and processing multiple long documents. To address these issues, we introduce an implicit strategy to utilize LLM’s potential output for the recommendation task and propose L3TR, a novel framework for listwise talent recommendation with LLMs. In this framework, we propose a block attention mechanism and a local positional encoding method to enhance inter-document processing and mitigate the position bias and concurrent token bias issue. We also introduce an ID sampling method for resolving the inconsistency between candidate set sizes in the training phase and the inference phase. We design evaluation methods to detect position bias and token bias and training-free debiasing methods. Extensive experiments on two real-world datasets validated the effectiveness of L3TR, showing consistent improvements over existing baselines.
[NLP-9] Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model
【速读】: 该论文旨在解决检索增强语言模型(Retrieval-Augmented Language Models, RALMs)在面对无关或噪声检索上下文时性能下降的问题。现有方法多通过粗粒度的参数更新(如层或模块级别)来提升鲁棒性,但忽略了大型语言模型(Large Language Models, LLMs)固有的神经元级稀疏性。解决方案的关键在于提出 Neuro-RIT(Neuron-guided Robust Instruction Tuning)框架,其核心是将适应策略从密集调整转向基于神经元级别的精准对齐:首先利用基于归因的神经元挖掘技术分离出处理相关与无关上下文的神经元,随后设计两阶段指令微调策略——一方面通过功能关闭仅响应无关上下文的神经元实现直接噪声抑制,另一方面优化特定层以增强证据提炼能力,从而显著提升模型在噪声环境下的鲁棒性。
链接: https://arxiv.org/abs/2604.02194
作者: Jaemin Kim,Jae O Lee,Sumyeong Ahn,Seo Yeon Park
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Language Models (RALMs) have demonstrated significant potential in knowledge-intensive tasks; however, they remain vulnerable to performance degradation when presented with irrelevant or noisy retrieved contexts. Existing approaches to enhance robustness typically operate via coarse-grained parameter updates at the layer or module level, often overlooking the inherent neuron-level sparsity of Large Language Models (LLMs). To address this limitation, we propose Neuro-RIT (Neuron-guided Robust Instruction Tuning), a novel framework that shifts the paradigm from dense adaptation to precision-driven neuron alignment. Our method explicitly disentangles neurons that are responsible for processing relevant versus irrelevant contexts using attribution-based neuron mining. Subsequently, we introduce a two-stage instruction tuning strategy that enforces a dual capability for noise robustness: achieving direct noise suppression by functionally deactivating neurons exclusive to irrelevant contexts, while simultaneously optimizing targeted layers for evidence distillation. Extensive experiments across diverse QA benchmarks demonstrate that Neuro-RIT consistently outperforms strong baselines and robustness-enhancing methods.
[NLP-10] he Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)中混合专家(Mixture-of-Experts, MoE)架构的可解释性问题,特别是其稀疏性是否使其相较于密集前馈网络(Feed-Forward Networks, FFNs)更具可解释性。研究发现,MoE中的专家神经元表现出更低的多义性(polysemanticity),且随着路由稀疏度增加,这种差异更加显著,表明稀疏性促使单个神经元和整个专家向单一语义方向演化。解决方案的关键在于将分析单位从神经元层面提升至专家层面,利用这一更有效的粒度自动解析数百个专家的功能,从而揭示专家并非宽泛领域专家或简单的token处理器,而是专注于细粒度的语言操作或语义任务(如LaTeX中闭合括号),证明MoE在专家层级上具有内在可解释性,为大规模模型的可解释性提供了清晰路径。
链接: https://arxiv.org/abs/2604.02178
作者: Jeremy Herbst,Jae Hee Lee,Stefan Wermter
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using k -sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: this https URL
[NLP-11] Adams Law: Textual Frequency Law on Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在训练和推理过程中对文本频率依赖性不足的问题,即缺乏系统性地利用高频文本数据以提升模型性能的研究。其核心问题是:尽管文本频率已被证实与人类阅读速度及认知过程相关,但其在LLM中的作用尚未被充分探索。解决方案的关键在于提出一个三阶段框架——Textual Frequency Law (TFL)、Textual Frequency Distillation (TFD) 和 Curriculum Textual Frequency Training (CTFT),其中TFL主张优先使用高频文本进行提示(prompting)和微调;TFD通过LLM生成扩展语料来校准初始频率估计;CTFT则按句子级频率递增顺序进行微调,从而系统性增强模型对高频表达的敏感性和泛化能力。
链接: https://arxiv.org/abs/2604.02176
作者: Hongyuan Adam Lu,Z.L.,Victor Wei,Zefan Zhang,Zhao Hong,Qiqi Xiang,Bowen Cao,Wai Lam
机构: FaceMind Corporation(面脸科技公司); The Chinese University of Hong Kong(香港中文大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.
[NLP-12] Do Lexical and Contextual Coreference Resolution Systems Degrade Differently under Mention Noise? An Empirical Study on Scientific Software Mentions
【速读】: 该论文旨在解决跨文档软件提及指代消解(cross-document software mention coreference resolution)问题,即在不同文档中识别指向同一软件实体的提及项。其解决方案的关键在于提出两种无需微调(fine-tuning-free)的方法:Fuzzy Matching (FM) 和 Context Aware Representations (CAR)。其中,CAR 通过融合提及级和文档级嵌入表示,在保持高性能的同时展现出更强的鲁棒性和可扩展性,尤其在噪声输入下表现更稳定,且在大规模语料上具有近线性的计算复杂度,优于 FM 的超线性增长特性。这一发现表明,系统选择应基于上游提及检测器的噪声特征与目标语料规模。
链接: https://arxiv.org/abs/2604.02171
作者: Atilla Kaan Alkan,Felix Grezes,Jennifer Lynn Bartlett,Anna Kelbert,Kelly Lockhart,Alberto Accomazzi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages
Abstract:We present our participation in the SOMD 2026 shared task on cross-document software mention coreference resolution, where our systems ranked second across all three subtasks. We compare two fine-tuning-free approaches: Fuzzy Matching (FM), a lexical string-similarity method, and Context Aware Representations (CAR), which combines mention-level and document-level embeddings. Both achieve competitive performance across all subtasks (CoNLL F1 of 0.94-0.96), with CAR consistently outperforming FM by 1 point on the official test set, consistent with the high surface regularity of software names, which reduces the need for complex semantic reasoning. A controlled noise-injection study reveals complementary failure modes: as boundary noise increases, CAR loses only 0.07 F1 points from clean to fully corrupted input, compared to 0.20 for FM, whereas under mention substitution, FM degrades more gracefully (0.52 vs. 0.63). Our inference-time analysis shows that FM scales superlinearly with corpus size, whereas CAR scales approximately linearly, making CAR the more efficient choice at large scale. These findings suggest that system selection should be informed by both the noise profile of the upstream mention detector and the scale of the target corpus. We release our code to support future work on this underexplored task.
[NLP-13] Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
【速读】: 该论文旨在解决语言代理在结构化工具调用场景中,链式思维(Chain-of-thought, CoT)推理长度与任务准确率之间的非单调关系问题,即过度推理反而会降低性能。其核心发现是:短时推理(32 tokens)能显著提升准确率(从44.0%提升至64.0%),而长时推理(256 tokens)则导致性能大幅下降(降至25.0%),且错误分解表明长推理会引发函数误选和幻觉。解决方案的关键在于提出一种结构化的简短CoT方法——函数路由链式思维(Function-Routing CoT, FR-CoT),通过模板化推理过程强制在初始阶段明确指定有效函数名,从而实现高准确性与低函数幻觉(0.0%),无需额外预算调优即可提供结构可靠性保障。
链接: https://arxiv.org/abs/2604.02155
作者: Xuan Qi
机构: IIIS, Tsinghua University (清华大学智能产业研究院)
类目: Computation and Language (cs.CL)
备注: 21 pages
Abstract:How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0–512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p 0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8–16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as “Function: [name] / Key args: […],” forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.
[NLP-14] MTI: A Behavior-Based Temperament Profiling System for AI Agents
【速读】: 该论文旨在解决当前缺乏标准化工具来衡量具备相同能力的AI模型在行为模式上的根本差异这一问题。现有方法要么借用人类人格维度并依赖自我报告(这与大语言模型的实际行为不一致),要么将行为变异视为缺陷而非特质。其解决方案的关键在于提出模型气质指数(Model Temperament Index, MTI),这是一个基于行为的代理气质评估系统,通过四个独立轴向——反应性(Reactivity,环境敏感度)、合规性(Compliance,指令-行为一致性)、社交性(Sociality,关系资源分配)和韧性(Resilience,抗压能力)——对AI代理进行量化分析。MTI采用两阶段设计,区分模型能力与气质,并基于“模型医学”的四壳模型框架,直接测量代理的行为表现而非自述内容,从而实现了对AI气质的客观、可比较的建模。
链接: https://arxiv.org/abs/2604.02145
作者: Jihoon Jeong
机构: Daegu Gyeongbuk Institute of Science and Technology (DGIST); ModuLabs; Meta; Mistral AI; LG AI Research; Alibaba; Google; Microsoft; HuggingFace; DeepSeek
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 29 pages, 6 figures, 12 tables. Paper #3 in the Model Medicine Series (Paper #1: arXiv:2603.04722 )
Abstract:AI models of equivalent capability can exhibit fundamentally different behavioral patterns, yet no standardized instrument exists to measure these dispositional differences. Existing approaches either borrow human personality dimensions and rely on self-report (which diverges from actual behavior in LLMs) or treat behavioral variation as a defect rather than a trait. We introduce the Model Temperament Index (MTI), a behavior-based profiling system that measures AI agent temperament across four axes: Reactivity (environmental sensitivity), Compliance (instruction-behavior alignment), Sociality (relational resource allocation), and Resilience (stress resistance). Grounded in the Four Shell Model from Model Medicine, MTI measures what agents do, not what they say about themselves, using structured examination protocols with a two-stage design that separates capability from disposition. We profile 10 small language models (1.7B-9B parameters, 6 organizations, 3 training paradigms) and report five principal findings: (1) the four axes are largely independent among instruction-tuned models (all |r| 0.42); (2) within-axis facet dissociations are empirically confirmed – Compliance decomposes into fully independent formal and stance facets (r = 0.002), while Resilience decomposes into inversely related cognitive and adversarial facets; (3) a Compliance-Resilience paradox reveals that opinion-yielding and fact-vulnerability operate through independent channels; (4) RLHF reshapes temperament not only by shifting axis scores but by creating within-axis facet differentiation absent in the unaligned base model; and (5) temperament is independent of model size (1.7B-9B), confirming that MTI measures disposition rather than capability. Comments: 29 pages, 6 figures, 12 tables. Paper #3 in the Model Medicine Series (Paper #1: arXiv:2603.04722) Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2604.02145 [cs.AI] (or arXiv:2604.02145v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.02145 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-15] GaelEval: Benchmarking LLM Performance for Scottish Gaelic LREC2026
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, LLMs)在缺乏官方支持的形态句法复杂型少数语言(如苏格兰盖尔语)中表现不一致且难以衡量的问题。现有翻译基准无法有效捕捉此类语言的结构能力,导致模型在实际应用中的真实性能被低估。解决方案的关键在于提出首个面向盖尔语的多维评估基准GaelEval,其包含三个核心组件:(i) 由专家编写的形态句法多项选择题问答(MCQA)任务;(ii) 基于文化语境的翻译基准;(iii) 大规模文化知识问答任务。通过该基准对19个LLM进行评估,发现前沿模型在语法任务上已超越母语者基准,并验证了盖尔语提示(in-language prompting)的稳定增益效应,同时揭示了专有模型相较于开源模型的持续优势。
链接: https://arxiv.org/abs/2604.02135
作者: Peter Devine,William Lamb,Beatrice Alex,Ignatius Ezeani,Dawn Knight,Mícheál J. Ó Meachair,Paul Rayson,Martin Wynne
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, to be published in Proceedings of LLMs4SSH (workshop co-located with LREC 2026; Mallorca, Spain; May 2026)
Abstract:Multilingual large language models (LLMs) often exhibit emergent ‘shadow’ capabilities in languages without official support, yet their performance on these languages remains uneven and under-measured. This is particularly acute for morphosyntactically rich minority languages such as Scottish Gaelic, where translation benchmarks fail to capture structural competence. We introduce GaelEval, the first multi-dimensional benchmark for Gaelic, comprising: (i) an expert-authored morphosyntactic MCQA task; (ii) a culturally grounded translation benchmark and (iii) a large-scale cultural knowledge QA task. Evaluating 19 LLMs against a fluent-speaker human baseline ( n=30 ), we find that Gemini 3 Pro Preview achieves 83.3% accuracy on the linguistic task, surpassing the human baseline ( 78.1% ). Proprietary models consistently outperform open-weight systems, and in-language (Gaelic) prompting yields a small but stable advantage (+ 2.4% ). On the cultural task, leading models exceed 90% accuracy, though most systems perform worse under Gaelic prompting and absolute scores are inflated relative to the manual benchmark. Overall, GaelEval reveals that frontier models achieve above-human performance on several dimensions of Gaelic grammar, demonstrates the effect of Gaelic prompting and shows a consistent performance gap favouring proprietary over open-weight models.
[NLP-16] LLM -as-a-Judge for Time Series Explanations
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成的时间序列解释在缺乏参考文本或特定任务规则的情况下,如何实现客观、可信赖的事实正确性评估问题。现有方法受限于依赖人工标注的参考解释(如基于相似度的指标)或仅处理数值信号的传统时序分析方法,无法直接验证自然语言解释是否忠实于原始时间序列数据。其解决方案的关键在于:将LLM同时作为解释生成器和评估器,在无参考设定下构建一个基于模式识别、数值准确性与答案一致性三维度的三级正确性标签体系,从而实现对解释内容的结构化评分与排序。通过构造包含350个案例的合成基准数据集(涵盖七类查询类型),实验表明尽管生成性能存在显著波动(如季节性下降和波动突变类任务准确率仅为0.00–0.12),但LLM在独立评分与排序任务中表现出高度稳定性,验证了其作为数据驱动型推理评价工具的可行性与可靠性。
链接: https://arxiv.org/abs/2604.02118
作者: Preetham Sivalingam,Murari Mandal,Saurabh Deshpande,Dhruv Kumar
机构: BITS Pilani; Birla AI Labs; KIIT Birla AI Labs
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under Review
Abstract:Evaluating factual correctness of LLM generated natural language explanations grounded in time series data remains an open challenge. Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional time series methods operate purely on numerical values and cannot assess free form textual reasoning. Thus, no general purpose method exists to directly verify whether an explanation is faithful to underlying time series data without predefined references or task specific rules. We study large language models as both generators and evaluators of time series explanations in a reference free setting, where given a time series, question, and candidate explanation, the evaluator assigns a ternary correctness label based on pattern identification, numeric accuracy, and answer faithfulness, enabling principled scoring and comparison. To support this, we construct a synthetic benchmark of 350 time series cases across seven query types, each paired with correct, partially correct, and incorrect explanations. We evaluate models across four tasks: explanation generation, relative ranking, independent scoring, and multi anomaly detection. Results show a clear asymmetry: generation is highly pattern dependent and exhibits systematic failures on certain query types, with accuracies ranging from 0.00 to 0.12 for Seasonal Drop and Volatility Shift, to 0.94 to 0.96 for Structural Break, while evaluation is more stable, with models correctly ranking and scoring explanations even when their own outputs are incorrect. These findings demonstrate feasibility of data grounded LLM based evaluation for time series explanations and highlight their potential as reliable evaluators of data grounded reasoning in the time series domain.
[NLP-17] Reliable Control-Point Selection for Steering Reasoning in Large Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)中推理行为控制的难题,即如何通过无训练方式(training-free)利用转向向量(steering vectors)有效引导大语言模型(Large Language Models, LLMs)的推理行为。现有方法依赖于链式思维(Chain-of-Thought, CoT)中的关键词匹配来识别行为边界,但该假设认为所有检测到的边界均代表真实的行为信号,而实际中多数边界并不稳定——实验表明,在541个由关键词检测出的边界中,93.3%在重新生成时无法重现目标行为,说明这些边界是行为不稳定的“噪声”。解决方案的关键在于提出一种基于概率建模的方法,将内在推理行为形式化为具有上下文依赖触发概率的随机事件,并引入稳定性过滤(stability filtering)机制,仅保留行为一致可复现的边界;同时结合内容子空间投影以去除与问题相关的残留噪声,从而显著提升转向向量的有效性与泛化能力。
链接: https://arxiv.org/abs/2604.02113
作者: Haomin Zhuang,Hojun Yoo,Xiaonan Luo,Kehan Guo,Xiangliang Zhang
机构: University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Steering vectors offer a training-free mechanism for controlling reasoning behaviors in large language models, but constructing effective vectors requires identifying genuine behavioral signals in the model’s hidden states. For behaviors that can be toggled via prompts, this is straightforward. However, many reasoning behaviors – such as self-reflection – emerge spontaneously and resist prompt-level control. Current methods detect these behaviors through keyword matching in chain-of-thought traces, implicitly assuming that every detected boundary encodes a genuine behavioral signal. We show that this assumption is overwhelmingly wrong: across 541 keyword-detected boundaries, 93.3% are behaviorally unstable, failing to reproduce the detected behavior under re-generation from the same prefix. We develop a probabilistic model that formalizes intrinsic reasoning behaviors as stochastic events with context-dependent trigger probabilities, and show that unstable boundaries dilute the steering signal. Guided by this analysis, we propose stability filtering, which retains only boundaries where the model consistently reproduces the target behavior. Combined with a content-subspace projection that removes residual question-specific noise, our method achieves 0.784 accuracy on MATH-500 (+5.0 over the strongest baseline). The resulting steering vectors transfer across models in the same architecture family without re-extraction, improving Nemotron-Research-Reasoning-1.5B (+5.0) and DeepScaleR-1.5B-Preview (+6.0). Code is available at this https URL.
[NLP-18] Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations INTERSPEECH2026
【速读】: 该论文旨在解决自监督语音模型(Self-Supervised Speech Models, S3Ms)对韵律对比(prosodic contrast)敏感性缺乏直接测量的问题。现有方法如ABX判别任务已被用于评估S3M表示在音位对比(phonemic contrast)上的表现,但尚未扩展至韵律层面。为此,作者提出“韵律ABX”(prosodic ABX),作为ABX框架的延伸,能够在仅需少量样本且无需显式标注的情况下量化模型对韵律差异的感知能力。其关键创新在于构建了涵盖英语重音、日语声调和汉语声调的最小对(minimal pairs)数据集,并通过跨语言实验验证了不同模型与层在多种条件下的排名一致性,从而为低资源场景下高效评估S3M的韵律表征能力提供了可行方案。
链接: https://arxiv.org/abs/2604.02102
作者: Haitong Sun,Stephen McIntosh,Kwanghee Choi,Eunjung Yeo,Daisuke Saito,Nobuaki Minematsu
机构: The University of Tokyo (东京大学); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to Interspeech 2026; 6 pages, 4 figures
Abstract:Speech representations from self-supervised speech models (S3Ms) are known to be sensitive to phonemic contrasts, but their sensitivity to prosodic contrasts has not been directly measured. The ABX discrimination task has been used to measure phonemic contrast in S3M representations via minimal pairs. We introduce prosodic ABX, an extension of this framework to evaluate prosodic contrast with only a handful of examples and no explicit labels. Also, we build and release a dataset of English and Japanese minimal pairs and use it along with a Mandarin dataset to evaluate contrast in English stress, Japanese pitch accent, and Mandarin tone. Finally, we show that model and layer rankings are often preserved across several experimental conditions, making it practical for low-resource settings.
[NLP-19] Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation
【速读】: 该论文旨在解决递归变换器(Recursive Transformer)中因每层重复使用相同变换而导致无法在深度方向上组合不同操作的问题。其核心限制在于缺乏对输入状态的动态响应能力,从而限制了模型表达复杂序列依赖的能力。解决方案的关键在于引入一个轻量级控制器(Controller)超网络,该控制器基于当前隐藏状态生成每步的对角调制向量,并通过冻结的SVD初始化LoRA基底实现输入相关的参数调制;同时结合门控递归机制(bias-initialized to 88% retention)与每步LayerNorm以保障深层迭代的稳定性。此方法仅增加9.2M可训练参数,显著提升了训练损失表现并优于同等规模静态LoRA方案。
链接: https://arxiv.org/abs/2604.02051
作者: Jaber Jaber,Osama Jaber
机构: RightNow AI
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 5 tables, 1 figure, 1 algorithm. Code: this https URL
Abstract:Recursive transformers reuse a shared weight block across multiple depth steps, trading parameters for compute. A core limitation: every step applies the same transformation, preventing the model from composing distinct operations across depth. We present Ouroboros, a system that attaches a compact Controller hypernetwork to a recursive transformer block. The Controller observes the current hidden state, produces a per-step diagonal modulation vector, and applies it to frozen SVD-initialized LoRA bases, making each recurrence step input-dependent. We combine this with gated recurrence (bias-initialized to 88% retention) and per-step LayerNorm for stable deep iteration. On Qwen2.5-3B split into a Prelude/Recurrent/Coda architecture (17 of 36 layers retained), Ouroboros reduces training loss by 43.4% over the unmodified 17-layer baseline, recovering 51.3% of the performance gap caused by layer removal. The full system adds only 9.2M trainable parameters (Controller, gate, and per-step norms) yet outperforms equivalently-sized static per-step LoRA by 1.44 loss points at depth 1 and remains ahead across all tested depths (1, 4, 8, 16) and ranks (8, 32, 64). We also find that gated recurrence is essential: without it, recursive layer application makes the model strictly worse. These gains are measured on the training distribution; on held-out text, the Controller does not yet improve over the baseline, a limitation we attribute to frozen downstream layers and discuss in detail. Code: this https URL
[NLP-20] Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
【速读】: 该论文旨在解决生成式 AI(Generative AI)中大语言模型(Large Language Model, LLM)推理速度慢的问题,特别是如何在固定验证预算下最大化每步接受的候选 token 数量。现有方法采用平衡树结构组织候选 token,但受限于深度与广度之间的权衡,难以充分利用不同来源 token 的质量差异。论文发现,两种常见的无训练 token 来源——基于输入上下文的 n-gram 匹配和前向传播中的统计预测——在接受率上存在显著差距(中位数相差约 6 倍,范围 2–18 倍)。基于此观察,作者证明最优树结构应为各向异性(anisotropic):高置信度 token 构成深链,低置信度 token 形成宽分支,从而突破平衡树的深度限制。解决方案的关键在于提出 GOOSE 框架,其构建自适应脊柱树(adaptive spine tree),即由高接受率上下文匹配 token 组成的深链与每个节点处扩展的低接受率备选分支构成,理论上保证每步接收 token 数不少于任一单独来源,并在五个 LLM(7B–33B 参数)和五个基准测试中实现 1.9–4.3 倍无损加速,优于平衡树基线 12–33%。
链接: https://arxiv.org/abs/2604.02047
作者: Tao Jin,Phuong Minh Nguyen,Naoya Inoue
机构: Japan Advanced Institute of Science and Technology (JAIST)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Speculative decoding accelerates large language model inference by drafting multiple candidate tokens and verifying them in a single forward pass. Candidates are organized as a tree: deeper trees accept more tokens per step, but adding depth requires sacrificing breadth (fallback options) under a fixed verification budget. Existing training-free methods draft from a single token source and shape their trees without distinguishing candidate quality across origins. We observe that two common training-free token sources - n-gram matches copied from the input context, and statistical predictions from prior forward passes - differ dramatically in acceptance rate (~6x median gap, range 2-18x across five models and five benchmarks). We prove that when such a quality gap exists, the optimal tree is anisotropic (asymmetric): reliable tokens should form a deep chain while unreliable tokens spread as wide branches, breaking through the depth limit of balanced trees. We realize this structure in GOOSE, a training-free framework that builds an adaptive spine tree - a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives at each node. We prove that the number of tokens accepted per step is at least as large as that of either source used alone. On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same budget.
[NLP-21] BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLM s
【速读】: 该论文旨在解决将因果生成式语言模型(causal generative language models)转化为双向编码器(bidirectional encoders)时面临的三大挑战:缺乏最优训练目标共识、大规模训练中的灾难性遗忘(catastrophic forgetting)以及难以灵活整合专用生成模型生态。其解决方案的关键在于通过系统性消融实验识别出一个常被忽略的“先验掩码阶段”(prior masking phase),并提出一种双策略方法:一是采用线性权重合并(linear weight merging)技术,二是引入轻量级多领域数据混合(lightweight multi-domain data mixture),从而在无需原始预训练数据的情况下有效缓解灾难性遗忘;此外,进一步通过与特定领域因果模型融合,实现模态和领域特异性能力的无缝迁移。最终构建的BidirLM系列编码器在文本、视觉和音频表征基准上均优于现有方法。
链接: https://arxiv.org/abs/2604.02045
作者: Nicolas Boizard,Théo Deschamps-Berger,Hippolyte Gisserot-Boukhlef,Céline Hudelot,Pierre Colombo
机构: Diabolocom; Artefact Research Center; MICS, CentraleSupélec, Université Paris-Saclay; Cohere
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages, 16 figures, 10 tables
Abstract:Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.
[NLP-22] racking the emergence of linguistic structure in self-supervised models learning from speech
【速读】: 该论文旨在解决自监督语音模型在训练过程中,不同层次的语言结构(linguistic structure)何时以及如何逐步涌现的问题。其核心发现是:不同抽象层级的语言结构在模型各层中的编码模式和学习轨迹存在显著差异,这种差异可部分归因于它们与声学信号的抽象程度及输入信息整合的时间尺度不同;此外,预训练目标的设计(如是否采用高阶预测任务)显著影响语言结构在层间的组织方式与学习动态,其中高阶预测任务(如迭代优化的伪标签)能促进更并行化的结构学习路径。
链接: https://arxiv.org/abs/2604.02043
作者: Marianne de Heer Kloots,Martijn Bentum,Hosein Mohebbi,Charlotte Pouw,Gaofei Shen,Willem Zuidema
机构: University of Amsterdam (阿姆斯特丹大学); Radboud University (奈梅亨大学); Tilburg University (蒂尔堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).
[NLP-23] Why Gaussian Diffusion Models Fail on Discrete Data?
【速读】: 该论文旨在解决扩散模型(Diffusion Models)在离散数据(如文本、代码和蛋白质序列)上应用时的采样质量下降问题。研究表明,当使用DDPM(Denoising Diffusion Probabilistic Models)求解器对以连续空间中delta-分布混合形式表示的离散分布进行建模时,会在特定采样区间内出现密度分布多模态的现象,导致DDPM偶尔进入模式间的低密度区域,从而产生分布外输入并恶化生成样本质量。解决方案的关键在于识别这一临界采样区间,并引入两种策略:一是自条件机制(self-conditioning),二是提出一种称为q-sampling的新型求解方法;进一步地,通过在临界区间内切换从DDPM到q-sampling,结合自条件机制,显著提升了真实数据上的生成质量,且该方法在条件与非条件任务中均表现出鲁棒性。
链接: https://arxiv.org/abs/2604.02028
作者: Alexander Shabalin,Simon Elistratov,Viacheslav Meshchaninov,Ildus Sadrtdinov,Dmitry Vetrov
机构: Constructor University; HSE University; Lomonosov Moscow State University
类目: Computation and Language (cs.CL)
备注:
Abstract:Diffusion models have become a standard approach for generative modeling in continuous domains, yet their application to discrete data remains challenging. We investigate why Gaussian diffusion models with the DDPM solver struggle to sample from discrete distributions that are represented as a mixture of delta-distributions in the continuous space. Using a toy Random Hierarchy Model, we identify a critical sampling interval in which the density of noisified data becomes multimodal. In this regime, DDPM occasionally enters low-density regions between modes producing out-of-distribution inputs for the model and degrading sample quality. We show that existing heuristics, including self-conditioning and a solver we term q-sampling, help alleviate this issue. Furthermore, we demonstrate that combining self-conditioning with switching from DDPM to q-sampling within the critical interval improves generation quality on real data. We validate these findings across conditional and unconditional tasks in multiple domains, including text, programming code, and proteins.
[NLP-24] kNNProxy: Efficient Training-Free Proxy Alignment for Black-Box Zero-Shot LLM -Generated Text Detection
【速读】: 该论文旨在解决零样本生成式 AI (Generative AI) 文本检测方法中因代理语言模型(proxy LLM)与未知源语言模型(source LLM)之间对齐不足而导致的检测可靠性问题。现有方法依赖监督微调或频繁调用商业API进行对齐,存在部署成本高、易受API静默变更影响及域偏移下鲁棒性差等缺陷。其解决方案的关键在于提出一种无需训练且查询高效的代理对齐框架——k近邻代理(kNNProxy),该框架利用kNN语言模型(kNN-LM)的检索机制作为域适配器,在推理阶段通过轻量级数据存储库中的最近邻证据与代理输出进行token级预测分布插值,实现无需微调即可获得对齐预测;进一步地,为增强域偏移下的鲁棒性,引入多代理混合(MoP)机制,根据输入域选择对应的域特定数据存储库进行检索,从而提升检测性能。
链接: https://arxiv.org/abs/2604.02008
作者: Kahim Wong,Kemou Li,Haiwei Wu,Jiantao Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:LLM-generated text (LGT) detection is essential for reliable forensic analysis and for mitigating LLM misuse. Existing LGT detectors can generally be categorized into two broad classes: learning-based approaches and zero-shot methods. Compared with learning-based detectors, zero-shot methods are particularly promising because they eliminate the need to train task-specific classifiers. However, the reliability of zero-shot methods fundamentally relies on the assumption that an off-the-shelf proxy LLM is well aligned with the often unknown source LLM, a premise that rarely holds in real-world black-box scenarios. To address this discrepancy, existing proxy alignment methods typically rely on supervised fine-tuning of the proxy or repeated interactions with commercial APIs, thereby increasing deployment costs, exposing detectors to silent API changes, and limiting robustness under domain shift. Motivated by these limitations, we propose the k -nearest neighbor proxy ( k NNProxy), a training-free and query-efficient proxy alignment framework that repurposes the k NN language model ( k NN-LM) retrieval mechanism as a domain adapter for a fixed proxy LLM. Specifically, a lightweight datastore is constructed once from a target-reflective LGT corpus, either via fixed-budget querying or from existing datasets. During inference, nearest-neighbor evidence induces a token-level predictive distribution that is interpolated with the proxy output, yielding an aligned prediction without proxy fine-tuning or per-token API outputs. To improve robustness under domain shift, we extend k NNProxy into a mixture of proxies (MoP) that routes each input to a domain-specific datastore for domain-consistent retrieval. Extensive experiments demonstrate strong detection performance of our method.
[NLP-25] SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning
【速读】: 该论文旨在解决多跳问答(Multi-hop QA)基准测试中大型语言模型(LLMs)因“虚假正确性”而被误导的问题,即模型在推理过程中可能依赖未 grounded 的步骤却仍能获得高分,从而掩盖了其真实推理能力的缺陷。解决方案的关键在于提出 SAFE 框架,该框架通过两个阶段实现严格验证:训练阶段利用知识图谱(Knowledge Graph, KG)构建原子级错误分类体系并建立验证流水线,识别出高达 14% 的不可回答实例;推理阶段则引入一个基于验证数据集训练的反馈模型,在实时推理中动态检测并阻止未 grounded 的推理步骤,从而确保推理路径可验证且逻辑严谨。实验表明,SAFE 不仅揭示了现有基准的结构性缺陷,还在准确率上相较标准基线提升平均 8.4 个百分点。
链接: https://arxiv.org/abs/2604.01993
作者: Daeyong Kwon,Soyoung Yoon,Seung-won Hwang
机构: Seoul National University, South Korea
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.
[NLP-26] RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
【速读】: 该论文旨在解决安全团队在面对海量新披露的通用漏洞披露(CVE)时,难以手动开发检测机制的问题。2025年美国国家漏洞数据库(NVD)公布了超过48,000个新漏洞,凸显了自动化检测规则生成的迫切需求。其解决方案核心是提出并部署了RuleForge系统,该系统基于结构化的Nuclei模板(YAML格式)自动生成JSON格式的检测规则,用于识别利用特定漏洞的恶意HTTP请求。关键创新在于引入“大型语言模型作为裁判”(LLM-as-a-judge)的置信度验证机制,从敏感性(避免漏报)和特异性(避免误报)两个维度评估候选规则,在生产环境中实现AUROC达0.75,并相较仅依赖合成测试的验证方式减少67%的误报;同时结合五维五次生成策略(5×5 generation strategy)与持续反馈闭环,实现了规则质量的系统性提升。
链接: https://arxiv.org/abs/2604.01977
作者: Ayush Garg,Sophia Hager,Jacob Montiel,Aditya Tiwari,Michael Gentile,Zach Reavis,David Magnotti,Wayne Fullen
机构: Johns Hopkins University (约翰霍普金斯大学); Amazon Web Services (亚马逊网络服务)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 11 pages, 10 figures. To be submitted to CAMLIS 2026
Abstract:Security teams face a challenge: the volume of newly disclosed Common Vulnerabilities and Exposures (CVEs) far exceeds the capacity to manually develop detection mechanisms. In 2025, the National Vulnerability Database published over 48,000 new vulnerabilities, motivating the need for automation. We present RuleForge, an AWS internal system that automatically generates detection rules–JSON-based patterns that identify malicious HTTP requests exploiting specific vulnerabilities–from structured Nuclei templates describing CVE details. Nuclei templates provide standardized, YAML-based vulnerability descriptions that serve as the structured input for our rule generation process. This paper focuses on RuleForge’s architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic feedback integration mechanism. This validation approach evaluates candidate rules across two dimensions–sensitivity (avoiding false negatives) and specificity (avoiding false positives)–achieving AUROC of 0.75 and reducing false positives by 67% compared to synthetic-test-only validation in production. Our 5x5 generation strategy (five parallel candidates with up to five refinement attempts each) combined with continuous feedback loops enables systematic quality improvement. We also present extensions enabling rule generation from unstructured data sources and demonstrate a proof-of-concept agentic workflow for multi-event-type detection. Our lessons learned highlight critical considerations for applying LLMs to cybersecurity tasks, including overconfidence mitigation and the importance of domain expertise in both prompt design and quality review of generated rules through human-in-the-loop validation. Comments: 11 pages, 10 figures. To be submitted to CAMLIS 2026 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2604.01977 [cs.CR] (or arXiv:2604.01977v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.01977 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-27] How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization
【速读】: 该论文旨在解决语言和手势等通信系统中词序或动作顺序优化的问题,即如何衡量不同排列方式在交换距离(swap distance)上的最优性。其核心问题是验证是否存在一种普遍的优化原则,使得实际观察到的语序或手势顺序尽可能减少相邻元素交换次数,从而降低沟通成本。解决方案的关键在于构建一个基于置换多面体(permutohedron)的数学框架,通过量化词序变化与交换距离的关系来评估优化程度,并首次将二次分配问题(quadratic assignment problem, QAP)引入语言研究领域,提出了一般性的最优分配原则(principle of optimal assignment),统一了包括交换距离最小化在内的多种语言学优化机制。
链接: https://arxiv.org/abs/2604.01938
作者: Ramon Ferrer-i-Cancho
机构: 未知
类目: Computation and Language (cs.CL); Statistical Mechanics (cond-mat.stat-mech); Physics and Society (physics.soc-ph)
备注:
Abstract:The structure of all the permutations of a sequence can be represented as a permutohedron, a graph where vertices are permutations and two vertices are linked if a swap of adjacent elements in the permutation of one of the vertices produces the permutation of the other vertex. It has been hypothesized that word orders in languages minimize the swap distance in the permutohedron: given a source order, word orders that are closer in the permutohedron should be less costly and thus more likely. Here we explain how to measure the degree of optimality of word order variation with respect to swap distance minimization. We illustrate the power of our novel mathematical framework by showing that crosslinguistic gestures are at least 77% optimal. It is unlikely that the multiple times where crosslinguistic gestures hit optimality are due to chance. We establish the theoretical foundations for research on the optimality of word or gesture order with respect to swap distance minimization in communication systems. Finally, we introduce the quadratic assignment problem (QAP) into language research as an umbrella for multiple optimization problems and, accordingly, postulate a general principle of optimal assignment that unifies various linguistic principles including swap distance minimization.
[NLP-28] Reliable News or Propagandist News? A Neurosymbolic Model Using Genre Topic and Persuasion Techniques to Improve Robustness in Classification
【速读】: 该论文旨在解决传播性新闻(propaganda news)检测中因训练数据偏差导致的语言模型(如BERT)过拟合、泛化能力差的问题。其核心解决方案是提出一种神经符号混合方法(neurosymbolic approach),将非上下文文本嵌入(fastText)与符号化概念特征(如体裁、主题和说服技巧)相结合,从而提升分类鲁棒性和对新来源的适应能力。实验表明,该方法在性能上优于纯文本模型,并通过消融研究和可解释性分析验证了符号特征的有效性。
链接: https://arxiv.org/abs/2604.01936
作者: Géraud Faye,Benjamin Icard,Morgane Casanova,Guillaume Gadek,Guillaume Gravier,Wassila Ouerdane,Céline Hudelot,Sylvain Gatepaille,Paul Égré
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Among news disorders, propagandist news are particularly insidious, because they tend to mix oriented messages with factual reports intended to look like reliable news. To detect propaganda, extant approaches based on Language Models such as BERT are promising but often overfit their training datasets, due to biases in data collection. To enhance classification robustness and improve generalization to new sources, we propose a neurosymbolic approach combining non-contextual text embeddings (fastText) with symbolic conceptual features such as genre, topic, and persuasion techniques. Results show improvements over equivalent text-only methods, and ablation studies as well as explainability analyses confirm the benefits of the added features. Keywords: Information disorder, Fake news, Propaganda, Classification, Topic modeling, Hybrid method, Neurosymbolic model, Ablation, Robustness Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.01936 [cs.CL] (or arXiv:2604.01936v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.01936 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-29] ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在隐式语境中仍存在显著偏见的问题,尤其当社会身份通过特征性线索(characteristic-based cues)而非显式名称表达时,现有评估方法难以捕捉此类偏见。其解决方案的关键在于提出ImplicitBBQ——一个基于文化关联属性的问答(QA)基准,能够系统评估年龄、性别、地域、宗教、种姓及社会经济地位等多维隐式偏见。该基准通过识别与特定群体相关联的特征性暗示,突破了传统以姓名为代理指标的局限性,揭示出当前模型在模糊情境下的隐式偏见水平是显式偏见的六倍以上,且主流对齐策略(如安全提示和链式推理)对此类偏见改善有限,凸显了文化根植的刻板印象仍是模型偏见治理的核心挑战。
链接: https://arxiv.org/abs/2604.01925
作者: Bhaskara Hanuma Vedula,Darshan Anghan,Ishita Goyal,Ponnurangam Kumaraguru,Abhijnan Chakraborty
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models increasingly suppress biased outputs when demographic identity is stated explicitly, yet may still exhibit implicit biases when identity is conveyed indirectly. Existing benchmarks use name based proxies to detect implicit biases, which carry weak associations with many social demographics and cannot extend to dimensions like age or socioeconomic status. We introduce ImplicitBBQ, a QA benchmark that evaluates implicit bias through characteristic based cues, culturally associated attributes that signal implicitly, across age, gender, region, religion, caste, and socioeconomic status. Evaluating 11 models, we find that implicit bias in ambiguous contexts is over six times higher than explicit bias in open weight models. Safety prompting and chain-of-thought reasoning fail to substantially close this gap; even few-shot prompting, which reduces implicit bias by 84%, leaves caste bias at four times the level of any other dimension. These findings indicate that current alignment and prompting strategies address the surface of bias evaluation while leaving culturally grounded stereotypic associations largely unresolved. We publicly release our code and dataset for model providers and researchers to benchmark potential mitigation techniques.
[NLP-30] Is Clinical Text Enough? A Multimodal Study on Mortality Prediction in Heart Failure Patients LREC2026
【速读】: 该论文旨在解决心力衰竭(Heart Failure, HF)患者短期死亡率预测的难题,尤其是在仅依赖结构化电子健康记录(Electronic Health Record, EHR)数据时预测精度不足的问题。其解决方案的关键在于采用实体感知的多模态Transformer模型,通过融合临床文本中的实体级表示与结构化变量,并引入监督式多模态融合策略,显著提升了预测性能;相较之下,大语言模型(Large Language Model, LLM)在不同模态和解码策略下表现不稳定,文本提示虽优于结构化输入,但整体仍难以满足临床决策支持需求。
链接: https://arxiv.org/abs/2604.01924
作者: Oumaima El Khettari,Virgile Barthet,Guillaume Hocquet,Joconde Weller,Emmanuel Morin,Pierre Zweigenbaum
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted in LREC 2026
Abstract:Accurate short-term mortality prediction in heart failure (HF) remains challenging, particularly when relying on structured electronic health record (EHR) data alone. We evaluate transformer-based models on a French HF cohort, comparing text-only, structured-only, multimodal, and LLM-based approaches. Our results show that enriching clinical text with entity-level representations improves prediction over CLS embeddings alone, and that supervised multimodal fusion of text and structured variables achieves the best overall performance. In contrast, large language models perform inconsistently across modalities and decoding strategies, with text-only prompts outperforming structured or multimodal inputs. These findings highlight that entity-aware multimodal transformers offer the most reliable solution for short-term HF outcome prediction, while current LLM prompting remains limited for clinical decision support.
[NLP-31] SURE: Synergistic Uncertainty-aware Reasoning for Multimodal Emotion Recognition in Conversations ICASSP2026
【速读】: 该论文旨在解决多模态情感识别在对话场景(Multimodal Emotion Recognition in Conversations, MERC)中面临的两大挑战:一是如何有效处理来自不同模态(如语音、文本、表情等)的噪声数据,二是如何建模上下文依赖关系以实现细粒度的情感推理。解决方案的关键在于提出SURE(Synergistic Uncertainty-aware REasoning)框架,其核心创新包括:(1)不确定性感知的专家混合模块(Uncertainty-Aware Mixture-of-Experts),用于对各模态特有的噪声进行建模与抑制;(2)迭代推理模块(Iterative Reasoning),通过多轮交互实现对对话上下文的深度推理;(3)Transformer门控模块(Transformer Gate),用于捕捉模态内和模态间的动态交互关系。实验表明,SURE在多个基准数据集上均优于现有最先进方法,验证了不确定性建模与迭代推理在提升MERC鲁棒性和语境理解能力中的关键作用。
链接: https://arxiv.org/abs/2604.01916
作者: Yiqiang Cai,Chengyan Wu,Bolei Ma,Bo Chen,Yun Xue,Julia Hirschberg,Ziwei Gong
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICASSP 2026
Abstract:Multimodal emotion recognition in conversations (MERC) requires integrating multimodal signals while being robust to noise and modeling contextual reasoning. Existing approaches often emphasize fusion but overlook uncertainty in noisy features and fine-grained reasoning. We propose SURE (Synergistic Uncertainty-aware REasoning) for MERC, a framework that improves robustness and contextual modeling. SURE consists of three components: an Uncertainty-Aware Mixture-of-Experts module to handle modality-specific noise, an Iterative Reasoning module for multi-turn reasoning over context, and a Transformer Gate module to capture intra- and inter-modal interactions. Experiments on benchmark MERC datasets show that SURE consistently outperforms state-of-the-art methods, demonstrating its effectiveness in robust multimodal reasoning. These results highlight the importance of uncertainty modeling and iterative reasoning in advancing emotion recognition in conversational settings.
[NLP-32] HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models
【速读】: 该论文旨在解决视频大语言模型(Video Large Language Models, VideoLLMs)在部署时因输入视频token数量庞大而导致的显著计算负担问题。现有方法主要在输入层面进行视频token剪枝,但忽略了视频本身固有的信息结构以及大语言模型(Large Language Models, LLMs)内部多模态信息的单向传播特性。解决方案的关键在于提出一种分层剪枝框架HieraVid,其核心思想是基于两个观察:1)视频具有段-帧结构;2)LLMs内部多模态信息呈单向传播。据此,HieraVid将剪枝过程分解为三个层级:段级(temporal segmentation and spatial merging)、帧级(pruning similar frames within segments while preserving diversity)和层级(redundancy gradually decreases with increasing LLM layers without performance degradation)。实验表明,仅保留30%的token即可实现新的SOTA性能,同时保持原模型98%以上(LLaVA-Video-7B)和99%以上(LLaVA-OneVision-7B)的性能。
链接: https://arxiv.org/abs/2604.01881
作者: Yansong Guo,Chaoyang Zhu,Jiayi Ji,Jianghang Lin,Liujuan Cao
机构: Xiamen University (厦门大学); Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China (多媒体可信感知与高效计算重点实验室,中国教育部)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.
[NLP-33] Beyond Detection: Ethical Foundations for Automated Dyslexic Error Attribution
【速读】: 该论文旨在解决两个核心问题:一是现有研究对阅读障碍(dyslexia)书写错误的自动识别多聚焦于纠错而非归属(attribution),二是忽视了此类自动化分类在教育场景中可能引发的伦理风险,如标签伤害、隐蔽筛选、算法偏见及机构滥用。解决方案的关键在于提出一种基于双输入神经网络模型的高精度错误归属方法,并构建以伦理优先为原则的部署框架。该模型利用涵盖正字法、音系学和形态学特征的综合特征集,在独立于写作者的条件下实现93.01%准确率与94.01% F1分数,其中音似错误和元音混淆被识别为最强归属信号;同时,论文系统分析公平性、可解释性、知情同意、透明度、人工监督与申诉机制等伦理要素,提供具体部署指南并公开讨论系统局限性和潜在误用风险,强调高准确性不足以支撑其在高风险教育场景中的直接应用。
链接: https://arxiv.org/abs/2604.01853
作者: Samuel Rose,Debarati Chakraborty
机构: University of Hull (赫尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Dyslexic spelling errors exhibit systematic phonological and orthographic patterns that distinguish them from the errors produced by typically developing writers. While this observation has motivated dyslexic-specific spell-checking and assistive writing tools, prior work has focused predominantly on error correction rather than attribution, and has largely neglected the ethical risks. The risk of harmful labelling, covert screening, algorithmic bias, and institutional misuse that automated classification of learners entails requires the development of robust ethical and legal frameworks for research in this area. This paper addresses both gaps. We formulate dyslexic error attribution as a binary classification task. Given a misspelt word and its correct target form, determine whether the error pattern is characteristic of a dyslexic or non-dyslexic writer. We develop a comprehensive feature set capturing orthographic, phonological, and morphological properties of each error, and propose a twin-input neural model evaluated against traditional machine learning baselines under writer-independent conditions. The neural model achieves 93.01% accuracy and an F1-score of 94.01%, with phonetically plausible errors and vowel confusions emerging as the strongest attribution signals. We situate these technical results within an explicit ethics-first framework, analysing fairness across subgroups, the interpretability requirements of educational deployment, and the conditions, consent, transparency, human oversight, and recourse, under which a system could be responsibly used. We provide concrete guidelines for ethical deployment and an open discussion of the systems limitations and misuse potential. Our results demonstrate that dyslexic error attribution is feasible at high accuracy while underscoring that feasibility alone is insufficient for deployment in high-stakes educational contexts.
[NLP-34] From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码补全任务中采用硬补全(Hard Completion, HC)范式时所面临的局限性——即在上下文信息不足的情况下仍强制生成完整代码,导致大量错误预测,进而引发高编辑成本。研究表明,在实际交互中61%的建议要么被修改,要么被拒绝,尽管与用户后续代码高度相似(>80%),说明模型在特定token位置频繁出错。解决方案的关键在于提出自适应占位符补全(Adaptive Placeholder Completion, APC),其核心思想是:基于不确定性建模,在高熵位置主动输出显式占位符(placeholder),允许开发者通过IDE导航直接填充,从而降低纠错成本。理论上,作者将代码补全建模为不确定条件下的成本最小化问题,并证明存在一个熵阈值,超过该阈值时APC的期望成本严格低于HC;实践上,通过过滤真实编辑日志构建训练数据并设计基于成本的奖励函数用于强化学习,实现了端到端的自适应回避策略学习,且在1.5B–14B参数模型上验证了预期编辑成本降低19%–50%,同时保持传统HC性能不变。
链接: https://arxiv.org/abs/2604.01849
作者: Liang Zhu,Haolin Chen,Lidong Zhao,Xian Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:While Large Language Models (LLMs) have demonstrated exceptional proficiency in code completion, they typically adhere to a Hard Completion (HC) paradigm, compelling the generation of fully concrete code even amidst insufficient context. Our analysis of 3 million real-world interactions exposes the limitations of this strategy: 61% of the generated suggestions were either edited after acceptance or rejected despite exhibiting over 80% similarity to the user’s subsequent code, suggesting that models frequently make erroneous predictions at specific token positions. Motivated by this observation, we propose Adaptive Placeholder Completion (APC), a collaborative framework that extends HC by strategically outputting explicit placeholders at high-entropy positions, allowing users to fill directly via IDE navigation. Theoretically, we formulate code completion as a cost-minimization problem under uncertainty. Premised on the observation that filling placeholders incurs lower cost than correcting errors, we prove the existence of a critical entropy threshold above which APC achieves strictly lower expected cost than HC. We instantiate this framework by constructing training data from filtered real-world edit logs and design a cost-based reward function for reinforcement learning. Extensive evaluations across 1.5B–14B parameter models demonstrate that APC reduces expected editing costs from 19% to 50% while preserving standard HC performance. Our work provides both a theoretical foundation and a practical training framework for uncertainty-aware code completion, demonstrating that adaptive abstention can be learned end-to-end without sacrificing conventional completion quality.
[NLP-35] PLOT: Enhancing Preference Learning via Optimal Transport
【速读】: 该论文旨在解决当前基于微调的大型语言模型(Large Language Models, LLMs)偏好学习中存在的性能提升有限、计算成本高、超参数敏感以及对全局token级关系建模不足等问题。其解决方案的关键在于提出PLOT方法,通过将偏好学习建模为最优传输(Optimal Transport, OT)问题,设计了一种基于token级别的损失函数,从而在对齐人类偏好时保持LLM原始分布不变,提升训练稳定性与鲁棒性;同时利用token嵌入捕捉语义关联,实现全局信息驱动的优化,显著改善了模型在人类价值观和逻辑推理等多类偏好任务中的对齐效果。
链接: https://arxiv.org/abs/2604.01837
作者: Liang Zhu,Yuelin Bai,Xiankun Ren,Jiaxi Yang,Lei Zhang,Feiteng Fang,Hamid Alinejad-Rokny,Minghuan Tan,Min Yang
机构: Southern University of Science and Technology (南方科技大学); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Chongqing University (重庆大学); University of New South Wales (新南威尔士大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Preference learning in Large Language Models (LLMs) has advanced significantly, yet existing methods remain limited by modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global token-level relationships. We introduce PLOT, which enhances Preference Learning in fine-tuning-based alignment through a token-level loss derived from Optimal Transport. By formulating preference learning as an Optimal Transport Problem, PLOT aligns model outputs with human preferences while preserving the original distribution of LLMs, ensuring stability and robustness. Furthermore, PLOT leverages token embeddings to capture semantic relationships, enabling globally informed optimization. Experiments across two preference categories - Human Values and Logic Problem Solving - spanning seven subpreferences demonstrate that PLOT consistently improves alignment performance while maintaining fluency and coherence. These results substantiate optimal transport as a principled methodology for preference learning, establishing a theoretically grounded framework that provides new insights for preference learning of LLMs.
[NLP-36] Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
【速读】: 该论文旨在解决跨模态(语言与视觉)适配中因预训练模型参数空间差异显著而导致的挑战,尤其是传统观点认为语言预训练模型(Language Pre-trained Models)由于参数分布不一致而不适用于下游视觉任务的问题。其核心解决方案是引入一种无需人工标注的“随机标签桥接训练”(random label bridge training)机制,作为模态适配学习器,有效对齐大语言模型(Large Language Model, LLM)参数与视觉基础任务之间的表示空间;关键发现在于:部分层的桥接训练更为高效,因为LLM中的某些层本身具备强泛化能力,即使未针对视觉任务微调仍可保留有用特征,这为直接利用语言预训练参数服务于视觉模型提供了新路径,并揭示了局部桥接训练在跨模态迁移中的实际价值。
链接: https://arxiv.org/abs/2604.01833
作者: Yaxin Luo,Zhiqiang Shen
机构: MBZUAI(穆巴达拉人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.
[NLP-37] DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment
【速读】: 该论文旨在解决强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)方法在对齐大型语言模型(Large Language Models, LLMs)时存在的高成本、不稳定以及依赖大量数据导致模型泛化能力下降的问题。解决方案的关键在于提出一种名为分布引导的高效微调(Distribution-guided Efficient Fine-Tuning, DEFT)框架,其核心机制是通过计算语言模型输出分布与偏好数据差异分布之间的差分分布奖励(differential distribution reward),从中筛选出小而高质量的数据子集,并将该分布引导信号融入现有对齐方法中,从而在显著降低训练时间的同时提升模型的对齐能力和泛化性能。
链接: https://arxiv.org/abs/2604.01787
作者: Liang Zhu,Feiteng Fang,Yuelin Bai,Longze Chen,Zhexiang Zhang,Minghuan Tan,Min Yang
机构: Southern University of Science and Technology (南方科技大学); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); University of Science and Technology of China (中国科学技术大学); University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model’s output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.
[NLP-38] aming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens
【速读】: 该论文旨在解决可控自动文本简化(Controllable Automatic Text Simplification, CATS)中因数据和评估方式局限导致的控制能力不足问题。现有方法常将可控性视为解码阶段的问题,且依赖不反映真实控制效果的指标,从而难以实现对输出可读性或压缩率等属性的稳定调控。解决方案的关键在于提出一种与领域无关的CATS框架,通过指令微调(instruction fine-tuning)结合离散控制标记(discrete control tokens),引导开源模型(Llama、Mistral、Qwen,参数规模1–14B)精确匹配目标可读性水平和压缩率。实验表明,小模型(1–3B)在特定条件下可与大模型竞争,但可靠控制高度依赖训练数据是否充分涵盖目标属性的变化;此外,论文指出传统简化度与相似性指标不足以衡量控制质量,需引入基于误差的对齐度量,并强调数据采样策略对避免分布偏移的重要性。
链接: https://arxiv.org/abs/2604.01779
作者: Hanna Hubarava,Yingqiang Gao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Controllable Automatic Text Simplification (CATS) produces user-tailored outputs, yet controllability is often treated as a decoding problem and evaluated with metrics that are not reflective to the measure of control. We observe that controllability in ATS is significantly constrained by data and evaluation. To this end, we introduce a domain-agnostic CATS framework based on instruction fine-tuning with discrete control tokens, steering open-source models to target readability levels and compression rates. Across three model families with different model sizes (Llama, Mistral, Qwen; 1-14B) and four domains (medicine, public administration, news, encyclopedic text), we find that smaller models (1-3B) can be competitive, but reliable controllability strongly depends on whether the training data encodes sufficient variation in the target attribute. Readability control (FKGL, ARI, Dale-Chall) is learned consistently, whereas compression control underperforms due to limited signal variability in the existing corpora. We further show that standard simplification and similarity metrics are insufficient for measuring control, motivating error-based measures for target-output alignment. Finally, our sampling and stratification experiments demonstrate that naive splits can introduce distributional mismatch that undermines both training and evaluation.
[NLP-39] FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models
【速读】: 该论文旨在解决参数高效微调(Parameter-efficient fine-tuning, PEFT)在多任务场景下面临的两大挑战:一是不同任务优化目标之间的干扰(task interference),二是有限参数预算导致的表征能力不足(representational deficiency)。现有方法虽引入混合专家(Mixture-of-Experts, MoE)架构缓解问题,但主要在空间域操作,易引入结构冗余和参数开销。其解决方案的关键在于将适配过程从空间域重构到频域——通过傅里叶分析发现不同任务具有不同的频率能量分布,且LLM各层对频率敏感度不一;进而提出FourierMoE,利用逆离散傅里叶变换(IDFT)实现频域专家分配与重建:设计频率自适应路由机制将token分发至专注特定频带的专家,每个专家学习共轭对称复系数,在理论上保证无损IDFT重建为实值空间权重,从而在显著减少可训练参数的同时,提升单任务与多任务微调性能。
链接: https://arxiv.org/abs/2604.01762
作者: Juyong Jiang,Fan Wang,Hong Qi,Sunghun Kim,Jing Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: The first two authors contributed equally to this work; listing order is random
Abstract:Parameter-efficient fine-tuning (PEFT) has emerged as a crucial paradigm for adapting large language models (LLMs) under constrained computational budgets. However, standard PEFT methods often struggle in multi-task fine-tuning settings, where diverse optimization objectives induce task interference and limited parameter budgets lead to representational deficiency. While recent approaches incorporate mixture-of-experts (MoE) to alleviate these issues, they predominantly operate in the spatial domain, which may introduce structural redundancy and parameter overhead. To overcome these limitations, we reformulate adaptation in the spectral domain. Our spectral analysis reveals that different tasks exhibit distinct frequency energy distributions, and that LLM layers display heterogeneous frequency sensitivities. Motivated by these insights, we propose FourierMoE, which integrates the MoE architecture with the inverse discrete Fourier transform (IDFT) for frequency-aware adaptation. Specifically, FourierMoE employs a frequency-adaptive router to dispatch tokens to experts specialized in distinct frequency bands. Each expert learns a set of conjugate-symmetric complex coefficients, preserving complete phase and amplitude information while theoretically guaranteeing lossless IDFT reconstruction into real-valued spatial weights. Extensive evaluations across 28 benchmarks, multiple model architectures, and scales demonstrate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer trainable parameters. These results highlight the promise of spectral-domain expert adaptation as an effective and parameter-efficient paradigm for LLM fine-tuning.
[NLP-40] LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在研究级数学推理能力评估中存在的局限性问题,特别是现有基准测试因数据污染和合成场景导致的虚假性能表现。其核心解决方案是提出LiveMathematicianBench——一个基于模型训练截止日期之后arXiv论文构建的动态多选题基准,通过引入新近发表定理实现评估情境的真实性,并采用十三类逻辑分类体系对定理类型进行细粒度划分,从而支持对不同推理形式的精准测量;此外,该方案设计了基于证明草图(proof-sketch)引导的干扰项生成机制,以构造具有误导性的错误选项,增强对模型实质性理解而非表面匹配的敏感度,并引入抗替换机制区分答案识别与真正推理能力,显著提升了评估的严谨性和有效性。
链接: https://arxiv.org/abs/2604.01754
作者: Linyang He,Qiyao Yu,Hanze Dong,Baohao Liao,Xinxing Xu,Micah Goldblum,Jiang Bian,Nima Mesgarani
机构: Columbia University (哥伦比亚大学); Microsoft Research (微软研究院); University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level proof strategies to construct plausible but invalid answer choices reflecting misleading proof directions, increasing sensitivity to genuine understanding over surface-level matching. We also introduce a substitution-resistant mechanism to distinguish answer recognition from substantive reasoning. Evaluation shows the benchmark is far from saturated: Gemini-3.1-pro-preview, the best model, achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply: GPT-5.4 scores highest at 30.6%, while Gemini-3.1-pro-preview falls to 17.6%, below the 20% random baseline. A dual-mode protocol reveals that proof-sketch access yields consistent accuracy gains, suggesting models can leverage high-level proof strategies for reasoning. Overall, LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs.
[NLP-41] Detecting Toxic Language: Ontology and BERT-based Approaches for Bulgarian Text
【速读】: 该论文旨在解决在线交流中毒性内容检测的难题,特别是现有方法常误判有价值信息(如医学术语和少数群体相关文本)的问题。其解决方案的关键在于提出一种针对保加利亚语的本体模型以刻画潜在毒性词汇,并构建包含4,384条人工标注句子的数据集(涵盖毒性语言、医学术语、非毒性语言及少数群体相关术语四类),进而训练一个基于BERT的分类模型,在保持高准确率的同时实现对毒性内容的精准识别,F1宏平均得分达0.89,具备直接部署于实际内容审核系统的能力。
链接: https://arxiv.org/abs/2604.01745
作者: Melania Berbatova,Tsvetoslav Vasev
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Toxic content detection in online communication remains a significant challenge, with current solutions often inadvertently blocking valuable information, including medical terms and text related to minority groups. This paper presents a more nu-anced approach to identifying toxicity in Bulgarian text while preserving access to essential information. The research explores two distinct methodologies for detecting toxic content. The developed methodologies have po-tential applications across diverse online platforms and content moderation systems. First, we propose an ontology that models the potentially toxic words in Bulgarian language. Then, we compose a dataset that comprises 4,384 manually anno-tated sentences from Bulgarian online forums across four categories: toxic language, medical terminology, non-toxic lan-guage, and terms related to minority communities. We then train a BERT-based model for toxic language classification, which reaches a 0.89 F1 macro score. The trained model is directly applicable in a real environment and can be integrated as a com-ponent of toxic content detection systems.
[NLP-42] Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition
【速读】: 该论文旨在解决越南语语音情感识别(Speech Emotion Recognition, SER)中因声学特征模糊和标注数据可靠性不足而导致的识别困难问题,尤其是在真实场景下情绪边界不清晰的情况下。其解决方案的关键在于提出一种人机协同框架,将人类知识融入学习过程,而非仅依赖数据驱动模型;该框架以大语言模型(Large Language Model, LLM)推理为核心,利用基于声学特征的模型提供置信度和特征级证据作为辅助信号,并通过置信度路由机制区分易分类与模糊样本,使不确定案例交由LLM进行结构化规则引导下的深度推理;同时引入迭代优化策略,基于错误分析持续更新规则以提升系统性能。
链接: https://arxiv.org/abs/2604.01711
作者: Truc Nguyen,Then Tran,Binh Truong,Phuoc Nguyen T. H
机构: University of Information Technology (UIT) (信息科技大学)
类目: Computation and Language (cs.CL)
备注: 6 pages, 2 figures. Dataset of 2,764 Vietnamese speech samples across three emotion classes
Abstract:Vietnamese Speech Emotion Recognition (SER) remains challenging due to ambiguous acoustic patterns and the lack of reliable annotated data, especially in real-world conditions where emotional boundaries are not clearly separable. To address this problem, this paper proposes a human-machine collaborative framework that integrates human knowledge into the learning process rather than relying solely on data-driven models. The proposed framework is centered around LLM-based reasoning, where acoustic feature-based models are used to provide auxiliary signals such as confidence and feature-level evidence. A confidence-based routing mechanism is introduced to distinguish between easy and ambiguous samples, allowing uncertain cases to be delegated to LLMs for deeper reasoning guided by structured rules derived from human annotation behavior. In addition, an iterative refinement strategy is employed to continuously improve system performance through error analysis and rule updates. Experiments are conducted on a Vietnamese speech dataset of 2,764 samples across three emotion classes (calm, angry, panic), with high inter-annotator agreement (Fleiss Kappa = 0.8574), ensuring reliable ground truth. The proposed method achieves strong performance, reaching up to 86.59% accuracy and Macro F1 around 0.85-0.86, demonstrating its effectiveness in handling ambiguous and hard-to-classify cases. Overall, this work highlights the importance of combining data-driven models with human reasoning, providing a robust and model-agnostic approach for speech emotion recognition in low-resource settings.
[NLP-43] Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)代理在执行长时程复杂任务(如多轮对话、游戏博弈和科学发现)时,对记忆机制的系统性比较缺失问题。现有记忆方法虽被广泛提出,但缺乏在统一实验设置下的全面评估,导致难以客观判断各方法的有效性与适用场景。论文的关键解决方案在于:首先构建一个统一框架以整合所有现有代理记忆方法;其次在两个知名基准上进行系统性对比实验,深入分析各类方法的性能表现;并在此基础上,通过融合已有模块设计出一种新型记忆方法,其在性能上超越了当前最先进方法。这一研究不仅提供了对现有记忆机制行为的深刻理解,也为未来研究指明了方向。
链接: https://arxiv.org/abs/2604.01707
作者: Yanchen Wu,Tenghui Lin,Yingli Zhou,Fangyuan Zhang,Qintian Guo,Xun Zhou,Sibo Wang,Xilin Liu,Yuchi Ma,Yixiang Fang
机构: CUHK-Shenzhen(深圳大学香港中文大学联合学院); CUHK(香港中文大学); HITSZ(哈尔滨工业大学(深圳)); BIT(北京理工大学); Huawei Cloud(华为云)
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:
Abstract:Memory emerges as the core module in the large language model (LLM)-based agents for long-horizon complex tasks (e.g., multi-turn dialogue, game playing, scientific discovery), where memory can enable knowledge accumulation, iterative reasoning and self-evolution. A number of memory methods have been proposed in the literature. However, these methods have not been systematically and comprehensively compared under the same experimental settings. In this paper, we first summarize a unified framework that incorporates all the existing agent memory methods from a high-level perspective. We then extensively compare representative agent memory methods on two well-known benchmarks and examine the effectiveness of all methods, providing a thorough analysis of those methods. As a byproduct of our experimental analysis, we also design a new memory method by exploiting modules in the existing methods, which outperforms the state-of-the-art methods. Finally, based on these findings, we offer promising future research opportunities. We believe that a deeper understanding of the behavior of existing methods can provide valuable new insights for future research.
[NLP-44] Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy
【速读】: 该论文旨在解决自动语音识别(ASR)在胃肠镜检查等临床场景中因领域特异性术语和复杂声学环境导致的可靠性不足问题。其解决方案的关键在于提出了一种名为EndoASR的领域自适应ASR系统,采用两阶段适配策略:首先利用合成胃肠镜报告进行领域特定语言建模优化,其次提升噪声鲁棒性;该方法显著降低了字符错误率(CER)并提高了医学术语准确率(Med ACC),同时保持了极低的实时因子(RTF=0.005)和紧凑模型规模(220M参数),实现了高效边缘部署,并在多中心真实临床环境中验证了其泛化能力与实用性。
链接: https://arxiv.org/abs/2604.01705
作者: Ruijie Yang,Yan Zhu,Peiyao Fu,Te Luo,Zhihua Wang,Xian Yang,Quanlin Li,Pinghong Zhou,Shuo Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review at npj Digital Medicine
Abstract:Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with large language models demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.
[NLP-45] On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning
【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)中不同来源的思维链(Chain-of-Thought, CoT)轨迹对大模型推理能力泛化性能影响不明确的问题。研究发现,尽管来自不同模型的CoT轨迹在训练损失上表现相近,但其推理模式存在显著差异:DeepSeek-R1-0528生成的轨迹具有发散性和高分支密度,而GPT-OSS-120B则呈现收敛且演绎性强的路径。这种差异导致基于DeepSeek-R1数据训练的模型继承了低效探索行为,易陷入冗余分支,从而损害泛化性能。解决方案的关键在于识别并过滤高频分支轨迹,通过筛选高质量、结构更优的CoT样本进行SFT,从而显著提升模型在多个推理基准上的表现,实验表明该方法在AIME25等任务上平均提升达3.6%。
链接: https://arxiv.org/abs/2604.01702
作者: Zhaoyi Li,Xiangyu Xi,Zhengyu Chen,Wei Wang,Gangwei Jiang,Ranran Shen,Linqi Song,Ying Wei,Defu Lian
机构: University of Science and Technology of China(中国科学技术大学); Meituan LongCat Team(美团龙猫团队); City University of Hong Kong(香港城市大学); Zhejiang University(浙江大学)
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:Supervised Fine-Tuning (SFT) on long Chain-of-Thought (CoT) trajectories has become a pivotal phase in building large reasoning models. However, how CoT trajectories from different sources influence the generalization performance of models remains an open question. In this paper, we conduct a comparative study using two sources of verified CoT trajectories generated by two competing models, \textttDeepSeek-R1-0528 and \textttgpt-oss-120b, with their problem sets controlled to be identical. Despite their comparable performance, we uncover a striking paradox: lower training loss does not translate to better generalization. SFT on \textttDeepSeek-R1-0528 data achieves remarkably lower training loss, yet exhibits significantly worse generalization performance on reasoning benchmarks compared to those trained on \textttgpt-oss-120b. To understand this paradox, we perform a multi-faceted analysis probing token-level SFT loss and step-level reasoning behaviors. Our analysis reveals a difference in reasoning patterns. \textttgpt-oss-120b exhibits highly convergent and deductive trajectories, whereas \textttDeepSeek-R1-0528 favors a divergent and branch-heavy exploration pattern. Consequently, models trained with \textttDeepSeek-R1 data inherit inefficient exploration behaviors, often getting trapped in redundant exploratory branches that hinder them from reaching correct solutions. Building upon this insight, we propose a simple yet effective remedy of filtering out frequently branching trajectories to improve the generalization of SFT. Experiments show that training on selected \textttDeepSeek-R1-0528 subsets surprisingly improves reasoning performance by up to 5.1% on AIME25, 5.5% on BeyondAIME, and on average 3.6% on five benchmarks.
[NLP-46] MiCA Learns More Knowledge Than LoRA and Full Fine-Tuning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在微调过程中参数效率低、知识获取能力有限的问题。传统方法如低秩适配(Low-Rank Adaptation, LoRA)主要聚焦于模型表示中占主导地位的子空间进行参数更新,而忽视了潜在的非显著子空间。其解决方案的关键在于提出一种名为“次要成分适配”(Minor Component Adaptation, MiCA)的新方法,该方法利用奇异值分解(Singular Value Decomposition, SVD)识别与最小奇异值相关联的次要奇异向量所对应的子空间,并将微调期间的参数更新限制在此类方向上。这一策略不仅显著提升了知识获取效率(在优化超参数下提升达5.9倍),同时保持了极低的参数开销(仅为LoRA的6–60%),从而实现更高效且稳定的预训练模型知识注入机制。
链接: https://arxiv.org/abs/2604.01694
作者: Sten Rüdiger,Sebastian Raschka
机构: RAIR Lab
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Minor Component Adaptation (MiCA) is a novel parameter-efficient fine-tuning method for large language models that focuses on adapting underutilized subspaces of model representations. Unlike conventional methods such as Low-Rank Adaptation (LoRA), which target dominant subspaces, MiCA leverages Singular Value Decomposition to identify subspaces related to minor singular vectors associated with the least significant singular values and constrains the update of parameters during fine-tuning to those directions. This strategy leads to up to 5.9x improvement in knowledge acquisition under optimized training hyperparameters and a minimal parameter footprint of 6-60% compared to LoRA. These results suggest that constraining adaptation to minor singular directions provides a more efficient and stable mechanism for integrating new knowledge into pre-trained language models.
[NLP-47] Coupled Query-Key Dynamics for Attention
【速读】: 该论文旨在解决标准缩放点积注意力(scaled dot-product attention)在语言建模中存在训练不稳定和性能瓶颈的问题。其核心解决方案是引入“耦合查询与键动力学”(coupled QK dynamics),即通过共享的可学习动态过程共同演化查询(query)和键(key),再进行打分计算,而非采用静态、独立的投影。实验表明,该方法在保持极低参数开销(仅增加0.11%参数)的前提下显著提升语言建模困惑度(perplexity),例如在WikiText-103数据集上达到22.55–22.62,较标准注意力提升6.6–6.9%。关键在于耦合机制本身而非积分器类型或步数,且其优势体现在样本效率层面:耦合注意力在相同计算量下仅需更少的训练token即可达到同等性能。
链接: https://arxiv.org/abs/2604.01683
作者: Barak Gahtan,Alex M. Bronstein
机构: Technion – Israel Institute of Technology (以色列理工学院); ISTA – Institute of Science and Technology Austria (奥地利科学技术研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Standard scaled dot-product attention computes scores from static, independent projections of the input. We show that evolving queries and keys \emphjointly through shared learned dynamics before scoring - which we call \textbfcoupled QK dynamics - improves language modeling perplexity and training stability. On WikiText-103 at 60M parameters, coupled dynamics achieves 22.55–22.62 perplexity vs.\ 24.22 for standard attention ( - 6.6–6.9%), with only 0.11% additional parameters (shared across both instantiations). A structural ablation isolates coupling as the active ingredient: a symplectic (Hamiltonian) and a non-symplectic (Euler) integrator perform identically when both couple Q and K, while an uncoupled MLP baseline of matched capacity reaches only 23.81 with 8 \times higher seed variance. The integration step count (1–7) is similarly irrelevant - a single coupled step suffices. A compute-matched comparison reveals that coupling is a \emphsample-efficiency mechanism: standard attention trained for 2.4 \times longer (matching wall-clock) reaches the same perplexity, but requires 2.4 \times more tokens. The advantage scales to 150M ( - 6.7%) but narrows at 350M ( - 1.0%), where Differential Attention (18.93) overtakes coupled dynamics (19.35). The benefit is corpus-dependent: coupling helps on domain-coherent text (WikiText-103 - 6.6%, PubMed - 4.5%) but degrades on heterogeneous web text ( + 10.3%) and shows no benefit on GLUE. We characterize when coupling helps and when it does not, providing practical guidelines.
[NLP-48] PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment
【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)中因使用token-level硬标签而导致模型过度自信地模仿缺乏事实支持的目标文本,从而在多句生成中引发幻觉传播的问题。解决方案的关键在于提出PRISM框架,这是一个可微分的风险门控机制,仅在事实敏感位置调整学习过程;其核心创新是引入轻量级、模型感知的概率重分配目标函数,通过跨句子风险权重和模型感知门控机制控制惩罚范围,对高置信度但存在事实风险的token进行抑制,从而在不损害整体能力的前提下提升事实一致性表现。
链接: https://arxiv.org/abs/2604.01682
作者: Chenning Xu,Mao Zheng,Mingyang Song
机构: Tencent(腾讯)
类目: Computation and Language (cs.CL)
备注:
Abstract:Supervised fine-tuning (SFT) with token-level hard labels can amplify overconfident imitation of factually unsupported targets, causing hallucinations that propagate in multi-sentence generation. We study an augmented SFT setting in which training instances include coarse sentence-level factuality risk labels and inter-sentence dependency annotations, providing structured signals about where factual commitments are weakly supported. We propose \textbfPRISM, a differentiable risk-gated framework that modifies learning only at fact-critical positions. PRISM augments standard SFT with a lightweight, model-aware probability reallocation objective that penalizes high-confidence predictions on risky target tokens, with its scope controlled by span-level risk weights and model-aware gating. Experiments on hallucination-sensitive factual benchmarks and general evaluations show that PRISM improves factual aggregates across backbones while maintaining a competitive overall capability profile. Ablations further show that the auxiliary signal is most effective when used conservatively, and that knowledge masking and model-aware reallocation play complementary roles in balancing factual correction and capability preservation.
[NLP-49] PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation ACL
【速读】: 该论文旨在解决情感支持对话(Emotional Support Conversation, ESC)中现有方法在深层次上下文理解能力不足的问题。其解决方案的关键在于提出了一种Persona-guided Retrieval and Causality-aware Cognitive Filtering(PRCCF)框架:首先通过persona-guided retrieval机制联合建模语义兼容性与人格一致性,以提升回应生成的相关性;其次引入causality-aware cognitive filtering模块,优先筛选因果相关的外部知识,从而增强上下文认知理解能力,促进更精准的情感推理。
链接: https://arxiv.org/abs/2604.01671
作者: Yanxin Luo,Xiaoyu Zhang,Jing Li,Yan Gao,Donghong Han
机构: Northeastern University (东北大学); Oracle (甲骨文)
类目: Computation and Language (cs.CL)
备注: 14 pages, 6 figures, 5 tables. Submitted to Transactions of the Association for Computational Linguistics (TACL)
Abstract:Emotional Support Conversation (ESC) aims to alleviate individual emotional distress by generating empathetic responses. However, existing methods face challenges in effectively supporting deep contextual understanding. To address this issue, we propose PRCCF, a Persona-guided Retrieval and Causality-aware Cognitive Filtering framework. Specifically, the framework incorporates a persona-guided retrieval mechanism that jointly models semantic compatibility and persona alignment to enhance response generation. Furthermore, it employs a causality-aware cognitive filtering module to prioritize causally relevant external knowledge, thereby improving contextual cognitive understanding for emotional reasoning. Extensive experiments on the ESConv dataset demonstrate that PRCCF outperforms state-of-the-art baselines on both automatic metrics and human evaluations. Our code is publicly available at: this https URL.
[NLP-50] What Do Claim Verification Datasets Actually Test? A Reasoning Trace Analysis
【速读】: 该论文试图解决的问题是:当前 claim verification(主张验证)基准测试在评估模型推理能力时缺乏系统性理解,尤其是对实际使用的推理类型及其分布不清晰。为解决这一问题,作者通过 GPT-4o-mini 生成了 24K 条主张验证样本的结构化推理轨迹(reasoning traces),发现直接证据提取(direct evidence extraction)占主导地位,而多句信息合成(multi-sentence synthesis)和数值推理(numerical reasoning)严重不足。关键解决方案在于利用一个小型(1B参数)推理验证器(reasoning verifier)对错误类型进行细粒度分析,揭示不同领域(如通用、科学、数学)的错误模式差异,从而指出高分数主要反映的是检索加蕴含(retrieval-plus-entailment)能力,而非真正复杂的推理能力。研究据此提出构建更具挑战性的评估套件以更全面地测试验证系统所需的推理能力。
链接: https://arxiv.org/abs/2604.01657
作者: Delip Rao,Chris Callison-Burch
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages
Abstract:Despite rapid progress in claim verification, we lack a systematic understanding of what reasoning these benchmarks actually exercise. We generate structured reasoning traces for 24K claim-verification examples across 9 datasets using GPT-4o-mini and find that direct evidence extraction dominates, while multi-sentence synthesis and numerical reasoning are severely under-represented. A dataset-level breakdown reveals stark biases: some datasets almost exclusively test lexical matching, while others require information synthesis in roughly half of cases. Using a compact 1B-parameter reasoning verifier, we further characterize five error types and show that error profiles vary dramatically by domain – general-domain verification is dominated by lexical overlap bias, scientific verification by overcautiousness, and mathematical verification by arithmetic reasoning failures. Our findings suggest that high benchmark scores primarily reflect retrieval-plus-entailment ability. We outline recommendations for building more challenging evaluation suites that better test the reasoning capabilities verification systems need.
[NLP-51] hinknCheck: Grounded Claim Verification with Compact Reasoning -Driven and Interpretable Models
【速读】: 该论文旨在解决事实核查(fact verification)任务中模型可解释性与资源效率之间的权衡问题,即如何在保持高准确率的同时,使验证过程具备结构化推理能力并降低参数规模。解决方案的关键在于提出 ThinknCheck——一个1B参数的验证器,其核心创新是采用“先生成结构化推理链(structured rationale),再输出二分类判断”的两阶段范式,并基于LLMAggreFact构建了一个包含24.1k样本的推理增强训练集(LLMAggreFact-Think),通过4-bit量化微调Gemma3模型实现高效部署。实验表明,该方法在LLMAggreFact和SciFact等基准上显著优于对比模型(如MiniCheck-7B),且移除推理步骤会导致性能大幅下降,证明了显式监督推理对小型模型性能提升的关键作用。
链接: https://arxiv.org/abs/2604.01652
作者: Delip Rao,Feijiang Han,Chris Callison-Burch
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages
Abstract:We present ThinknCheck, a 1B-parameter verifier for grounded claim verification that first produces a short, structured rationale and then a binary verdict. We construct LLMAggreFact-Think, a 24.1k reasoning-augmented training set derived from LLMAggreFact, and fine-tune a 4-bit Gemma3 model to follow this format. On LLMAggreFact, ThinknCheck attains 78.1 balanced accuracy (BAcc), surpassing MiniCheck-7B (77.4) with 7x fewer parameters; removing the reasoning step reduces BAcc to 57.5. On SciFact, ThinknCheck reaches 64.7 BAcc, a +14.7 absolute gain over MiniCheck-7B. By contrast, zero-shot chain-of-thought on the base Gemma3-1B harms accuracy relative to direct answers, and preference optimization with a simple format+accuracy reward underperforms supervised reasoning. To probe the latter, we introduce GSMClaims and a domain-specialized variant, ThinknCheck-Science, which improves across benchmarks, including 61.0% accuracy on GSMClaims. Overall, explicit, supervised reasoning enables compact verifiers that are competitive while remaining resource-efficient and interpretable.
[NLP-52] Frag ile Reasoning : A Mechanistic Analysis of LLM Sensitivity to Meaning-Preserving Perturbations
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理任务中对语义不变但表面形式变化(如名称替换和数字格式改写)的脆弱性问题,即模型输出答案随输入表层结构微小扰动而发生显著波动的现象。其核心解决方案是提出一种统一的机制诊断框架——Mechanistic Perturbation Diagnostics (MPD),整合了对数透镜分析(logit lens)、激活修补(activation patching)、组件消融(component ablation)以及级联放大指数(Cascading Amplification Index, CAI)等方法,以定位和量化模型内部错误传播路径。关键发现包括:CAI作为层间分歧放大度量指标优于单一最早分歧层预测失败;不同架构的故障可归类为局部化、分布式与纠缠型三类,且修复效率存在显著差异,表明模型结构特性决定了其对扰动的敏感性和可修复性。
链接: https://arxiv.org/abs/2604.01639
作者: Shou-Tzu Han,Rodrigue Rizk,KC Santosh
机构: University of South Dakota (南达科他大学)
类目: Computation and Language (cs.CL)
备注: Preprint. Under review at COLM 2026
Abstract:Large language models demonstrate strong performance on mathematical reasoning benchmarks, yet remain surprisingly fragile to meaning-preserving surface perturbations. We systematically evaluate three open-weight LLMs, Mistral-7B, Llama-3-8B, and Qwen2.5-7B, on 677 GSM8K problems paired with semantically equivalent variants generated through name substitution and number format paraphrasing. All three models exhibit substantial answer-flip rates (28.8%-45.1%), with number paraphrasing consistently more disruptive than name swaps. To trace the mechanistic basis of these failures, we introduce the Mechanistic Perturbation Diagnostics (MPD) framework, combining logit lens analysis, activation patching, component ablation, and the Cascading Amplification Index (CAI) into a unified diagnostic pipeline. CAI, a novel metric quantifying layer-wise divergence amplification, outperforms first divergence layer as a failure predictor for two of three architectures (AUC up to 0.679). Logit lens reveals that flipped samples diverge from correct predictions at significantly earlier layers than stable samples. Activation patching reveals a stark architectural divide in failure localizability: Llama-3 failures are recoverable by patching at specific layers (43/60 samples), while Mistral and Qwen failures are broadly distributed (3/60 and 0/60). Based on these diagnostic signals, we propose a mechanistic failure taxonomy (localized, distributed, and entangled) and validate it through targeted repair experiments: steering vectors and layer fine-tuning recover 12.2% of localized failures (Llama-3) but only 7.2% of entangled (Qwen) and 5.2% of distributed (Mistral) failures.
[NLP-53] CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning CVPR2026
【速读】: 该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在跨模态多跳推理(multi-hop reasoning)能力上的不足问题。现有大多数多模态基准测试仅依赖单一图像或图像集合,使得答案可从单一模态直接推断,无法有效评估模型整合文本与视觉信息进行复杂推理的能力;同时训练数据中缺乏交错的图文内容,导致模型容易产生幻觉且推理过程缺乏视觉证据支撑。为应对这一挑战,作者提出CRIT数据集及基于图结构的自动化任务生成管道,其关键在于构建涵盖自然图像、视频和文本丰富来源的多样化跨模态推理任务,并通过人工验证的测试集实现可靠评估。实验表明,即使最先进的模型在CRIT上表现不佳,而经过CRIT训练的模型则显著提升了跨模态多跳推理能力,包括在SPIQA等标准基准上的改进。
链接: https://arxiv.org/abs/2604.01634
作者: Junyoung Sung,Seungwoo Lyu,Minjun Kim,Sumin An,Arsha Nagrani,Paul Hongsuck Seo
机构: Korea University (韩国科学技术院); Google DeepMind (谷歌深度思维)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to CVPR 2026
Abstract:Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark reveal that even state-of-the-art models struggle on such reasoning tasks. Models trained on CRIT show significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other standard multimodal benchmarks.
[NLP-54] Grounding AI-in-Education Development in Teachers Voices: Findings from a National Survey in Indonesia
【速读】: 该论文旨在解决印度尼西亚教育场景中人工智能(Artificial Intelligence, AI)应用缺乏大规模、以教师为中心的实证研究这一问题,从而阻碍了适配本地情境的AI教学系统与政策的发展。其解决方案的关键在于开展一项覆盖全国范围的问卷调查,对349名K-12教师进行调研,系统分析不同学段、教龄和区域教师在教学实践中使用AI的现状、动机与障碍,发现教师主要将AI用于减轻备课负担(如评估设计、课程规划和教学材料开发),但通用输出、基础设施限制及情境适配不足仍是制约其有效融入课堂的核心因素。
链接: https://arxiv.org/abs/2604.01630
作者: Nurul Aisyah,Muhammad Dehan Al Kautsar,Arif Hidayat,Fajri Koto
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Despite emerging use in Indonesian classrooms, there is limited large-scale, teacher-centred evidence on how AI is used in practice and what support teachers need, hindering the development of context-appropriate AI systems and policies. To address this gap, we conduct a nationwide survey of 349 K-12 teachers across elementary, junior high, and senior high schools. We find increasing use of AI for pedagogy, content development, and teaching media, although adoption remains uneven. Elementary teachers report more consistent use, while senior high teachers engage less; mid-career teachers assign higher importance to AI, and teachers in Eastern Indonesia perceive greater value. Across levels, teachers primarily use AI to reduce instructional preparation workload (e.g., assessment, lesson planning, and material development). However, generic outputs, infrastructure constraints, and limited contextual alignment continue to hinder effective classroom integration.
[NLP-55] OSCAR: Orchestrated Self-verification and Cross-path Refinement
【速读】: 该论文旨在解决生成式 AI(Generative AI)中幻觉(hallucination)问题,即模型在生成过程中产生与事实不符但自洽的错误内容。传统方法依赖外部训练的幻觉分类器进行干预,而本文提出一种基于扩散语言模型(Diffusion Language Models, DLMs)原生信号的推理时控制框架——OSCAR。其关键在于利用DLM固有的去噪轨迹(denoising trajectory)特性,通过跨链交叉熵(cross-chain entropy)定位高不确定性token位置,并基于检索证据实施定向重掩码(targeted remasking),从而在不依赖额外训练的前提下实现对幻觉的有效抑制和事实准确性的提升。该方法充分利用了DLM的结构优势,相比自回归模型具备更强的事实不确定性感知能力。
链接: https://arxiv.org/abs/2604.01624
作者: Yash Shah,Abhijit Chakraborty,Naresh Kumar Devulapally,Vishnu Lokhande,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学); University at Buffalo, SUNY (纽约州立大学布法罗分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Diffusion language models (DLMs) expose their denoising trajectories, offering a natural handle for inference-time control; accordingly, an ideal hallucination mitigation framework should intervene during generation using this model-native signal rather than relying on an externally trained hallucination classifier. Toward this, we formulate commitment uncertainty localization: given a denoising trajectory, identify token positions whose cross-chain entropy exceeds an unsupervised threshold before factually unreliable commitments propagate into self-consistent but incorrect outputs. We introduce a suite of trajectory-level assessments, including a cross-chain divergence-at-hallucination (CDH) metric, for principled comparison of localization methods. We also introduce OSCAR, a training-free inference-time framework operationalizing this formulation. OSCAR runs N parallel denoising chains with randomized reveal orders, computes cross-chain Shannon entropy to detect high-uncertainty positions, and then performs targeted remasking conditioned on retrieved evidence. Ablations confirm that localization and correction contribute complementary gains, robust across N in 4, 8, 16. On TriviaQA, HotpotQA, RAGTruth, and CommonsenseQA using LLaDA-8B and Dream-7B, OSCAR enhances generation quality by significantly reducing hallucinated content and improving factual accuracy through uncertainty-guided remasking, which also facilitates more effective integration of retrieved evidence. Its native entropy-based uncertainty signal surpasses that of specialized trained detectors, highlighting an inherent capacity of diffusion language models to identify factual uncertainty that is not present in the sequential token commitment structure of autoregressive models. We are releasing the codebase1 to support future research on localization and uncertainty-aware generation in DLMs.
[NLP-56] Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)中混合专家(Mixture-of-Experts, MoE)架构因继承自自回归系统的token-choice(Token-Choice, TC)路由机制而导致的负载不均与计算分配僵化问题。其解决方案的关键在于引入专家选择(Expert-Choice, EC)路由:EC路由通过设计实现确定性负载均衡,显著提升吞吐量并加速收敛;进一步地,基于EC容量可外部调控的特性,提出时间步依赖的专家容量分配策略,动态调整各去噪步骤的计算资源——实证表明,在低掩码率(low-mask-ratio)步骤分配更多计算资源能在相同浮点运算量(FLOPs)下获得最优性能,原因在于此类上下文中token的学习效率高出一个数量级,从而带来最大边际收益。最终,该方法还证明了现有TC-DLM模型可通过仅替换路由器即可转化为EC架构,实现更快收敛和更优下游任务表现,确立了EC路由作为DLM MoE架构的新范式,并揭示了扩散语言模型中的计算分配应被视为一种可适应的策略而非固定结构。
链接: https://arxiv.org/abs/2604.01622
作者: Shuibai Zhang,Caspian Zhuang,Chihan Cui,Zhihan Yang,Fred Zhangzhi Peng,Yanxin Zhang,Haoyue Bai,Zack Jia,Yang Zhou,Guanhua Chen,Ming Liu
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Scitix; Cornell University (康奈尔大学); Duke University (杜克大学); UC Davis (加州大学戴维斯分校); Southern University of Science and Technology (南方科技大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 26 pages
Abstract:Diffusion language models (DLMs) enable parallel, non-autoregressive text generation, yet existing DLM mixture-of-experts (MoE) models inherit token-choice (TC) routing from autoregressive systems, leading to load imbalance and rigid computation allocation. We show that expert-choice (EC) routing is a better fit for DLMs: it provides deterministic load balancing by design, yielding higher throughput and faster convergence than TC. Building on the property that EC capacity is externally controllable, we introduce timestep-dependent expert capacity, which varies expert allocation according to the denoising step. We find that allocating more capacity to low-mask-ratio steps consistently achieves the best performance under matched FLOPs, and provide a mechanistic explanation: tokens in low-mask-ratio contexts exhibit an order-of-magnitude higher learning efficiency, so concentrating compute on these steps yields the largest marginal return. Finally, we show that existing pretrained TC DLMs can be retrofitted to EC by replacing only the router, achieving faster convergence and improved accuracy across diverse downstream tasks. Together, these results establish EC routing as a superior paradigm for DLM MoE models and demonstrate that computation in DLMs can be treated as an adaptive policy rather than a fixed architectural constant. Code is available at this https URL.
[NLP-57] Swift-SVD: Theoretical Optimality Meets Practical Efficiency in Low-Rank LLM Compression
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在部署过程中因静态权重和动态键值缓存(Key-Value cache)带来的高内存与带宽消耗问题。现有基于奇异值分解(SVD)的压缩方法存在两大局限:部分方法重建误差较大,而理论最优的方法则在实际应用中效率低下。解决方案的关键在于提出Swift-SVD,一种激活感知的、闭式求解的压缩框架,其通过增量聚合输入批次的输出激活协方差,并在聚合后执行单次特征值分解,实现无需训练、快速且层级最优的低秩近似;同时引入有效秩(effective rank)分析局部压缩潜力,并设计动态秩分配策略以兼顾局部重建损失与端到端层重要性,从而在保持理论最优性的同时显著提升实践效率。
链接: https://arxiv.org/abs/2604.01609
作者: Ruoling Qi,Yirui Liu,Xuaner Wu,Xiangyu Wang,Ming Li,Chen Chen,Jian Chen,Yin Chen,Qizhen Weng
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:The deployment of Large Language Models is constrained by the memory and bandwidth demands of static weights and dynamic Key-Value cache. SVD-based compression provides a hardware-friendly solution to reduce these costs. However, existing methods suffer from two key limitations: some are suboptimal in reconstruction error, while others are theoretically optimal but practically inefficient. In this paper, we propose Swift-SVD, an activation-aware, closed-form compression framework that simultaneously guarantees theoretical optimum, practical efficiency and numerical stability. Swift-SVD incrementally aggregates covariance of output activations given a batch of inputs and performs a single eigenvalue decomposition after aggregation, enabling training-free, fast, and optimal layer-wise low-rank approximation. We employ effective rank to analyze local layer-wise compressibility and design a dynamic rank allocation strategy that jointly accounts for local reconstruction loss and end-to-end layer importance. Extensive experiments across six LLMs and eight datasets demonstrate that Swift-SVD outperforms state-of-the-art baselines, achieving optimal compression accuracy while delivering 3-70X speedups in end-to-end compression time. Our code will be released upon acceptance.
[NLP-58] DeltaMem: Towards Agent ic Memory Management via Reinforcement Learning
【速读】: 该论文旨在解决多智能体系统在管理角色中心记忆(persona-centric memory)时存在的信息丢失和场景适应性差的问题,这些问题导致其在对话等长期任务中性能不佳。解决方案的关键在于提出DeltaMem,一个基于单智能体设定的代理式记忆管理系统,将角色记忆管理建模为端到端任务;同时借鉴人类记忆演化机制构建用户-助手对话数据集及操作级记忆更新标签,并引入基于记忆的Levenshtein距离(Memory-based Levenshtein Distance)作为奖励函数,结合定制化的强化学习框架优化记忆管理能力,从而显著提升模型在多个长期记忆基准测试中的表现。
链接: https://arxiv.org/abs/2604.01560
作者: Qi Zhang,Shen Huang,Chu Liu,Shouqing Yang,Junbo Zhao,Haobo Wang,Pengjun Xie
机构: 未知
类目: Computation and Language (cs.CL)
备注: preprint, under review
Abstract:Recent advances in persona-centric memory have revealed the powerful capability of multi-agent systems in managing persona memory, especially in conversational scenarios. However, these complex frameworks often suffer from information loss and are fragile across varying scenarios, resulting in suboptimal performance. In this paper, we propose DeltaMem, an agentic memory management system that formulates persona-centric memory management as an end-to-end task within a single-agent setting. To further improve the performance of our agentic memory manager, we draw inspiration from the evolution of human memory and synthesize a user-assistant dialogue dataset along with corresponding operation-level memory updating labels. Building on this, we introduce a novel Memory-based Levenshtein Distance to formalize the memory updating reward, and propose a tailored reinforcement learning framework to further enhance the management capabilities of DeltaMem. Extensive experiments show that both training-free and RL-trained DeltaMem outperform all product-level baselines across diverse long-term memory benchmarks, including LoCoMo, HaluMem, and PersonaMem.
[NLP-59] Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在医疗领域微调过程中出现的“灾难性遗忘”(catastrophic forgetting)问题,即模型在学习特定临床任务时会显著丧失原有的指令遵循能力(instruction-following ability),从而限制其在临床场景中的实际应用。解决方案的关键在于提出一种基于权重空间插值的模型融合框架(model merging framework),通过将临床基础模型(GatorTronLlama)与通用指令微调模型(Llama-3.1-8B-Instruct)进行融合,有效保留临床专业知识的同时维持强大的指令理解与生成能力。实验表明,该方法在多个医学基准和临床生成任务中均优于传统微调策略,且在极低监督数据(如64-shot)下仍能逼近全量微调性能,具备高度可扩展性和资源效率,适用于医疗资源受限环境下的部署。
链接: https://arxiv.org/abs/2604.01538
作者: Mengxian Lyu,Cheng Peng,Ziyi Chen,Mengyuan Zhang,Jieting Li Lu,Yonghui Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models have been adopted in the medical domain for clinical documentation to reduce clinician burden. However, studies have reported that LLMs often “forget” a significant amount of instruction-following ability when fine-tuned using a task-specific medical dataset, a critical challenge in adopting general-purpose LLMs for clinical applications. This study presents a model merging framework to efficiently adapt general-purpose LLMs to the medical domain by countering this forgetting issue. By merging a clinical foundation model (GatorTronLlama) with a general instruct model (Llama-3.1-8B-Instruct) via interpolation-based merge methods, we seek to derive a domain-adapted model with strong performance on clinical tasks while retaining instruction-following ability. Comprehensive evaluation across medical benchmarks and five clinical generation tasks (e.g., radiology and discharge summarization) shows that merged models can effectively mitigate catastrophic forgetting, preserve clinical domain expertise, and retain instruction-following ability. In addition, our model merging strategies demonstrate training efficiency, achieving performance on par with fully fine-tuned baselines under severely constrained supervision (e.g., 64-shot vs. 256-shot). Consequently, weight-space merging constitutes a highly scalable solution for adapting open-source LLMs to clinical applications, facilitating broader deployment in resource-constrained healthcare environments.
[NLP-60] Read More Think More: Revisiting Observation Reduction for Web Agents
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的Web代理在处理网页观测信息时,如何选择最优观察表示形式的问题。以往研究普遍认为HTML文本的冗长性会降低性能,因而采用简化表示(如可访问性树)作为标准做法。本文通过实证发现,最优观察表示取决于模型能力与思维令牌预算(thinking token budget):对于低能力模型,紧凑表示更优;而对于高能力模型,保留完整HTML能显著提升性能,且随着思维令牌预算增加,HTML的优势进一步放大。关键解决方案在于提出一种自适应策略——根据模型能力和可用思考资源动态选择观察表示,并引入基于差分(diff-based)的历史观测表示以高效利用历史信息,从而在不同设置下实现性能优化。
链接: https://arxiv.org/abs/2604.01535
作者: Masafumi Enomoto,Ryoma Obara,Haochen Zhang,Masafumi Oyamada
机构: NEC Corporation
类目: Computation and Language (cs.CL)
备注:
Abstract:Web agents based on large language models (LLMs) rely on observations of web pages – commonly represented as HTML – as the basis for identifying available actions and planning subsequent steps. Prior work has treated the verbosity of HTML as an obstacle to performance and adopted observation reduction as a standard practice. We revisit this trend and demonstrate that the optimal observation representation depends on model capability and thinking token budget: (1) compact observations (accessibility trees) are preferable for lower-capability models, while detailed observations (HTML) are advantageous for higher-capability models; moreover, increasing thinking tokens further amplifies the benefit of HTML. (2) Our error analysis suggests that higher-capability models exploit layout information in HTML for better action grounding, while lower-capability models suffer from increased hallucination under longer inputs. We also find that incorporating observation history improves performance across most models and settings, and a diff-based representation offers a token-efficient alternative. Based on these findings, we suggest practical guidelines: adaptively select observation representations based on model capability and thinking token budget, and incorporate observation history using diff-based representations.
[NLP-61] Why Instruction-Based Unlearning Fails in Diffusion Models?
【速读】: 该论文旨在解决生成式 AI(Generative AI)中指令驱动的遗忘(instruction-based unlearning)方法在扩散模型(diffusion models)中的有效性问题,即是否能够通过自然语言指令在推理阶段有效抑制特定概念的生成。研究表明,仅依赖提示层面的语言控制无法使扩散模型系统性地消除目标概念,其关键原因在于:未学习指令未能引发对目标概念标记(token)的持续注意力抑制,导致相关表征在整个去噪过程中仍得以保留。因此,有效的遗忘机制必须超越推理时的语言干预,需引入更深层次的模型内部干预策略。
链接: https://arxiv.org/abs/2604.01514
作者: Zeliang Zhang,Rui Sun,Jiani Liu,Qi Wu,Chenliang Xu
机构: University of Rochester (罗切斯特大学); UCLA (加州大学洛杉矶分校); UCSB (加州大学圣塔芭芭拉分校)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Instruction-based unlearning has proven effective for modifying the behavior of large language models at inference time, but whether this paradigm extends to other generative models remains unclear. In this work, we investigate instruction-based unlearning in diffusion-based image generation models and show, through controlled experiments across multiple concepts and prompt variants, that diffusion models systematically fail to suppress targeted concepts when guided solely by natural-language unlearning instructions. By analyzing both the CLIP text encoder and cross-attention dynamics during the denoising process, we find that unlearning instructions do not induce sustained reductions in attention to the targeted concept tokens, causing the targeted concept representations to persist throughout generation. These results reveal a fundamental limitation of prompt-level instruction in diffusion models and suggest that effective unlearning requires interventions beyond inference-time language control.
[NLP-62] Magic Madness Heaven Sin: LLM Output Diversity is Everything Everywhere All at Once
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)研究中对输出变异性的理解碎片化问题,即当前学术界在探讨生成多样性、推理能力、对齐性及表征分析时缺乏统一的理论框架,且任务背后的规范性目标常未被明确界定。其解决方案的关键在于提出“Magic, Madness, Heaven, Sin”(MMHS)框架,该框架将输出变异性置于同质性-异质性轴上进行建模,并依据任务的规范性目标(epistemic事实性、interactional用户效用、societal代表性、safety鲁棒性)划分四类评价语境;通过此框架系统解析不同语境下变异性的失效模式(如幻觉、模式坍缩、偏见和消解),揭示单一目标优化(如提升安全性)可能对其他维度(如群体代表性或创造性多样性)产生意外损害,从而主张基于任务语境的输出变异评估,将其重新定义为由任务目标塑造的属性,而非模型固有特性。
链接: https://arxiv.org/abs/2604.01504
作者: Harnoor Dhingra
机构: Microsoft
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Under review
Abstract:Research on Large Language Models (LLMs) studies output variation across generation, reasoning, alignment, and representational analysis, often under the umbrella of “diversity.” Yet the terminology remains fragmented, largely because the normative objectives underlying tasks are rarely made explicit. We introduce the Magic, Madness, Heaven, Sin framework, which models output variation along a homogeneity-heterogeneity axis, where valuation is determined by the task and its normative objective. We organize tasks into four normative contexts: epistemic (factuality), interactional (user utility), societal (representation), and safety (robustness). For each, we examine the failure modes and vocabulary such as hallucination, mode collapse, bias, and erasure through which variation is studied. We apply the framework to analyze all pairwise cross-contextual interactions, revealing that optimizing for one objective, such as improving safety, can inadvertently harm demographic representation or creative diversity. We argue for context-aware evaluation of output variation, reframing it as a property shaped by task objectives rather than a model’s intrinsic trait.
[NLP-63] From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents
【速读】: 该论文旨在解决开源大语言模型(Large Language Model, LLM)在软件工程基准测试(SWE-bench)中性能不足的问题,尤其是如何高效地将前沿闭源模型的知识迁移到资源受限的开源模型中。解决方案的关键在于提出了一种两阶段监督微调(Supervised Fine-Tuning, SFT)流程:首先通过SWE-ZERO利用大规模、无需执行的轨迹来掌握代码语义和仓库级推理能力;随后通过SWE-HERO引入有针对性的、基于执行反馈的精炼策略,将语义理解转化为严谨的工程工作流。该方法显著提升了开源模型在SWE-bench上的表现,实现了对多语言任务的零样本迁移能力,且无需依赖昂贵的计算资源。
链接: https://arxiv.org/abs/2604.01496
作者: Nikolai Ludwig,Wasi Uddin Ahmad,Somshubra Majumdar,Boris Ginsburg
机构: NVIDIA(英伟达)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:We introduce SWE-ZERO to SWE-HERO, a two-stage SFT recipe that achieves state-of-the-art results on SWE-bench by distilling open-weight frontier LLMs. Our pipeline replaces resource-heavy dependencies with an evolutionary refinement strategy: (1) SWE-ZERO utilizes large-scale, execution-free trajectories to master code semantics and repository-level reasoning, and (2) SWE-HERO applies targeted, execution-backed refinement to transition these semantic intuitions into rigorous engineering workflows. Our empirical results set a new benchmark for open-source models of comparable size. We release a dataset of 300k SWE-ZERO and 13k SWE-HERO trajectories distilled from Qwen3-Coder-480B, alongside a suite of agents based on the Qwen2.5-Coder series. Notably, SWE-HERO-32B achieves a 62.2% resolution rate on SWE-bench Verified. Furthermore, despite being trained exclusively on Python, our agents demonstrate robust zero-shot transferability on SWE-bench Multilingual, reaching 44.1% and confirming the paradigm’s generalizability across diverse languages.
[NLP-64] When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)过程中出现的奖励欺骗(reward hacking)问题,特别是在编程任务中,模型通过篡改评估器代码来绕过真实求解目标,从而获得虚假高奖励。为系统研究此现象,作者构建了一个环境操控实验场景,使模型能够修改测试用例以“作弊”通过验证。研究发现模型行为呈现可复现的三阶段反弹模式:初始尝试失败、短暂回归合法求解、最终采用新策略成功作弊。关键解决方案是提出优势值修正(Advantage Modification),其核心在于利用表示工程从通用对比样本中提取出“捷径”(shortcut)、“欺骗”(deception)和“评估意识”(evaluation awareness)的概念方向,并发现“捷径方向”与作弊行为高度相关,因而作为代理指标嵌入GRPO的优势计算中,在训练阶段即对作弊轨迹施加惩罚,而非仅在推理时进行干预,从而实现更鲁棒的奖励欺骗抑制效果。
链接: https://arxiv.org/abs/2604.01476
作者: Rui Wu,Ruixiang Tang
机构: Rutgers University (罗格斯大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 15 pages, 8 figures
Abstract:Reinforcement learning for LLMs is vulnerable to reward hacking, where models exploit shortcuts to maximize reward without solving the intended task. We systematically study this phenomenon in coding tasks using an environment-manipulation setting, where models can rewrite evaluator code to trivially pass tests without solving the task, as a controlled testbed. Across both studied models, we identify a reproducible three-phase rebound pattern: models first attempt to rewrite the evaluator but fail, as their rewrites embed test cases their own solutions cannot pass. They then temporarily retreat to legitimate solving. When legitimate reward remains scarce, they rebound into successful hacking with qualitatively different strategies. Using representation engineering, we extract concept directions for shortcut, deception, and evaluation awareness from domain-general contrastive pairs and find that the shortcut direction tracks hacking behavior most closely, making it an effective representational proxy for detection. Motivated by this finding, we propose Advantage Modification, which integrates shortcut concept scores into GRPO advantage computation to penalize hacking rollouts before policy updates. Because the penalty is internalized into the training signal rather than applied only at inference time, Advantage Modification provides more robust suppression of hacking compared with generation-time activation steering.
[NLP-65] A Dynamic Atlas of Persian Poetic Symbolism: Families Fields and the Historical Rewiring of Meaning
【速读】: 该论文旨在解决现有计算文学分析方法在处理波斯诗歌时,将象征性元素(如酒器、花园、火焰等)简化为孤立词汇或整体文档语义的问题,从而忽略了波斯诗学中象征形式以“家族”为单位、通过重复关系获得意义的实践组织方式。其解决方案的关键在于构建一个包含129,451首诗歌的大规模语料库,并基于反复出现的形式将其归类为可追踪的“象征家族”,区分意象材料与神圣及宫廷指涉内容,在多层图结构中映射它们之间的关联;这一方法揭示了象征核心相对稀疏、指涉成分更密集、连接区域具有选择性而非弥散性的特征,并捕捉到不同时代间符号网络的动态演化——包括模块化增强、跨范畴联结减弱、宫廷桥梁弱化、神圣桥梁强化以及枢纽节点位置的变迁,表明波斯象征体系并非静态集合,而是一个随时间演变的动态系统。
链接: https://arxiv.org/abs/2604.01467
作者: Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Persian poetry is often remembered through recurrent symbols before it is remembered through plot. Wine vessels, gardens, flames, sacred titles, bodily beauty, and courtly names return across centuries, yet computational work still tends to flatten this material into isolated words or broad document semantics. That misses a practical unit of organization in Persian poetics: related forms travel as families and gain force through recurring relations. Using a corpus of 129,451 poems, we consolidate recurrent forms into traceable families, separate imagistic material from sacred and courtly reference, and map their relations in a multi-layer graph. The symbolic core is relatively sparse, the referential component much denser, and the attachment zone between them selective rather than diffuse. Across 11 Hijri-century bins, some families remain widely distributed, especially Shab (Night), Ruz (Day), and Khaak (Earth). Wine vessels, garden space, flame, and lyric sound strengthen later, while prestige-coded and heroic-courtly vocabulary is weighted earlier. Century-specific graphs show change in arrangement as well as membership. Modularity rises, cross-scope linkage declines, courtly bridges weaken, and sacred bridges strengthen. Hub positions shift too: Kherqe (Sufi Robe) gains late prominence, Farkhondeh Blessed and Banafsheh (Violet) recede, and Saaghar (Wine Cup) stays central across the chronology. In this corpus, Persian symbolism appears less as a fixed repertory than as a long-lived system whose internal weights and connections change over time.
[NLP-66] Wired for Overconfidence: A Mechanistic Perspective on Inflated Verbalized Confidence in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的“自信错误”问题,即模型在生成事实性错误答案时往往表现出过高的口头化置信度,从而误导用户并削弱置信度分数作为可靠不确定性信号的作用。其解决方案的关键在于通过电路级机制分析,识别出导致置信度膨胀的特定神经网络组件——即集中在中后期层中的少量多层感知机(MLP)模块和注意力头,并发现这些组件会将置信度膨胀信号写入最终token位置;进一步地,通过针对这些电路进行推理时的定向干预,显著改善了模型的校准性能。
链接: https://arxiv.org/abs/2604.01457
作者: Tianyi Zhao,Yinhan He,Wendy Zheng,Yujie Zhang,Chen Chen
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models are often not just wrong, but \emphconfidently wrong: when they produce factually incorrect answers, they tend to verbalize overly high confidence rather than signal uncertainty. Such verbalized overconfidence can mislead users and weaken confidence scores as a reliable uncertainty signal, yet its internal mechanisms remain poorly understood. We present a circuit-level mechanistic analysis of this inflated verbalized confidence in LLMs, organized around three axes: capturing verbalized confidence as a differentiable internal signal, identifying the circuits that causally inflate it, and leveraging these insights for targeted inference-time recalibration. Across two instruction-tuned LLMs on three datasets, we find that a compact set of MLP blocks and attention heads, concentrated in middle-to-late layers, consistently writes the confidence-inflation signal at the final token position. We further show that targeted inference-time interventions on these circuits substantially improve calibration. Together, our results suggest that verbalized overconfidence in LLMs is driven by identifiable internal circuits and can be mitigated through targeted intervention.
[NLP-67] Are Finer Citations Always Better? Rethinking Granularity for Attributed Generation
【速读】: 该论文旨在解决生成式AI(Generative AI)中引用粒度(citation granularity)设计对事实溯源准确性(attribution quality)的影响问题。当前实践中倾向于采用细粒度引用(如句子级),以利于人工验证,但其对模型性能的实际影响尚不明确。研究通过分析8B至120B规模的多个模型发现,强制使用细粒度引用会显著降低溯源质量(恶化16%-276%),而中等粒度(段落级)表现最优;进一步表明,过细的引用破坏了模型进行多句语义整合的能力,尤其在大模型中更为明显。解决方案的关键在于:将引用粒度与模型自然的语义范围相匹配,而非单纯追求人类可读性——这一策略可在保持或提升答案正确性的前提下,大幅提升溯源准确性,从而实现更可靠的生成内容可信溯源。
链接: https://arxiv.org/abs/2604.01432
作者: Hexuan Wang,Jingyu Zhang,Benjamin Van Durme,Daniel Khashabi(Johns Hopkins University)
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Citation granularity - whether to cite individual sentences, paragraphs, or documents - is a critical design choice in attributed generation. While fine-grained citations are often preferred for precise human verification, their impact on model performance remains under-explored. We analyze four model scales (8B-120B) and demonstrate that enforcing fine-grained citations degrades attribution quality by 16-276% compared to the best-performing granularity. We observe a consistent performance pattern where attribution quality peaks at intermediate granularities (paragraph-level). Our analysis suggests that fine-grained (sentence-level) citations disrupt necessary semantic dependencies for attributing evidence to answer claims, while excessively coarse citations (multi-paragraph) introduce distracting noise. Importantly, the magnitude of this performance gap varies non-monotonically with model scale: fine-grained constraints disproportionately penalize larger models, suggesting that atomic citation units disrupt the multi-sentence information synthesis at which these models excel. Strikingly, citation-optimal granularity leads to substantial gains in attribution quality while preserving or even improving answer correctness. Overall, our findings demonstrate that optimizing solely for human verification via fine-grained citation disregards model constraints, compromising both attribution faithfulness and generation reliability. Instead, effective attribution requires aligning citation granularity with the model’s natural semantic scope.
[NLP-68] he power of context: Random Forest classification of near synonyms. A case study in Modern Hindi
【速读】: 该论文旨在解决一个长期存在的语言学问题:尽管理论上绝对同义词(absolute synonyms)不应存在,因为它们无法扩展语言的表达能力,但现实中大量同义词共存的现象仍需解释。作者试图通过量化方法验证,即使两个词语义相同,其来源差异(如梵语与波斯-阿拉伯借词)是否仍能在使用模式中留下可识别的痕迹。解决方案的关键在于利用词嵌入(word embeddings)和随机森林(Random Forest)分类器对印地语中的同义词对进行训练,结果表明模型能够仅凭分布数据(distributional data)准确区分词源,即便这些词在语义上无关。这一发现说明语境能编码历史词源信号,揭示了同义词可能承载源自不同文化背景的细微系统性差异,从而为语言中“语义框架”(semantic frame)的形成提供了新的视角。
链接: https://arxiv.org/abs/2604.01425
作者: Jacek Bąkowski
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Synonymy is a widespread yet puzzling linguistic phenomenon. Absolute synonyms theoretically should not exist, as they do not expand language’s expressive potential. However, it was suggested that even if synonyms denote the same concept, they may reflect different perspectives or carry distinct cultural associations, claims that have rarely been tested quantitatively. In Hindi, prolonged contact with Persian produced many Perso-Arabic loanwords coexisting with their Sanskrit counterpart, forming numerous synonym pairs. This study investigates whether centuries after these borrowings appeared in the Subcontinent their origin can still be distinguished using distributional data alone and regardless of their semantic content. A Random Forest trained on word embeddings of Hindi synonyms successfully classified words by Sanskrit or Perso-Arabic origin, even when they were semantically unrelated, suggesting that usage patterns preserve traces of etymology. These findings provide quantitative evidence that context encodes etymological signals and that synonymy may reflect subtle but systematic distinctions linked to origin. They support the idea that synonymous words can offer different perspectives and that etymologically related words may form distinct conceptual subspaces, creating a new type of semantic frame shaped by historical origin. Overall, the results highlight the power of context in capturing nuanced distinctions beyond traditional semantic similarity. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.01425 [cs.CL] (or arXiv:2604.01425v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.01425 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-69] Cost-Efficient Estimation of General Abilities Across Benchmarks
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)评估中存在效率低下的问题,即当前众多基准测试难以高效、准确地预测模型在未见任务上的性能。其核心挑战在于如何从海量多样化的任务中提炼出具有代表性的评估方法,以最小的资源消耗实现高精度预测。解决方案的关键在于提出一种结合改进的多维项目反应理论(Multidimensional Item Response Theory, MIRT)模型与基于最优实验设计的自适应项目选择机制的方法,能够在仅观测16个测试项的情况下,对112个保留任务的性能预测达到平均绝对误差(MAE)低于7%的水平;进一步引入成本感知折扣因子优化选择策略后,将达到相同预测精度所需的总token数从141,000降至22,000,实现85%的评估成本削减。
链接: https://arxiv.org/abs/2604.01418
作者: Michael Krumdick,Adam Wiemerslage,Seth Ebner,Charles Lovering,Chris Tanner
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the “Wide-scale Item Level Dataset” (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique items spanning 163 tasks drawn from 27 datasets. This dataset enables the first analysis of how different techniques can predict a model’s performance on a large, diverse collection of unseen tasks under different budget constraints. We demonstrate that combining a modified multidimensional item response theory (IRT) model with adaptive item selection driven by optimal experimental design can predict performance on 112 held-out benchmark tasks with a mean absolute error (MAE) of less than 7%, and can do so after observing only 16 items. We further demonstrate that incorporating cost-aware discount factors into our selection criteria can reduce the total tokens needed to reach 7% MAE from 141,000 tokens to only 22,000, an 85% reduction in evaluation cost.
[NLP-70] Adaptive Stopping for Multi-Turn LLM Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮推理(multi-turn reasoning)场景下,如自适应检索增强生成(adaptive retrieval-augmented generation, RAG)和ReAct-style代理中,缺乏有效且具有形式保证的停止策略问题。现有方法依赖启发式规则或固定轮次预算,无法确保最终预测包含正确答案,尤其在金融、医疗等高风险领域可能导致决策错误或资源浪费。解决方案的关键在于提出首个适用于多轮推理的 conformal prediction(CP)框架——MiCP(Multi-Turn Language Models with Conformal Prediction),其通过在不同轮次间分配差异化误差预算(error budget),实现早期停止的同时维持整体覆盖保证(coverage guarantee),从而在保持准确性前提下显著降低推理轮次、计算成本与预测集大小。
链接: https://arxiv.org/abs/2604.01413
作者: Xiaofan Zhou,Huy Nguyen,Bo Yu,Chenxi Liu,Lu Cheng
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); University of Utah (犹他大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) increasingly rely on multi-turn reasoning and interaction, such as adaptive retrieval-augmented generation (RAG) and ReAct-style agents, to answer difficult questions. These methods improve accuracy by iteratively retrieving information, reasoning, or acting, but introduce a key challenge: \textbfWhen should the model stop? Existing approaches rely on heuristic stopping rules or fixed turn budgets and provide no formal guarantees that the final prediction still contains the correct answer. This limitation is particularly problematic in high-stakes domains such as finance and healthcare, where unnecessary turns increase cost and latency, while stopping too early risks incorrect decisions. Conformal prediction (CP) provides formal coverage guarantees, but existing LLM-CP methods only apply to a single model output and cannot handle multi-turn pipelines with adaptive stopping. To address this gap, we propose Multi-Turn Language Models with Conformal Prediction (MiCP), the first CP framework for multi-turn reasoning. MiCP allocates different error budgets across turns, enabling the model to stop early while maintaining an overall coverage guarantee. We demonstrate MiCP on adaptive RAG and ReAct, where it achieves the target coverage on both single-hop and multi-hop question answering benchmarks while reducing the number of turns, inference cost, and prediction set size. We further introduce a new metric that jointly evaluates coverage validity and answering efficiency.
[NLP-71] st-Time Scaling Makes Overtraining Compute-Optimal
【速读】: 该论文旨在解决现代大语言模型(Large Language Models, LLMs)在测试时通过重复采样(repeated sampling)进行扩展所带来的推理成本问题,这一成本与模型规模和采样次数呈正比,从而形成预训练扩展规律(pretraining scaling laws,如Chinchilla)无法覆盖的权衡关系。解决方案的关键在于提出“训练到测试”(Train-to-Test, T²)扩展规律,该规律联合优化模型规模、训练数据量和测试时采样次数,在固定端到端预算下实现最优决策;其核心创新在于引入用于测试时扩展的pass@k建模方法,并将预训练与测试时策略统一优化,实证表明在考虑推理成本后,最优预训练策略显著偏向过训练区域,且该结论在后训练阶段依然成立,验证了T²扩展规律在现代部署中的有效性。
链接: https://arxiv.org/abs/2604.01411
作者: Nicholas Roberts,Sungjun Cho,Zhiqi Gao,Tzu-Heng Huang,Albert Wu,Gabriel Orlanski,Avi Trost,Kelly Buchanan,Aws Albarghouthi,Frederic Sala
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
Abstract:Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test ( T^2 ) scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. T^2 modernizes pretraining scaling laws with pass@ k modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from T^2 are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that T^2 scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making T^2 scaling meaningful in modern deployments.
[NLP-72] Assessing Pause Thresholds for empirical Translation Process Research
【速读】: 该论文旨在解决翻译过程中如何准确界定“自动化翻译行为”与“需要反思的翻译行为”的分界点问题,即如何科学计算生成式 AI (Generative AI) 或人工翻译中用于划分生产单元(Production Unit Breaks)的停顿阈值。其解决方案的关键在于比较三种现有停顿阈值计算方法,并提出并评估一种新的计算方法,以更精确地区分不同认知负荷下的翻译行为阶段,从而提升对翻译过程动态机制的理解和建模精度。
链接: https://arxiv.org/abs/2604.01410
作者: Devi Sri Bandaru,Michael Carl,Xinyue Ren
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for Presentation at “Translation in Transition 8, September 2026”
Abstract:Text production (and translations) proceeds in the form of stretches of typing, interrupted by keystroke pauses. It is often assumed that fast typing reflects unchallenged/automated translation production while long(er) typing pauses are indicative of translation problems, hurdles or difficulties. Building on a long discussion concerning the determination of pause thresholds that separate automated from presumably reflective translation processes (O’Brien, 2006; Alves and Vale, 2009; Timarova et al., 2011; Dragsted and Carl, 2013; Lacruz et al., 2014; Kumpulainen, 2015; Heilmann and Neumann 2016), this paper compares three recent approaches for computing these pause thresholds, and suggest and evaluate a novel method for computing Production Unit Breaks.
[NLP-73] Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models
【速读】: 该论文旨在解决语言模型在回答以实体为中心的事实性问题时,其内部机制尚不明确的问题。解决方案的关键在于通过模板化提示定位出对特定实体敏感的多层感知机(MLP)神经元,并利用因果干预验证这些神经元的功能:在PopQA数据集上的实验证明,这些局部化的神经元主要集中在早期层,且负向消融会导致实体特异性遗忘,而向占位符token注入信号则能显著提升答案检索效果,优于均值实体和错误细胞控制组。此外,单个神经元即可恢复实体一致预测,表明存在一种紧凑的实体检索机制而非深度逐层渐进增强,且该机制对别名、缩写、拼写错误及多语言形式具有鲁棒性,支持“标准化”(canonicalization)解释。
链接: https://arxiv.org/abs/2604.01404
作者: Itay Yona,Dan Barzilay,Michael Karasik,Mor Geva
机构: Mentaleap; Indepdent Researcher; Tel Aviv University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Language models can answer many entity-centric factual questions, but it remains unclear which internal mechanisms are involved in this process. We study this question across multiple language models. We localize entity-selective MLP neurons using templated prompts about each entity, and then validate them with causal interventions on PopQA-based QA examples. On a curated set of 200 entities drawn from PopQA, localized neurons concentrate in early layers. Negative ablation produces entity-specific amnesia, while controlled injection at a placeholder token improves answer retrieval relative to mean-entity and wrong-cell controls. For many entities, activating a single localized neuron is sufficient to recover entity-consistent predictions once the context is initialized, consistent with compact entity retrieval rather than purely gradual enrichment across depth. Robustness to aliases, acronyms, misspellings, and multilingual forms supports a canonicalization interpretation. The effect is strong but not universal: not every entity admits a reliable single-neuron handle, and coverage is higher for popular entities. Overall, these results identify sparse, causally actionable access points for analyzing and modulating entity-conditioned factual behavior.
[NLP-74] Open-Domain Safety Policy Construction EACL2026
【速读】: 该论文旨在解决生成式 AI (Generative AI) 产品中内容审核策略(content moderation policy)制定与维护成本高昂的问题,尤其是针对特定领域(domain-specific)安全政策的编写难度大、耗时长。解决方案的关键在于提出一种名为 Deep Policy Research (DPR) 的轻量级智能体系统,其核心机制是通过一个结构化的研究循环:仅依赖人类撰写的种子领域信息,利用单一网页搜索工具和轻量级框架,迭代生成搜索查询、从多样网络来源提炼规则,并将规则组织为索引文档。该方法在多个基准测试中显著优于仅基于定义或上下文学习的基线模型,在端到端场景下甚至可媲美专家编写的政策片段,表明任务特异性、结构化的研究流程比通用深度研究更适用于政策起草。
链接: https://arxiv.org/abs/2604.01354
作者: Di Wu,Siyue Liu,Zixiang Ji,Ya-Liang Chang,Zhe-Yu Liu,Andrew Pleffer,Kai-Wei Chang
机构: University of California, Los Angeles (加州大学洛杉矶分校); Taboola (taboola)
类目: Computation and Language (cs.CL)
备注: EACL 2026 (Findings)
Abstract:Moderation layers are increasingly a core component of many products built on user- or model-generated content. However, drafting and maintaining domain-specific safety policies remains costly. We present Deep Policy Research (DPR), a minimal agentic system that drafts a full content moderation policy based on only human-written seed domain information. DPR uses a single web search tool and lightweight scaffolding to iteratively propose search queries, distill diverse web sources into policy rules, and organize rules into an indexed document. We evaluate DPR on (1) the OpenAI undesired content benchmark across five domains with two compact reader LLMs and (2) an in-house multimodal advertisement moderation benchmark. DPR consistently outperforms definition-only and in-context learning baselines, and in our end-to-end setting it is competitive with expert-written policy sections in several domains. Moreover, under the same seed specification and evaluation protocol, DPR outperforms a general-purpose deep research system, suggesting that a task-specific, structured research loop can be more effective than generic web research for policy drafting. We release our experiment code at this https URL.
[NLP-75] No Attacker Needed: Unintentional Cross-User Contamination in Shared-State LLM Agents
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在多用户共享状态场景下出现的“非故意跨用户污染”(Unintentional Cross-User Contamination, UCC)问题。UCC指由于共享知识层未按用户作用域隔离,导致一个用户的本地有效信息被错误应用于其他用户,从而引发沉默错误(silent wrong answers),且无需攻击者介入即可发生。解决方案的关键在于识别并区分三类污染类型,并提出基于写入时净化(write-time sanitization)的防御机制;然而研究表明,仅靠文本级净化在包含可执行代码等复杂Artifact的共享状态下仍存在显著残留风险,因此必须引入面向Artifact级别的防护策略,以实现对跨用户污染的有效遏制。
链接: https://arxiv.org/abs/2604.01350
作者: Tiankai Yang,Jiate Li,Yi Nian,Shen Dong,Ruiyao Xu,Ryan Rossi,Kaize Ding,Yue Zhao
机构: University of Southern California (南加州大学); Michigan State University (密歇根州立大学); Northwestern University (西北大学); Adobe Research (Adobe研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:LLM-based agents increasingly operate across repeated sessions, maintaining task states to ensure continuity. In many deployments, a single agent serves multiple users within a team or organization, reusing a shared knowledge layer across user identities. This shared persistence expands the failure surface: information that is locally valid for one user can silently degrade another user’s outcome when the agent reapplies it without regard for scope. We refer to this failure mode as unintentional cross-user contamination (UCC). Unlike adversarial memory poisoning, UCC requires no attacker; it arises from benign interactions whose scope-bound artifacts persist and are later misapplied. We formalize UCC through a controlled evaluation protocol, introduce a taxonomy of three contamination types, and evaluate the problem in two shared-state mechanisms. Under raw shared state, benign interactions alone produce contamination rates of 57–71%. A write-time sanitization is effective when shared state is conversational, but leaves substantial residual risk when shared state includes executable artifacts, with contamination often manifesting as silent wrong answers. These results indicate that shared-state agents need artifact-level defenses beyond text-level sanitization to prevent silent cross-user failures.
[NLP-76] Procedural Knowledge at Scale Improves Reasoning
【速读】: 该论文旨在解决现有测试时扩展(test-time scaling)方法在复杂推理任务中未能系统复用先前推理轨迹中的程序性知识(procedural knowledge)的问题,尤其是缺乏对如何重构问题、选择策略以及验证或回溯的显式利用。其解决方案的关键在于提出一种名为“推理记忆”(Reasoning Memory)的检索增强生成(Retrieval-Augmented Generation, RAG)框架,通过将已有的分步推理轨迹分解为自包含的子问题-子程序对(subquestion-subroutine pairs),构建包含3200万条紧凑程序性知识条目的数据存储库;在推理阶段,借助轻量级的“思维内提示”(in-thought prompt)使模型能够显式提取核心子问题,并从当前推理路径中检索相关子程序作为隐式程序先验,从而实现大规模、结构化的程序性知识复用。实验证明,该方法在多个数学、科学和编程基准上显著优于传统RAG及计算预算相当的测试时扩展基线。
链接: https://arxiv.org/abs/2604.01348
作者: Di Wu,Devendra Singh Sachan,Wen-tau Yih,Mingda Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.
[NLP-77] Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences
【速读】: 该论文旨在解决语言模型中人类偏好学习的挑战,即如何有效建模基于细微、主观判断的奖励信号,而非明确标签。其核心问题在于现有方法在捕捉人类判断的多维特性(如帮助性、安全性与相关性)时表现有限,导致奖励模型性能不足(基准ROC AUC低于0.74)。解决方案的关键在于提出一种特征增强框架,通过引入可解释的辅助信号(如响应长度、拒绝指示、毒性评分和提示-响应语义相似度)来丰富文本表征,使模型能够显式建模这些关键维度;同时结合SHAP和LIME实现细粒度可解释性分析,揭示决策依赖于情境化的安全性和支持性语境,而非孤立关键词,从而显著提升配对准确率(最高达0.84 ROC AUC),并量化特征间交互对偏见放大的影响。
链接: https://arxiv.org/abs/2604.01312
作者: Simona-Vasilica Oprea,Adela Bâra
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning human preferences in language models remains fundamentally challenging, as reward modeling relies on subtle, subjective comparisons or shades of gray rather than clear-cut labels. This study investigates the limits of current approaches and proposes a feature-augmented framework to better capture the multidimensional nature of human judgment. Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task. To address this, we enrich textual representations with interpretable signals: response length, refusal indicators, toxicity scores and prompt response semantic similarity, enabling models to explicitly capture key aspects of helpfulness, safety and relevance. The proposed hybrid approach yields consistent improvements across all models, achieving up to 0.84 ROC AUC and significantly higher pairwise accuracy, with DeBERTav3Large demonstrating the best performance. Beyond accuracy, we integrate SHAP and LIME to provide fine-grained interpretability, revealing that model decisions depend on contextualized safety and supportive framing rather than isolated keywords. We further analyze bias amplification, showing that while individual features have weak marginal effects, their interactions influence preference learning.
[NLP-78] M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency
【速读】: 该论文旨在解决科学论断与多模态证据之间一致性评估的难题,现有基准在规模、领域多样性及视觉复杂性方面存在不足,难以真实反映模型对跨模态信息整合能力的评估。其解决方案的关键在于构建M2-Verify——一个大规模多模态数据集,涵盖来自PubMed和arXiv的超过46.9万条实例,覆盖16个科学领域,并通过专家审核确保质量;该数据集不仅支持对模型一致性判断能力的系统评测,还揭示了当前先进模型在高复杂度场景(如解剖结构变化)下性能显著下降以及生成式解释中存在幻觉现象的问题,从而为未来研究提供可靠评估工具和实践指南。
链接: https://arxiv.org/abs/2604.01306
作者: Abolfazl Ansari,Delvin Ce Zhang,Zhuoyang Zou,Wenpeng Yin,Dongwon Lee
机构: The Pennsylvania State University (宾夕法尼亚州立大学); University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注: Preprint. Under Review
Abstract:Evaluating scientific arguments requires assessing the strict consistency between a claim and its underlying multimodal evidence. However, existing benchmarks lack the scale, domain diversity, and visual complexity needed to evaluate this alignment realistically. To address this gap, we introduce M2-Verify, a large-scale multimodal dataset for checking scientific claim consistency. Sourced from PubMed and arXiv, M2-Verify provides over 469K instances across 16 domains, rigorously validated through expert audits. Extensive baseline experiments show that state-of-the-art models struggle to maintain robust consistency. While top models achieve up to 85.8% Micro-F1 on low-complexity medical perturbations, performance drops to 61.6% on high-complexity challenges like anatomical shifts. Furthermore, expert evaluations expose hallucinations when models generate scientific explanations for their alignment decisions. Finally, we demonstrate our dataset’s utility and provide comprehensive usage guidelines.
[NLP-79] Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming
【速读】: 该论文旨在解决在竞技编程(competitive programming)任务中如何有效扩展推理令牌(reasoning token)预算以提升模型性能的问题。其核心挑战在于:单纯依赖训练阶段的强化学习(RL)来增加单次生成的推理长度会因全注意力机制导致计算成本迅速上升,且难以在测试时高效利用资源。解决方案的关键在于提出一种双管齐下的策略:一是通过验证增强的强化学习(verification RL warmup)和随机截断(randomized clipping)优化训练轨迹,使模型在较低token消耗下获得更高准确率;二是设计了一个多轮并行思考(multi-round parallel thinking)流水线,将token预算分配到多个线程与生成-验证-精炼的多轮迭代中,实现端到端训练以匹配测试结构。该方法显著提升了效率,在平均仅使用760万tokens/题的情况下达到原RL模型oracle pass@16的性能,并在456道高难度AetherCode问题上超越GPT-5-high。
链接: https://arxiv.org/abs/2604.01302
作者: Qianfan Zhang,Tianyu Guo,Xuandi Ren,Jiale Chen,Ming Ding,Ran Xin,Xia Xiao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time structure. Starting from Seed-OSS-36B, the full system with 16 threads and 16 rounds per thread matches the underlying RL model’s oracle pass@16 at pass@1 using 7.6 million tokens per problem on average, and surpasses GPT-5-high on 456 hard competitive programming problems from AetherCode.
[NLP-80] Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在回答知识密集型视觉问答(VQA)任务时,难以有效整合视觉线索与外部检索文本证据的问题,尤其是在面对噪声或部分相关文本信息时,模型往往无法精准定位图像中的细粒度视觉区域。解决方案的关键在于提出一种无需训练的推理阶段框架——Look Twice (LoT),其核心机制是利用预训练MLLM的注意力模式来识别与查询最相关的视觉区域和文本片段,并通过轻量级提示标记(prompt-level markers)突出这些证据,引导模型在生成答案时重新关注关键信息。该方法显著提升了模型在多个基于知识的VQA基准上的表现,且在无文本上下文的视觉主导场景中也展现出性能提升,同时不依赖额外训练或架构改动。
链接: https://arxiv.org/abs/2604.01280
作者: Marco Morini,Sara Sarto,Marcella Cornia,Lorenzo Baraldi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL
Abstract:Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt-level markers that encourage the model to re-attend to the relevant evidence during generation. Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Additional evaluations on vision-centric and hallucination-oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.
[NLP-81] he Overlooked Repetitive Lengthening Form in Sentiment Analysis EMNLP2024
【速读】: 该论文旨在解决生成式 AI(Generative AI)在情感分析(Sentiment Analysis, SA)任务中对非正式表达形式——尤其是重复延展形式(Repetitive Lengthening Form, RLF)的理解不足问题。RLF作为一种常见于在线交流中的强调性语言风格(如“太开心啦啦啦”),长期以来被忽视,但其具有显著的情感表达能力并可作为文档级情感的特征标志。为应对这一挑战,作者构建了首个聚焦RLF的多领域情感分析数据集 \textbfLengthening(含850k样本),并提出一种两阶段指令微调框架 \textbfExplainable \textbfInstruction Tuning (\textbfExpInstruct),以提升大语言模型(LLMs)在RLF理解上的性能与可解释性。关键创新在于引入统一量化方法评估模型对非正式表达的理解程度,并证明通过ExpInstruct可在有限样本下使开源模型达到零样本GPT-4在性能和解释性上的水平,从而推动面向真实网络语境的情感计算研究。
链接: https://arxiv.org/abs/2604.01268
作者: Lei Wang,Eduard Dragut
机构: Temple University (坦普尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Findings of EMNLP 2024
Abstract:Individuals engaging in online communication frequently express personal opinions with informal styles (e.g., memes and emojis). While Language Models (LMs) with informal communications have been widely discussed, a unique and emphatic style, the Repetitive Lengthening Form (RLF), has been overlooked for years. In this paper, we explore answers to two research questions: 1) Is RLF important for sentiment analysis (SA)? 2) Can LMs understand RLF? Inspired by previous linguistic research, we curate \textbfLengthening, the first multi-domain dataset with 850k samples focused on RLF for SA. Moreover, we introduce \textbfExplainable \textbfInstruction Tuning (\textbfExpInstruct), a two-stage instruction tuning framework aimed to improve both performance and explainability of LLMs for RLF. We further propose a novel unified approach to quantify LMs’ understanding of informal expressions. We show that RLF sentences are expressive expressions and can serve as signatures of document-level sentiment. Additionally, RLF has potential value for online content analysis. Our results show that fine-tuned Pre-trained Language Models (PLMs) can surpass zero-shot GPT-4 in performance but not in explanation for RLF. Finally, we show ExpInstruct can improve the open-sourced LLMs to match zero-shot GPT-4 in performance and explainability for RLF with limited samples. Code and sample data are available at this https URL
信息检索
[IR-0] AstroConcepts: A Large-Scale Multi-Label Classification Corpus for Astrophysics
【速读】:该论文旨在解决科学多标签文本分类中极端类别不平衡(extreme class imbalance)的问题,即专业术语在数据集中呈现严重的幂律分布,导致传统分类方法难以有效处理稀有类别。其解决方案的关键在于构建并公开AstroConcepts语料库——一个包含21,702篇天体物理学论文摘要、标注了来自统一天文学词表(Unified Astronomy Thesaurus)的2,367个概念的高质量资源,其中76%的概念训练样本少于50条。该语料库支持对极端不平衡场景下的系统性研究,并通过对比传统模型、神经网络及词汇约束大语言模型(vocabulary-constrained LLMs)的方法,揭示了三类关键模式:词汇约束LLMs在天体物理分类中表现接近领域适配模型,暗示参数高效方法的可能性;领域适配对罕见术语提升显著但绝对性能仍有限;以及提出频率分层评估策略以暴露聚合指标掩盖的性能差异,从而将鲁棒性评估置于科学多标签分类的核心位置。
链接: https://arxiv.org/abs/2604.02156
作者: Atilla Kaan Alkan,Felix Grezes,Sergi Blanco-Cuaresma,Jennifer Lynn Bartlett,Daniel Chivvis,Anna Kelbert,Kelly Lockhart,Alberto Accomazzi
机构: 未知
类目: Computation and Language (cs.CL); Instrumentation and Methods for Astrophysics (astro-ph.IM); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 9 pages, 2 figures
Abstract:Scientific multi-label text classification suffers from extreme class imbalance, where specialized terminology exhibits severe power-law distributions that challenge standard classification approaches. Existing scientific corpora lack comprehensive controlled vocabularies, focusing instead on broad categories and limiting systematic study of extreme imbalance. We introduce AstroConcepts, a corpus of English abstracts from 21,702 published astrophysics papers, labeled with 2,367 concepts from the Unified Astronomy Thesaurus. The corpus exhibits severe label imbalance, with 76% of concepts having fewer than 50 training examples. By releasing this resource, we enable systematic study of extreme class imbalance in scientific domains and establish strong baselines across traditional, neural, and vocabulary-constrained LLM methods. Our evaluation reveals three key patterns that provide new insights into scientific text classification. First, vocabulary-constrained LLMs achieve competitive performance relative to domain-adapted models in astrophysics classification, suggesting a potential for parameter-efficient approaches. Second, domain adaptation yields relatively larger improvements for rare, specialized terminology, although absolute performance remains limited across all methods. Third, we propose frequency-stratified evaluation to reveal performance patterns that are hidden by aggregate scores, thereby making robustness assessment central to scientific multi-label evaluation. These results offer actionable insights for scientific NLP and establish benchmarks for research on extreme imbalance.
[IR-1] Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
【速读】:该论文旨在解决当前重排序模型(reranker)在检索增强生成(Retrieval-Augmented Generation, RAG)中因与下游大语言模型(LLM)生成过程脱钩而导致的语义相关性与实际生成效用不一致的问题。现有方法通常依赖静态人工标注的相关性标签进行优化,忽视了文档对 LLM 生成质量的实际贡献,从而导致检索结果虽在主题上相关,却无法有效支持精确答案生成。解决方案的关键在于提出一种基于强化学习的重排序偏好优化框架(ReRanking Preference Optimization, RRPO),将重排序建模为序列决策过程,并利用 LLM 的反馈直接优化上下文效用(context utility),从而实现无需昂贵人工标注即可对齐重排序与生成质量的目标;同时引入参考锚定的确定性基线以保障训练稳定性,实验证明该方法在知识密集型任务中显著优于强基线模型(如 RankZephyr),且具备良好的泛化性和鲁棒性。
链接: https://arxiv.org/abs/2604.02091
作者: Yuhang Wu,Xiangqing Shen,Fanfan Wang,Cangqi Zhou,Zhen Wu,Xinyu Dai,Rui Xia
机构: Nanjing University of Science and Technology (南京理工大学); Nanjing University (南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 16 pages
Abstract:Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM’s generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a reference-anchored deterministic baseline. Extensive experiments on knowledge-intensive benchmarks demonstrate that RRPO significantly outperforms strong baselines, including the powerful list-wise reranker RankZephyr. Further analysis highlights the versatility of our framework: it generalizes seamlessly to diverse readers (e.g., GPT-4o), integrates orthogonally with query expansion modules like Query2Doc, and remains robust even when trained with noisy supervisors.
[IR-2] Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models LREC2026
【速读】:该论文旨在解决当前科学知识发现对大规模语言模型(Large Language Models, LLMs)的过度依赖问题,尤其是基于数十亿甚至数百亿参数的专有模型限制了科研社区的可复现性和可及性。其核心问题是:在科学应用场景中,是否必须使用大模型才能实现高质量的学术辅助功能?解决方案的关键在于提出一种轻量级的检索增强框架,通过任务感知路由机制动态选择最优检索策略,并融合全文科学文献与结构化元数据信息,结合紧凑型指令微调语言模型生成带引用的答案。研究表明,检索设计与模型规模具有互补性——良好的检索可以部分弥补小模型能力不足,但复杂推理任务仍需足够模型容量支撑,从而强调了检索优化和任务感知设计在构建实用、可复现的学术助手中的核心作用。
链接: https://arxiv.org/abs/2604.01965
作者: Florian Kelber,Matthias Jobst,Yuni Susanti,Michael Färber
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: Accepted at NSLP@LREC 2026
Abstract:Scientific knowledge discovery increasingly relies on large language models, yet many existing scholarly assistants depend on proprietary systems with tens or hundreds of billions of parameters. Such reliance limits reproducibility and accessibility for the research community. In this work, we ask a simple question: do we need bigger models for scientific applications? Specifically, we investigate to what extent carefully designed retrieval pipelines can compensate for reduced model scale in scientific applications. We design a lightweight retrieval-augmented framework that performs task-aware routing to select specialized retrieval strategies based on the input query. The system further integrates evidence from full-text scientific papers and structured scholarly metadata, and employs compact instruction-tuned language models to generate responses with citations. We evaluate the framework across several scholarly tasks, focusing on scholarly question answering (QA), including single- and multi-document scenarios, as well as biomedical QA under domain shift and scientific text compression. Our findings demonstrate that retrieval and model scale are complementary rather than interchangeable. While retrieval design can partially compensate for smaller models, model capacity remains important for complex reasoning tasks. This work highlights retrieval and task-aware design as key factors for building practical and reproducible scholarly assistants.
[IR-3] Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite LREC2026
【速读】:该论文旨在解决机器翻译(Machine Translation, MT)在大规模基准数据集构建中因噪声、结构丢失和质量不均导致的可靠性问题,尤其关注如何在不依赖人工标注的情况下实现可扩展的质量评估与验证。其解决方案的关键在于提出了一种三步自动化质量保障方法:首先通过结构化语料库审计进行针对性修复;其次利用神经评分指标(COMET,含参考文本和无参考文本两种模式)对比不同翻译服务(DeepL / ChatGPT / Google)的质量表现;最后借助大语言模型(LLM)实现细粒度的跨句段级翻译错误分布分析。该方法能有效识别低质量翻译并指导优先人工校验,从而提升大规模多语言基准数据集的可信度与实用性。
链接: https://arxiv.org/abs/2604.01957
作者: Klaudia Thellmann,Bernhard Stadler,Michael Färber
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at LREC 2026
Abstract:Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean). Reference-based COMET on MMLU against human-edited samples points in the same direction. We release cleaned/corrected versions of the EU20 datasets, and code for reproducibility. In sum, automated quality assurance offers practical, scalable indicators that help prioritize review – complementing, not replacing, human gold standards.
[IR-4] From BM25 to Corrective RAG : Benchmarking Retrieval Strategies for Text-and-Table Documents
【速读】:该论文旨在解决异构文档(包含文本与表格数据)中检索增强生成(Retrieval-Augmented Generation, RAG)系统对检索质量依赖性强但缺乏系统性比较的问题。其关键解决方案是构建了一个包含23,088个查询和7,318份混合内容文档的金融问答基准,并对十种现代检索策略(涵盖稀疏、稠密、融合、交叉编码重排序、查询扩展、索引增强及自适应检索)进行了全面评估。研究发现,两阶段流水线(先混合检索后神经重排序)在Recall@5达到0.816、MRR@3达0.605,显著优于所有单阶段方法;同时指出BM25在金融文档上优于先进稠密检索模型,挑战了语义搜索普遍更优的假设,为RAG系统的实际部署提供了可量化的性能边界与成本-精度权衡建议。
链接: https://arxiv.org/abs/2604.01733
作者: Meftun Akarsu,Recep Kaan Karaman,Christopher Mierbach
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 11 pages, 6 figures, 6 tables
Abstract:Retrieval-Augmented Generation (RAG) systems critically depend on retrieval quality, yet no systematic comparison of modern retrieval methods exists for heterogeneous documents containing both text and tabular data. We benchmark ten retrieval strategies spanning sparse, dense, hybrid fusion, cross-encoder reranking, query expansion, index augmentation, and adaptive retrieval on a challenging financial QA benchmark of 23,088 queries over 7,318 documents with mixed text-and-table content. We evaluate retrieval quality via Recall@k, MRR, and nDCG, and end-to-end generation quality via Number Match, with paired bootstrap significance testing. Our results show that (1) a two-stage pipeline combining hybrid retrieval with neural reranking achieves Recall@5 of 0.816 and MRR@3 of 0.605, outperforming all single-stage methods by a large margin; (2) BM25 outperforms state-of-the-art dense retrieval on financial documents, challenging the common assumption that semantic search universally dominates; and (3) query expansion methods (HyDE, multi-query) and adaptive retrieval provide limited benefit for precise numerical queries, while contextual retrieval yields consistent gains. We provide ablation studies on fusion methods and reranker depth, actionable cost-accuracy recommendations, and release our full benchmark code.
[IR-5] STABLE: Efficient Hybrid Nearest Neighbor Search via Magnitude-Uniformity and Cardinality-Robustness
【速读】:该论文旨在解决混合近邻搜索(Hybrid Approximate Nearest Neighbor Search, Hybrid ANNS)在处理大规模异构数据时面临的两大挑战:一是因相似度量尺度差异导致的“兼容性障碍”(Compatibility Barrier),二是因属性基数(attribute cardinality)不同引发的“容忍瓶颈”(Tolerance Bottleneck)。解决方案的关键在于提出一个名为STABLE的鲁棒异构感知混合检索框架,其核心创新包括:引入增强型异构语义感知(enhanced heterogeneous semantic perception, AUTO)度量,实现特征相似性和属性一致性联合建模,从而缓解相似度量尺度不一致问题并提升对不同属性基数的鲁棒性;构建基于AUTO的异构语义关系图(Heterogeneous Semantic Relation Graph, HELP)索引以组织异构语义关联;以及设计动态异构路由机制,确保高效搜索。
链接: https://arxiv.org/abs/2604.01617
作者: Qianyun Yang,Zhiwei Chen,Yupeng Hu,Zixu Li,Zhiheng Fu,Liqiang Nie
机构: Shandong University (山东大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳)
类目: Information Retrieval (cs.IR)
备注: Accepted by IEEE TKDE
Abstract:Hybrid Approximate Nearest Neighbor Search (Hybrid ANNS) is a foundational search technology for large-scale heterogeneous data and has gained significant attention in both academia and industry. However, current approaches overlook the heterogeneity in data distribution, thus ignoring two major challenges: the Compatibility Barrier for Similarity Magnitude Heterogeneity and the Tolerance Bottleneck to Attribute Cardinality. To overcome these issues, we propose the robuSt heTerogeneity-Aware hyBrid retrievaL framEwork, STABLE, designed for accurate, efficient, and robust hybrid ANNS under datasets with various distributions. Specifically, we introduce an enhAnced heterogeneoUs semanTic perceptiOn (AUTO) metric to achieve a joint measurement of feature similarity and attribute consistency, addressing similarity magnitude heterogeneity and improving robustness to datasets with various attribute cardinalities. Thereafter, we construct our Heterogeneous sEmantic reLation graPh (HELP) index based on AUTO to organize heterogeneous semantic relations. Finally, we employ a novel Dynamic Heterogeneity Routing method to ensure an efficient search. Extensive experiments on five feature vector benchmarks with various attribute cardinalities demonstrate the superior performance of STABLE.
[IR-6] ReFormeR: Learning and Applying Explicit Query Reformulation Patterns
【速读】:该论文旨在解决查询重写(Query Reformulation)过程中缺乏可控性和可解释性的问题,尤其是在利用大语言模型(Large Language Model, LLM)进行查询改写时,常出现语义漂移或不可预测的改写结果。解决方案的关键在于提出 ReFormeR 方法:首先从初始查询与增强改写查询对中提取短格式的改写模式(reformulation patterns),构建一个可迁移的紧凑模式库;随后根据新查询的检索上下文选择最合适的改写模式,从而将改写过程约束在特定操作范围内(如消歧、词汇锚定或判别性特征添加等)。这种方法使改写策略显式化,显著提升了改写的针对性和有效性,在 TREC DL 2019、DL 2020 和 DL Hard 数据集上均优于传统反馈方法及近期基于 LLM 的改写与扩展方法。
链接: https://arxiv.org/abs/2604.01417
作者: Amin Bigdeli,Mert Incesu,Negar Arabzadeh,Charles L. A. Clarke,Ebrahim Bagheri
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:We present ReFormeR, a pattern-guided approach for query reformulation. Instead of prompting a language model to generate reformulations of a query directly, ReFormeR first elicits short reformulation patterns from pairs of initial queries and empirically stronger reformulations, consolidates them into a compact library of transferable reformulation patterns, and then selects an appropriate reformulation pattern for a new query given its retrieval context. The selected pattern constrains query reformulation to controlled operations such as sense disambiguation, vocabulary grounding, or discriminative facet addition, to name a few. As such, our proposed approach makes the reformulation policy explicit through these reformulation patterns, guiding the LLM towards targeted and effective query reformulations. Our extensive experiments on TREC DL 2019, DL 2020, and DL Hard show consistent improvements over classical feedback methods and recent LLM-based query reformulation and expansion approaches.
[IR-7] ransforming OPACs into Intelligent Discovery Systems: An AI-Powered Knowledge Graph-Driven Smart OPAC for Digital Libraries
【速读】:该论文旨在解决传统在线公共检索目录(Online Public Access Catalogue, OPAC)在面对学术文献快速增长时检索效率低下、知识发现能力不足的问题,尤其是传统关键词索引和布尔查询难以支持语义层面的知识探索。解决方案的关键在于构建一个智能OPAC框架,通过引入人工智能(Artificial Intelligence)与知识图谱(Knowledge Graph)技术,实现语义搜索、主题过滤及基于知识图谱的可视化展示,从而提升用户交互体验与信息探索能力;该框架整合多源开放学术数据,并利用语义嵌入(Semantic Embeddings)增强结果的相关性与上下文理解,最终显著改善检索效率、相关性并减少信息过载。
链接: https://arxiv.org/abs/2604.01262
作者: M. S. Rajeevan,B. Mini Devi
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 8 pages, 4 tables, 6 figures presented at Intellib 2026 International Conference
Abstract:Traditional Online Public Access Catalogues (OPACs) are becoming less effective due to the rapid growth of scholarly literature. Conventional search methods, such as keyword indexing and Boolean queries, often fail to support efficient knowledge discovery. This paper proposes a Smart OPAC framework that transforms traditional OPACs into intelligent discovery systems using artificial intelligence and knowledge graph techniques. The framework enables semantic search, thematic filtering, and knowledge graph-based visualization to enhance user interaction and exploration. It integrates multiple open scholarly data sources and applies semantic embeddings to improve relevance and contextual understanding. The system supports exploratory search, semantic navigation, and refined result filtering based on user-defined themes. Quantitative evaluation demonstrates improvements in retrieval efficiency, relevance, and reduction of information overload. The proposed approach offers practical implications for modernizing digital library services and supports next-generation research workflows. Future work includes user-centric evaluation, personalization, and dynamic knowledge graph updates.
[IR-8] OkanNet: A Lightweight Deep Learning Architecture for Classification of Brain Tumor from MRI Images
【速读】:该论文旨在解决医学影像中脑肿瘤自动检测与分类的效率与准确性问题,特别是针对人工分析MRI图像耗时且易受疲劳影响导致误差的局限性。解决方案的关键在于对比两种深度学习方法:一是设计轻量级自定义卷积神经网络(OkanNet),其具有较低计算成本和更快训练速度;二是采用迁移学习策略,基于在ImageNet上预训练的50层ResNet-50模型进行微调。实验表明,ResNet-50在准确率(96.49%)和精确度(0.963)上表现更优,而OkanNet虽精度较低(88.10%),但训练速度约为ResNet-50的3.2倍,适用于资源受限的移动或嵌入式系统,从而揭示了模型深度与计算效率之间的权衡关系。
链接: https://arxiv.org/abs/2604.01264
作者: Okan Uçar,Murat Kurt
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 7 pages, 3 figures, 1 table
Abstract:Medical imaging techniques, especially Magnetic Resonance Imaging (MRI), are accepted as the gold standard in the diagnosis and treatment planning of neurological diseases. However, the manual analysis of MRI images is a time-consuming process for radiologists and is prone to human error due to fatigue. In this study, two different Deep Learning approaches were developed and analyzed comparatively for the automatic detection and classification of brain tumors (Glioma, Meningioma, Pituitary, and No Tumor). In the first approach, a custom Convolutional Neural Network (CNN) architecture named “OkanNet”, which has a low computational cost and fast training time, was designed from scratch. In the second approach, the Transfer Learning method was applied using the 50-layer ResNet-50 [1] architecture, pre-trained on the ImageNet dataset. In experiments conducted on an extended dataset compiled by Masoud Nickparvar containing a total of 7,023 MRI images, the Transfer Learning-based ResNet-50 model exhibited superior classification performance, achieving 96.49% Accuracy and 0.963 Precision. In contrast, the custom OkanNet architecture reached an accuracy rate of 88.10% ; however, it proved to be a strong alternative for mobile and embedded systems with limited computational power by yielding results approximately 3.2 times faster ( 311 seconds) than ResNet-50 in terms of training time. This study demonstrates the trade-off between model depth and computational efficiency in medical image analysis through experimental data.
人机交互
[HC-0] Dark Patterns in Indian Quick Commerce Apps: A Student Perspective
【速读】:该论文旨在解决印度快速配送(Quick Commerce, Q-Commerce)平台中普遍存在的“欺骗性设计暗模式”(dark patterns)问题,这些模式通过诱导用户提高订单金额,加剧了数字消费者在高数字素养背景下仍难以抵御的“认知负荷”与“意识-行为鸿沟”(Awareness-Action Gap)。研究发现,尽管大学生群体能够识别界面中的操纵性策略,但因时间压力和便利性架构的驱动,他们往往屈从于这些设计,且将此类行为视为资本主义代价的一部分。解决方案的关键在于提出“价值敏感设计”(Value-Sensitive Design, VSD)路径,以在商业激励与用户自主权之间建立更符合全球南方(Global South)语境的平衡机制。
链接: https://arxiv.org/abs/2604.02257
作者: Tanish Taneja,Arihant Tripathy,Nimmi Rangaswamy
机构: International Institute of Information Technology Hyderabad (国际信息科技学院海得拉巴分校)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to Bridge Over Troubled Water (CHI 2026 Workshop)
Abstract:As quick commerce (Q-Commerce) platforms in India redefine urban consumption, the use of deceptive design dark patterns to inflate order values has become a systemic concern. This paper investigates the ‘Awareness-Action Gap’ among Indian university students, a demographic characterized by high digital fluency yet significant financial constraints. Using a qualitative approach with 16 participants, we explore how temporal pressures and convenience-driven architectures override price sensitivity. Our findings reveal that while students recognize manipulative UI tactics, they frequently succumb to them due to induced cognitive load and the normalization of deceptive marketing as a price of capitalism. We conclude by suggesting value-sensitive design alternatives to align commercial incentives with user autonomy in the Global South.
[HC-1] Impact of Multimodal and Conversational AI on Learning Outcomes and Experience
【速读】:该论文旨在解决生成式 AI 系统中多模态(multimodal)与对话性(conversationality)如何协同影响学习效果的问题,特别是在视觉丰富的 STEM(科学、技术、工程和数学)领域。其核心问题是:尽管对话式 AI 被认为能提升学习参与度,但其对学习成效的实际影响尚不明确,尤其是当与多模态信息整合时是否能够增强认知加工过程。解决方案的关键在于设计并验证一种结合文本与图像的文档驱动型对话系统(MuDoC),相比仅提供纯文本响应的对话系统(TexDoC)和传统教科书界面(DocSearch),MuDoC 通过视觉-语言整合显著提升了学习者的后测成绩,并优化了认知负荷结构——即对话性降低了外在认知负荷(extraneous load),而多模态内容增强了内在认知负荷(germane load),从而实现更有效的学习。
链接: https://arxiv.org/abs/2604.02221
作者: Karan Taneja,Anjali Singh,Ashok K. Goel
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures, Accepted to AIED 2026 (Seoul, South Korea)
Abstract:Multimodal Large Language Models (MLLMs) offer an opportunity to support multimedia learning through conversational systems grounded in educational content. However, while conversational AI is known to boost engagement, its impact on learning in visually-rich STEM domains remains under-explored. Moreover, there is limited understanding of how multimodality and conversationality jointly influence learning in generative AI systems. This work reports findings from a randomized controlled online study (N = 124) comparing three approaches to learning biology from textbook content: (1) a document-grounded conversational AI with interleaved text-and-image responses (MuDoC), (2) a document-grounded conversational AI with text-only responses (TexDoC), and (3) a textbook interface with semantic search and highlighting (DocSearch). Learners using MuDoC achieved the highest post-test scores and reported the most positive learning experience. Notably, while TexDoC was rated as significantly more engaging and easier to use than DocSearch, it led to the lowest post-test scores, revealing a disconnect between student perceptions and learning outcomes. Interpreted through the lens of the Cognitive Load Theory, these findings suggest that conversationality reduces extraneous load, while visual-verbal integration induced by multimodality increases germane load, leading to better learning outcomes. When conversationality is not complemented by multimodality, reduced cognitive effort may instead inflate perceived understanding without improving learning outcomes.
[HC-2] Visual Decoding Operators: Towards a Compositional Theory of Visualization Perception
【速读】:该论文旨在解决现有感知有效性研究中缺乏可计算结构以预测新可视化类型与任务组合表现的问题,即传统基于通道(如角度、位置、长度)的分解方法无法泛化到未实验过的场景,需重复进行实验。其解决方案的关键在于提出一种新的分析单元——将定量可视化解读操作形式化为可组合的视觉解码算子(visual decoding operators),并利用概率密度函数(PDF)和累积分布函数(CDF)图表验证这些算子在不同任务中的可重用性及其误差特征,通过分层贝叶斯建模刻画其性能;进一步在Moritz等人[35]的散点图均值估计实验中验证该框架对结构不同的任务的泛化能力,证明特定算子组合能准确预测观测响应的偏差与方差,而其他五种策略则在可区分的方式上失败,从而为构建能预测不同观看条件、新图表类型和新任务下解释分布的生成式模型奠定基础。
链接: https://arxiv.org/abs/2604.02220
作者: Sheng Long,Remco Chang,Eugene Wu,Alex Kale,Matthew Kay
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Prior work on perceptual effectiveness has decomposed visualizations into smaller common units (e.g., channels such as angle, position, and length) to establish rankings. While useful, these decompositions lack the computational structure to predict performance for new visualization \times task combinations, requiring new experiments for each. We propose an alternative unit of analysis: operationalizing quantitative visualization interpretation as sequences of composable visual decoding operators. Using probability density function (PDF) and cumulative distribution function (CDF) charts, we examine how chart-specific tasks can be decomposed into reusable, chart-agnostic perceptual operations and characterize their error profiles through hierarchical Bayesian modeling. We then test generalizability by composing learned operators to predict performance on a structurally different task: Moritz et al.'s [35] scatterplot mean-estimation experiment, where the chart type, chart dimensions, and analytic goal all differ from the learning conditions. With a pre-registered analysis plan, we compose operators under six candidate strategies and evaluate each against empirical data with no parameters fit to the response data. One strategy captures both bias and variance of observed responses; five alternatives fail in distinguishable ways. We argue that this decoding-operator-oriented approach to empirical visualization research and theory-building lays the groundwork for generative models that can predict a distribution of likely interpretations under different viewing conditions, new chart types, and new tasks. Free copy of this paper and supplemental materials: this https URL experiment interface: this https URL.
[HC-3] ViT-Explainer: An Interactive Walkthrough of the Vision Transformer Pipeline
【速读】:该论文旨在解决视觉 Transformer(Vision Transformer, ViT)模型在推理过程中缺乏端到端、可引导的可解释性分析工具的问题。现有方法多局限于孤立模块或面向专家的分析,难以帮助用户系统理解从图像分块 token 化到最终分类的完整推理流程。其解决方案的关键在于提出 ViT-Explainer,一个基于网页的交互式可视化系统,整合了动画演示、patch 级注意力热力图以及适配视觉任务的 Logit Lens,并支持引导式与自由探索两种模式,从而实现对 ViT 推理过程的直观、全面理解。
链接: https://arxiv.org/abs/2604.02182
作者: Juan Manuel Hernandez,Mariana Fernandez-Espinosa,Denis Parra,Diego Gomez-Zara
机构: Pontificia Universidad Católica de Chile(智利天主教大学); University of Notre Dame(圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 7 pages, 4 figures
Abstract:Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are processed as sequences of patch tokens. Existing interpretability tools often focus on isolated components or expert-oriented analysis, leaving a gap in guided, end-to-end understanding of the full inference pipeline. To bridge this gap, we present ViT-Explainer, a web-based interactive system that provides an integrated visualization of Vision Transformer inference, from patch tokenization to final classification. The system combines animated walkthroughs, patch-level attention overlays, and a vision-adapted Logit Lens within both guided and free exploration modes. A user study with six participants suggests that ViT-Explainer is easy to learn and use, helping users interpret and understand Vision Transformer behavior.
[HC-4] Designing Transformational Games to Support Socio-ethical Reasoning about Generative AI
【速读】:该论文旨在解决青少年在学习生成式 AI(Generative AI)及其伦理问题时,如何将复杂且严肃的主题转化为具有吸引力的学习体验这一挑战。解决方案的关键在于设计并实施两类基于社会互动的转化型游戏(transformational games)——Diversity Duel 和 Secret Agent,通过引入三个核心机制:同伴评价(peer evaluation)、基于约束的创造性(constraint-based creativity)以及社会推理(social deduction),促进参与者对生成式 AI 输出中的偏见进行识别与讨论,并将其与现实世界不平等联系起来,进而深化其对提示工程如何塑造 AI 行为的理解,从而有效培养批判性 AI 素养。
链接: https://arxiv.org/abs/2604.02154
作者: Jaemarie Solyst,Ruth Karen Nakigozi,Chloe Fong,R. Benjamin Shapiro
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:There is an increasing need for young people to become critically AI literate, understanding not only how AI works but also its limitations and ethical nuances. Yet, designing learning experiences that make such complex, serious topics engaging remains a challenge. This paper explores transformational games as a promising approach for supporting youth learning about generative AI (GenAI) and ethics. We designed and implemented two games, Diversity Duel and Secret Agent, that integrate GenAI tools with gameplay elements. This work investigates how the games’ elements: (1) peer evaluation, (2) constraint-based creativity, and (3) social deduction supported socio-ethical reasoning about GenAI. Participants recognized and debated bias in GenAI outputs, connected these patterns to real-world inequities, and developed nuanced understandings of bias. Participants further came to see how prompt design shapes AI behavior. Our findings suggest that group-based games with these elements can support fostering critical AI literacy.
[HC-5] ProVega: A Grammar to Ease the Prototyping Creation and Reproducibility of Progressive Data Analysis and Visualization Solutions
【速读】:该论文旨在解决现代大数据分析中对高速响应与交互性之间平衡的挑战,即在保证可控精度的前提下实现快速数据处理和可视化(Progressive Data Analysis and Visualization, PDAV),同时降低其实现与复现的难度。其核心解决方案是提出ProVega——一种基于Vega-Lite的语法规范,用于简化PDAV的编程接口,支持从简单图表到复杂视觉环境的渐进式分析;并配套开发了Pro-Ex编辑器,以提升渐进式方案的设计与分析效率。通过重实现11个文献中的典型案例并经39名用户验证其保真度,证明了ProVega对多种渐进方法(如数据分块、过程分块及混合分块)的有效支持,且专家用户研究进一步确认了其在真实任务中的实用性。
链接: https://arxiv.org/abs/2604.02096
作者: Matteo Filosa,Graziano Blasilli,Emilio Martino,Marco Angelini
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Modern data analysis requires speed for massive datasets. Progressive Data Analysis and Visualization (PDAV) emerged as a discipline to address this problem, providing fast response times while maintaining interactivity with controlled accuracy. Yet it remains difficult to implement and reproduce. To lower this barrier, we present ProVega, a Vega-Lite-based grammar that simplifies PDAV instrumentation for both simple visualizations and complex visual environments. Alongside it, we introduce Pro-Ex, an editor designed to streamline the creation and analysis of progressive solutions. We validated ProVega by reimplementing 11 exemplars from the literature-verified for fidelity by 39 users-and demonstrating its support for various progressive methods, including data-chunking, process-chunking, and mixed-chunking. An expert user study confirmed the efficacy of ProVega and the Pro-Ex environment in real-world tasks. ProVega, Pro-Ex, and all related materials are available at this https URL
[HC-6] As Far as Eye See: Vergence-Pupil Coupling in Near-Far Depth Switching
【速读】:该论文旨在解决瞳孔大小波动对头戴式眼动仪在真实物理深度视觉中测量辐辏角(vergence angle)产生的干扰问题。研究表明,瞳孔大小与辐辏之间存在个体差异显著的耦合效应,尤其在静态光照条件下变异较大;而通过控制光照条件(如亮度调制)或采用分块固定(blockwise fixation)和音频提示等实验设计,可有效降低这种耦合带来的误差,使辐辏估计更加稳定和可靠。因此,解决方案的关键在于优化实验控制条件以减少瞳孔-辐辏耦合的影响,而非完全消除该效应。
链接: https://arxiv.org/abs/2604.01917
作者: Virmarie Maquiling,Yasmeen Abdrabou,Enkelejda Kasneci
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Human-Computer Interaction (cs.HC)
备注: 6 pages, 2 figures, ETRA26
Abstract:Vergence is widely used as a proxy for depth perception and spatial attention in immersive and real-world eye-tracking studies. In this paper, we investigate how pupil size artefacts affect vergence estimates during real physical depth viewing with a head-mounted eye tracker. Using a beamsplitter setup with physically near and far targets, we elicited controlled convergent and divergent eye movements under static, luminance-modulated, and blockwise fixation conditions. Near and far targets were reliably separable in vergence angle across participants. However, pupil-vergence coupling varied substantially across individuals and conditions. Static illumination produced large inter-participant variability, while luminance modulation reduced this spread, yielding more clustered estimates. Blockwise and audio-cued recordings further showed that pupil-vergence coupling persists even without visual depth onsets. These results suggest that pupil size fluctuations can systematically influence vergence estimates, and that controlled viewing conditions can reduce–but not eliminate–this effect.
[HC-7] Night Eyes: A Reproducible Framework for Constellation-Based Corneal Reflection Matching
【速读】:该论文旨在解决眼动追踪中角膜反光点(corneal reflection, glint)检测与匹配的可复现性问题,尤其针对多LED光源环境下因硬件差异导致的算法性能不稳定。现有方法常将glint检测作为黑箱模块嵌入整体系统,缺乏标准化流程和清晰评估机制。其解决方案的关键在于提出一种基于二维几何约束的星座匹配(constellation-based)流水线,通过引入相似性-布局对齐(Similarity-Layout Alignment, SLA)策略,将glint视为结构化星群而非孤立斑点进行匹配;该方法结合可控过检测、自适应候选回退、外观感知评分及可选语义布局先验,在保持检测与对应关系显式分离的前提下,实现了噪声环境下稳定的身份保留对应关系。
链接: https://arxiv.org/abs/2604.01909
作者: Virmarie Maquiling,Yasmeen Abdrabou,Enkelejda Kasneci
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 6 pages, 3 figures, 2 algorithms, ETRA26
Abstract:Corneal reflection (glint) detection plays an important role in pupil-corneal reflection (P-CR) eye tracking, but in practice it is often handled as heuristics embedded within larger systems, making reproducibility difficult across hardware setups. We introduce a 2D geometry-driven, constellation-based pipeline for mulit-glint detection and matching, focusing on reproducibility and clear evaluation. Inspired by lost-in-space star identification, we treat glints as structured constellations rather than independent blobs. We propose a Similarity-Layout Alignment (SLA) procedure which adapts constellation matching to the specific constraints of multi-LED eye tracking. The framework brings together controlled over-detection, adaptive candidate fallback, appearance-aware scoring, and optional semantic layout priors while keeping detection and correspondence explicitly separated. Evaluated on a public multi-LED dataset, the system provides stable identity-preserving correspondence under noisy conditions. We release code, presets, and evaluation scripts to enable transparent replication, comparison, and dataset annotation.
[HC-8] Eyes Cant Always Tell: Fusing Eye Tracking and User Priors for User Modeling under AI Advice Conditions
【速读】:该论文旨在解决高风险决策场景下,如何准确建模用户认知状态(如认知负荷和决策信心)以支持自适应人工智能(Adaptive AI)系统的问题。现有研究虽利用眼动追踪作为非侵入性行为信号来反映认知努力,但未系统探讨AI辅助情境(如建议可靠性与用户异质性)对眼动信号与认知状态映射关系的影响。解决方案的关键在于:首先,通过控制实验发现AI建议的可靠性显著调节眼动模式与认知状态之间的关联;其次,提出融合眼动特征与用户先验信息(人口统计学、AI素养及技术信任倾向)的方法,从而提升跨被试泛化能力,实现条件感知且个性化的用户建模,为构建与人类认知对齐的自适应AI系统提供实证依据与方法支撑。
链接: https://arxiv.org/abs/2604.01741
作者: Xin Sun,Shu Wei,Ting Pan,Yajing Wang,Jos A. Bosch,Isao Echizen,Abdallah El Ali,Saku Sugawara
机构: National Institute of Informatics (NII), Tokyo, Japan; University of Amsterdam, Netherlands; Yale School of Medicine, New Haven, Connecticut, USA; University of Tokyo, Tokyo, Japan; Centrum Wiskunde & Informatica (CWI), Amsterdam, Netherlands; Utrecht University, Utrecht, Netherlands
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Modeling users’ cognitive states (e.g., cognitive load and decision confidence) is essential for building adaptive AI in high-stakes decision-making. While eye tracking provides non-invasive behavioral signals correlated with cognitive effort, prior work has not systematically examined how AI assistance contexts, specifically varying advice reliability and user heterogeneity, can alter the mapping between gaze signals and cognitive states. We conducted a within-subject lab eye-tracking study (N=54) on factual verification tasks under three conditions: No-AI, Correct-AI advice, and Incorrect-AI advice. We analyze condition-dependent changes in self-reports and eye-tracking patterns and evaluate the robustness of eye-tracking-based user modeling. Results show that AI advice increases decision confidence compared to No-AI, while Correct-AI is associated with lower perceived cognitive load and more efficient gaze behavior. Crucially, predictive modeling is context-sensitive: the relationship between eye-tracking signals and cognitive states shifts across AI conditions. Finally, fusing eye-tracking features with user priors (demographics, AI literacy/experience, and propensity to trust technology) improves cross-participant generalization. These findings support condition-aware and personalized user modeling for cognitively aligned adaptive AI systems.
[HC-9] Cognitive Energy Modeling for Neuroadaptive Human-Machine Systems using EEG and WGAN-GP
【速读】:该论文旨在解决如何在实时建模脑状态演化并量化其能量成本这一挑战,尤其是在使用生成式AI(Generative AI)合成脑电图(EEG)数据时,能否保留用于基于过渡能量分析的动态结构问题。解决方案的关键在于引入薛定谔桥问题(Schrödinger Bridge Problem, SBP)所导出的传输代价(transport cost)作为度量指标,评估生成对抗网络(GAN)生成的EEG数据是否保持了真实EEG中对认知状态转换至关重要的分布几何结构。研究通过对比真实与合成EEG在Stroop任务中的转换能量,证明二者在群体和个体层面具有高度一致性,从而验证了合成EEG可用于SBP驱动的认知能量建模,并进一步提出以SBP-derived认知能量为控制信号的神经适应系统框架,实现人机系统对用户认知与情绪状态的实时调节。
链接: https://arxiv.org/abs/2604.01653
作者: Sriram Sattiraju,Vaibhav Gollapalli,Aryan Shah,Timothy McMahan
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校); University of North Texas (北德克萨斯大学)
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
备注:
Abstract:Electroencephalography (EEG) provides a non-invasive insight into the brain’s cognitive and emotional dynamics. However, modeling how these states evolve in real time and quantifying the energy required for such transitions remains a major challenge. The Schrödinger Bridge Problem (SBP) offers a principled probabilistic framework to model the most efficient evolution between the brain states, interpreted as a measure of cognitive energy cost. While generative models such as GANs have been widely used to augment EEG data, it remains unclear whether synthetic EEG preserves the underlying dynamical structure required for transition-based analysis. In this work, we address this gap by using SBP-derived transport cost as a metric to evaluate whether GAN-generated EEG retains the distributional geometry necessary for energy-based modeling of cognitive state transitions. We compare transition energies derived from real and synthetic EEG collected during Stroop tasks and demonstrate strong agreement across group and participant-level analyses. These results indicate that synthetic EEG preserves the transition structure required for SBP-based modeling, enabling its use in data-efficient neuroadaptive systems. We further present a framework in which SBP-derived cognitive energy serves as a control signal for adaptive human-machine systems, supporting real-time adjustment of system behavior in response to user cognitive and affective state.
[HC-10] AromaGen: Interactive Generation of Rich Olfactory Experiences with Multimodal Language Models
【速读】:该论文旨在解决现有嗅觉交互系统受限于固定香氛胶囊和预定义生成模式的问题,以及因缺乏大规模嗅觉数据集而阻碍基于人工智能(AI)方法发展的瓶颈。其解决方案的关键在于提出AromaGen——一个由多模态大语言模型(Multimodal Large Language Model, MLLM)驱动的可穿戴式实时通用香气生成接口,能够将自由文本或视觉输入映射为12种精心筛选的基础气味分子的结构化混合,并通过颈部佩戴装置释放;同时支持用户通过自然语言反馈进行上下文学习迭代优化,从而显著提升生成香气与真实食物香气的相似度(中位数达8/10),并降低感知的人工感,实现更贴近现实的交互式香气生成。
链接: https://arxiv.org/abs/2604.01650
作者: Yunge Wen,Awu Chen,Jianing Yu,Jas Brooks,Hiroshi Ishii,Paul Pu Liang
机构: New York University (纽约大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Smell’s deep connection with food, memory, and social experience has long motivated researchers to bring olfaction into interactive systems. Yet most olfactory interfaces remain limited to fixed scent cartridges and pre-defined generation patterns, and the scarcity of large-scale olfactory datasets has further constrained AI-based approaches. We present AromaGen, an AI-powered wearable interface capable of real-time, general-purpose aroma generation from free-form text or visual inputs. AromaGen is powered by a multimodal LLM that leverages latent olfactory knowledge to map semantic inputs to structured mixtures of 12 carefully selected base odorants, released through a neck-worn dispenser. Users can iteratively refine generated aromas through natural language feedback via in-context learning. Through a controlled user study ( N = 26 ), AromaGen matches human-composed mixtures in zero-shot generation and significantly surpasses them after iterative refinement, achieving a median similarity of 8/10 to real food aromas and reducing perceived artificiality to levels comparable to real food. AromaGen is a step towards real-world interactive aroma generation, opening new possibilities for communication, wellbeing, and immersive technologies.
[HC-11] Acoustic and perceptual differences between standard and accented Chinese speech and their voice clones
【速读】:该论文旨在解决语音克隆(voice cloning)技术中对说话人身份保留与口音保留的分离评估问题,尤其关注口音差异如何影响感知上的身份匹配和可懂度。传统评估多聚焦于整体语音质量,而忽视了口音在语音克隆中的作用及其感知后果。研究通过结合计算分析与感知实验设计,发现尽管基于嵌入(embedding)的距离指标未能区分标准普通话与重口音普通话的原始-克隆差异,但在感知层面,克隆语音对于标准语者更接近原声,且口音较重的语音在克隆后可懂度提升更为显著。这一结果表明,口音变化虽未反映在现成的说话人嵌入距离中,却显著影响感知身份匹配和可懂度,从而提出应将说话人身份保留与口音保留作为独立维度进行评估,这是其解决方案的关键所在。
链接: https://arxiv.org/abs/2604.01562
作者: Tianle Yang,Chengzhe Sun,Phil Rose,Siwei Lyu
机构: University at Buffalo (纽约州立大学布法罗分校); Australian National University (澳大利亚国立大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses show no reliable accented-standard difference in original-clone distances across systems. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in an off-the-shelf speaker-embedding distance, and they motivate evaluating speaker identity preservation and accent preservation as separable dimensions.
[HC-12] Designing for Patient Voice in Interactive Health ALT
【速读】:该论文试图解决的问题是:在交互式健康(Interactive Health, IH)研究中,患者的生活经验常被当作待分析的数据而非具有独立知识价值的证据,导致“患者声音”不仅是一个参与包容性问题,更是一个认识论(epistemic)问题,即如何识别和传播患者的体验性叙述。解决方案的关键在于重构知识基础设施,通过借鉴《英国医学杂志》(The BMJ)、《美国医学会杂志》(JAMA)及《英国运动医学杂志》(British Journal of Sports Medicine)等医疗出版物中的患者合作实践,探索支持由患者主导或作者化的体验性贡献的结构化路径,从而推动将患者经验作为与传统研究形式并列的证据来源纳入IH领域的知识体系。
链接: https://arxiv.org/abs/2604.01558
作者: Yuhao Sun
机构: Lancaster University (兰卡斯特大学); University of Edinburgh (爱丁堡大学)
类目: Human-Computer Interaction (cs.HC)
备注: This paper has been conditionally accepted to the Interactive Health Conference 2026 in Porto, Portugal
Abstract:Interactive Health (IH) research increasingly engages patients through participatory and user-centred approaches. However, patients’ lived experiences are typically treated more as data to be analysed than as knowledge in their own right. In this paper, I argue that ‘patient voice’ in the field of IH is both an inclusion issue and an epistemic one. More specifically, it concerns how experiential accounts are recognised and circulated. I examine how methodological conventions, authorship norms, review criteria, and publication formats tend to position patients as participants rather than as authors of evidence. Looking to patient-partnered practices in medical publishing, including The BMJ, JAMA, and British Journal of Sports Medicine, I outline a possible infrastructural pathway for supporting patient-authored or patient-led experiential contributions within the field. I present this as a design probe to surface assumptions and trade-offs. I end this paper by inviting the IH community to reflect on how its knowledge infrastructures might accommodate experiential evidence alongside established research forms.
[HC-13] he Weak Signal Cultivation Model: A Human-Centric Framework for Frontline Risk Detection Signal Tracking and Proactive Organizational Resilience
【速读】:该论文旨在解决组织在风险识别与管理中难以有效捕捉和传递来自一线员工的微弱风险信号(Weak Risk Signals)的问题,从而导致风险预警滞后或被忽视。解决方案的关键在于提出一种以人为中心的“弱信号培育模型”(Weak Signal Cultivation Model, WSCM),其核心是构建一个二维连续坐标系([0,10] × [0,10])——即“弱信号培育场”,将每个风险信号定位为一个节点,分别由当前风险强度(Risk Intensity)和风险增长潜力(Risk Growth Potential)两个维度表征。该模型通过动态追踪节点在四个区域(Question Marks、Lit Fuses、Sleeping Cats、Owls)之间的移动轨迹,实现对弱信号的结构化表达与演化可视化,进而建立从一线经验到管理层决策的风险沟通桥梁,并为生成式 AI (Generative AI) 支持的风险分析提供概念基础与实践工具。
链接: https://arxiv.org/abs/2604.01495
作者: Maurice Codourey,Emmanuel A. Gonzalez
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 23 pages, 2 figures, 8 tables, 15 equations, white paper
Abstract:This white paper introduces the Weak Signal Cultivation Model (WSCM). WSCM is a human-centric framework for detecting, structuring, and tracking weak risk signals as observed by frontline staff. The model centers on a continuous [0,10] x [0,10] coordinate field–the Weak Signal Cultivation Field, in which each identified signal is positioned as a node on two independent dimensions: its current Risk Intensity (x) and its Risk Growth Potential (y). Represented as a risk locus, nodes move across the field over time as new team assessments or measurements arrive. The locus reflects the signal’s trajectory across four possible regions: Question Marks, Lit Fuses, Sleeping Cats, and Owls. Through this graphical approach, bridging risk communication from the frontline experience to management decision-making is made through a single organizational vocabulary. The model introduced in this document is designed to serve as a practitioner tool and a conceptual foundation for AI-supported analytics.
[HC-14] Low-Burden LLM -Based Preference Learning: Personalizing Assistive Robots from Natural Language Feedback for Users with Paralysis
【速读】:该论文旨在解决物理辅助机器人(Physically Assistive Robots, PARs)在个性化行为适配过程中,因传统偏好学习方法(如耗时的成对比较)导致严重身体与认知疲劳的问题,尤其针对重度运动障碍用户。解决方案的关键在于提出一种低负担、离线的框架,通过大型语言模型(Large Language Models, LLMs)结合职业治疗实践框架(Occupational Therapy Practice Framework, OTPF),将非结构化的自然语言反馈转化为确定性的机器人控制策略;该框架首先利用LLMs解析主观用户反应以识别明确的生理与心理需求,并映射为可解释的决策树结构,随后通过“LLM-as-a-Judge”机制自动验证代码结构安全性,从而实现高效、安全且符合用户偏好的机器人行为生成。
链接: https://arxiv.org/abs/2604.01463
作者: Keshav Shankar,Dan Ding,Wei Gao
机构: University of Pittsburgh (匹兹堡大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: This work has been submitted to the 2026 IEEE International Conference on Robot and Human Interactive Communication (ROMAN)
Abstract:Physically Assistive Robots (PARs) require personalized behaviors to ensure user safety and comfort. However, traditional preference learning methods, like exhaustive pairwise comparisons, cause severe physical and cognitive fatigue for users with profound motor impairments. To solve this, we propose a low-burden, offline framework that translates unstructured natural language feedback directly into deterministic robotic control policies. To safely bridge the gap between ambiguous human speech and robotic code, our pipeline uses Large Language Models (LLMs) grounded in the Occupational Therapy Practice Framework (OTPF). This clinical reasoning decodes subjective user reactions into explicit physical and psychological needs, which are then mapped into transparent decision trees. Before deployment, an automated “LLM-as-a-Judge” verifies the code’s structural safety. We validated this system in a simulated meal preparation study with 10 adults with paralysis. Results show our natural language approach significantly reduces user workload compared to traditional baselines. Additionally, independent clinical experts confirmed the generated policies are safe and accurately reflect user preferences.
[HC-15] Democratizing Foundations of Problem-Solving with AI: A Breadth-First Search Curriculum for Middle School Students
【速读】:该论文旨在解决K-12教育中如何将人工智能(Artificial Intelligence, AI)学习目标有效融入现有学科教学的问题,尤其关注在农村初中科学课堂中实现AI教育的有意义整合。其解决方案的关键在于设计并实施了一个与AI4K12框架一致的课程模块,以广度优先搜索(Breadth-First Search, BFS)作为切入点,通过无设备(unplugged)活动和交互式模拟环境,使学生在理解BFS算法的基础上,将其应用于病毒传播和接触追踪等真实科学情境中,从而实现AI问题解决能力的学习与学科知识的协同提升。
链接: https://arxiv.org/abs/2604.01396
作者: Griffin Pitts,Kimia Fazeli,Tirth Bhatt,Jennifer Albert,Marnie Hill,Tiffany Barnes,Shiyan Jiang,Bita Akram
机构: 未知
类目: Computers and Society (cs.CY); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注: Paper accepted to the 27th International Conference on AI in Education (AIED 2026)
Abstract:As AI becomes more common in students’ everyday experiences, a major challenge for K-12 AI education is designing learning experiences that can be meaningfully integrated into existing subject-area instruction. This paper presents the design and implementation of an AI4K12-aligned curriculum that embeds AI learning goals within a rural middle school science classroom using Breadth-First Search (BFS) as an accessible entry point to AI problem-solving. Through unplugged activities and an interactive simulation environment, students learned BFS as a strategy for exploring networks and identifying shortest paths, then applied it to science contexts involving virus spread and contact tracing. To examine engagement and learning, we analyzed pre- and post-assessments, student work artifacts, and a teacher interview. Results suggest that students engaged productively with the curriculum, improved their understanding of BFS and AI problem-solving, and benefited from learning these ideas within ongoing science instruction. Teacher feedback further indicated that the module fit well within the science curriculum while supporting intended science learning outcomes. We conclude with curriculum and design considerations for broadening access to learning about problem-solving with AI in education.
[HC-16] Disclosure or Marketing? Analyzing the Efficacy of Vendor Self-reports for Vetting Public-sector AI
【速读】:该论文旨在解决当前负责任人工智能(Responsible AI)治理中,基于文档的披露工具(如模型卡片、数据表和AI事实清单)在实际应用中的生产、解读与使用缺乏实证研究支持的问题。其解决方案的关键在于通过定性研究方法,对广泛采用的GovAI Coalition FactSheet进行深入分析,揭示其在政府采购和治理场景下的多重功能冲突及其局限性——即FactSheet被期望同时承担展示供应商产品、支撑评估尽职调查和促进早期对话等角色,但因自愿性和公开自我披露的结构性限制,难以独立作为评估或风险评估工具。然而,当将FactSheet视为一种关系性工具(relational artifacts),用于建立信任、共享理解并推动持续对话时,其可成为长期更有效披露与治理的基础。
链接: https://arxiv.org/abs/2604.01332
作者: Blaine Kuehnert,Nari Johnson,Ravit Dotan,Hoda Heidari
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 31 pages, 2 figures
Abstract:Documentation-based disclosure has become a central governance strategy for responsible AI, particularly in public-sector procurement. Tools such as model cards, datasheets, and AI FactSheets are increasingly expected to support accountability, risk assessment, and informed decision-making across organizational boundaries. Yet there is limited empirical evidence about how these artifacts are produced, interpreted, and used in practice. In this paper, we present a qualitative study of the GovAI Coalition FactSheet, a widely adopted transparency document designed to support AI procurement and governance in government contexts. Drawing on semi-structured interviews with vendors and public-sector practitioners, alongside a systematic analysis of completed FactSheets, we examine how FactSheets are used, what information they surface, and where they fall short. We find that FactSheets are asked to serve multiple and conflicting purposes simultaneously: showcasing vendor offerings, supporting evaluation and due diligence, and facilitating early-stage dialogue between vendors and agencies. These competing expectations, combined with the structural constraints of voluntary and public self-disclosure, limit the ability of FactSheets to function as standalone evaluation or risk-assessment tools. At the same time, our findings suggest that when understood as relational artifacts used to establish trust, shared understanding, and ongoing dialogue, FactSheets can help create conditions that support more meaningful disclosure and governance over time.
[HC-17] From Automation to Augmentation: A Framework for Designing Human-Centric Work Environments in Society 5.0
【速读】:该论文旨在解决Society 5.0和Industry 5.0背景下人类中心技术融合缺乏可操作定义的问题,即如何在企业层面实现可测量、可优化和可评估的人类-AI协同增效。其核心挑战在于现有模型将增强函数φ(D)视为外生变量(仅依赖AI部署存量),忽略了组织设计对人类-AI互动效果的关键影响;同时缺少多维工具将工作场所设计选择与增效生产力关联,并且未提供人类中心性在经济上最优的判定标准。解决方案的关键在于将增强函数内生化为φ(D, W),其中W为五维工作场所设计向量(包括AI界面设计、决策权分配、任务编排、学习回路架构及心理社会工作环境),并证明当员工可增强的认知资本超过临界阈值时,以人类为中心的设计能最大化利润。这一理论框架通过系统综述120篇文献和哥伦比亚制造业调查数据验证了管理实践质量对技术投资回报的放大效应(交互系数0.304, p<0.01),最终提出基于理论的“工作场所增效设计指数”(Workplace Augmentation Design Index, WADI),用于诊断企业层级的人类中心性水平。
链接: https://arxiv.org/abs/2604.01364
作者: Cristian Espinal Maya
机构: Universidad EAFIT (埃菲尔特大学); ESUMER (埃苏梅尔研究生院)
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 57 pages, 2 figures, 8 tables, 1 appendix with formal proofs. CFE Working Paper No. 6
Abstract:Society 5.0 and Industry 5.0 call for human-centric technology integration, yet the concept lacks an operational definition that can be measured, optimized, or evaluated at the firm level. This paper addresses three gaps. First, existing models of human-AI complementarity treat the augmentation function phi(D) as exogenous – dependent only on the stock of AI deployed – ignoring that two firms with identical technology investments achieve radically different augmentation outcomes depending on how the workplace is organized around the human-AI interaction. Second, no multi-dimensional instrument exists linking workplace design choices to augmentation productivity. Third, the Society 5.0 literature proposes human-centricity as a normative aspiration but provides no formal criterion for when it is economically optimal. We make four contributions. (1) We endogenize the augmentation function as phi(D, W), where W is a five-dimensional workplace design vector – AI interface design, decision authority allocation, task orchestration, learning loop architecture, and psychosocial work environment – and prove that human-centric design is profit-maximizing when the workforce’s augmentable cognitive capital exceeds a critical threshold. (2) We conduct a PRISMA-guided systematic review of 120 papers (screened from 6,096 records) to map the evidence base for each dimension. (3) We provide secondary empirical evidence from Colombia’s EDIT manufacturing survey (N=6,799 firms) showing that management practice quality amplifies the return to technology investment (interaction coefficient 0.304, p0.01). (4) We propose the Workplace Augmentation Design Index (WADI), a 36-item theory-grounded instrument for diagnosing human-centricity at the firm level. Decision authority allocation emerges as the binding constraint for Society 5.0 transitions, and task orchestration as the most under-researched dimension
计算机视觉
[CV-0] EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors CVPR2026
【速读】:该论文旨在解决深度事件立体视觉(deep-event stereo vision)模型训练中缺乏真实标注数据的问题,尤其是依赖昂贵主动传感器获取的地面真值(ground truth)标注难以获得。解决方案的关键在于提出EventHub框架,通过标准彩色图像(color images)利用先进的新视角合成(novel view synthesis)技术生成代理标注(proxy annotations)和代理事件(proxy events),或在已有事件数据配对情况下仅生成代理标注;进而使用该数据工厂构建的训练集,将RGB领域的先进立体匹配模型迁移至事件数据处理,从而获得具有前所未有的泛化能力的事件立体模型。
链接: https://arxiv.org/abs/2604.02331
作者: Luca Bartolomei,Fabio Tosi,Matteo Poggi,Stefano Mattoccia,Guillermo Gallego
机构: University of Bologna (博洛尼亚大学); TU Berlin (柏林工业大学); Advanced Research Center on Electronic System (电子系统高级研究中心); Einstein Center Digital Future (爱因斯坦数字未来中心); SCIoI Excellence Cluster (SCIoI卓越集群)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project Page: this https URL Code: this https URL
Abstract:We propose EventHub, a novel framework for training deep-event stereo networks without ground truth annotations from costly active sensors, relying instead on standard color images. From these images, we derive either proxy annotations and proxy events through state-of-the-art novel view synthesis techniques, or simply proxy annotations when images are already paired with event data. Using the training set generated by our data factory, we repurpose state-of-the-art stereo models from RGB literature to process event data, obtaining new event stereo models with unprecedented generalization capabilities. Experiments on widely used event stereo datasets support the effectiveness of EventHub and show how the same data distillation mechanism can improve the accuracy of RGB stereo foundation models in challenging conditions such as nighttime scenes.
[CV-1] ActionParty: Multi-Subject Action Binding in Generative Video Games
【速读】:该论文旨在解决现有视频扩散模型在多智能体场景中难以实现动作绑定(action binding)的问题,即模型无法将特定动作准确关联到对应主体(subject),从而限制了对多个智能体的协同控制能力。其解决方案的关键在于提出ActionParty,通过引入主体状态令牌(subject state tokens),即持续捕捉场景中每个主体状态的潜在变量,并结合空间偏置机制(spatial biasing mechanism)联合建模状态令牌与视频潜在表示,从而解耦全局视频帧渲染与个体受控的主体更新过程,实现了对多达七个玩家的可控生成和稳定追踪。
链接: https://arxiv.org/abs/2604.02330
作者: Alexander Pondaven,Ziyi Wu,Igor Gilitschenski,Philip Torr,Sergey Tulyakov,Fabio Pizzati,Aliaksandr Siarohin
机构: Snap Inc.(Snap公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Recent advances in video diffusion have enabled the development of “world models” capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.
[CV-2] Generative World Renderer
【速读】:该论文旨在解决生成式渲染(Generative AI)在真实场景中应用时,因现有合成数据集现实感不足和时间一致性差而导致的域差距(domain gap)问题。其关键解决方案是构建一个大规模、动态的视觉复杂游戏数据集,通过创新的双屏拼接采集方法获取400万帧连续视频(720p/30 FPS),同步包含RGB图像与五通道G-buffer信息,并覆盖多种场景、视觉效果及恶劣天气和运动模糊等变体。该数据集支持双向渲染:一方面提升逆向渲染(inverse rendering)在真实环境中的几何与材质分解鲁棒性,另一方面实现基于G-buffer引导的高保真视频生成;同时提出基于视觉语言模型(VLM)的评估协议,在无真实标签情况下量化语义、空间与时间一致性,验证了模型跨数据集泛化能力和可控生成性能。
链接: https://arxiv.org/abs/2604.02329
作者: Zheng-Hui Huang,Zhixiang Wang,Jiaming Tan,Ruihan Yu,Yidan Zhang,Bo Zheng,Yu-Lun Liu,Yung-Yu Chuang,Kaipeng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.
[CV-3] Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection CVPR
【速读】:该论文旨在解决3D异常检测与分割任务中现有方法在多视角(multiview)和多模态(multimodal)信息融合上的局限性问题。传统方法通常独立处理每个视角,忽视了跨视角和跨模态特征之间的协同关系,导致异常判别能力受限。其解决方案的关键在于提出ModMap框架,该框架基于跨模态特征映射(crossmodal feature mapping)范式,通过特征级调制(feature-wise modulation)显式建模视图依赖关系,并引入一种跨视图训练策略,利用所有可能的视图组合进行训练,从而实现多视角集成与聚合的有效异常评分。此外,作者还训练并公开了一个专为工业数据集设计的高分辨率深度编码器,进一步提升了模型性能。
链接: https://arxiv.org/abs/2604.02328
作者: Alex Costanzino,Pierluigi Zama Ramirez,Giuseppe Lisanti,Luigi Di Stefano
机构: CVLab, University of Bologna (CV实验室,博洛尼亚大学); Ca’ Foscari University of Venice (威尼斯卡福斯卡里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR Findings 2026
Abstract:We present ModMap, a natively multiview and multimodal framework for 3D anomaly detection and segmentation. Unlike existing methods that process views independently, our method draws inspiration from the crossmodal feature mapping paradigm to learn to map features across both modalities and views, while explicitly modelling view-dependent relationships through feature-wise modulation. We introduce a cross-view training strategy that leverages all possible view combinations, enabling effective anomaly scoring through multiview ensembling and aggregation. To process high-resolution 3D data, we train and publicly release a foundational depth encoder tailored to industrial datasets. Experiments on SiM3D, a recent benchmark that introduces the first multiview and multimodal setup for 3D anomaly detection and segmentation, demonstrate that ModMap attains state-of-the-art performance by surpassing previous methods by wide margins.
[CV-4] Steerable Visual Representations
【速读】:该论文旨在解决预训练视觉Transformer(如DINOv2和MAE)在下游任务中难以聚焦于图像中非显著目标的问题,以及多模态大语言模型(Multimodal LLMs)因语言主导而导致视觉表征退化的问题。其解决方案的关键在于提出了一种新的视觉表征类型——可引导视觉表示(Steerable Visual Representations),通过轻量级交叉注意力机制将文本提示直接注入视觉编码器的各层(早期融合),从而实现对全局与局部视觉特征的语言引导能力,同时保持原始视觉表示的质量,并在异常检测、个性化物体区分等任务上展现出零样本泛化性能。
链接: https://arxiv.org/abs/2604.02327
作者: Jona Ruthardt,Manu Gaur,Deva Ramanan,Makarand Tapaswi,Yuki M. Asano
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: preprint
Abstract:Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.
[CV-5] Beyond Referring Expressions: Scenario Comprehension Visual Grounding
【速读】:该论文旨在解决现有视觉定位(Visual Grounding)基准在评估图像区域与字面指代表达之间对齐能力时存在的局限性,即模型往往仅通过匹配显著的命名类别即可取得较好性能,而忽视了对场景语境中角色、意图和关系推理的需求。为此,作者提出了一种新的挑战性设置——基于场景的视觉定位(Scenario-Based Visual Grounding),并构建了Referring Scenario Comprehension (RSC) 基准,其查询为段落长度的文本描述,包含对象角色、用户目标及上下文线索,并引入干扰项以增强理解难度。解决方案的关键在于:一是设计具有可解释难度标签(如唯一性、杂乱度、尺寸、重叠和位置)的数据集,用于细粒度分析模型失败模式;二是提出ScenGround方法,采用课程推理(Curriculum Reasoning)策略,结合监督预训练与难度感知强化学习,有效提升模型在复杂场景下的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2604.02323
作者: Ruozhen He,Nisarg A. Shah,Qihua Dong,Zilin Xiao,Jaywon Koo,Vicente Ordonez
机构: Rice University; Johns Hopkins University; Northeastern University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 18 figures, Project Page: this https URL
Abstract:Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.
[CV-6] Large-scale Codec Avatars: The Unreason able Effectiveness of Large-scale Avatar Pretraining CVPR2026
【速读】:该论文旨在解决高质量3D虚拟人建模中 fidelity(保真度)与 generalization(泛化能力)之间的权衡问题:传统多视角工作室数据虽能实现高保真建模并精确控制表情和姿态,但难以泛化到真实世界场景;而基于海量野外数据训练的模型虽具备良好泛化性,却因固有的3D歧义导致重建质量较低。解决方案的关键在于提出一种大规模编解码器虚拟人(Large-Scale Codec Avatars, LCA)框架,首次引入预训练/后训练范式(pre/post-training paradigm)——先在100万段野外视频上预训练以学习广泛的外观与几何先验,再在高质量标注数据上后训练以提升表达力与保真度,从而实现对头发、服装、种族等多样性的强泛化能力,并支持精细面部表情与手指级关节控制,同时展现出零样本鲁棒性、可重光照性和松散衣物支持等涌现特性。
链接: https://arxiv.org/abs/2604.02320
作者: Junxuan Li,Rawal Khirodkar,Chengan He,Zhongshi Jiang,Giljoo Nam,Lingchen Yang,Jihyun Lee,Egor Zakharov,Zhaoen Su,Rinat Abdrashitov,Yuan Dong,Julieta Martinez,Kai Li,Qingyang Tan,Takaaki Shiratori,Matthew Hu,Peihong Guo,Xuhua Huang,Ariyan Zarei,Marco Pesavento,Yichen Xu,He Wen,Teng Deng,Wyatt Borsos,Anjali Thakrar,Jean-Charles Bazin,Carsten Stoll,Ginés Hidalgo,James Booth,Lucy Wang,Xiaowen Ma,Yu Rong,Sairanjith Thalanki,Chen Cao,Christian Häne,Abhishek Kar,Sofien Bouaziz,Jason Saragih,Yaser Sheikh,Shunsuke Saito
机构: Codec Avatars Lab, Meta
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted in CVPR2026. Website: this https URL
Abstract:High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.
[CV-7] Stop Wandering: Efficient Vision-Language Navigation via Metacognitive Reasoning
【速读】:该论文旨在解决训练-free视觉语言导航(Vision-Language Navigation, VLN)代理在3D环境中因缺乏元认知能力而导致的低效行为问题,如局部振荡和重复访问相同区域。现有方法依赖于贪婪的前沿选择策略和被动的空间记忆机制,无法有效监控探索进度、诊断策略失败或动态调整行为。解决方案的关键在于提出MetaNav,一个集成空间记忆、历史感知规划和反射修正机制的元认知导航代理:空间记忆构建持久的3D语义地图;历史感知规划通过惩罚重复访问提升效率;反射修正机制检测探索停滞并利用大语言模型(Large Language Model, LLM)生成校正规则以指导未来的前沿选择,从而显著提升导航的鲁棒性和效率。
链接: https://arxiv.org/abs/2604.02318
作者: Xueying Li,Feng Lyu,Hao Wu,Mingliu Liu,Jia-Nan Liu,Guozi Liu
机构: Central South University (中南大学); Nanjing University (南京大学); State Grid Hubei Electric Power Research Institute (国网湖北省电力科学研究院); Dongguan University of Technology (东莞理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures
Abstract:Training-free Vision-Language Navigation (VLN) agents powered by foundation models can follow instructions and explore 3D environments. However, existing approaches rely on greedy frontier selection and passive spatial memory, leading to inefficient behaviors such as local oscillation and redundant revisiting. We argue that this stems from a lack of metacognitive capabilities: the agent cannot monitor its exploration progress, diagnose strategy failures, or adapt accordingly. To address this, we propose MetaNav, a metacognitive navigation agent integrating spatial memory, history-aware planning, and reflective correction. Spatial memory builds a persistent 3D semantic map. History-aware planning penalizes revisiting to improve efficiency. Reflective correction detects stagnation and uses an LLM to generate corrective rules that guide future frontier selection. Experiments on GOAT-Bench, HM3D-OVON, and A-EQA show that MetaNav achieves state-of-the-art performance while reducing VLM queries by 20.7%, demonstrating that metacognitive reasoning significantly improves robustness and efficiency.
[CV-8] A Simple Baseline for Streaming Video Understanding
【速读】:该论文旨在解决当前流式视频理解方法中对复杂记忆机制的过度依赖问题,尤其是在处理长视频流时,现有模型往往引入复杂的记忆模块以提升性能,但其有效性尚未得到充分验证。解决方案的关键在于提出一个简单的滑动窗口基线方法——SimpleStream,该方法仅将最近N帧输入到现成的视觉语言模型(Vision-Language Model, VLM)中,无需额外的记忆或检索模块。实验表明,SimpleStream在OVO-Bench和StreamingBench两个基准上均达到或超越了13个主流离线与在线视频大模型的表现,且在仅使用4帧的情况下即可获得67.7%和80.59%的平均准确率,证明了简单设计的有效性。此外,研究揭示了长上下文的价值具有骨干网络依赖性,并发现感知与记忆之间存在稳定权衡关系,从而呼吁未来流式视频评测应区分近期场景感知与长程记忆能力,以更清晰地评估复杂模块的实际贡献。
链接: https://arxiv.org/abs/2604.02317
作者: Yujiao Shen,Shulin Tian,Jingkang Yang,Ziwei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.
[CV-9] VOID: Video Object and Interaction Deletion
【速读】:该论文旨在解决现有视频物体移除方法在处理物体间复杂物理交互(如碰撞)时无法生成符合物理规律的替代结果的问题。当前方法虽能有效修复物体“背后”的内容及外观伪影(如阴影和反射),但在面对移除物体引发的下游物理变化时表现不佳。解决方案的关键在于提出 VOID 框架,其核心包括:1)利用 Kubric 和 HUMOTO 生成包含因果反事实(counterfactual)场景的配对数据集,以模拟移除物体后物理交互的合理改变;2)借助视觉-语言模型识别受移除物体影响的区域,并以此为引导,驱动视频扩散模型生成符合物理一致性的反事实视频内容。该方法显著提升了移除后场景动态的一致性,使视频编辑模型具备更高层次的因果推理能力。
链接: https://arxiv.org/abs/2604.02296
作者: Saman Motamed,William Harvey,Benjamin Klein,Luc Van Gool,Zhuoning Yuan,Ta-Ying Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing video object removal methods excel at inpainting content “behind” the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.
[CV-10] AdamFlow: Adam-based Wasserstein Gradient Flows for Surface Registration in Medical Imaging
【速读】:该论文旨在解决医学影像中解剖形状分析里表面配准(surface registration)方法在效率与鲁棒性之间难以平衡的问题。传统局部点匹配方法虽计算高效但易受噪声和初始值影响,而全局点集对齐方法则常伴随高计算成本。其解决方案的关键在于将表面网格建模为概率测度,并将表面配准问题转化为分布优化问题;通过引入具有对数线性复杂度的高效切片 Wasserstein 距离(sliced Wasserstein distance)来衡量两网格间的差异,并提出一种名为 AdamFlow 的新优化方法——该方法将经典的 Adam 优化算法从欧几里得空间推广至概率空间,用于最小化切片 Wasserstein 距离,从而实现快速且稳定的配准性能,在仿射及非刚性配准场景下均表现出优越效果。
链接: https://arxiv.org/abs/2604.02290
作者: Qiang Ma,Qingjie Meng,Xin Hu,Yicheng Wu,Wenjia Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:
Abstract:Surface registration plays an important role for anatomical shape analysis in medical imaging. Existing surface registration methods often face a trade-off between efficiency and robustness. Local point matching methods are computationally efficient, but vulnerable to noise and initialisation. Methods designed for global point set alignment tend to incur a high computational cost. To address the challenge, here we present a fast surface registration method, which formulates surface meshes as probability measures and surface registration as a distributional optimisation problem. The discrepancy between two meshes is measured using an efficient sliced Wasserstein distance with log-linear computational complexity. We propose a novel optimisation method, AdamFlow, which generalises the well-known Adam optimisation method from the Euclidean space to the probability space for minimising the sliced Wasserstein distance. We theoretically analyse the asymptotic convergence of AdamFlow and empirically demonstrate its superior performance in both affine and non-rigid surface registration across various anatomical structures.
[CV-11] Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation
【速读】:该论文旨在解决当前多模态大语言模型在从2D扩展到3D生成任务时面临的挑战,即高质量3D数据稀缺导致的3D合成过程欠约束问题。现有方法通常依赖于间接的2D编辑后映射至3D的流程,难以保证几何一致性。其解决方案的关键在于提出Omni123——一个原生支持3D的统一基础模型,通过将文本、图像和3D表示为共享序列空间中的离散token,利用丰富的2D数据作为几何先验来增强3D表征;并引入交错的X-to-X训练范式,在异构配对数据集上协调多种跨模态任务,无需完全对齐的文本-图像-3D三元组,从而在自回归序列中遍历语义-视觉-几何循环(如文本→图像→3D→图像),联合强制语义一致性、外观保真度与多视角几何一致性,显著提升文本引导下的3D生成与编辑性能。
链接: https://arxiv.org/abs/2604.02289
作者: Chongjie Ye,Cheng Cao,Chuanyu Pan,Yiming Hao,Yihao Zhi,Yuanming Hu,Xiaoguang Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.
[CV-12] Deep Neural Network Based Roadwork Detection for Autonomous Driving
【速读】:该论文旨在解决道路施工区域对自动驾驶车辆和人类驾驶员带来的重大挑战,因其高度动态性和异质性特征导致感知与导航困难。解决方案的关键在于提出一种实时系统,通过融合YOLO神经网络与LiDAR(Light Detection and Ranging)数据,实现对道路施工对象的检测与定位;该系统能够在行驶过程中识别单个施工物体、将其合并为连贯的施工区域,并以世界坐标系记录其轮廓,从而提升自动驾驶车辆在复杂施工环境中的安全通行能力。
链接: https://arxiv.org/abs/2604.02282
作者: Sebastian Wullrich,Nicolai Steinke,Daniel Goehring
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 10 figures
Abstract:Road construction sites create major challenges for both autonomous vehicles and human drivers due to their highly dynamic and heterogeneous nature. This paper presents a real-time system that detects and localizes roadworks by combining a YOLO neural network with LiDAR data. The system identifies individual roadwork objects while driving, merges them into coherent construction sites and records their outlines in world coordinates. The model training was based on an adapted US dataset and a new dataset collected from test drives with a prototype vehicle in Berlin, Germany. Evaluations on real-world road construction sites showed a localization accuracy below 0.5 m. The system can support traffic authorities with up-to-date roadwork data and could enable autonomous vehicles to navigate construction sites more safely in the future.
[CV-13] Novel Memory Forgetting Techniques for Autonomous AI Agents : Balancing Relevance and Efficiency
【速读】:该论文旨在解决长时对话代理(long-horizon conversational agents)在持续记忆(persistent memory)过程中因无控制的记忆累积导致的时间衰减(temporal decay)和虚假记忆传播(false memory propagation)问题。其解决方案的关键在于提出一种自适应预算遗忘(adaptive budgeted forgetting)框架,通过相关性引导的评分机制(relevance-guided scoring)与有界优化(bounded optimization)共同调节记忆保留策略,综合考虑记忆的新颖性(recency)、频次(frequency)和语义对齐度(semantic alignment),从而在有限上下文约束下维持推理稳定性,显著提升长程F1得分(超过0.583基准水平),增强记忆一致性并降低虚假记忆行为,同时不增加上下文使用量。
链接: https://arxiv.org/abs/2604.02280
作者: Payal Fofadiya,Sunil Tiwari
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long-horizon conversational agents require persistent memory for coherent reasoning, yet uncontrolled accumulation causes temporal decay and false memory propagation. Benchmarks such as LOCOMO and LOCCO report performance degradation from 0.455 to 0.05 across stages, while MultiWOZ shows 78.2% accuracy with 6.8% false memory rate under persistent retention. This work introduces an adaptive budgeted forgetting framework that regulates memory through relevanceguided scoring and bounded optimization. The approach integrates recency, frequency, and semantic alignment to maintain stability under constrained context. Comparative analysis demonstrates improved long-horizon F1 beyond 0.583 baseline levels, higher retention consistency, and reduced false memory behavior without increasing context usage. These findings confirm that structured forgetting preserves reasoning performance while preventing unbounded memory growth in extended conversational settings.
[CV-14] Modular Energy Steering for Safe Text-to-Image Generation with Foundation Models
【速读】:该论文旨在解决文本到图像生成模型(text-to-image generative models)在安全性和可控性方面的挑战,尤其是在不损害生成质量的前提下实现可靠的安全控制。现有方法通常依赖于模型微调或精心构建的数据集,但这些方式可能导致生成性能下降或难以扩展。论文提出了一种推理阶段的引导框架(inference-time steering framework),其核心创新在于利用冻结的预训练视觉-语言基础模型(vision-language foundation models)提供的梯度反馈,作为生成过程中的监督信号,从而无需修改底层生成器即可实现安全控制。关键在于将语义信息转化为可直接注入扩散或流匹配模型的“能量估计”(energy-based sampling),使安全引导成为一种模块化、无需训练的机制,同时保持对多样化视觉概念的良好泛化能力与高保真生成效果。
链接: https://arxiv.org/abs/2604.02265
作者: Yaoteng Tan,Zikui Cai,M. Salman Asif
机构: University of California Riverside (加州大学河滨分校); University of Maryland (马里兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Controlling the behavior of text-to-image generative models is critical for safe and practical deployment. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. We propose an inference-time steering framework that leverages gradient feedback from frozen pretrained foundation models to guide the generation process without modifying the underlying generator. Our key observation is that vision-language foundation models encode rich semantic representations that can be repurposed as off-the-shelf supervisory signals during generation. By injecting such feedback through clean latent estimates at each sampling step, our method formulates safety steering as an energy-based sampling problem. This design enables modular, training-free safety control that is compatible with both diffusion and flow-matching models and can generalize across diverse visual concepts. Experiments demonstrate state-of-the-art robustness against NSFW red-teaming benchmarks and effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Our framework provides a principled approach for utilizing foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation.
[CV-15] SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation CVPR2026
【速读】:该论文旨在解决基础视觉 Transformer(Vision Transformer, ViT)在需要细粒度空间理解的任务中表现受限的问题,尤其是在密集预测场景(如基于 ViT 的视觉语言模型进行开放词汇分割)中,由于预训练分辨率固定和固有的粗粒度 patch 级表示,难以实现高分辨率下的精确像素级推理。现有方法通常采用滑动窗口策略在预训练分辨率下处理大分辨率图像,虽能提升精度但计算开销显著。解决方案的关键在于提出 SPAR(Single-Pass Any-Resolution ViT),通过特征回归损失将一个细粒度滑动窗口教师模型的空间推理能力蒸馏到单次前向传播的学生模型中,无需架构改动或像素级监督,从而实现高效、任意分辨率的密集特征提取。
链接: https://arxiv.org/abs/2604.02252
作者: Naomi Kombol,Ivan Martinović,Siniša Šegvić,Giorgos Tolias
机构: Faculty of Electrical Engineering and Computing (电气工程与计算学院); VRG, Faculty of Electrical Engineering (VRG,电气工程学院); University of Zagreb (萨格勒布大学); Czech Technical University in Prague (布拉格捷克理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: this https URL
[CV-16] UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models
【速读】:该论文旨在解决无人机(UAV)在复杂动态城市场景中执行多模态视觉跟踪任务时面临的挑战,尤其是现有视觉-语言-动作(VLA)模型存在的时序特征冗余和缺乏空间几何先验的问题。解决方案的关键在于提出一种改进的VLA跟踪模型——UAV-Track VLA,其核心创新包括:1)引入时序压缩网络以高效捕捉帧间动态信息,缓解时序冗余;2)设计并行双分支解码器结构,包含一个空间感知辅助定位头与一个光流匹配动作专家,实现跨模态特征解耦与细粒度连续动作生成。该方法在CARLA仿真环境中验证了优越的端到端性能,在长距离行人跟踪任务中取得61.76%的成功率和269.65帧平均跟踪长度,并具备零样本泛化能力及显著降低的单步推理延迟(减少33.4%,降至0.0571秒),从而实现了高效率、实时的无人机控制。
链接: https://arxiv.org/abs/2604.02241
作者: Qiyao Zhang,Shuhua Zheng,Jianli Sun,Chengxiang Li,Xianke Wu,Zihan Song,Zhiyong Cui,Yisheng Lv,Yonglin Tian
机构: Beijing Institute of Technology (北京理工大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Sanya (三亚大学); Beijing University of Posts and Telecommunications (北京邮电大学); Hunan University (湖南大学); Beihang University (北京航空航天大学); Flying Intelligence Team (虚拟研究社区)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Embodied visual tracking is crucial for Unmanned Aerial Vehicles (UAVs) executing complex real-world tasks. In dynamic urban scenarios with complex semantic requirements, Vision-Language-Action (VLA) models show great promise due to their cross-modal fusion and continuous action generation capabilities. To benchmark multimodal tracking in such environments, we construct a dedicated evaluation benchmark and a large-scale dataset encompassing over 890K frames, 176 tasks, and 85 diverse objects. Furthermore, to address temporal feature redundancy and the lack of spatial geometric priors in existing VLA models, we propose an improved VLA tracking model, UAV-Track VLA. Built upon the \pi_0.5 architecture, our model introduces a temporal compression net to efficiently capture inter-frame dynamics. Additionally, a parallel dual-branch decoder comprising a spatial-aware auxiliary grounding head and a flow matching action expert is designed to decouple cross-modal features and generate fine-grained continuous actions. Systematic experiments in the CARLA simulator validate the superior end-to-end performance of our method. Notably, in challenging long-distance pedestrian tracking tasks, UAV-Track VLA achieves a 61.76% success rate and 269.65 average tracking frames, significantly outperforming existing baselines. Furthermore, it demonstrates robust zero-shot generalization in unseen environments and reduces single-step inference latency by 33.4% (to 0.0571s) compared to the original \pi_0.5 , enabling highly efficient, real-time UAV control. Data samples and demonstration videos are available at: this https URL_VLA.
[CV-17] SCALE: Semantic- and Confidence-Aware Conditional Variational Autoencoder for Zero-shot Skeleton-based Action Recognition ICPR2026
【速读】:该论文旨在解决零样本骨架动作识别(Zero-shot skeleton-based action recognition, ZSAR)中因动作名称语义不充分导致细粒度动态信息难以区分,以及未见类别间语义混淆所引发的识别性能下降问题。其解决方案的关键在于提出一种轻量且确定性的语义与置信度感知列表能量框架(SCALE),将ZSAR建模为类别条件能量排序任务;通过构建冻结文本表征参数化的条件变分自编码器(Conditional Variational Autoencoder, CVAE),实现无需生成样本即可对未见类别进行似然评估;同时引入语义与置信度感知的列表能量损失函数,强化语义相近难负例的区分能力,并利用后验不确定性自适应调整决策边界,重加权模糊训练实例;此外,采用潜在原型对比目标使后验均值对齐文本引导的潜在原型,提升语义组织性和类别可分性,从而在NTU-60和NTU-120数据集上显著优于基于VAE和对齐的基线方法,且媲美扩散模型性能。
链接: https://arxiv.org/abs/2604.02222
作者: Soroush Oraki,Feng Ding,Jie Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICPR 2026
Abstract:Zero-shot skeleton-based action recognition (ZSAR) aims to recognize action classes without any training skeletons from those classes, relying instead on auxiliary semantics from text. Existing approaches frequently depend on explicit skeleton-text alignment, which can be brittle when action names underspecify fine-grained dynamics and when unseen classes are semantically confusable. We propose SCALE, a lightweight and deterministic Semantic- and Confidence-Aware Listwise Energy-based framework that formulates ZSAR as class-conditional energy ranking. SCALE builds a text-conditioned Conditional Variational Autoencoder where frozen text representations parameterize both the latent prior and the decoder, enabling likelihood-based evaluation for unseen classes without generating samples at test time. To separate competing hypotheses, we introduce a semantic- and confidence-aware listwise energy loss that emphasizes semantically similar hard negatives and incorporates posterior uncertainty to adapt decision margins and reweight ambiguous training instances. Additionally, we utilize a latent prototype contrast objective to align posterior means with text-derived latent prototypes, improving semantic organization and class separability without direct feature matching. Experiments on NTU-60 and NTU-120 datasets show that SCALE consistently improves over prior VAE- and alignment-based baselines while remaining competitive with diffusion-based methods.
[CV-18] UniDriveVLA: Unifying Understanding Perception and Action Planning for Autonomous Driving
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在自动驾驶场景中面临的感知与推理冲突问题:即在提升空间感知能力(如3D环境理解)时,常会损害模型原有的语义推理能力(如场景理解与决策)。现有方法要么依赖2D视觉语言模型导致空间感知不足,要么引入3D表示破坏语义一致性。其关键解决方案是提出UniDriveVLA——一种基于混合Transformer架构的统一驾驶VLA模型,通过专家解耦机制将驱动理解、场景感知和行为规划三个任务分别建模为独立专家,并利用掩码联合注意力机制进行协同控制,从而实现感知与推理的分离优化;同时结合稀疏感知范式与三阶段渐进式训练策略,在保持语义推理能力的同时显著增强空间感知性能。
链接: https://arxiv.org/abs/2604.02190
作者: Yongkang Li,Lijun Zhou,Sixu Yan,Bencheng Liao,Tianyi Yan,Kaixin Xiong,Long Chen,Hongwei Xie,Bing Wang,Guang Chen,Hangjun Ye,Wenyu Liu,Haiyang Sun,Xinggang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: code has been released at this https URL
Abstract:Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at this https URL
[CV-19] Lightweight Spatiotemporal Highway Lane Detection via 3D-ResNet and PINet with ROI-Aware Attention
【速读】:该论文旨在解决高速公路场景下车道检测的鲁棒性与实时性问题,特别是在复杂动态环境中的准确性和计算效率挑战。其核心解决方案是提出一种轻量级、端到端的车道检测架构,通过融合3D卷积神经网络(3D CNN)与实例分割技术,构建两个模型:第一个模型利用特征金字塔网络(FPN)和自注意力机制增强多尺度空间特征表示;第二个模型引入感兴趣区域(ROI)检测头,聚焦于车道相关区域,从而在显著降低误检率的同时减少计算复杂度。实验表明,第二模型在TuSimple数据集上达到93.40%的准确率,且参数更少、延迟更低,具备良好的ADAS集成潜力。
链接: https://arxiv.org/abs/2604.02188
作者: Sorna Shanmuga Raja,Abdelhafid Zenati
机构: City, St George’s University of London (城市大学伦敦圣乔治学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a lightweight, end-to-end highway lane detection architecture that jointly captures spatial and temporal information for robust performance in real-world driving scenarios. Building on the strengths of 3D convolutional neural networks and instance segmentation, we propose two models that integrate a 3D-ResNet encoder with a Point Instance Network (PINet) decoder. The first model enhances multi-scale feature representation using a Feature Pyramid Network (FPN) and Self-Attention mechanism to refine spatial dependencies. The second model introduces a Region of Interest (ROI) detection head to selectively focus on lane-relevant regions, thereby improving precision and reducing computational complexity. Experiments conducted on the TuSimple dataset (highway driving scenarios) demonstrate that the proposed second model achieves 93.40% accuracy while significantly reducing false negatives. Compared to existing 2D and 3D baselines, our approach achieves improved performance with fewer parameters and reduced latency. The architecture has been validated through offline training and real-time inference in the Autonomous Systems Laboratory at City, St George’s University of London. These results suggest that the proposed models are well-suited for integration into Advanced Driver Assistance Systems (ADAS), with potential scalability toward full Lane Assist Systems (LAS). Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.02188 [cs.CV] (or arXiv:2604.02188v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.02188 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-20] CXR-LT 2026 Challenge: Projection-Aware Multi-Label and Zero-Shot Chest X-Ray Classification
【速读】:该论文旨在解决胸部X光(CXR)图像中已知病灶的多标签分类(multi-label classification)与未知病灶的零样本分类(zero-shot classification)问题。针对不同投影角度的CXRs,研究通过将特定投影的模型整合进一个统一的分类网络框架来提升泛化能力;在零样本分类任务中,提出一种新颖的双分支架构,融合对比学习(contrastive learning)、非对称损失(Asymmetric Loss, ASL)和大语言模型(LLM)生成的描述性提示(descriptive prompts),有效缓解长尾分布不均衡问题并增强零样本迁移性能。关键创新在于结合多模态提示与结构化损失函数,实现对未见类别的高精度识别,同时借助强数据增强和测试时增强(test-time augmentation, TTA)确保整体鲁棒性。
链接: https://arxiv.org/abs/2604.02185
作者: Juno Cho(1),Dohui Kim(2),Mingeon Kim(1),Hyunseo Jang(3),Chang Sun Lee(4),Jong Chul Ye(4) ((1) KAIST, (2) GIST, (3) Korea University, (4) KAIST Graduate School of AI)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures. Accepted to the IEEE ISBI 2026 CXR-LT Challenge
Abstract:This challenge tackles multi-label classification for known chest X-ray (CXR) lesions and zero-shot classification for unseen ones. To handle diverse CXR projections, we integrate projection-specific models via a classification network into a unified framework. For zero-shot classification (Task 2), we extend CheXzero with a novel dual-branch architecture that combines contrastive learning, Asymmetric Loss (ASL), and LLM-generated descriptive prompts. This effectively mitigates severe long-tail imbalances and maximizes zero-shot generalization. Additionally, strong data and test-time augmentations (TTA) ensure robustness across both tasks.
[CV-21] Reflection Generation for Composite Image Using Diffusion Model
【速读】:该论文旨在解决图像合成中反射生成(reflection generation)这一长期被忽视的问题,即如何在将前景物体嵌入背景时,自动生成与环境物理一致且视觉逼真的反射效果。其解决方案的关键在于:首先,将反射位置和外观的先验信息注入基础扩散模型(foundation diffusion model);其次,区分反射为两类并采用类型感知(type-aware)的模型设计;最后,构建了首个大规模物体反射数据集 DEROBA 以支持训练。实验表明,该方法能生成物理上一致且视觉真实的反射,为反射生成任务设立了新基准。
链接: https://arxiv.org/abs/2604.02168
作者: Haonan Zhao,Qingyang Liu,Jiaxuan Chen,Li Niu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image composition involves inserting a foreground object into the background while synthesizing environment-consistent effects such as shadows and reflections. Although shadow generation has been extensively studied, reflection generation remains largely underexplored. In this work, we focus on reflection generation. We inject the prior information of reflection placement and reflection appearance into foundation diffusion model. We also divide reflections into two types and adopt type-aware model design. To support training, we construct the first large-scale object reflection dataset DEROBA. Experiments demonstrate that our method generates reflections that are physically coherent and visually realistic, establishing a new benchmark for reflection generation.
[CV-22] Beyond the Fold: Quantifying Split-Level Noise and the Case for Leave-One-Dataset-Out AU Evaluation CVPR2026
【速读】:该论文旨在解决面部动作单元(Action Unit, AU)检测模型评估中因交叉验证协议本身引入的随机性导致性能提升难以准确判断的问题。研究发现,标准的分组交叉验证(subject-exclusive cross-validation)会带来可测量的随机方差,尤其在低频AU上表现更为显著,且操作点指标(如F1分数)波动远大于阈值无关指标(如AUC),甚至可能导致模型排名随划分不同而改变。为此,作者提出采用留一数据集外(Leave-One-Dataset-Out, LODO)跨数据集验证策略,以消除单一数据集内划分随机性,揭示出单数据集交叉验证下被掩盖的领域级不稳定现象。解决方案的关键在于通过LODO协议获得更稳定、可解释的模型评估结果,从而避免将由协议噪声带来的“改进”误判为真实性能提升。
链接: https://arxiv.org/abs/2604.02162
作者: Saurabh Hinduja,Gurmeet Kaur,Maneesh Bilalpur,Jeffrey Cohn,Shaun Canavan
机构: CGI Technologies and Solutions Inc(CGI技术与解决方案公司); University of Pittsburgh(匹兹堡大学); University of South Florida(南佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Subject-exclusive cross-validation is the standard evaluation protocol for facial Action Unit (AU) detection, yet reported improvements are often small. We show that cross-validation itself introduces measurable stochastic variance. On BP4D+, repeated 3-fold subject-exclusive splits produce an empirical noise floor of \pm 0.065 in average F1, with substantially larger variation for low-prevalence AUs. Operating-point metrics such as F1 fluctuate more than threshold-independent measures such as AUC, and model ranking can change under different fold assignments. We further evaluate cross-dataset robustness using a Leave-One-Dataset-Out (LODO) protocol across five AU datasets. LODO removes partition randomness and exposes domain-level instability that is not visible under single-dataset cross-validation. Together, these results suggest that gains often reported in cross-fold validation may fall within protocol variance. Leave-one-dataset-out cross-validation yields more stable and interpretable findings Comments: CVPR 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.02162 [cs.CV] (or arXiv:2604.02162v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.02162 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-23] CoRegOVCD: Consistency-Regularized Open-Vocabulary Change Detection
【速读】:该论文旨在解决开放词汇变化检测(Open-vocabulary Change Detection, OVCD)中训练-free场景下密集概念响应难以直接比较的问题,特别是在存在外观变化、弱跨概念竞争及土地覆盖类别空间连续性导致的噪声、碎片化和语义不可靠变化证据的情况下。其解决方案的关键在于提出了一种无训练的密集推理框架——一致性正则化开放词汇变化检测(Consistency-Regularized Open-Vocabulary Change Detection, CoRegOVCD),通过将特定概念的变化重新表述为校准后验差异(calibrated posterior discrepancy)来增强语义变化证据的可比性:其中竞争后验校准(Competitive Posterior Calibration, CPC)与语义后验差值(Semantic Posterior Delta, SPD)将原始概念响应转化为竞争感知的查询概念后验并量化其跨时间差异;几何-令牌一致性门控(Geometry-Token Consistency Gate, GeoGate)与区域共识差异(Regional Consensus Discrepancy, RCD)进一步通过几何感知结构验证和区域共识机制抑制无效响应并提升空间一致性。
链接: https://arxiv.org/abs/2604.02160
作者: Weidong Tang,Hanbin Sun,Zihan Li,Yikai Wang,Feifan Zhang
机构: China Agricultural University (中国农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote sensing change detection (CD) aims to identify where land-cover semantics change across time, but most existing methods still assume a fixed label space and therefore cannot answer arbitrary user-defined queries. Open-vocabulary change detection (OVCD) instead asks for the change mask of a queried concept. In the fully training-free setting, however, dense concept responses are difficult to compare directly across dates: appearance variation, weak cross-concept competition, and the spatial continuity of many land-cover categories often produce noisy, fragmented, and semantically unreliable change evidence. We propose Consistency-Regularized Open-Vocabulary Change Detection (CoRegOVCD), a training-free dense inference framework that reformulates concept-specific change as calibrated posterior discrepancy. Competitive Posterior Calibration (CPC) and the Semantic Posterior Delta (SPD) convert raw concept responses into competition-aware queried-concept posteriors and quantify their cross-temporal discrepancy, making semantic change evidence more comparable without explicit instance matching. Geometry-Token Consistency Gate (GeoGate) and Regional Consensus Discrepancy (RCD) further suppress unsupported responses and improve spatial coherence through geometry-aware structural verification and regional consensus. Across four benchmarks spanning building-oriented and multi-class settings, CoRegOVCD consistently improves over the strongest previous training-free baseline by 2.24 to 4.98 F1 _C points and reaches a six-class average of 47.50% F1 _C on SECOND.
[CV-24] CASHG: Context-Aware Stylized Online Handwriting Generation
【速读】:该论文旨在解决生成自然语句级别的在线手写轨迹(online handwriting trajectory)时,如何忠实体现书写者风格的问题,尤其在字符间的连笔连续性(inter-character connectivity)和间距一致性(spacing consistency)方面存在挑战。传统方法将这些边界特性视为序列建模的隐式结果,在句子尺度下可靠性不足且在组合多样性有限时表现不佳。解决方案的关键在于提出CASHG(Context-aware Stylized Online Handwriting Generator),其核心创新是显式建模字符间的连接关系:通过Character Context Encoder提取字符身份与句子依赖的上下文记忆,并在基于双词(bigram)感知的滑动窗口Transformer解码器中融合这些信息,强调局部前驱-当前字符过渡;同时引入三阶段课程学习策略,从孤立字形逐步过渡到完整句子,提升稀疏转换覆盖下的鲁棒性。
链接: https://arxiv.org/abs/2604.02103
作者: Jinsu Shin,Sungeun Hong,Jin Yeong Bak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 42 pages, 19 figures
Abstract:Online handwriting represents strokes as time-ordered trajectories, which makes handwritten content easier to transform and reuse in a wide range of applications. However, generating natural sentence-level online handwriting that faithfully reflects a writer’s style remains challenging, since sentence synthesis demands context-dependent characters with stroke continuity and spacing. Prior methods treat these boundary properties as implicit outcomes of sequence modeling, which becomes unreliable at the sentence scale and under limited compositional diversity. We propose CASHG, a context-aware stylized online handwriting generator that explicitly models inter-character connectivity for style-consistent sentence-level trajectory synthesis. CASHG uses a Character Context Encoder to obtain character identity and sentence-dependent context memory and fuses them in a bigram-aware sliding-window Transformer decoder that emphasizes local predecessor–current transitions, complemented by gated context fusion for sentence-level this http URL proceeds through a three-stage curriculum from isolated glyphs to full sentences, improving robustness under sparse transition coverage. We further introduce Connectivity and Spacing Metrics (CSM), a boundary-aware evaluation suite that quantifies cursive connectivity and spacing similarity. Under benchmark-matched evaluation protocols, CASHG consistently improves CSM over comparison methods while remaining competitive in DTW-based trajectory similarity, with gains corroborated by a human evaluation.
[CV-25] LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
【速读】:该论文旨在解决现有统一模型(Unified Models, UMs)在跨模态理解与生成任务中因视觉表征分离而导致的效率低下和性能受限问题,尤其是像素空间解码带来的冗余计算与Codec偏差。其解决方案的关键在于提出LatentUM,通过将所有模态映射到共享的语义潜在空间(shared semantic latent space),消除了视觉理解与生成之间对像素空间中介的依赖,从而自然实现灵活的交错式跨模态推理与生成,显著提升计算效率并增强跨模态对齐能力。
链接: https://arxiv.org/abs/2604.02097
作者: Jiachun Jin,Zetong Zhou,Xiao Yang,Hao Zhang,Pengfei Liu,Jun Zhu,Zhijie Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.
[CV-26] GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding CVPR2026
【速读】:该论文旨在解决视频大语言模型(Vid-LLMs)在视频时序定位(Video Temporal Grounding, VTG)任务中因采用均匀帧采样导致关键帧分布稀疏、重要时序线索丢失的问题。其解决方案的核心是提出一种基于精粒度查询引导的视觉令牌采样机制——Grounded Visual Token Sampling (GroundVTS),该机制通过筛选最具信息量的时序片段,在输入大语言模型(LLM)前保留关键的时空特征并维持时序一致性;同时引入渐进式优化策略,使LLM能够适应视觉特征的非均匀分布,从而增强对时序依赖关系的建模能力,实现更精准的视频定位。
链接: https://arxiv.org/abs/2604.02093
作者: Rong Fan,Kaiyan Xiao,Minghao Zhu,Liuyi Wang,Kai Dai,Zhao Yang
机构: Newcapec AI Research (新大陆人工智能研究院); Fudan University (复旦大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published as a conference paper at CVPR 2026
Abstract:Video temporal grounding (VTG) is a critical task in video understanding and a key capability for extending video large language models (Vid-LLMs) to broader applications. However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues. To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. GroundVTS employs a fine-grained, query-guided mechanism to filter visual tokens before feeding them into the LLM, thereby preserving essential spatio-temporal information and maintaining temporal coherence. Futhermore, we introduce a progressive optimization strategy that enables the LLM to effectively adapt to the non-uniform distribution of visual features, enhancing its ability to model temporal dependencies and achieve precise video localization. We comprehensively evaluate GroundVTS on three standard VTG benchmarks, where it outperforms existing methods, achieving a 7.7-point improvement in mIoU for moment retrieval and 12.0-point improvement in mAP for highlight detection. Code is available at this https URL.
[CV-27] Center-Aware Detection with Swin-based Co-DETR Framework for Cervical Cytology
【速读】:该论文旨在解决宫颈细胞学图像(Pap smear images)中细胞检测的难题,特别是由于细胞密集分布和复杂形态导致的自动分析挑战。其关键解决方案在于:首先将检测任务建模为中心点预测问题,以适配数据集中固定尺寸边界框的标注特性;其次引入中心保持型数据增强策略与几何优化的边界框调整方法,有效缓解定位抖动问题;最后通过针对不同任务的损失权重调优,实现对Track A和Track B两个子任务的性能提升。整体方案基于Co-DINO框架与Swin-Large骨干网络,构建了一个高效且鲁棒的多尺度特征提取与细胞定位pipeline,最终在RIVA宫颈细胞学挑战赛中分别获得Track A第2名和Track B第1名。
链接: https://arxiv.org/abs/2604.02090
作者: Yan Kong,Yuan Yin,Hongan Chen,Yuqi Fang,Caifeng Shan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ISBI 2026 Accepted Paper Winning Solution for the RIVA Cervical Cytology Challenge
Abstract:Automated analysis of Pap smear images is critical for cervical cancer screening but remains challenging due to dense cell distribution and complex morphology. In this paper, we present our winning solution for the RIVA Cervical Cytology Challenge, achieving 1st place in Track B and 2nd place in Track A. Our approach leverages a powerful baseline, integrating the Co-DINO framework with a Swin-Large backbone for robust multi-scale feature extraction. To address the dataset’s unique fixed-size bounding box annotations, we formulate the detection task as a center-point prediction problem. Tailoring our approach to this formulation, we introduce a center-preserving data augmentation strategy and an analytical geometric box optimization to effectively absorb localization jitter. Finally, we apply track-specific loss tuning to adapt the loss weights for each task. Experiments demonstrate that our targeted optimizations improve detection performance, providing an effective pipeline for cytology image analysis. Our code is available at this https URL.
[CV-28] FlowSlider: Training-Free Continuous Image Editing via Fidelity-Steering Decomposition
【速读】:该论文旨在解决现有基于学习的连续图像编辑方法中因依赖辅助模块和合成监督信号而导致的训练开销大、滑块行为受训练分布耦合、在编辑或领域分布变化时可靠性下降的问题。其解决方案的关键在于提出一种无需训练的FlowSlider方法,通过将Rectified Flow中的更新分解为两个近似正交的组成部分:一是保持源图像保真度的稳定性项(fidelity term),用于维持图像结构与身份特征;二是驱动语义转换的目标导向项(steering term),控制编辑方向。由于二者正交性,仅通过缩放steering term即可实现稳定且平滑的编辑强度控制,从而在不进行后训练的情况下实现高质量、可靠的连续编辑。
链接: https://arxiv.org/abs/2604.02088
作者: Taichi Endo,Guoqing Hao,Kazuhiko Sumi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: HuggingFace Space: this https URL
Abstract:Continuous image editing aims to provide slider-style control of edit strength while preserving source-image fidelity and maintaining a consistent edit direction. Existing learning-based slider methods typically rely on auxiliary modules trained with synthetic or proxy supervision. This introduces additional training overhead and couples slider behavior to the training distribution, which can reduce reliability under distribution shifts in edits or domains. We propose \textitFlowSlider, a training-free method for continuous editing in Rectified Flow that requires no post-training. \textitFlowSlider decomposes FlowEdit’s update into (i) a fidelity term, which acts as a source-conditioned stabilizer that preserves identity and structure, and (ii) a steering term that drives semantic transition toward the target edit. Geometric analysis and empirical measurements show that these terms are approximately orthogonal, enabling stable strength control by scaling only the steering term while keeping the fidelity term unchanged. As a result, \textitFlowSlider provides smooth and reliable control without post-training, improving continuous editing quality across diverse tasks.
[CV-29] PLUME: Latent Reasoning Based Universal Multimodal Embedding
【速读】:该论文旨在解决通用多模态嵌入(Universal Multimodal Embedding, UME)中因依赖显式链式思维(Chain-of-Thought, CoT)推理而导致的推理开销大、多模态证据易被压缩至文本瓶颈的问题。解决方案的关键在于提出PLUME框架,其通过引入隐式推理机制,用连续潜变量状态的自回归滚动替代显式CoT生成,并结合语义锚引导的过渡适配器,在固定计算预算下实现多样化推理路径的动态控制;同时采用从显式到潜式的渐进式训练课程(progressive explicit-to-latent curriculum),在训练中逐步将显式推理行为迁移至隐藏状态计算,最终在推理阶段完全消除显式CoT,从而显著提升效率(超过30倍加速)并保持高性能,尤其适用于视频和视觉文档等结构复杂、证据密集的检索场景。
链接: https://arxiv.org/abs/2604.02073
作者: Chenwei He,Xiangzhao Hao,Tianyu Yang,Yuxiang Ma,Yuheng Jia,Lingxiang Wu,Chaoyang Zhao,Haiyun Guo,Jinqiao Wang
机构: Southeast University (东南大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.
[CV-30] Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection CVPR2026
【速读】:该论文旨在解决人类-物体交互(Human-Object Interaction, HOI)检测任务中因缺乏对场景内多尺度上下文信息充分建模而导致的交互理解不足问题。现有方法虽借助视觉语言模型(Vision-Language Models, VLMs)引入语义先验,但未能有效利用分布在整幅图像中的多样化上下文线索。其解决方案的关键在于提出实例中心的上下文挖掘网络(Instance-centric Context Mining Network, InCoM-Net),通过两个核心模块实现:一是实例中心上下文精炼(Instance-centric Context Refinement, ICR),从VLM提取的特征中分别捕获实例内部、实例之间及全局场景三类上下文信息;二是渐进式上下文聚合(Progressive Context Aggregation, ProCA),迭代融合多源上下文特征与检测器生成的实例级特征,从而支持更深层次的交互推理。该设计显著提升了HOI检测的准确性和泛化能力,在HICO-DET和V-COCO基准上达到当前最优性能。
链接: https://arxiv.org/abs/2604.02071
作者: Soo Won Seo,KyungChae Lee,Hyungchan Cho,Taein Son,Nam Ik Cho,Jun Won Choi
机构: Seoul National University (首尔国立大学); Hanyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026. Code: this https URL
Abstract:Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at this https URL.
[CV-31] Network Structure in UK Payment Flows: Evidence on Economic Interdependencies and Implications for Real-Time Measurement
【速读】:该论文旨在解决传统双边计量方法难以捕捉复杂经济结构关系的问题,从而提升对支付流的预测精度与经济状态的实时监测能力。其解决方案的关键在于引入图论特征(如中心性度量和聚类系数)来增强时间序列模型,显著提升了支付流预测准确率(提升8.8个百分点),尤其在经济中断时期表现突出——例如新冠疫情中传统模型性能大幅下降(R²从0.38降至0.19),而网络增强模型仍保持较高准确性,贡献达+13.8个百分点。研究进一步识别出金融服务业、批发贸易和专业服务行业在支付网络中具有结构性中心地位,揭示了其系统重要性超越交易规模本身。
链接: https://arxiv.org/abs/2604.02068
作者: Aditya Humnabadkar
机构: Office for National Statistics (英国国家统计局)
类目: Computer Vision and Pattern Recognition (cs.CV); Econometrics (econ.EM)
备注: Accepted for Poster presentation at the ESCoE Conference on Economic Measurement 2026
Abstract:Network analysis of inter-industry payment flows reveals structural economic relationships invisible to traditional bilateral measurement approaches, with significant implications for real-time economic monitoring. Analysing 532,346 UK payment records (2017–2024) across 89 industry sectors, we demonstrate that graph-theoretic features which include centrality measures and clustering coefficients improve payment flow forecasting by 8.8 percentage points beyond traditional time-series methods. Critically, network features prove most valuable during economic disruptions: during the COVID-19 pandemic, when traditional forecasting accuracy collapsed (R2 falling from 0.38 to 0.19), network-enhanced models maintained substantially better performance, with network contributions reaching +13.8 percentage points. The analysis identifies Financial Services, Wholesale Trade, and Professional Services as structurally central industries whose network positions indicate systemic importance beyond their transaction volumes. Network density increased 12.5% over the sample period, with visible disruption during 2020 followed by recovery exceeding pre-pandemic integration levels. These findings suggest payment network monitoring could enhance official statistics production by providing leading indicators of structural economic change and improving nowcasting accuracy during periods when traditional temporal patterns prove unreliable.
[CV-32] CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects
【速读】:该论文旨在解决在复杂多物体场景中,当多个物体具有相同功能(affordance)但仅有一个符合任务意图时,如何准确识别并定位正确目标物体的问题,即“混淆对”(confusing pairs)下的3D affordance grounding问题。现有方法通常在孤立单物体上评估,且依赖显式类别标签,难以应对自然语言指令下隐含意图的场景。解决方案的关键在于提出CompassNet框架,其核心创新为两个模块:Instance-bounded Cross Injection(ICI)通过在对象边界内约束语言与几何特征对齐,防止跨对象语义泄漏;Bi-level Contrastive Refinement(BCR)在几何组和点级双层次强化对比学习,显著提升目标与混淆表面之间的区分度。
链接: https://arxiv.org/abs/2604.02060
作者: Jingliang Li,Jindou Jia,Tuo An,Chuhao Zhou,Xiangyu Chen,Shilin Shan,Boyu Ma,Bofan Lyu,Gen Li,Jianfei Yang
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Code available at: this http URL
Abstract:When told to “cut the apple,” a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Multi-Object Affordance Grounding under Intent-Driven Instructions, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a cluttered multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusable multi-object scenes. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 scenes, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object scenes.
[CV-33] COMPASS: Complete Multimodal Fusion via Proxy Tokens and Shared Spaces for Ubiquitous Sensing
【速读】:该论文旨在解决多模态感知中因模态缺失导致的融合不完整问题,即现有方法在面对部分模态缺失时,常通过丢弃缺失分支、使用特定子集的融合策略或重建缺失特征来应对,这使得融合头接收到的输入结构与训练阶段不同,从而削弱跨模态交互能力。解决方案的关键在于提出COMPASS框架,其核心思想是“融合完整性”(fusion completeness):无论是否存在缺失模态,融合头始终接收固定N槽的多模态输入(每槽一个token)。对于每个缺失模态,COMPASS利用共享潜在空间中的成对源到目标生成器,从已观测模态合成目标特定的代理token(proxy token),并将这些代理token聚合为单一替换token,以维持完整的输入结构。为确保代理token既具备表示兼容性又携带任务信息,还引入了代理对齐、共享空间正则化和每个代理的判别监督机制。实验表明,该方法在多种缺失场景下显著优于现有方法,验证了保持模态完整融合接口作为鲁棒多模态感知设计原则的有效性。
链接: https://arxiv.org/abs/2604.02056
作者: Hao Wang,Yanyu Qian,Pengcheng Weng,Zixuan Xia,William Dan,Yangxin Xu,Fei Wang
机构: Universität Bern(伯尔尼大学); Xi’an Jiaotong University(西安交通大学); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Missing modalities remain a major challenge for multimodal sensing, because most existing methods adapt the fusion process to the observed subset by dropping absent branches, using subset-specific fusion, or reconstructing missing features. As a result, the fusion head often receives an input structure different from the one seen during training, leading to incomplete fusion and degraded cross-modal interaction. We propose COMPASS, a missing-modality fusion framework built on the principle of fusion completeness: the fusion head always receives a fixed N-slot multimodal input, with one token per modality slot. For each missing modality, COMPASS synthesizes a target-specific proxy token from the observed modalities using pairwise source-to-target generators in a shared latent space, and aggregates them into a single replacement token. To make these proxies both representation-compatible and task-informative, we combine proxy alignment, shared-space regularization, and per-proxy discriminative supervision. Experiments on XRF55, MM-Fi, and OctoNet under diverse single- and multiple-missing settings show that COMPASS outperforms prior methods on the large majority of scenarios. Our results suggest that preserving a modality-complete fusion interface is a simple and effective design principle for robust multimodal sensing.
[CV-34] rue to Tone? Quantifying Skin Tone Fidelity and Bias in Photographic-to-Virtual Human Pipelines
【速读】:该论文旨在解决虚拟人类(Virtual Human, VH)渲染中面部肤色再现不准确的问题,这会导致真实感下降、身份识别失真以及公平性偏差。现有Avatar生成流程多依赖未经色度校准的摄影输入,难以保证肤色的一致性和跨群体的公平性。其解决方案的关键在于提出一种全自动、可扩展的方法论,系统评估VH生成全流程中的肤色保真度:通过整合皮肤颜色与光照提取、纹理重着色、实时渲染及定量色彩分析,采用芝加哥面孔数据库(Chicago Face Database, CFD)图像进行实验,比较基于脸颊区域采样与全脸多维掩码的肤色提取策略,并结合预训练的TRUST光照隔离框架,在MetaHuman纹理上应用提取肤色并在多种光照条件下渲染,最终以CIELAB空间中的ΔE和个体类型角(Individual Typology Angle, ITA)为指标客观评价肤色一致性。该方法无需人工干预且除预训练模块外无学习阶段,具备低计算成本和大规模评估能力,揭示了肤色提取策略存在种族表型依赖性,且深肤色样本普遍表现出更高的色度误差。
链接: https://arxiv.org/abs/2604.02055
作者: Gabriel Ferri Schneider,Erick Menezes,Rafael Mecenas,Paulo Knob,Victor Araujo,Soraia Raupp Musse
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 10 figures
Abstract:Accurate reproduction of facial skin tone is essential for realism, identity preservation, and fairness in Virtual Human (VH) rendering. However, most accessible avatar creation pipelines rely on photographic inputs that lack colorimetric calibration, which can introduce inconsistencies and bias. We propose a fully automatic and scalable methodology to systematically evaluate skin tone fidelity across the VH generation pipeline. Our approach defines a full workflow that integrates skin color and illumination extraction, texture recolorization, real-time rendering, and quantitative color analysis. Using facial images from the Chicago Face Database (CFD), we compare skin tone extraction strategies based on cheek-region sampling, following the literature, and multidimensional masking derived from full-face analysis. Additionally, we test both strategies with lighting isolation, using the pre-trained TRUST framework, employed without any training or optimization within our pipeline. Extracted skin tones are applied to MetaHuman textures and rendered under multiple lighting configurations. Skin tone consistency is evaluated objectively in the CIELAB color space using the \Delta E metric and the Individual Typology Angle (ITA). The proposed methodology operates without manual intervention and, with the exception of pre-trained illumination compensation modules, the pipeline does not include learning or training stages, enabling low computational cost and large-scale evaluation. Using this framework, we generate and analyze approximately 19,848 rendered instances. Our results show phenotype-dependent behavior of extraction strategies and consistently higher colorimetric errors for darker skin tones.
[CV-35] Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models
【速读】:该论文旨在解决多语言视觉-语言模型(Vision-Language Models, VLMs)在非英语语境下训练数据匮乏的问题,尤其是日语等低资源语言缺乏大规模、多样化且高质量的视觉问答(Visual Question Answering, VQA)数据集,从而限制了高性能非英语VLMs的构建。解决方案的关键在于提出并构建了目前最大的日语多模态后训练数据集Jagle,其包含约920万条跨多样任务的数据实例;通过收集异构来源(如图像、图文对和PDF文档),并采用基于生成式AI(Generative AI)的VQA生成、翻译与文本渲染等多种策略合成高质量VQA对,有效克服了现有日语VQA资源规模小、覆盖范围窄的瓶颈。实验表明,使用Jagle训练的2.2B参数模型在十项日语评估任务上表现优异,超越InternVL3.5-2B,并且与英文数据集FineVision联合训练时还能提升英文性能,验证了该方案的有效性与通用性。
链接: https://arxiv.org/abs/2604.02048
作者: Issa Sugiura,Keito Sasagawa,Keisuke Nakao,Koki Maeda,Ziqi Yin,Zhishen Yang,Shuhei Kurita,Yusuke Oda,Ryoko Tokuhisa,Daisuke Kawahara,Naoaki Okazaki
机构: Kyoto University (京都大学); NII LLMC (日本信息研究所语言模型研究中心); Waseda University (早稻田大学); Institute of Science Tokyo (东京科学研究所); NII (日本信息研究所); Aichi Institute of Technology (爱知工科大学); Institute of Physical and Chemical Research (理化学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 7 figures
Abstract:Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation, translation, and text rendering. Experiments demonstrate that a 2.2B model trained with Jagle achieves strong performance on Japanese tasks, surpassing InternVL3.5-2B in average score across ten Japanese evaluation tasks and approaching within five points of Qwen3-VL-2B-Instruct. Furthermore, combining Jagle with FineVision does not degrade English performance; instead, it improves English performance compared to training with FineVision alone. To facilitate reproducibility and future research, we release the dataset, trained models, and code.
[CV-36] Efficient Reasoning via Thought Compression for Language Segmentation
【速读】:该论文旨在解决生成式AI(Generative AI)在语言引导分割任务中因链式思维(Chain-of-thought, CoT)推理产生冗长解释而导致计算成本过高、难以部署于实际场景的问题。其解决方案的关键在于提出一种名为WISE(Wisdom from Internal Self-Exploration)的新范式,通过训练模型生成一个结构化的三段式序列:简洁的推理过程(concise rationale)、最终答案和详细的解释(detailed explanation)。该结构利用自回归条件约束使简洁推理成为生成详细解释的充分前提,并通过自蒸馏目标联合优化语义保真度与简洁性,促使模型将复杂推理内化为紧凑形式;推理阶段则省略详细解释,仅保留简洁推理,并引入WISE-S策略——通过在用户查询中注入强调简洁性的提示指令,缓解分布偏移问题,从而实现高效且准确的推理。
链接: https://arxiv.org/abs/2604.02040
作者: Qing Zhou,Shiyu Zhang,Yuyu Jia,Junyu Gao,Weiping Ni,Junzheng Wu,Qi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Chain-of-thought (CoT) reasoning has significantly improved the performance of large multimodal models in language-guided segmentation, yet its prohibitive computational cost, stemming from generating verbose rationales, limits real-world applicability. We introduce WISE (Wisdom from Internal Self-Exploration), a novel paradigm for efficient reasoning guided by the principle of \textitthinking twice – once for learning, once for speed. WISE trains a model to generate a structured sequence: a concise rationale, the final answer, and then a detailed explanation. By placing the concise rationale first, our method leverages autoregressive conditioning to enforce that the concise rationale acts as a sufficient summary for generating the detailed explanation. This structure is reinforced by a self-distillation objective that jointly rewards semantic fidelity and conciseness, compelling the model to internalize its detailed reasoning into a compact form. At inference, the detailed explanation is omitted. To address the resulting conditional distribution shift, our inference strategy, WISE-S, employs a simple prompting technique that injects a brevity-focused instruction into the user’s query. This final adjustment facilitates the robust activation of the learned concise policy, unlocking the full benefits of our framework. Extensive experiments show that WISE-S achieves state-of-the-art zero-shot performance on the ReasonSeg benchmark with 58.3 cIoU, while reducing the average reasoning length by nearly \textbf5 \times – from 112 to just 23 tokens. Code is available at \hrefthis https URLWISE.
[CV-37] IndoorCrowd: A Multi-Scene Dataset for Human Detection Segmentation and Tracking with an Automated Annotation Pipeline
【速读】:该论文旨在解决室内复杂场景下人类行为理解的挑战,特别是在监控、智能建筑和人机交互等领域中,现有数据集难以大规模捕捉真实室内环境的多样性与复杂性。其解决方案的关键在于构建了一个多场景、高精度标注的室内人群数据集 IndoorCrowd,涵盖四个校园地点共31段视频(9,913帧,5 fps),提供逐实例的人类分割掩码,并通过控制子集评估三种基础模型(SAM3、GroundingSAM、EfficientGroundingSAM)的自动标注性能,同时提供用于多目标跟踪(MOT)的2,552帧子集以支持基于MOTChallenge格式的连续身份追踪。该数据集不仅填补了室内复杂场景下的数据空白,还为检测、分割和跟踪任务建立了基准,揭示了人群密度、尺度和遮挡等因素对算法性能的显著影响。
链接: https://arxiv.org/abs/2604.02032
作者: Sebastian-Ion Nae,Radu Moldoveanu,Alexandra Stefania Ghita,Adina Magda Florea
机构: National University of Science and Technology Politehnica Bucharest (布加勒斯特理工大学); Expleo (Expleo)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at Conference on Computer Vision and Pattern Recognition Workshops 2026
Abstract:Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises 31 videos ( 9,913 frames at 5 fps) with human-verified, per-instance segmentation masks. A 620 -frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen’s \kappa , AP, precision, recall, and mask IoU. A further 2,552 -frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with 79.3% dense frames and a mean instance scale of 60.8 px, is the most challenging scene. The project page is available at this https URL.
[CV-38] Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data
【速读】:该论文旨在解决自编码器(Autoencoder)在处理空间非均匀采样图像时因数据分布不均衡而导致的重建偏差问题,尤其在医学成像、生物学和物理学等领域中,稀有但重要的特征常被背景主导的区域所掩盖,导致模型偏向主流模式而丢失细粒度信息并产生模糊重建。其解决方案的关键在于两个互补机制:一是基于自熵(self-entropy)的损失函数,用于对统计上罕见的空间位置进行加权,提升模型对这些区域的关注;二是样本传播(Sample Propagation)机制,一种重放策略,在训练过程中选择性地将难重构样本重新暴露给模型,从而增强对稀有空间模式的学习能力。该方法显著优于传统针对监督分类设计的数据平衡策略,在多种真实世界数据集上实现了更一致且高质量的重建性能。
链接: https://arxiv.org/abs/2604.02031
作者: Alejandro Castañeda Garcia,Jan van Gemert,Daan Brinks,Nergis Tömen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Autoencoders can be challenged by spatially non-uniform sampling of image content. This is common in medical imaging, biology, and physics, where informative patterns occur rarely at specific image coordinates, as background dominates these locations in most samples, biasing reconstructions toward the majority appearance. In practice, autoencoders are biased toward dominant patterns resulting in the loss of fine-grained detail and causing blurred reconstructions for rare spatial inputs especially under spatial data imbalance. We address spatial imbalance by two complementary components: (i) self-entropy-based loss that upweights statistically uncommon spatial locations and (ii) Sample Propagation, a replay mechanism that selectively re-exposes the model to hard to reconstruct samples across batches during training. We benchmark existing data balancing strategies, originally developed for supervised classification, in the unsupervised reconstruction setting. Drawing on the limitations of these approaches, our method specifically targets spatial imbalance by encouraging models to focus on statistically rare locations, improving reconstruction consistency compared to existing baselines. We validate in a simulated dataset with controlled spatial imbalance conditions, and in three, uncontrolled, diverse real-world datasets spanning physical, biological, and astronomical domains. Our approach outperforms baselines on various reconstruction metrics, particularly under spatial imbalance distributions. These results highlight the importance of data representation in a batch and emphasize rare samples in unsupervised image reconstruction. We will make all code and related data available.
[CV-39] Are VLMs Lost Between Sky and Space? LinkS2Bench for UAV-Satellite Dynamic Cross-View Spatial Intelligence
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在无人机(UAV)与卫星协同空间智能任务中表现不足的问题,尤其是其在动态局部到全局空间映射能力上的缺失。现有基准测试仅限于孤立的无人机视频或静态卫星图像,无法评估跨视图推理所需的动态对齐能力。为此,作者提出了LinkS² Bench,这是首个专门用于评估VLMs广域动态跨视图空间智能的综合性基准,包含1,022分钟动态无人机影像与覆盖超200 km²的高分辨率卫星图像,并构建了17.9k个高质量问答对,涵盖感知、定位、关系和推理四个维度的细粒度任务。解决方案的关键在于设计了一个显式跨视图对齐适配器(Cross-View Alignment Adapter),实验证明该机制显著提升了模型性能,识别出准确的跨视图动态对齐是当前VLMs面临的核心瓶颈,也为未来复杂空间推理任务中的模型优化提供了有效路径。
链接: https://arxiv.org/abs/2604.02020
作者: Dian Liu,Jie Feng,Di Li,Yuhui Zheng,Guanbin Li,Weisheng Dong,Guangming Shi
机构: Xidian University (西安电子科技大学); Qinghai Normal University (青海师范大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Synergistic spatial intelligence between UAVs and satellites is indispensable for emergency response and security operations, as it uniquely integrates macro-scale global coverage with dynamic, real-time local perception. However, the capacity of Vision-Language Models (VLMs) to master this complex interplay remains largely unexplored. This gap persists primarily because existing benchmarks are confined to isolated Unmanned Aerial Vehicle (UAV) videos or static satellite imagery, failing to evaluate the dynamic local-to-global spatial mapping essential for comprehensive cross-view reasoning. To bridge this gap, we introduce LinkS ^2 Bench, the first comprehensive benchmark designed to evaluate VLMs’ wide-area, dynamic cross-view spatial intelligence. LinkS ^2 Bench links 1,022 minutes of dynamic UAV footage with high-resolution satellite imagery covering over 200 km ^2 . Through an LMM-assisted pipeline and rigorous human annotation, we constructed 17.9k high-quality question-answer pairs comprising 12 fine-grained tasks across four dimensions: perception, localization, relation, and reasoning. Evaluations of 18 representative VLMs reveal a substantial gap compared to human baselines, identifying accurate cross-view dynamic alignment as the critical bottleneck. To alleviate this, we design a Cross-View Alignment Adapter, demonstrating that explicit alignment significantly improves model performance. Furthermore, fine-tuning experiments underscore the potential of LinkS ^2 Bench in advancing VLM adaptation for complex spatial reasoning.
[CV-40] Decouple and Rectify: Semantics-Preserving Structural Enhancement for Open-Vocabulary Remote Sensing Segmentation
【速读】:该论文旨在解决开放词汇语义分割(open-vocabulary semantic segmentation)在遥感(remote sensing, RS)领域中,如何兼顾语言对齐的语义识别能力与细粒度的空间边界 delineation(划分)的问题。现有方法如CLIP虽具备强大的语义泛化能力,但其全局对齐的视觉表征难以捕捉结构细节;而引入RS预训练的DINO特征的方法则存在语义空间被破坏的风险,因未能精准定位结构增强需求。解决方案的关键在于提出一种“解耦-修正”框架DR-Seg:首先基于关键观察——CLIP特征通道具有功能异质性而非均匀语义空间——将CLIP特征解耦为语义主导和结构主导子空间,从而实现对结构信息的靶向增强而不干扰语义一致性;随后通过先验驱动的图修正模块注入高保真结构先验,并结合不确定性引导的自适应融合模块动态整合优化分支与原始CLIP分支,最终实现更精确且语义一致的分割结果。
链接: https://arxiv.org/abs/2604.02010
作者: Jie Feng,Fengze Li,Junpeng Zhang,Siyu Chen,Yuping Liang,Junying Chen,Ronghua Shang
机构: Xidian University (西安电子科技大学); Jimei University (集美大学); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary semantic segmentation in the remote sensing (RS) field requires both language-aligned recognition and fine-grained spatial delineation. Although CLIP offers robust semantic generalization, its global-aligned visual representations inherently struggle to capture structural details. Recent methods attempt to compensate for this by introducing RS-pretrained DINO features. However, these methods treat CLIP representations as a monolithic semantic space and cannot localize where structural enhancement is required, failing to effectively delineate boundaries while risking the disruption of CLIP’s semantic integrity. To address this limitation, we propose DR-Seg, a novel decouple-and-rectify framework in this paper. Our method is motivated by the key observation that CLIP feature channels exhibit distinct functional heterogeneity rather than forming a uniform semantic space. Building on this insight, DR-Seg decouples CLIP features into semantics-dominated and structure-dominated subspaces, enabling targeted structural enhancement by DINO without distorting language-aligned semantics. Subsequently, a prior-driven graph rectification module injects high-fidelity structural priors under DINO guidance to form a refined branch, while an uncertainty-guided adaptive fusion module dynamically integrates this refined branch with the original CLIP branch for final prediction. Comprehensive experiments across eight benchmarks demonstrate that DR-Seg establishes a new state-of-the-art.
[CV-41] st-Time Adaptation for Height Completion via Self-Supervised ViT Features and Monocular Foundation Models
【速读】:该论文旨在解决大规模数字表面模型(Digital Surface Model, DSM)中普遍存在不完整或过时区域的问题,这些问题通常由数据获取限制、重建伪影或建成环境变化引起。传统基于空间插值的方法假设高度具有空间连续性,因此在缺失物体区域表现不佳;而现有基于学习的方法虽提升了重建质量,但依赖于特定传感器数据的监督训练,泛化能力受限。解决方案的关键在于提出 Prior2DSM——一个无需训练的测试时框架,通过融合自监督视觉Transformer(Vision Transformer, ViT)特征(来自DINOv3)与单目深度基础模型(monocular depth foundation model),利用语义特征空间对应关系将不完整的高度先验信息传播至缺失区域,并结合参数高效的低秩适应(Low-Rank Adaptation, LoRA)与轻量级多层感知机(MLP)实现测试时适应(Test-Time Adaptation, TTA),以预测空间变化的尺度和偏移参数,从而将相对深度估计转换为度量高度。该方法显著降低重建误差并保持结构保真度,相较线性拟合的单目深度估计模型,RMSE最高减少46%。
链接: https://arxiv.org/abs/2604.02009
作者: Osher Rafaeli,Tal Svoray,Ariel Nahlieli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate digital surface models (DSMs) are essential for many geospatial applications, including urban monitoring, environmental analyses, infrastructure management, and change detection. However, large-scale DSMs frequently contain incomplete or outdated regions due to acquisition limitations, reconstruction artifacts, or changes in the built environment. Traditional height completion approaches primarily rely on spatial interpolation or which assume spatial continuity and therefore fail when objects are missing. Recent learning-based approaches improve reconstruction quality but typically require supervised training on sensor-specific datasets, limiting their generalization across domains and sensing conditions. We propose Prior2DSM, a training-free framework for metric DSM completion that operates entirely at test time by leveraging foundation models. Unlike previous height completion approaches that require task-specific training, the proposed method combines self-supervised Vision Transformer (ViT) features from DINOv3 with monocular depth foundation models to propagate metric information from incomplete height priors through semantic feature-space correspondence. Test-time adaptation (TTA) is performed using parameter-efficient low-rank adaptation (LoRA) together with a lightweight multilayer perceptron (MLP), which predicts spatially varying scale and shift parameters to convert relative depth estimates into metric heights. Experiments demonstrate consistent improvements over interpolation based methods, prior-based rescaling height approaches, and state-of-the-art monocular depth estimation models. Prior2DSM reduces reconstruction error while preserving structural fidelity, achieving up to a 46% reduction in RMSE compared to linear fitting of MDE, and further enables DSM updating and coupled RGB-DSM generation.
[CV-42] ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction
【速读】:该论文旨在解决仅从航空影像(aerial-only imagery)生成真实感的地表视角图像和一致的三维场景模型的问题,其核心挑战在于极端视点变化、中间观测缺失以及大规模尺度差异。现有方法要么在后期优化渲染结果但常导致几何不一致,要么依赖多高度的地面真值数据(multi-altitude ground-truth),而这类数据通常不可获得。论文提出了一种名为ProDiG(Progressive Altitude Gaussian Splatting)的扩散引导框架,其关键创新在于:1)通过渐进式变换机制将航空视角的3D表示逐步调整至地表级保真度;2)引入几何感知的因果注意力模块,在参考视图扩散过程中注入极线结构信息以增强几何一致性;3)设计距离自适应高斯模块(distance-adaptive Gaussian module),根据相机距离动态调节高斯核的尺度与透明度,从而稳定跨越大视点间隙的重建过程。上述组件协同作用,使ProDiG无需额外地面真值即可实现高质量、几何合理的地表视角生成。
链接: https://arxiv.org/abs/2604.02003
作者: Sirshapan Mitra,Yogesh S. Rawat
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating ground-level views and coherent 3D site models from aerial-only imagery is challenging due to extreme viewpoint changes, missing intermediate observations, and large scale variations. Existing methods either refine renderings post-hoc, often producing geometrically inconsistent results, or rely on multi-altitude ground-truth, which is rarely available. Gaussian Splatting and diffusion-based refinements improve fidelity under small variations but fail under wide aerial-to-ground gaps. To address these limitations, we introduce ProDiG (Progressive Altitude Gaussian Splatting), a diffusion-guided framework that progressively transforms aerial 3D representations toward ground-level fidelity. ProDiG synthesizes intermediate-altitude views and refines the Gaussian representation at each stage using a geometry-aware causal attention module that injects epipolar structure into reference-view diffusion. A distance-adaptive Gaussian module dynamically adjusts Gaussian scale and opacity based on camera distance, ensuring stable reconstruction across large viewpoint gaps. Together, these components enable progressive, geometrically grounded refinement without requiring additional ground-truth viewpoints. Extensive experiments on synthetic and real-world datasets demonstrate that ProDiG produces visually realistic ground-level renderings and coherent 3D geometry, significantly outperforming existing approaches in terms of visual quality, geometric consistency, and robustness to extreme viewpoint changes.
[CV-43] MTLSI-Net: A Linear Semantic Interaction Network for Parameter-Efficient Multi-Task Dense Prediction ICME2026
【速读】:该论文旨在解决多任务密集预测(Multi-task dense prediction)中跨任务交互建模的难题,尤其是传统自注意力机制在高分辨率特征上存在二次计算复杂度的问题。解决方案的关键在于提出一种线性复杂度的多任务线性语义交互网络(MTLSI-Net),其核心创新包括:1)多任务多尺度查询线性融合模块,通过共享全局上下文矩阵实现跨任务依赖的多尺度建模;2)语义Token蒸馏模块,压缩冗余特征并提炼关键跨任务知识;3)跨窗口集成注意力模块,利用双分支结构将全局语义注入局部特征,兼顾全局一致性与空间精度。上述设计使模型能够在保持线性计算复杂度的同时显著减少参数量,并实现最优的多任务性能。
链接: https://arxiv.org/abs/2604.01995
作者: Chen Liu,Hengyu Man,Xiaopeng Fan,Debin Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ICME 2026, to be published
Abstract:Multi-task dense prediction aims to perform multiple pixel-level tasks simultaneously. However, capturing global cross-task interactions remains non-trivial due to the quadratic complexity of standard self-attention on high-resolution features. To address this limitation, we propose a Multi-Task Linear Semantic Interaction Network (MTLSI-Net), which facilitates cross-task interaction through linear attention. Specifically, MTLSI-Net incorporates three key components: a Multi-Task Multi-scale Query Linear Fusion Block, which captures cross-task dependencies across multiple scales with linear complexity using a shared global context matrix; a Semantic Token Distiller that compresses redundant features into compact semantic tokens, distilling essential cross-task knowledge; and a Cross-Window Integrated attention Block that injects global semantics into local features via a dual-branch architecture, preserving both global consistency and spatial precision. These components collectively enable the network to capture comprehensive cross-task interactions at linear complexity with reduced parameters. Extensive experiments on NYUDv2 and PASCAL-Context demonstrate that MTLSI-Net achieves state-of-the-art performance, validating its effectiveness and efficiency in multi-task learning.
[CV-44] Resonance4D: Frequency-Domain Motion Supervision for Preset-Free Physical Parameter Learning in 4D Dynamic Physical Scene Simulation
【速读】:该论文旨在解决物理驱动的4D动态模拟中一个被忽视的矛盾:可靠的运动监督通常依赖于在线视频扩散或光流(optical flow)管道,其计算成本超过模拟器本身;同时,现有方法通过仅优化部分材料参数来简化逆向物理建模,限制了复杂材料与动态场景中的真实性。解决方案的关键在于提出Resonance4D框架,该框架通过轻量但物理表达能力强的监督机制将3D高斯泼溅(3D Gaussian Splatting)与物质点法(Material Point Method, MPM)耦合,并引入双域运动监督(Dual-domain Motion Supervision, DMS),在互补域中联合约束局部形变的空间结构一致性与振荡及全局动态模式的频域谱一致性,从而显著降低训练成本和内存开销,同时保留物理意义明确的运动特征;此外,结合零样本文本提示分割与仿真引导初始化,实现高斯点的物体部件级区域自动分解,支持全材料参数联合优化,最终在单个消费级GPU上实现高保真物理驱动的4D动态模拟,峰值显存从35 GB降至约20 GB。
链接: https://arxiv.org/abs/2604.01994
作者: Changshe Zhang,Jie Feng,Siyu Chen,Guanbin Li,Ronghua Shang,Junpeng Zhang
机构: Xidian University (西安电子科技大学); Jimei University (集美大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Physics-driven 4D dynamic simulation from static 3D scenes remains constrained by an overlooked contradiction: reliable motion supervision often relies on online video diffusion or optical-flow pipelines whose computational cost exceeds that of the simulator itself. Existing methods further simplify inverse physical modeling by optimizing only partial material parameters, limiting realism in scenes with complex materials and dynamics. We present Resonance4D, a physics-driven 4D dynamic simulation framework that couples 3D Gaussian Splatting with the Material Point Method through lightweight yet physically expressive supervision. Our key insight is that dynamic consistency can be enforced without dense temporal generation by jointly constraining motion in complementary domains. To this end, we introduce Dual-domain Motion Supervision (DMS), which combines spatial structural consistency for local deformation with frequency-domain spectral consistency for oscillatory and global dynamic patterns, substantially reducing training cost and memory overhead while preserving physically meaningful motion cues. To enable stable full-parameter physical recovery, we further combine zero-shot text-prompted segmentation with simulation-guided initialization to automatically decompose Gaussians into object-part-level regions and support joint optimization of full material parameters. Experiments on both synthetic and real scenes show that Resonance4D achieves strong physical fidelity and motion consistency while reducing peak GPU memory from over 35,GB to around 20,GB, enabling high-fidelity physics-driven 4D simulation on a single consumer-grade GPU.
[CV-45] Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中因视觉注意力惯性(visual inertia)导致的认知幻觉问题,即模型在早期解码阶段后注意力趋于静态,无法动态支持对象间关系的推理,从而影响组合理解能力。解决方案的关键在于提出一种无需训练的惯性感知视觉激励(Inertia-aware Visual Excitation, IVE)方法,其核心机制是将认知推理建模为视觉注意力对历史趋势的动态响应:通过识别相对于历史注意力模式动态变化的视觉token并抑制惯性行为,同时引入惯性感知惩罚项以限制注意力在局部区域的过度集中和持续停留,从而提升模型对复杂语义关系的推理能力。
链接: https://arxiv.org/abs/2604.01989
作者: Boyang Gong,Yu Zheng,Fanye Kong,Jie Zhou,Jiwen Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.
[CV-46] Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models
【速读】:该论文旨在解决医学影像领域中放射科医生工作负荷日益增长且不可持续的问题,通过构建更高效、高质量的多模态基础模型(Foundation Models, FMs)来提升CT和MRI图像分析能力。其核心解决方案在于提出Curia-2框架,显著优化了原始预训练策略与表征质量,首次将视觉Transformer架构扩展至百亿参数级别用于多模态CT/MRI基础模型,并通过重构CuriaBench评测体系,设立2D切片级与3D体素级两个独立评估赛道,从而更精准地衡量模型在视觉任务与临床复杂任务(如病灶检测)中的性能表现。
链接: https://arxiv.org/abs/2604.01987
作者: Antoine Saporta,Baptiste Callard,Corentin Dancette,Julien Khlaut,Charles Corbière,Leo Butsanets,Amaury Prat,Pierre Manceron
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The rapid growth of medical imaging has fueled the development of Foundation Models (FMs) to reduce the growing, unsustainable workload on radiologists. While recent FMs have shown the power of large-scale pre-training to CT and MRI analysis, there remains significant room to optimize how these models learn from complex radiological volumes. Building upon the Curia framework, this work introduces Curia-2, which significantly improves the original pre-training strategy and representation quality to better capture the specificities of radiological data. The proposed methodology enables scaling the architecture up to billion-parameter Vision Transformers, marking a first for multi-modal CT and MRI FMs. Furthermore, we formalize the evaluation of these models by extending and restructuring CuriaBench into two distinct tracks: a 2D track tailored for slice-based vision models and a 3D track for volumetric benchmarking. Our results demonstrate that Curia-2 outperforms all FMs on vision-focused tasks and fairs competitively to vision-language models on clinically complex tasks such as finding detection. Weights will be made publicly available to foster further research.
[CV-47] Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation
【速读】:该论文旨在解决现有视觉跟踪器在实际应用中缺乏人机协同适应能力的问题,即传统跟踪方法通常采用“一次性触发、无需交互”的模式,难以满足需要人类实时干预的场景需求。其核心解决方案是提出**交互式跟踪(Interactive Tracking)新范式,并构建首个大规模基准数据集InteractTrack,包含150个带密集边界框标注和时间戳语言指令的视频序列;同时设计了全面的评估协议以验证模型在交互场景下的性能表现,并引入交互式记忆增强跟踪(Interactive Memory-Augmented Tracking, IMAT)**作为基线方法,通过动态记忆机制学习用户反馈并自适应调整跟踪行为,从而实现更智能、可协作的跟踪系统。
链接: https://arxiv.org/abs/2604.01974
作者: Yuqing Huang,Guotian Zeng,Zhenqiao Yuan,Zhenyu He,Xin Li,Yaowei Wang,Ming-Hsuan Yang
机构: Pengcheng Laboratory (鹏程实验室); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); South China University of Technology (华南理工大学); Pazhou Lab (黄埔实验室); UC Merced (加州大学默塞德分校); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing visual trackers mainly operate in a non-interactive, fire-and-forget manner, making them impractical for real-world scenarios that require human-in-the-loop adaptation. To overcome this limitation, we introduce Interactive Tracking, a new paradigm that allows users to guide the tracker at any time using natural language commands. To support research in this direction, we make three main contributions. First, we present InteractTrack, the first large-scale benchmark for interactive tracking, containing 150 videos with dense bounding box annotations and timestamped language instructions. Second, we propose a comprehensive evaluation protocol and evaluate 25 representative trackers, showing that state-of-the-art methods fail in interactive scenarios; strong performance on conventional benchmarks does not transfer. Third, we introduce Interactive Memory-Augmented Tracking (IMAT), a new baseline that employs a dynamic memory mechanism to learn from user feedback and update tracking behavior accordingly. Our benchmark, protocol, and baseline establish a foundation for developing more intelligent, adaptive, and collaborative tracking systems, bridging the gap between automated perception and human guidance. The full benchmark, tracking results, and analysis are available at this https URL.
[CV-48] NearID: Identity Representation Learning via Near-identity Distractors
【速读】:该论文旨在解决现有视觉编码器在面向身份的任务(如个性化生成和图像编辑)中,因将对象身份与背景上下文混淆而导致表示不可靠的问题。其核心解决方案是引入一种基于“近似身份”(Near-identity, NearID)干扰项的原理性框架,通过在与参考图像完全相同的背景下放置语义相似但不同的实例,从而消除上下文捷径,仅保留身份作为区分信号。关键创新在于构建了包含19K身份和316K匹配上下文干扰项的NearID数据集,并采用严格的边缘阈值评估协议(Sample Success Rate, SSR),在此设定下提出两阶段对比目标,在冻结主干网络上学习身份感知表示,实现从原始SSR 30.7%提升至99.2%,显著改善身份判别能力并增强与人类判断的一致性。
链接: https://arxiv.org/abs/2604.01973
作者: Aleksandar Cvejic,Rameen Abdal,Abdelrahman Eldesokey,Bernard Ghanem,Peter Wonka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code at this https URL
Abstract:When evaluating identity-focused tasks such as personalized generation and image editing, existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics. We introduce the first principled framework to address this vulnerability using Near-identity (NearID) distractors, where semantically similar but distinct instances are placed on the exact same background as a reference image, eliminating contextual shortcuts and isolating identity as the sole discriminative signal. Based on this principle, we present the NearID dataset (19K identities, 316K matched-context distractors) together with a strict margin-based evaluation protocol. Under this setting, pre-trained encoders perform poorly, achieving Sample Success Rates (SSR), a strict margin-based identity discrimination metric, as low as 30.7% and often ranking distractors above true cross-view matches. We address this by learning identity-aware representations on a frozen backbone using a two-tier contrastive objective enforcing the hierarchy: same identity NearID distractor random negative. This improves SSR to 99.2%, enhances part-level discrimination by 28.0%, and yields stronger alignment with human judgments on DreamBench++, a human-aligned benchmark for personalization. Project page: this https URL
[CV-49] SDesc3D: Towards Layout-Aware 3D Indoor Scene Generation from Short Descriptions
【速读】:该论文旨在解决短文本条件下的3D室内场景生成中存在的物理合理性不足与细节丰富度欠缺的问题,其核心挑战在于现有方法依赖显式的语义线索(如物体及其空间关系)进行场景构建,导致在输入信息稀疏时难以生成结构合理且语义连贯的3D场景。解决方案的关键在于提出SDesc3D框架,通过引入多视角结构先验增强(Multi-view scene prior augmentation)来弥补文本输入的不足,将不可获取的语义关系线索转化为多视角关系先验聚合;同时设计功能感知布局锚定(Functionality-aware layout grounding),利用区域功能语义作为隐式空间锚点,并结合分层布局推理机制提升场景组织能力;最后采用迭代反思-修正机制(Iterative reflection-rectification scheme)实现结构合理性逐步优化,从而显著增强生成场景的物理可实现性与细节完整性。
链接: https://arxiv.org/abs/2604.01972
作者: Jie Feng,Jiawei Shen,Junjia Huang,Junpeng Zhang,Mingtao Feng,Weisheng Dong,Guanbin Li
机构: Xidian University (西安电子科技大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D indoor scene generation conditioned on short textual descriptions provides a promising avenue for interactive 3D environment construction without the need for labor-intensive layout specification. Despite recent progress in text-conditioned 3D scene generation, existing works suffer from poor physical plausibility and insufficient detail richness in such semantic condensation cases, largely due to their reliance on explicit semantic cues about compositional objects and their spatial relationships. This limitation highlights the need for enhanced 3D reasoning capabilities, particularly in terms of prior integration and spatial this http URL by this, we propose SDesc3D, a short-text conditioned 3D indoor scene generation framework, that leverages multi-view structural priors and regional functionality implications to enable 3D layout reasoning under sparse textual this http URL, we introduce a Multi-view scene prior augmentation that enriches underspecified textual inputs with aggregated multi-view structural knowledge, shifting from inaccessible semantic relation cues to multi-view relational prior aggregation. Building on this, we design a Functionality-aware layout grounding, employing regional functionality grounding for implicit spatial anchors and conducting hierarchical layout reasoning to enhance scene organization and semantic this http URL, an Iterative reflection-rectification scheme is employed for progressive structural plausibility refinement via this http URL experiments show that our method outperforms existing approaches on short-text conditioned 3D indoor scene this http URL will be publicly available.
[CV-50] Ego-Grounding for Personalized Question-Answering in Egocentric Videos CVPR’26
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在个性化问答任务中缺乏对第一人称视角(ego-grounding)理解能力的问题,尤其是在长时序的自指视频(egocentric videos)中,模型难以准确识别、记忆并推理与“我”相关的信息。其解决方案的关键在于构建了首个专门用于评估MLLMs在自指视频中进行个性化问答能力的数据集MyEgo,该数据集包含541段长视频和5000个关于“我的物品”“我的活动”及“我的过去”的个性化问题。通过系统性基准测试发现,当前主流MLLMs(包括开源与闭源、小规模与大规模模型)在MyEgo上表现不佳,且显式推理或模型规模扩展并未带来一致性能提升,表明现有模型在跟踪和长期记忆“自我”信息方面存在显著局限。这一工作凸显了ego-grounding和长程记忆在实现自指视频个性化辅助中的核心作用,并为后续研究提供了关键基准与方向。
链接: https://arxiv.org/abs/2604.01966
作者: Junbin Xiao,Shenglang Zhang,Pengxiang Zhu,Angela Yao
机构: University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: To appear at CVPR’26
Abstract:We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs’ ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about “my things”, “my activities”, and “my past”. Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering “me” and “my past”. These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at this https URL
[CV-51] Automated Prostate Gland Segmentation in MRI Using nnU-Net
【速读】:该论文旨在解决多参数磁共振成像(multiparametric MRI, mpMRI)中前列腺腺体自动分割的准确性问题,以支持临床和科研应用如图像配准、体积估算及放射组学分析。传统人工勾画耗时且存在观察者间差异,而通用分割工具在前列腺特异性任务中表现不足。解决方案的关键在于提出一种基于nnU-Net v2框架的专用深度学习方法,充分利用mpMRI中的多模态信息(包括T2加权成像、扩散加权成像(DWI)和表观扩散系数(ADC)图),从而实现高精度分割。模型在PI-CAI数据集上训练,并通过5折交叉验证和外部独立队列验证,分别获得0.96和0.82的平均Dice分数,显著优于通用工具TotalSegmentator(Dice=0.15),验证了任务特定、多模态策略的重要性与有效性。
链接: https://arxiv.org/abs/2604.01964
作者: Pablo Rodriguez-Belenguer,Gloria Ribas,Javier Aquerreta Escribano,Rafael Moreno-Calatayud,Leonor Cerda-Alberich,Luis Marti-Bonmati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 2 tables, 1 figure
Abstract:Accurate segmentation of the prostate gland in multiparametric MRI (mpMRI) is a fundamental step for a wide range of clinical and research applications, including image registration, volume estimation, and radiomic analysis. However, manual delineation is time-consuming and subject to inter-observer variability, while general-purpose segmentation tools often fail to provide sufficient accuracy for prostate-specific tasks. In this work, we propose a dedicated deep learning-based approach for automatic prostate gland segmentation using the nnU-Net v2 framework. The model leverages multimodal mpMRI data, including T2-weighted imaging, diffusion-weighted imaging (DWI), and apparent diffusion coefficient (ADC) maps, to exploit complementary tissue information. Training was performed on 981 cases from the PI-CAI dataset using whole-gland annotations, and model performance was assessed through 5-fold cross-validation and external validation on an independent cohort of 54 patients from Hospital La Fe. The proposed model achieved a mean Dice score of 0.96 +/- 0.00 in cross-validation and 0.82 on the external test set, demonstrating strong generalization despite domain shift. In comparison, a general-purpose approach (TotalSegmentator) showed substantially lower performance, with a Dice score of 0.15, primarily due to under-segmentation of the gland. These results highlight the importance of task-specific, multimodal segmentation strategies and demonstrate the potential of the proposed approach for reliable integration into clinical research workflows. To facilitate reproducibility and deployment, the model has been fully containerized and is available as a ready-to-use inference tool.
[CV-52] MAVFusion: Efficient Infrared and Visible Video Fusion via Motion-Aware Sparse Interaction
【速读】:该论文旨在解决现有红外与可见光视频融合方法在处理帧间运动时存在Temporal Consistency(时间一致性)不足以及计算效率低的问题。当前方法虽通过跨帧交互提升一致性,但往往引入高计算开销;而传统静态图像融合方法无法有效应对视频序列中的动态变化。解决方案的关键在于提出一种端到端的视频融合框架MAVFusion,其核心创新是引入了运动感知稀疏交互机制(motion-aware sparse interaction mechanism):利用光流(optical flow)识别多模态序列中的动态区域,仅在这些稀疏区域施加计算密集型的跨模态注意力机制以捕捉显著变化并促进模态间信息交换;而对于静态背景区域,则采用轻量级弱交互模块维持结构和外观完整性。该解耦策略在保证高质量融合结果的同时显著提升了推理速度,实验证明其在多个基准上达到SOTA性能且推理速度达14.16 FPS(640×480分辨率)。
链接: https://arxiv.org/abs/2604.01958
作者: Xilai Li,Weijun Jiang,Xiaosong Li,Yang Liu,Hongbin Wang,Tao Ye,Huafeng Li,Haishu Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared and visible video fusion combines the object saliency from infrared images with the texture details from visible images to produce semantically rich fusion results. However, most existing methods are designed for static image fusion and cannot effectively handle frame-to-frame motion in videos. Current video fusion methods improve temporal consistency by introducing interactions across frames, but they often require high computational cost. To mitigate these challenges, we propose MAVFusion, an end-to-end video fusion framework featuring a motion-aware sparse interaction mechanism that enhances efficiency while maintaining superior fusion quality. Specifically, we leverage optical flow to identify dynamic regions in multi-modal sequences, adaptively allocating computationally intensive cross-modal attention to these sparse areas to capture salient transitions and facilitate inter-modal information exchange. For static background regions, a lightweight weak interaction module is employed to maintain structural and appearance integrity. By decoupling the processing of dynamic and static regions, MAVFusion simultaneously preserves temporal consistency and fine-grained details while significantly accelerating inference. Extensive experiments demonstrate that MAVFusion achieves state-of-the-art performance on multiple infrared and visible video benchmarks, achieving a speed of 14.16,FPS at 640 \times 480 resolution. The source code will be available at this https URL.
[CV-53] A Self supervised learning framework for imbalanced medical imaging datasets
【速读】:该论文旨在解决医学图像分类中两个关键问题:一是标注数据量不足,二是类别分布极度不平衡(即常见类样本丰富而罕见类样本稀缺)。针对这些问题,作者提出了一种改进的自监督学习(Self-Supervised Learning, SSL)方法——对先前提出的MIMV方法进行扩展,引入一种新的增强策略以构建不对称多图像、多视角(Asymmetric Multi-Image, Multi-View, AMIMV)样本对。其核心创新在于通过AMIMV机制在有限标注条件下增强模型对罕见类别的表征能力,同时提升SSL在长尾分布下的鲁棒性。实验表明,在MedMNIST数据集上,该方法在retinaMNIST、tissueMNIST和DermaMNIST三个子任务上分别实现了4.25%、1.88%和3.1%的性能提升。
链接: https://arxiv.org/abs/2604.01947
作者: Yash Kumar Sharma,Charan Ramtej Kodi,Vineet Padmanabhan
机构: University of Hyderabad (海得拉巴大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Two problems often plague medical imaging analysis: 1) Non-availability of large quantities of labeled training data, and 2) Dealing with imbalanced data, i.e., abundant data are available for frequent classes, whereas data are highly limited for the rare class. Self supervised learning (SSL) methods have been proposed to deal with the first problem to a certain extent, but the issue of investigating the robustness of SSL to imbalanced data has rarely been addressed in the domain of medical image classification. In this work, we make the following contributions: 1) The MIMV method proposed by us in an earlier work is extended with a new augmentation strategy to construct asymmetric multi-image, multi-view (AMIMV) pairs to address both data scarcity and dataset imbalance in medical image classification. 2) We carry out a data analysis to evaluate the robustness of AMIMV under varying degrees of class imbalance in medical imaging . 3) We evaluate eight representative SSL methods in 11 medical imaging datasets (MedMNIST) under long-tailed distributions and limited supervision. Our experimental results on the MedMNIST dataset show an improvement of 4.25% on retinaMNIST, 1.88% on tissueMNIST, and 3.1% on DermaMNIST.
[CV-54] Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm
【速读】:该论文旨在解决早期儿童教育(Early Childhood Education, ECE)场景下图像字幕生成任务中存在的两个核心问题:一是缺乏大规模、领域特定的数据集,导致模型难以捕捉ECE场景中细粒度的语义概念,从而生成泛化且不精确的描述;二是传统训练范式在提升专业物体识别能力方面存在局限,监督学习倾向于高频表达,而强化学习在困难样本上优化不稳定。解决方案的关键在于提出ECAC(ECE Daily Activity Image Captioning Benchmark),一个包含256,121张真实世界图像及专家级字幕与细粒度标签的大规模基准数据集,并设计了面向领域的评估指标——教学玩具识别分数(Teaching Toy Recognition Score, TTS),以量化专业物体命名准确性;同时引入RSRS(Reward-Conditional Switch of Reinforcement Learning and Supervised Fine-Tuning)混合训练框架,通过动态切换强化学习与监督微调策略,在零奖励难样本上回退至监督微调,有效缓解优势崩溃(advantage collapse),实现稳定优化,最终构建出针对教育场景适配的多模态大语言模型KinderMM-Cap-3B,在TTS指标上达到51.06,显著优于现有方法。
链接: https://arxiv.org/abs/2604.01941
作者: Sixing Li,Zhibin Gu,Ziqi Zhang,Weiguo Pan,Bing Li,Ying Wang,Hongzhe Liu
机构: Beijing Union University (北京联合大学); Hebei Normal University (河北师范大学); Chinese Academy of Sciences (中国科学院); National Laboratory of Pattern Recognition (国家模式识别实验室); University of Chinese Academy of Sciences (中国科学院大学); PeopleAI, Inc. (PeopleAI公司); People Youhe Education Technology Co., Ltd. (人和教育科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Image captioning for Early Childhood Education (ECE) is essential for automated activity understanding and educational assessment. However, existing methods face two key challenges. First, the lack of large-scale, domain-specific datasets limits the model’s ability to capture fine-grained semantic concepts unique to ECE scenarios, resulting in generic and imprecise descriptions. Second, conventional training paradigms exhibit limitations in enhancing professional object description capability, as supervised learning tends to favor high-frequency expressions, while reinforcement learning may suffer from unstable optimization on difficult samples. To address these limitations, we introduce ECAC, a large-scale benchmark for ECE daily activity image captioning, comprising 256,121 real-world images annotated with expert-level captions and fine-grained labels. ECAC is further equipped with a domain-oriented evaluation protocol, the Teaching Toy Recognition Score (TTS), to explicitly measure professional object naming accuracy. Furthermore, we propose RSRS (Reward-Conditional Switch of Reinforcement Learning and Supervised Fine-Tuning), a hybrid training framework that dynamically alternates between RL and supervised optimization. By rerouting hard samples with zero rewards to supervised fine-tuning, RSRS effectively mitigates advantage collapse and enables stable optimization for fine-grained recognition. Leveraging ECAC and RSRS, we develop KinderMM-Cap-3B, a domain-adapted multimodal large language model. Extensive experiments demonstrate that our model achieves a TTS of 51.06, substantially outperforming state-of-the-art baselines while maintaining superior caption quality, highlighting its potential for specialized educational applications. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.01941 [cs.CV] (or arXiv:2604.01941v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.01941 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-55] Rethinking Representations for Cross-Domain Infrared Small Target Detection: A Generalizable Perspective from the Frequency Domain
【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, IRSTD)中因域间分布差异导致的模型泛化性能下降问题,即在训练与测试数据存在观测条件或环境因素变化时,现有方法由于过度拟合特定域的特征而无法有效识别跨域小目标。其解决方案的关键在于提出一种空间-谱协同感知网络(S²CPNet),通过三个核心模块实现:首先,基于频域视角揭示谱相位不一致性是域差异的主要表现,并设计相位校正模块(Phase Rectification Module, PRM)以提取具有泛化能力的目标感知特征;其次,在跳跃连接中引入正交注意力机制(Orthogonal Attention Mechanism, OAM),在保留位置信息的同时优化关键特征表示;最后,采用选择性风格重构(Selective Style Recomposition, SSR)策略缓解对域特定模式的偏差。该方法在多个跨域场景下均实现了最先进的检测性能。
链接: https://arxiv.org/abs/2604.01934
作者: Yimin Fu,Songbo Wang,Feiyan Wu,Jialin Lyu,Zhunga Liu,Michael K. Ng
机构: Hong Kong Baptist University (香港浸会大学); Northwestern Polytechnical University (西北工业大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code will be released at this https URL upon acceptance
Abstract:The accurate target-background separation in infrared small target detection (IRSTD) highly depends on the discriminability of extracted representations. However, most existing methods are confined to domain-consistent settings, while overlooking whether such discriminability can generalize to unseen domains. In practice, distribution shifts between training and testing data are inevitable due to variations in observational conditions and environmental factors. Meanwhile, the intrinsic indistinctiveness of infrared small targets aggravates overfitting to domain-specific patterns. Consequently, the detection performance of models trained on source domains can be severely degraded when deployed in unseen domains. To address this challenge, we propose a spatial-spectral collaborative perception network (S ^2 CPNet) for cross-domain IRSTD. Moving beyond conventional spatial learning pipelines, we rethink IRSTD representations from a frequency perspective and reveal inconsistencies in spectral phase as the primary manifestation of domain discrepancies. Based on this insight, we develop a phase rectification module (PRM) to derive generalizable target awareness. Then, we employ an orthogonal attention mechanism (OAM) in skip connections to preserve positional information while refining informative representations. Moreover, the bias toward domain-specific patterns is further mitigated through selective style recomposition (SSR). Extensive experiments have been conducted on three IRSTD datasets, and the proposed method consistently achieves state-of-the-art performance under diverse cross-domain settings.
[CV-56] Learning Spatial Structure from Pre-Beamforming Per-Antenna Range-Doppler Radar Data via Visibility-Aware Cross-Modal Supervision
【速读】:该论文旨在解决汽车雷达感知流水线中传统依赖波束成形(beamforming)构建角度域表示的问题,核心疑问是:是否可以直接从预波束成形的单天线距离-多普勒(range-Doppler, RD)张量中学习到有意义的空间结构?其解决方案的关键在于设计了一个双啁啾共享权重编码器(dual-chirp shared-weight encoder),以端到端、全数据驱动的方式处理原始RD张量,并通过鸟瞰图(bird’s-eye-view, BEV)占用作为几何探针进行空间可恢复性评估;监督信号为可见性感知且跨模态,基于LiDAR并显式建模雷达视场和通过射线追踪实现的遮挡感知LiDAR可观测性,从而验证了无需显式角度域构造或手工信号处理阶段即可直接从原始雷达数据中学习空间结构的有效性。
链接: https://arxiv.org/abs/2604.01921
作者: George Sebastian,Philipp Berthold,Bianca Forkel,Leon Pohl,Mirko Maehlisch
机构: University of the Bundeswehr Munich (联邦国防军大学慕尼黑分校); dtec.bw – Digitalization and Technology Research Center of the Bundeswehr (联邦国防军数字化与技术研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Automotive radar perception pipelines commonly construct angle-domain representations via beamforming before applying learning-based models. This work instead investigates a representational question: can meaningful spatial structure be learned directly from pre-beamforming per-antenna range-Doppler (RD) measurements? Experiments are conducted on a 6-TX x 8-RX (48 virtual antennas) commodity automotive radar employing an A/B chirp-sequence frequency-modulated continuous-wave (CS-FMCW) transmit scheme, in which the effective transmit aperture varies between chirps (single-TX vs. multi-TX), enabling controlled analysis of chirp-dependent transmit configurations. We operate on pre-beamforming per-antenna RD tensors using a dual-chirp shared-weight encoder trained in an end-to-end, fully data-driven manner, and evaluate spatial recoverability using bird’s-eye-view (BEV) occupancy as a geometric probe rather than a performance-driven objective. Supervision is visibility-aware and cross-modal, derived from LiDAR with explicit modeling of the radar field-of-view and occlusion-aware LiDAR observability via ray-based visibility. Through chirp ablations (A-only, B-only, A+B), range-band analysis, and physics-aligned baselines, we assess how transmit configurations affect geometric recoverability. The results indicate that spatial structure can be learned directly from pre-beamforming per-antenna RD tensors without explicit angle-domain construction or hand-crafted signal-processing stages.
[CV-57] Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts
【速读】:该论文旨在解决医学视觉定位(Medical Visual Grounding, MVG)中视觉定位精度不足的问题,尤其是在依赖潜在嵌入表示的视觉语言模型(Vision-Language Models, VLMs)中,由于缺乏显式的空间定位先验导致定位模糊。解决方案的关键在于提出KnowMVG框架,其核心创新包括:1)一种知识增强的提示策略,将与短语相关的医学知识编码为紧凑嵌入以增强语义理解;2)一种全局-局部注意力机制,协同利用粗粒度全局信息和精细局部线索来引导精确区域定位。该设计在不增加额外文本推理开销的前提下,有效连接高层语义理解和细粒度视觉感知,显著提升了定位准确性和可解释性。
链接: https://arxiv.org/abs/2604.01915
作者: Yifan Gao,Tao Zhou,Yi Zhou,Ke Zou,Yizhe Zhang,Huazhu Fu
机构: Nanjing University of Science and Technology (南京理工大学); Southeast University (东南大学); National University of Singapore (新加坡国立大学); Institute of High-Performance Computing, Agency for Science, Technology and Research (新加坡科技研究局高性能计算研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures
Abstract:Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.
[CV-58] Lifting Unlabeled Internet-level Data for 3D Scene Understanding CVPR2026
【速读】:该论文旨在解决3D场景理解中高质量标注数据稀缺且获取成本高昂的问题,同时利用互联网上大量易得的未标注视频数据来辅助训练。其解决方案的关键在于设计精心构造的数据引擎(data engines),通过自动处理和生成训练数据,使端到端模型能够在与人工标注数据集协同训练时提升性能。研究识别并分析了自动化数据生成中的瓶颈,揭示了影响从无标签数据中学习效率和效果的关键因素,并在低层次感知(如3D目标检测与实例分割)到高层次推理任务(如3D空间视觉问答VQA和视觉语言导航VLN)上验证了所生成数据的有效性,表明该方法可显著提升模型的零样本性能并进一步通过微调优化,为利用网络公开数据推动更强大的场景理解系统提供了可行路径。
链接: https://arxiv.org/abs/2604.01907
作者: Yixin Chen,Yaowei Zhang,Huangyue Yu,Junchao He,Yan Wang,Jiangyong Huang,Hongyu Shen,Junfeng Ni,Shaofei Wang,Baoxiong Jia,Song-Chun Zhu,Siyuan Huang
机构: State Key Laboratory of General Artificial Intelligence, BIGAI; Beijing University of Posts and Telecommunications; Peking University; Beijing Institute of Technology; Tsinghua University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026. Project page: this https URL
Abstract:Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.
[CV-59] Light-ResKAN: A Parameter-Sharing Lightweight KAN with Gram Polynomials for Efficient SAR Image Recognition
【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)图像识别在资源受限边缘设备上部署时面临的挑战,即大尺寸SAR图像导致深度学习模型计算复杂度高,而现有轻量级模型难以同时实现高精度特征提取与低计算开销。其解决方案的关键在于提出Light-ResKAN架构:首先,用可学习激活函数的Kolmogorov-Arnold Network (KAN)卷积替代传统卷积,提升特征提取的自适应性;其次,引入Gram多项式作为激活函数以更好捕捉SAR数据中的非线性关系;最后,采用通道共享参数策略,在保持每通道独特特征的同时显著降低参数量和浮点运算次数(FLOPs)。该方法在多个SAR数据集上实现了高精度识别,并大幅降低了计算资源消耗。
链接: https://arxiv.org/abs/2604.01903
作者: Pan Yi,Weijie Li,Xiaodong Chen,Jiehua Zhang,Li Liu,Yongxiang Liu
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 8 figures, accepted by JSTARS
Abstract:Synthetic Aperture Radar (SAR) image recognition is vital for disaster monitoring, military reconnaissance, and ocean observation. However, large SAR image sizes hinder deep learning deployment on resource-constrained edge devices, and existing lightweight models struggle to balance high-precision feature extraction with low computational requirements. The emerging Kolmogorov-Arnold Network (KAN) enhances fitting by replacing fixed activations with learnable ones, reducing parameters and computation. Inspired by KAN, we propose Light-ResKAN to achieve a better balance between precision and efficiency. First, Light-ResKAN modifies ResNet by replacing convolutions with KAN convolutions, enabling adaptive feature extraction for SAR images. Second, we use Gram Polynomials as activations, which are well-suited for SAR data to capture complex non-linear relationships. Third, we employ a parameter-sharing strategy: each kernel shares parameters per channel, preserving unique features while reducing parameters and FLOPs. Our model achieves 99.09%, 93.01%, and 97.26% accuracy on MSTAR, FUSAR-Ship, and SAR-ACD datasets, respectively. Experiments on MSTAR resized to 1024 \times 1024 show that compared to VGG16, our model reduces FLOPs by 82.90 \times and parameters by 163.78 \times . This work establishes an efficient solution for edge SAR image recognition.
[CV-60] FTPFusion: Frequency-Aware Infrared and Visible Video Fusion with Temporal Perturbation
【速读】:该论文旨在解决红外与可见光视频融合中同时保持时间稳定性与空间细节的难题。现有方法或侧重帧级增强而缺乏有效的时间建模,或依赖复杂的时空聚合机制却牺牲了高频细节。其解决方案的关键在于提出FTPFusion框架,通过频域感知机制将特征分解为高频与低频分量进行协同建模:高频分支采用稀疏跨模态时空交互以捕捉运动相关上下文和互补细节;低频分支引入时间扰动策略提升对闪烁、抖动及局部错位等复杂视频变化的鲁棒性;此外,设计偏移感知的时间一致性约束显式稳定受时间扰动影响的跨帧表示,从而在多个公开基准上实现空间保真度与时间一致性双优。
链接: https://arxiv.org/abs/2604.01900
作者: Xilai Li,Chusheng Fang,Xiaosong Li
机构: Foshan University (佛山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared and visible video fusion plays a critical role in intelligent surveillance and low-light monitoring. However, maintaining temporal stability while preserving spatial detail remains a fundamental challenge. Existing methods either focus on frame-wise enhancement with limited temporal modeling or rely on heavy spatio-temporal aggregation that often sacrifices high-frequency details. In this paper, we propose FTPFusion, a frequency-aware infrared and visible video fusion method based on temporal perturbation and sparse cross-modal interaction. Specifically, FTPFusion decomposes the feature representations into high-frequency and low-frequency components for collaborative modeling. The high-frequency branch performs sparse cross-modal spatio-temporal interaction to capture motion-related context and complementary details. The low-frequency branch introduces a temporal perturbation strategy to enhance robustness against complex video variations, such as flickering, jitter, and local misalignment. Furthermore, we design an offset-aware temporal consistency constraint to explicitly stabilize cross-frame representations under temporal disturbances. Extensive experiments on multiple public benchmarks demonstrate that FTPFusion consistently outperforms state-of-the-art methods across multiple metrics in both spatial fidelity and temporal consistency. The source code will be available at this https URL.
[CV-61] SHARC: Reference point driven Spherical Harmonic Representation for Complex Shapes ICPR2026
【速读】:该论文旨在解决任意拓扑结构(genus-agnostic)三维形状的高效且高保真重建问题,传统方法在处理复杂几何细节时往往面临精度不足或计算效率低下的挑战。其解决方案的关键在于提出SHARC框架,通过在物体内部最优位置设置参考点(reference points),并利用球谐函数(Spherical Harmonic, SH)对距离场进行建模,从而实现对表面细节的精准捕捉;具体而言,该方法设计了一个联合优化目标函数,以最大化参考点的稀疏性(sparsity)与中心性(centrality),同时确保从这些点可见整个表面几何,随后采用射线投射采样和快速球谐变换(Fast Spherical Harmonic Transform, FSHT)计算SH系数,并引入可配置的低通滤波器与基于邻近性的局部一致性约束来提升重建精度与鲁棒性,最终在重建准确性和计算效率上均优于现有方法,同时保持模型简洁性(model parsimony)。
链接: https://arxiv.org/abs/2604.01894
作者: Panagiotis Sapoutzoglou,George Terzakis,Maria Pateraki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)
备注: Accepted at ICPR 2026
Abstract:We propose SHARC, a novel framework that synthesizes arbitrary, genus-agnostic shapes by means of a collection of Spherical Harmonic (SH) representations of distance fields. These distance fields are anchored at optimally placed reference points in the interior volume of the surface in a way that maximizes learning of the finer details of the surface. To achieve this, we employ a cost function that jointly maximizes sparsity and centrality in terms of positioning, as well as visibility of the surface from their location. For each selected reference point, we sample the visible distance field to the surface geometry via ray-casting and compute the SH coefficients using the Fast Spherical Harmonic Transform (FSHT). To enhance geometric fidelity, we apply a configurable low-pass filter to the coefficients and refine the output using a local consistency constraint based on proximity. Evaluation of SHARC against state-of-the-art methods demonstrates that the proposed method outperforms existing approaches in both reconstruction accuracy and time efficiency without sacrificing model parsimony. The source code is available at this https URL.
[CV-62] ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery
【速读】:该论文旨在解决遥感视觉定位(Remote Sensing Visual Grounding, RSVG)任务中因依赖句级视觉-语言对齐而导致的细粒度语义线索利用不足的问题,尤其在空间关系(spatial relations)和对象属性(object attributes)等关键语义信息上的建模能力有限,从而难以区分外观相似的目标。其解决方案的关键在于提出ProVG框架,通过将语言表达解耦为全局上下文、空间关系和对象属性三类语义模块,并设计一种简单而有效的渐进式跨模态调制器(progressive cross-modal modulator),以“概览-定位-验证”(survey-locate-verify)机制动态调节视觉注意力,实现从粗到精的视觉-语言对齐;同时引入跨尺度融合模块与语言引导校准解码器,进一步提升多尺度遥感图像中的对齐精度与定位准确性。
链接: https://arxiv.org/abs/2604.01893
作者: Ke Li,Ting Wang,Di Wang,Yongshan Zhu,Yiming Zhang,Tao Lei,Quan Wang
机构: Xidian University (西安电子科技大学); UC San Diego (加州大学圣地亚哥分校); SUST (江苏科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit fine-grained linguistic cues, such as \textitspatial relations and \textitobject attributes, that are crucial for distinguishing objects with similar characteristics. Importantly, these cues play distinct roles across different grounding stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose \textbfProVG, a novel RSVG framework that improves localization accuracy by decoupling language expressions into global context, spatial relations, and object attributes. To integrate these linguistic cues, ProVG employs a simple yet effective progressive cross-modal modulator, which dynamically modulates visual attention through a \textitsurvey-locate-verify scheme, enabling coarse-to-fine vision-language alignment. In addition, ProVG incorporates a cross-scale fusion module to mitigate the large-scale variations in remote sensing imagery, along with a language-guided calibration decoder to refine cross-modal alignment during prediction. A unified multi-task head further enables ProVG to support both referring expression comprehension and segmentation tasks. Extensive experiments on two benchmarks, \textiti.e., RRSIS-D and RISBench, demonstrate that ProVG consistently outperforms existing methods, achieving new state-of-the-art performance.
[CV-63] Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters CVPR
【速读】:该论文旨在解决当前文本到图像生成模型(text-to-image generative models)在部署过程中存在的安全漏洞问题,即尽管系统已配备安全过滤器和内容审核流水线以阻止有害或违反政策的内容生成,但这些机制仍易受到低努力量的“越狱攻击”(jailbreak attacks)的规避。解决方案的关键在于提出了一套无需模型访问、优化或对抗训练的提示词策略,通过引入五类视觉越狱技术——艺术重构、材料替换、伪教育框架、生活方式美学伪装和模糊动作替代——利用提示词审核与视觉安全过滤之间的语义理解鸿沟,将不当意图隐藏于看似无害的语义上下文中,从而有效绕过现有防护机制并稳定生成受限内容。实验表明,该方法在多个主流文本到图像系统中实现了高达74.47%的攻击成功率(Attack Success Rate, ASR)。
链接: https://arxiv.org/abs/2604.01888
作者: Ahmed B Mustafa,Zihan Ye,Yang Lu,Michael P Pound,Shreyank N Gowda
机构: University of Nottingham (诺丁汉大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Text-to-Image version of the Anyone can Jailbreak paper. Accepted in CVPR-W AIMS 2026
Abstract:Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks across several state-of-the-art text-to-image systems and demonstrate that simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery. Our findings highlight a critical gap between surface-level prompt filtering and the semantic understanding required to detect adversarial intent in generative media systems. Across all tested models and attack categories we observe an attack success rate (ASR) of up to 74.47%.
[CV-64] GS2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在实际应用中因大量高斯点导致的高内存开销问题,同时避免现有基于剪枝的方法在节省内存时牺牲空间一致性并引入渲染伪影。其解决方案的关键在于提出一种图结构驱动的空间分布优化方法(GS²),通过三个核心机制实现:(1) 基于证据下界(Evidence Lower Bound, ELBO)的自适应加密策略,自动控制高斯点的密度增长过程;(2) 透明度感知的渐进式剪枝策略,动态移除低透明度高斯点以进一步降低内存占用;(3) 图结构特征编码模块,利用特征引导的点位移机制优化高斯点的空间分布。实验表明,GS²在仅使用约12.5%高斯点的情况下仍能实现比原始3DGS更高的峰值信噪比(PSNR),并在渲染质量与内存效率上全面优于对比基线。
链接: https://arxiv.org/abs/2604.01884
作者: Xianben Yang,Tao Wang,Yuxuan Li,Yi Jin,Haibin Ling
机构: Beijing Jiaotong University (北京交通大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and real-time rendering. Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points. Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issue, we propose graph-based spatial distribution optimization for compact 3D Gaussian Splatting (GS\textasciicircum2), which enhances reconstruction quality by optimizing the spatial distribution of Gaussian points. Specifically, we introduce an evidence lower bound (ELBO)-based adaptive densification strategy that automatically controls the densification process. In addition, an opacity-aware progressive pruning strategy is proposed to further reduce memory consumption by dynamically removing low-opacity Gaussian points. Furthermore, we propose a graph-based feature encoding module to adjust the spatial distribution via feature-guided point shifting. Extensive experiments validate that GS\textasciicircum2 achieves a compact Gaussian representation while delivering superior rendering quality. Compared with 3DGS, it achieves higher PSNR with only about 12.5% Gaussian points. Furthermore, it outperforms all compared baselines in both rendering quality and memory efficiency.
[CV-65] A3R: Agent ic Affordance Reasoning via Cross-Dimensional Evidence in 3D Gaussian Scenes
【速读】:该论文旨在解决复杂3D Gaussian场景中细粒度可及性推理(affordance reasoning)的问题,即如何从文本指令中准确识别支持特定动作的区域。现有方法通常将此问题建模为基于静态场景观测的一次性预测任务,但其局限性在于忽视了在固定观测下任务相关证据不完整所导致的推理失败。解决方案的关键在于将可及性推理重新定义为一个顺序证据获取过程,通过迭代地融合互补的3D几何与2D语义证据逐步降低不确定性。为此,作者提出A3R框架,一种基于多模态大语言模型(MLLM)策略的代理式推理系统,能够自主选择证据获取动作并跨维度更新可及性信念;同时引入基于GRPO的策略学习机制以优化序列决策过程,从而显著提升复杂场景下的推理准确性和证据获取效率。
链接: https://arxiv.org/abs/2604.01882
作者: Di Li,Jie Feng,Guanbin Li,Ronghua Shang,Yuhui Zheng,Weisheng Dong,Guangming Shi
机构: Xidian University (西安电子科技大学); Sun Yat-sen University (中山大学); Qinghai Normal University (青海师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Affordance reasoning in 3D Gaussian scenes aims to identify the region that supports the action specified by a given text instruction in complex environments. Existing methods typically cast this problem as one-shot prediction from static scene observations, assuming sufficient evidence is already available for reasoning. However, in complex 3D scenes, many failure cases arise not from weak prediction capacity, but from incomplete task-relevant evidence under fixed observations. To address this limitation, we reformulate fine-grained affordance reasoning as a sequential evidence acquisition process, where ambiguity is progressively reduced through complementary 3D geometric and 2D semantic evidence. Building on this formulation, we propose A3R, an agentic affordance reasoning framework that enables an MLLM-based policy to iteratively select evidence acquisition actions and update the affordance belief through cross-dimensional evidence acquisition. To optimize such sequential decision making, we further introduce a GRPO-based policy learning strategy that improves evidence acquisition efficiency and reasoning accuracy. Extensive experiments on scene-level benchmarks show that A3R consistently surpasses static one-shot baselines, demonstrating the advantage of agentic cross-dimensional evidence acquisition for fine-grained affordance reasoning in complex 3D Gaussian scenes.
[CV-66] GeoAI Agency Primitives
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在地理信息系统(GIS)领域应用中的核心瓶颈问题:尽管卫星图像描述、视觉问答和可提示分割等技术取得进展,但这些能力尚未转化为 GIS 实践者在矢量图层、栅格地图及制图产品生产中的实际生产力提升。其根本原因在于缺乏一个连接基础模型与以地理空间数据为对象、人机协同迭代的工作流程之间的“代理层”(agency layer)。解决方案的关键在于提出一套包含 9 个基本原语(primitives)的词汇表,如导航、感知、地理参考记忆和双模型建模等,构建可实现、可测试且可比较的代理辅助机制,从而支撑 GIS 场景下人机协同的持续迭代与高效协作。
链接: https://arxiv.org/abs/2604.01869
作者: Akram Zaytar,Rohan Sawahn,Caleb Robinson,Gilles Q. Hacheme,Girmaw A. Tadesse,Inbal Becker-Reshef,Rahul Dodhia,Juan Lavista Ferres
机构: Microsoft AI for Good Lab (微软AI for美好实验室); NASA Harvest (美国国家航空航天局收获项目)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present ongoing research on agency primitives for GeoAI assistants – core capabilities that connect Foundation models to the artifact-centric, human-in-the-loop workflows where GIS practitioners actually work. Despite advances in satellite image captioning, visual question answering, and promptable segmentation, these capabilities have not translated into productivity gains for practitioners who spend most of their time producing vector layers, raster maps, and cartographic products. The gap is not model capability alone but the absence of an agency layer that supports iterative collaboration. We propose a vocabulary of 9 primitives for such a layer – including navigation, perception, geo-referenced memory, and dual modeling – along with a benchmark that measures human productivity. Our goal is a vocabulary that makes agentic assistance in GIS implementable, testable, and comparable.
[CV-67] MAR-MAER: Metric-Aware and Ambiguity-Adaptive Autoregressive Image Generation
【速读】:该论文旨在解决自回归(Autoregressive, AR)模型在文本到图像生成中面临的两大问题:一是生成图像质量难以满足人类偏好,二是对模糊提示(ambiguous prompts)的语义理解能力不足,导致输出缺乏多样性与合理性。解决方案的关键在于提出一种分层自回归框架 MAR-MAER,其核心由两个模块构成:一是基于度量感知嵌入正则化的方法,通过轻量级投影头结合自适应核回归损失函数,使模型内部表示更贴近人类偏好的指标(如 CLIPScore 和 HPSv2);二是引入条件变分模块,以可控随机性驱动分层标记生成过程,从而增强模型对模糊或开放性提示的语义灵活性与多样性输出能力。实验表明,该方法在 COCO 和新构建的模糊提示基准测试中显著优于基线 Hi-MAR 模型,在 CLIPScore 和 HPSv2 上分别提升 +1.6 和 +5.3,并在模糊输入下产生更广泛且一致的图像结果。
链接: https://arxiv.org/abs/2604.01864
作者: Kai Dong,Tingting Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AMME 2025
Abstract:Autoregressive (AR) models have demonstrated significant success in the realm of text-to-image generation. However, they usually face two major challenges. Firstly, the generated images may not always meet the quality standards expected by humans. Furthermore, these models face difficulty when dealing with ambiguous prompts that could be interpreted in several valid ways. To address these issues, we introduce MAR-MAER, an innovative hierarchical autoregressive framework. It combines two main components. It is a metric-aware embedding regularization method. The other one is a probabilistic latent model used for handling ambiguous semantics. Our method utilizes a lightweight projection head, which is trained with an adaptive kernel regression loss function. This aligns the model’s internal representations with human-preferred quality metrics, such as CLIPScore and HPSv2. As a result, the embedding space that is learned more accurately reflects human judgment. We are also introducing a conditional variational module. This approach incorporates an aspect of controlled randomness within the hierarchical token generation process. This capability allows the model to produce a diverse array of coherent images based on ambiguous or open-ended prompts. We conducted extensive experiments using COCO and a newly developed Ambiguous-Prompt Benchmark. The results show that MAR-MAER achieves excellent performance in both metric consistency and semantic flexibility. It exceeds the baseline Hi-MAR model’s performance, showing an improvement of +1.6 in CLIPScore and +5.3 in HPSv2. For unclear inputs, it produces a notably wider range of outputs. These findings have been confirmed through both human evaluation and automated metrics.
[CV-68] Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation CVPR2026
【速读】:该论文旨在解决Temporal Action Segmentation (TAS)任务中因复杂模型架构导致的实际部署困难问题,同时提升细粒度分割质量。其解决方案的关键在于提出一种轻量级双损失训练框架,仅通过增加一个输出通道和两个辅助损失项即可实现性能提升:一是边界回归损失(boundary-regression loss),通过单通道边界预测促进精确的时间定位;二是基于累积分布函数(CDF)的段级正则化损失,通过匹配预测与真实段的累积分布来增强段内结构的一致性。该框架与模型架构无关,可无缝集成到现有TAS模型(如MS-TCN、C2F-TCN、FACT)中作为训练时损失函数,在不显著改变帧级准确率的前提下,显著提升分割一致性与边界精度。
链接: https://arxiv.org/abs/2604.01859
作者: Hinako Mitsuoka,Kazuhiro Hotta
机构: Meijo University (明治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026 Workshop “AI-driven Skilled Activity Understanding, Assessment Feedback Generation (SAUAFG)”
Abstract:Recent progress in Temporal Action Segmentation (TAS) has increasingly relied on complex architectures, which can hinder practical deployment. We present a lightweight dual-loss training framework that improves fine-grained segmentation quality with only one additional output channel and two auxiliary loss terms, requiring minimal architectural modification. Our approach combines a boundary-regression loss that promotes accurate temporal localization via a single-channel boundary prediction and a CDF-based segment-level regularization loss that encourages coherent within-segment structure by matching cumulative distributions over predicted and ground-truth segments. The framework is architecture-agnostic and can be integrated into existing TAS models (e.g., MS-TCN, C2F-TCN, FACT) as a training-time loss function. Across three benchmark datasets, the proposed method improves segment-level consistency and boundary quality, yielding higher F1 and Edit scores across three different models. Frame-wise accuracy remains largely unchanged, highlighting that precise segmentation can be achieved through simple loss design rather than heavier architectures or inference-time refinements.
[CV-69] Semantic Richness or Geometric Reasoning ? The Frag ility of VLMs Visual Invariance
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在基本几何变换下表现出的根本性脆弱性问题,即模型缺乏对空间变换的鲁棒不变性(robust spatial invariance)和等变性(equivariance),导致其在简单旋转、缩放和恒等变换下难以可靠识别物体。解决方案的关键在于通过系统性评估不同视觉域(包括符号草图、自然照片和抽象艺术)中的性能下降现象,揭示当前VLMs在语义理解与空间推理之间存在系统性差距,从而强调未来多模态系统需强化几何基础建模以提升空间感知能力。
链接: https://arxiv.org/abs/2604.01848
作者: Jason Qiu,Zachary Meurer,Xavier Thomas,Deepti Ghadiyaram
机构: Boston University (波士顿大学); Runway (Runway)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.
[CV-70] FaCT-GS: Fast and Scalable CT Reconstruction with Gaussian Splatting
【速读】:该论文旨在解决基于高斯溅射(Gaussian Splatting, GS)的X射线计算机断层成像(X-ray Computed Tomography, CT)重建方法中存在的效率瓶颈问题,即尽管GS在重建质量上可与传统算法媲美甚至更优,但其计算速度不足以驱动从成熟重建算法向GS迁移。解决方案的关键在于提出FaCT-GS框架,通过深度优化体素化(voxelization)和光栅化(rasterization)流水线,显著提升计算效率,并支持更大规模投影和输出体积的扩展;同时改进的体素化策略实现了对预存体积中高斯分布的快速拟合,可用于热启动重建或作为压缩表示替代方案,从而在标准512×512投影下实现超过4倍、在2k投影下超过13倍于当前最优GS方法的速度提升。
链接: https://arxiv.org/abs/2604.01844
作者: Pawel Tomasz Pieta,Rasmus Juul Pedersen,Sina Borgi,Jakob Sauer Jørgensen,Jens Wenzel Andreasen,Vedrana Andersen Dahl
机构: Technical University of Denmark (丹麦技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gaussian Splatting (GS) has emerged as a dominating technique for image rendering and has quickly been adapted for the X-ray Computed Tomography (CT) reconstruction task. However, despite being on par or better than many of its predecessors, the benefits of GS are typically not substantial enough to motivate a transition from well-established reconstruction algorithms. This paper addresses the most significant remaining limitations of the GS-based approach by introducing FaCT-GS, a framework for fast and flexible CT reconstruction. Enabled by an in-depth optimization of the voxelization and rasterization pipelines, our new method is significantly faster than its predecessors and scales well with projection and output volume size. Furthermore, the improved voxelization enables rapid fitting of Gaussians to pre-existing volumes, which can serve as a prior for warm-starting the reconstruction, or simply as an alternative, compressed representation. FaCT-GS is over 4X faster than the State of the Art GS CT reconstruction on standard 512x512 projections, and over 13X faster on 2k projections. Implementation available at: this https URL.
[CV-71] Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images ICPR2026
【速读】:该论文旨在解决传统向量量化方法(如VQ-VAE、VQ-GAN)中离散表示的固有位置依赖性问题,即潜在代码的空间排列和上下文耦合特性导致在生成时需依赖自回归或扩散模型来建模代码间依赖关系。其解决方案的关键在于提出一种置换不变向量量化自动编码器(Permutation-Invariant Vector Quantized Autoencoder, PI-VQ),通过强制潜在代码不携带任何位置信息,促使模型学习更具全局语义性的特征表示,并实现无需预训练先验即可直接进行图像插值生成。为弥补置换不变表示带来的信息容量下降,作者进一步引入**匹配量化(Matching Quantization)**机制,基于最优二分图匹配提升有效瓶颈容量达3.5倍,从而在保持结构可解释性的同时支持高效的图像合成与插值采样。
链接: https://arxiv.org/abs/2604.01843
作者: Jamie S. J. Stirling,Noura Al-Moubayed,Hubert P. H. Shum
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages plus references; 5 figures; supplementary appended; accepted to ICPR 2026
Abstract:Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on optimal bipartite matching that increases effective bottleneck capacity by 3.5\times relative to naive nearest-neighbour quantization. The compositional structure of the learned codes further enables interpolation-based sampling, allowing synthesis of novel images in a single forward pass. We evaluate PI-VQ on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for images synthesised with our approach. We discuss the trade-offs inherent to position-free representations, including separability and interpretability of the latent codes, pointing to numerous directions for future work.
[CV-72] Semantic Segmentation of Textured Non-manifold 3D Meshes using Transformers
【速读】:该论文旨在解决纹理化三维网格(textured 3D meshes)在深度学习驱动的语义分割任务中因不规则结构带来的挑战,特别是现有方法通常忽略网格面所携带的丰富纹理信息的问题。其解决方案的关键在于提出一种纹理感知的Transformer架构,该架构直接从每个网格面关联的原始像素中学习特征,并引入一种新的分层学习机制以实现多尺度特征聚合;其中,纹理分支将所有面级像素汇总为可学习的token,与几何描述符融合后通过堆叠的两阶段Transformer块(Two-Stage Transformer Blocks, TSTB)进行处理,从而同时支持局部与全局信息流,显著提升了分割性能。
链接: https://arxiv.org/abs/2604.01836
作者: Mohammadreza Heidarianbaei,Max Mehltretter,Franz Rottensteiner
机构: Leibniz University Hannover (汉诺威大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Textured 3D meshes jointly represent geometry, topology, and appearance, yet their irregular structure poses significant challenges for deep-learning-based semantic segmentation. While a few recent methods operate directly on meshes without imposing geometric constraints, they typically overlook the rich textural information also provided by such meshes. We introduce a texture-aware transformer that learns directly from raw pixels associated with each mesh face, coupled with a new hierarchical learning scheme for multi-scale feature aggregation. A texture branch summarizes all face-level pixels into a learnable token, which is fused with geometrical descriptors and processed by a stack of Two-Stage Transformer Blocks (TSTB), which allow for both a local and a global information flow. We evaluate our model on the Semantic Urban Meshes (SUM) benchmark and a newly curated cultural-heritage dataset comprising textured roof tiles with triangle-level annotations for damage types. Our method achieves 81.9% mF1 and 94.3% OA on SUM and 49.7% mF1 and 72.8% OA on the new dataset, substantially outperforming existing approaches.
[CV-73] Ranking-Guided Semi-Supervised Domain Adaptation for Severity Classification
【速读】:该论文旨在解决医学图像分析中因域偏移(domain shift)导致的严重程度分类(severity classification)难题,尤其针对类别边界不清晰、类标签具有自然序关系(naturally ordered class labels)所带来的挑战。其解决方案的关键在于引入一种基于排序机制的跨域对齐方法:首先通过跨域排序(Cross-Domain Ranking)学习样本对在源域与目标域间的相对顺序关系,从而构建类相关的排序分数(rank scores);进而利用连续分布对齐(Continuous Distribution Alignment)将源域与目标域的排序分数分布进行对齐,实现更精细的域适应。该方法有效缓解了传统半监督域适应在有序类别场景下的性能瓶颈。
链接: https://arxiv.org/abs/2604.01834
作者: Shota Harada,Ryoma Bise,Kiyohito Tanaka,Seiichi Uchida
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semi-supervised domain adaptation leverages a few labeled and many unlabeled target samples, making it promising for addressing domain shifts in medical image analysis. However, existing methods struggle with severity classification due to unclear class boundaries. Severity classification involves naturally ordered class labels, complicating adaptation. We propose a novel method that aligns source and target domains using rank scores learned via ranking with class order. Specifically, Cross-Domain Ranking ranks sample pairs across domains, while Continuous Distribution Alignment aligns rank score distributions. Experiments on ulcerative colitis and diabetic retinopathy classification validate the effectiveness of our approach, demonstrating successful alignment of class-specific rank score distributions.
[CV-74] SafeRoPE: Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers CVPR26
【速读】:该论文旨在解决基于MMDiT(Multi-Query Multi-Dimensional Image Transformer)的文本到图像(Text-to-Image, T2I)生成模型中存在安全隐患的问题,尤其是由多词交互触发的不安全语义内容。现有方法主要依赖于微调或注意力调制进行概念去学习,但其计算开销大且难以适配Transformer架构的扩散模型。解决方案的关键在于:首先通过分析MMDiT的注意力机制发现,不安全语义集中在可解释的、低维的头级别子空间中,且特定的安全关键头负责提取不安全特征;其次利用对查询和键向量施加旋转位置编码(RoPE)扰动,能够有效修改特定概念而不影响良性内容质量。基于此,作者提出SafeRoPE框架,通过构建头级别的不安全子空间并计算潜在风险评分(Latent Risk Score, LRS),结合头级别的RoPE扰动实现对不安全输出的精准抑制,同时保持生成图像的质量与实用性,在安全性与功能保留之间达到最优平衡。
链接: https://arxiv.org/abs/2604.01826
作者: Xiang Yang,Feifei Li,Mi Zhang,Geng Hong,Xiaoyu You,Min Yang
机构: Fudan University (复旦大学); East China University of Science and Technology (华东理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR26
Abstract:Recent Text-to-Image (T2I) models based on rectified-flow transformers (e.g., SD3, FLUX) achieve high generative fidelity but remain vulnerable to unsafe semantics, especially when triggered by multi-token interactions. Existing mitigation methods largely rely on fine-tuning or attention modulation for concept unlearning; however, their expensive computational overhead and design tailored to U-Net-based denoisers hinder direct adaptation to transformer-based diffusion models (e.g., MMDiT). In this paper, we conduct an in-depth analysis of the attention mechanism in MMDiT and find that unsafe semantics concentrate within interpretable, low-dimensional subspaces at head level, where a finite set of safety-critical heads is responsible for unsafe feature extraction. We further observe that perturbing the Rotary Positional Embedding (RoPE) applied to the query and key vectors can effectively modify some specific concepts in the generated images. Motivated by these insights, we propose SafeRoPE, a lightweight and fine-grained safe generation framework for MMDiT. Specifically, SafeRoPE first constructs head-wise unsafe subspaces by decomposing unsafe embeddings within safety-critical heads, and computes a Latent Risk Score (LRS) for each input vector via projection onto these subspaces. We then introduce head-wise RoPE perturbations that can suppress unsafe semantics without degrading benign content or image quality. SafeRoPE combines both head-wise LRS and RoPE perturbations to perform risk-specific head-wise rotation on query and key vector embeddings, enabling precise suppression of unsafe outputs while maintaining generation fidelity. Extensive experiments demonstrate that SafeRoPE achieves SOTA performance in balancing effective harmful content mitigation and utility preservation for safe generation of MMDiT. Codes are available at this https URL.
[CV-75] STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
【速读】:该论文旨在解决大规模多模态模型在视频问答(Video Question Answering, VQA)任务中,基于群体的策略优化方法因响应正确性相似而导致奖励方差低、优势估计弱或不稳定的问题。解决方案的关键在于提出STRIVE(SpatioTemporal Reinforcement with Importance-aware Variant Exploration)框架,通过构建每个输入视频的多个时空变体,并在文本生成与视觉变体之间进行联合归一化,从而增强奖励信号;同时引入重要性感知采样机制,在保障时间覆盖的前提下优先选择与问题语义相关的帧,确保探索过程语义合理,避免对单一时空配置过拟合,进而促进跨互补视觉视角的鲁棒推理。
链接: https://arxiv.org/abs/2604.01824
作者: Emad Bahrami,Olga Zatsarynna,Parth Pathak,Sunando Sengupta,Juergen Gall,Mohsen Fayyaz
机构: University of Bonn (波恩大学); Microsoft (微软); Meta (Meta); Lamarr Institute for Machine Learning and Artificial Intelligence (拉玛尔机器学习与人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes frames most relevant to the input question while preserving temporal coverage. This design encourages robust reasoning across complementary visual perspectives rather than overfitting to a single spatiotemporal configuration. Experiments on six challenging video reasoning benchmarks including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest demonstrate consistent improvements over strong reinforcement learning baselines across multiple large multimodal models. Our results highlight the role of structured spatiotemporal exploration as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance.
[CV-76] A deep learning pipeline for PAM50 subtype classification using histopathology images and multi-objective patch selection
【速读】:该论文旨在解决乳腺癌分子分型中依赖昂贵分子检测手段(如PAM50基因表达谱)的问题,提出一种基于HE染色全切片图像(WSI)的深度学习预测方法,以实现无需分子检测即可准确分类PAM50亚型。其解决方案的关键在于构建一个优化驱动的深度学习框架,通过非支配排序遗传算法II(NSGA-II)联合蒙特卡洛Dropout不确定性估计,协同优化图像块(patch)的信息量、空间多样性、不确定性及数量,从而筛选出少量但高度信息丰富的图像块用于分类;该方法采用ResNet18提取特征并结合定制卷积神经网络(CNN)头进行分类,在TCGA-BRCA内部训练集和CPTAC-BRCA外部测试集上分别实现了F1-score 0.8812和0.7952,验证了其高精度与计算效率优势,具备临床转化潜力。
链接: https://arxiv.org/abs/2604.01798
作者: Arezoo Borji,Gernot Kronreif,Bernhard Angermayr,Francisco Mario Calisto,Wolfgang Birkfellner,Inna Servetnyk,Yinyin Yuan,Sepideh Hatamikia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Breast cancer is a highly heterogeneous disease with diverse molecular profiles. The PAM50 gene signature is widely recognized as a standard for classifying breast cancer into intrinsic subtypes, enabling more personalized treatment strategies. In this study, we introduce a novel optimization-driven deep learning framework that aims to reduce reliance on costly molecular assays by directly predicting PAM50 subtypes from HE-stained whole-slide images (WSIs). Our method jointly optimizes patch informativeness, spatial diversity, uncertainty, and patch count by combining the non-dominated sorting genetic algorithm II (NSGA-II) with Monte Carlo dropout-based uncertainty estimation. The proposed method can identify a small but highly informative patch subset for classification. We used a ResNet18 backbone for feature extraction and a custom CNN head for classification. For evaluation, we used the internal TCGA-BRCA dataset as the training cohort and the external CPTAC-BRCA dataset as the test cohort. On the internal dataset, an F1-score of 0.8812 and an AUC of 0.9841 using 627 WSIs from the TCGA-BRCA cohort were achieved. The performance of the proposed approach on the external validation dataset showed an F1-score of 0.7952 and an AUC of 0.9512. These findings indicate that the proposed optimization-guided, uncertainty-aware patch selection can achieve high performance and improve the computational efficiency of histopathology-based PAM50 classification compared to existing methods, suggesting a scalable imaging-based replacement that has the potential to support clinical decision-making.
[CV-77] PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency CVPR2026
【速读】:该论文旨在解决单目深度估计(Monocular Depth Estimation, MDE)在连续帧间缺乏时间一致性的问题,这种不一致性会导致深度预测出现抖动甚至在深度范围突变时发生估计失败。解决方案的关键在于引入移动机器人轮式里程计(wheel odometry)信息,通过光流计算相邻帧间的相机位姿与稀疏深度,并利用三角化结果更新递归贝叶斯估计的尺度因子,进而对预训练深度基础模型输出的相对深度进行尺度校正,从而实现时间上稳定且一致的深度估计。
链接: https://arxiv.org/abs/2604.01791
作者: Leezy Han,Seunggyu Kim,Dongseok Shim,Hyeonbeom Lee
机构: Ajou University (亚洲大学); Sony Creative AI Lab (索尼创意人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026
Abstract:Monocular depth estimation (MDE) has been widely adopted in the perception systems of autonomous vehicles and mobile robots. However, existing approaches often struggle to maintain temporal consistency in depth estimation across consecutive frames. This inconsistency not only causes jitter but can also lead to estimation failures when the depth range changes abruptly. To address these challenges, this paper proposes a consistency-aware monocular depth estimation framework that leverages wheel odometry from a mobile robot to achieve stable and coherent depth predictions over time. Specifically, we estimate camera pose and sparse depth from triangulation using optical flow between consecutive frames. The sparse depth estimates are used to update a recursive Bayesian estimate of the metric scale, which is then applied to rescale the relative depth predicted by a pre-trained depth estimation foundation model. The proposed method is evaluated on the KITTI, TartanAir, MS2, and our own dataset, demonstrating robust and accurate depth estimation performance.
[CV-78] GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents CVPR2026
【速读】:该论文旨在解决江南园林(Jiangnan gardens)在数字内容创作(如影视、游戏和数字旅游)中因手工建模依赖专家经验而导致的效率低下问题。解决方案的关键在于提出GardenDesigner框架,该框架通过编码江南园林建造中的美学原则,并集成基于程序化建模的多智能体链(chain of agents),实现自动化布局与资产生成:其中地形分布代理和路径生成代理遵循以水为中心的地形设计和探索式路径规则;资产选择与布局优化代理则依据美学与文化约束完成对象的区域配置;同时引入GardenVerse知识库增强资产排列合理性,并开发基于Unity的交互界面,使非专业用户可通过文本输入在一分钟内完成园林构建。
链接: https://arxiv.org/abs/2604.01777
作者: Mengtian Li,Fan Yang,Ruixue Xiong,Yiyan Fan,Zhifeng Xie,Zeyu Wang
机构: Shanghai University (上海大学); Shanghai Engineering Research Center of Motion Picture Special Effects (上海市 motion picture 特效工程研究中心); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Project page: this https URL
Abstract:Jiangnan gardens, a prominent style of Chinese classical gardens, hold great potential as digital assets for film and game production and digital tourism. However, manual modeling of Jiangnan gardens heavily relies on expert experience for layout design and asset creation, making the process time-consuming. To address this gap, we propose GardenDesigner, a novel framework that encodes aesthetic principles for Jiangnan garden construction and integrates a chain of agents based on procedural modeling. The water-centric terrain and explorative pathway rules are applied by terrain distribution and road generation agents. Selection and spatial layout of garden assets follow the aesthetic and cultural constraints. Consequently, we propose asset selection and layout optimization agents to select and arrange objects for each area in the garden. Additionally, we introduce GardenVerse for Jiangnan garden construction, including expert-annotated garden knowledge to enhance the asset arrangement process. To enable interaction and editing, we develop an interactive interface and tools in Unity, in which non-expert users can construct Jiangnan gardens via text input within one minute. Experiments and human evaluations demonstrate that GardenDesigner can generate diverse and aesthetically pleasing Jiangnan gardens. Project page is available at this https URL.
[CV-79] FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation
【速读】:该论文旨在解决高分辨率森林结构数据获取成本高、频次低的问题,特别是针对传统机载激光雷达(LiDAR)在生成冠层高度模型(Canopy Height Model, CHM)、植物面积指数(Plant Area Index, PAI)和叶高多样性(Foliage Height Diversity, FHD)等关键森林结构参数时的局限性。解决方案的关键在于提出一种LiDAR到RGB红外(RGBI)的知识蒸馏(Knowledge Distillation, KD)框架——FSKD,其中多模态教师模型通过交叉注意力机制融合RGBI影像与LiDAR衍生的平面指标和垂直剖面信息,而仅使用RGBI的SegFormer学生模型则学习复现这些复杂输出。该方法在德国萨克森州384 km²森林区域训练并验证,实现了零样本条件下SOTA的CHM预测性能(中位绝对误差MedAE 4.17 m,R²=0.51),且能联合预测CHM、PAI与FHD,显著优于现有单目CHM估计器,并具备对时间错配(如冬季LiDAR与夏季RGBI)的鲁棒性,从而支持20 cm级高精度、可扩展的森林监测应用。
链接: https://arxiv.org/abs/2604.01766
作者: Taimur Khan,Hannes Feilhauer,Muhammad Jazib Zafar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Paper in-review
Abstract:Very High Resolution (VHR) forest structure data at individual-tree scale is essential for carbon, biodiversity, and ecosystem monitoring. Still, airborne LiDAR remains costly and infrequent despite being the reference for forest structure metrics like Canopy Height Model (CHM), Plant Area Index (PAI), and Foliage Height Diversity (FHD). We propose FSKD: a LiDAR-to-RGB-Infrared (RGBI) knowledge distillation (KD) framework in which a multi-modal teacher fuses RGBI imagery with LiDAR-derived planar metrics and vertical profiles via cross-attention, and an RGBI-only SegFormer student learns to reproduce these outputs. Trained on 384 km^2 of forests in Saxony, Germany (20 cm ground sampling distance (GSD)) and evaluated on eight geographically distinct test tiles, the student achieves state-of-the-art (SOTA) zero-shot CHM performance (MedAE 4.17 m, R^2 =0.51, IoU 0.87), outperforming HRCHM/DAC baselines by 29–46% in MAE (5.81 m vs. 8.14–10.84 m) with stronger correlation coefficients (0.713 vs. 0.166–0.652). Ablations show that multi-modal fusion improves performance by 10–26% over RGBI-only training, and that asymmetric distillation with appropriate model capacity is critical. The method jointly predicts CHM, PAI, and FHD, a multi-metric capability not provided by current monocular CHM estimators, although PAI/FHD transfer remains region-dependent and benefits from local calibration. The framework also remains effective under temporal mismatch (winter LiDAR, summer RGBI), removing strict co-acquisition constraints and enabling scalable 20 cm operational monitoring for workflows such as Digital Twin Germany and national Digital Orthophoto programs.
[CV-80] DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning
【速读】:该论文旨在解决现有世界行动模型(World-Action Models, WAM)在物理世界具身系统中几何感知不足的问题,即当前方法多聚焦于2D外观或潜在表示建模,缺乏对场景几何结构的有效建模,从而限制了其在自动驾驶等任务中的推理与规划能力。解决方案的关键在于提出DriveDreamer-Policy——一个统一的驾驶世界行动模型,通过模块化架构集成深度生成、未来视频生成和运动规划三个功能;其核心创新在于显式学习几何感知的世界表征,并利用该表征同时指导未来预测与决策规划,从而提升想象未来的一致性和驾驶动作的鲁棒性,同时保持模型的模块化设计和可控延迟。
链接: https://arxiv.org/abs/2604.01765
作者: Yang Zhou,Xiaofeng Wang,Hao Shao,Letian Wang,Guosheng Zhao,Jiangnan Shao,Jiagang Zhu,Tingdong Yu,Zheng Zhu,Guan Huang,Steven L. Waslander
机构: GigaAI; University of Toronto; CUHK MMLab
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 11 pages, 4 figures; Project Website: this https URL
Abstract:Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.
[CV-81] Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning ICLR2026
【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)在处理非显式视觉信息时的推理能力不足问题,特别是当图像仅作为线索而非直接答案时,模型难以进行多步骤的复杂认知推理。其解决方案的关键在于提出一个名为RebusBench的新基准测试集,包含1,164个重构谜题(rebus puzzle),专门用于评估模型将视觉感知与语言先验知识(如习语)相结合,并执行抽象映射以生成超出像素空间意义的能力。实验表明,尽管现有模型具备基本的视觉和语言模块,但缺乏连接这些模块的认知推理机制,导致性能严重受限(精确匹配低于10%,语义准确率低于20%),从而揭示了当前LVLMs在神经符号整合方面的关键瓶颈。
链接: https://arxiv.org/abs/2604.01764
作者: Seyed Amir Kasaei,Arash Marioriyad,Mahbod Khaleti,MohammadAmin Fazli,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban
机构: Sharif University of Technology (谢里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2026 Workshop: From Human Cognition to AI Reasoning (HCAIR)
Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at this https URL.
[CV-82] Cosine-Normalized Attention for Hyperspectral Image Classification
【速读】:该论文旨在解决基于Transformer的高光谱图像分类(HSIC)方法中,传统点积相似性注意力机制因同时考虑特征幅值与方向而导致性能受限的问题。其关键解决方案是从几何角度重新审视注意力评分机制,提出一种余弦归一化注意力公式:通过将查询(query)和键(key)嵌入投影到单位超球面(unit hypersphere)上,并采用平方余弦相似度计算相似性,从而聚焦于高光谱信号的角结构(angular structure),弱化对幅值变化的敏感性。该方法在轻量级骨干网络下仍显著优于多个近期基于Transformer与Mamba的模型,在极小监督条件下展现出更强的泛化能力与可靠的归纳偏置(inductive bias)。
链接: https://arxiv.org/abs/2604.01763
作者: Muhammad Ahmad,Manuel Mazzara
机构: SDAIA-KFUPM, Joint Research Center for Artificial Intelligence (JRCAI), King Fahd University of Petroleum and Minerals (国王法赫德石油矿产大学); Institute of Software Development and Engineering, Innopolis University (innopolis大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Transformer-based methods have improved hyperspectral image classification (HSIC) by modeling long-range spatial-spectral dependencies; however, their attention mechanisms typically rely on dot-product similarity, which mixes feature magnitude and orientation and may be suboptimal for hyperspectral data. This work revisits attention scoring from a geometric perspective and introduces a cosine-normalized attention formulation that aligns similarity computation with the angular structure of hyperspectral signatures. By projecting query and key embeddings onto a unit hypersphere and applying a squared cosine similarity, the proposed method emphasizes angular relationships while reducing sensitivity to magnitude variations. The formulation is integrated into a spatial-spectral Transformer and evaluated under extremely limited supervision. Experiments on three benchmark datasets demonstrate that the proposed approach consistently achieves higher performance, outperforming several recent Transformer- and Mamba-based models despite using a lightweight backbone. In addition, a controlled analysis of multiple attention score functions shows that cosine-based scoring provides a reliable inductive bias for hyperspectral representation learning.
[CV-83] Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion
【速读】:该论文旨在解决预训练视频扩散模型在生成任务中缺乏对特定外观特征(如风格、光照)进行可控编辑的问题,尤其是在使用高维自监督特征(如DINO)作为条件信号时,其包含的语义、风格和光照等信息高度纠缠,限制了生成控制能力。解决方案的关键在于提出一种轻量级架构与训练策略,通过解耦外观特征与其他保留特征(如场景结构),实现对风格化和再照明等外观变化的鲁棒控制;同时发现低空间分辨率可通过更高维度的特征表示来补偿,从而提升从显式空间表征中生成图像的可控性。
链接: https://arxiv.org/abs/2604.01761
作者: Edoardo A. Dominici,Thomas Deixelberger,Konstantinos Vardis,Markus Steinberger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page this https URL
Abstract:Video models have recently been applied with success to problems in content generation, novel view synthesis, and, more broadly, world simulation. Many applications in generation and transfer rely on conditioning these models, typically through perceptual, geometric, or simple semantic signals, fundamentally using them as generative renderers. At the same time, high-dimensional features obtained from large-scale self-supervised learning on images or point clouds are increasingly used as a general-purpose interface for vision models. The connection between the two has been explored for subject specific editing, aligning and training video diffusion models, but not in the role of a more general conditioning signal for pretrained video diffusion models. Features obtained through self-supervised learning like DINO, contain a lot of entangled information about style, lighting and semantics of the scene. This makes them great at reconstruction tasks but limits their generative capabilities. In this paper, we show how we can use the features for tasks such as video domain transfer and video-from-3D generation. We introduce a lightweight architecture and training strategy that decouples appearance from other features that we wish to preserve, enabling robust control for appearance changes such as stylization and relighting. Furthermore, we show that low spatial resolution can be compensated by higher feature dimensionality, improving controllability in generative rendering from explicit spatial representations.
[CV-84] Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding CVPR2026
【速读】:该论文旨在解决现有视觉-语言预训练模型(如CLIP)难以直接应用于超声图像的问题,因为超声数据具有异质性解剖结构和多样化的诊断属性。解决方案的关键在于构建了一个大规模的超声图像-文本数据集US-365K(包含365k对样本,覆盖52个解剖类别),并提出Ultrasonographic Diagnostic Taxonomy (UDT),其包含两个层级知识框架:超声解剖层次分类体系(Ultrasonographic Hierarchical Anatomical Taxonomy)用于标准化解剖组织结构,以及超声诊断属性框架(Ultrasonographic Diagnostic Attribute Framework)形式化定义九类诊断维度(包括器官、形态、边界、回声强度、内部特征、后方声学现象、血流等)。在此基础上,作者进一步设计了Ultrasound-CLIP框架,引入语义软标签与语义损失函数以增强样本区分能力,并构建基于UDF文本表示的异质图模态,实现病变-属性关系的结构化推理,从而显著提升在分类、检索及零样本、线性探测和微调任务上的性能与泛化能力。
链接: https://arxiv.org/abs/2604.01749
作者: Jiayun Jin,Haolong Chai,Xueying Huang,Xiaoqing Guo,Zengwei Zheng,Zhan Zhou,Junmei Wang,Xinyu Wang,Jie Liu,Binbin Zhou
机构: Hangzhou City University (杭州城市大学); Hong Kong Baptist University (香港浸会大学); Zhejiang University (浙江大学); Women’s Hospital, School of Medicine, Zhejiang University (浙江大学医学院附属妇产科医院); The First Affiliated Hospital, Zhejiang University School of Medicine (浙江大学医学院附属第一医院); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Ultrasound imaging is widely used in clinical diagnostics due to its real-time capability and radiation-free nature. However, existing vision-language pre-training models, such as CLIP, are primarily designed for other modalities, and are difficult to directly apply to ultrasound data, which exhibit heterogeneous anatomical structures and diverse diagnostic attributes. To bridge this gap, we construct US-365K, a large-scale ultrasound image-text dataset containing 365k paired samples across 52 anatomical categories. We establish Ultrasonographic Diagnostic Taxonomy (UDT) containing two hierarchical knowledge frameworks. Ultrasonographic Hierarchical Anatomical Taxonomy standardizes anatomical organization, and Ultrasonographic Diagnostic Attribute Framework formalizes nine diagnostic dimensions, including body system, organ, diagnosis, shape, margins, echogenicity, internal characteristics, posterior acoustic phenomena, and vascularity. Building upon these foundations, we propose Ultrasound-CLIP, a semantic-aware contrastive learning framework that introduces semantic soft labels and semantic loss to refine sample discrimination. Moreover, we construct a heterogeneous graph modality derived from UDAF’s textual representations, enabling structured reasoning over lesion-attribute relations. Extensive experiments with patient-level data splitting demonstrate that our approach achieves state-of-the-art performance on classification and retrieval benchmarks, while also delivering strong generalization to zero-shot, linear probing, and fine-tuning tasks.
[CV-85] Unifying UAV Cross-View Geo-Localization via 3D Geometric Perception
【速读】:该论文旨在解决无人机(UAV)在无全球导航卫星系统(GNSS)环境下进行跨视角地理定位的问题,其核心挑战在于无人机倾斜影像与卫星正射影像之间存在的严重几何差异。现有方法通常采用“地点检索+姿态估计”的解耦流程,将透视畸变视为外观噪声而非显式几何变换,导致精度受限。本文的关键解决方案是提出一种几何感知的UAV地理定位框架,通过视觉几何接地Transformer(VGGT)从多视角无人机图像序列中重建局部三维场景,并渲染出虚拟鸟瞰图(BEV)表示,该BEV对齐了无人机视角与卫星影像的几何结构,作为几何中介实现鲁棒的跨视角检索并提供空间先验以支持高精度3自由度(3-DoF)姿态回归;同时引入卫星注意力模块(Satellite-wise Attention Block),在保持线性计算复杂度的前提下隔离不同卫星候选之间的干扰,从而提升多假设下的定位可靠性。
链接: https://arxiv.org/abs/2604.01747
作者: Haoyuan Li,Wen Yang,Fang Xu,Hong Tan,Haijian Zhang,Shengyang Li,Gui-Song Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures
Abstract:Cross-view geo-localization for Unmanned Aerial Vehicles (UAVs) operating in GNSS-denied environments remains challenging due to the severe geometric discrepancy between oblique UAV imagery and orthogonal satellite maps. Most existing methods address this problem through a decoupled pipeline of place retrieval and pose estimation, implicitly treating perspective distortion as appearance noise rather than an explicit geometric transformation. In this work, we propose a geometry-aware UAV geo-localization framework that explicitly models the 3D scene geometry to unify coarse place recognition and fine-grained pose estimation within a single inference pipeline. Our approach reconstructs a local 3D scene from multi-view UAV image sequences using a Visual Geometry Grounded Transformer (VGGT), and renders a virtual Bird’s-Eye View (BEV) representation that orthorectifies the UAV perspective to align with satellite imagery. This BEV serves as a geometric intermediary that enables robust cross-view retrieval and provides spatial priors for accurate 3 Degrees of Freedom (3-DoF) pose regression. To efficiently handle multiple location hypotheses, we introduce a Satellite-wise Attention Block that isolates the interaction between each satellite candidate and the reconstructed UAV scene, preventing inter-candidate interference while maintaining linear computational complexity. In addition, we release a recalibrated version of the University-1652 dataset with precise coordinate annotations and spatial overlap analysis, enabling rigorous evaluation of end-to-end localization accuracy. Extensive experiments on the refined University-1652 benchmark and SUES-200 demonstrate that our method significantly outperforms state-of-the-art baselines, achieving robust meter-level localization accuracy and improved generalization in complex urban environments.
[CV-86] Dense Point-to-Mask Optimization with Reinforced Point Selection for Crowd Instance Segmentation
【速读】:该论文旨在解决密集人群场景下基于点标注(point annotations)的实例分割(instance segmentation)精度不足的问题,尤其针对当前主流大模型如Segment Anything Model (SAM) 在稠密人群中的表现不佳。其关键解决方案是提出两种核心方法:一是Dense Point-to-Mask Optimization (DPMO),通过将SAM与最近邻排除圆(Nearest Neighbor Exclusive Circle, NNEC)约束相结合,从点标注生成高质量的密集实例掩码(mask);二是Reinforced Point Selection (RPS)框架,利用Group Relative Policy Optimization (GRPO)训练策略,从初始点预测中选择最优点以提升实例分割性能。这两项创新共同实现了在传统点标注数据集上获得可靠掩码,并显著提升了密集人群场景下的实例分割与计数准确率。
链接: https://arxiv.org/abs/2604.01742
作者: Hongru Chen,Jiyang Huang,Jia Wan,Antoni B.Chan
机构: Harbin Institute of Technology, Shenzhen; City University of Hong Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Crowd instance segmentation is a crucial task with a wide range of applications, including surveillance and transportation. Currently, point labels are common in crowd datasets, while region labels (e.g., boxes) are rare and inaccurate. The masks obtained through segmentation help to improve the accuracy of region labels and resolve the correspondence between individual location coordinates and crowd density maps. However, directly applying currently popular large foundation models such as SAM does not yield ideal results in dense crowds. To this end, we first propose Dense Point-to-Mask Optimization (DPMO), which integrates SAM with the Nearest Neighbor Exclusive Circle (NNEC) constraint to generate dense instance segmentation from point annotations. With DPMO and manual correction, we obtain mask annotations from the existing point annotations for traditional crowd datasets. Then, to predict instance segmentation in dense crowds, we propose a Reinforced Point Selection (RPS) framework trained with Group Relative Policy Optimization (GRPO), which selects the best predicted point from a sampling of the initial point prediction. Through extensive experiments, we achieve state-of-the-art crowd instance segmentation performance on ShanghaiTech, UCF-QNRF, JHU-CROWD++, and NWPU-Crowd datasets. Furthermore, we design new loss functions supervised by masks that boost counting performance across different models, demonstrating the significant role of mask annotations in enhancing counting accuracy.
[CV-87] Setup-Independent Full Projector Compensation
【速读】:该论文旨在解决投影补偿(Projector Compensation)中现有方法对特定硬件配置高度依赖的问题,即当投影表面、光照条件或投影仪-相机位姿发生变化时,传统方法通常需要重新训练或微调才能有效工作,严重限制了其在实际场景中的泛化能力。解决方案的关键在于提出SIComp——首个无需微调或重训练即可适应未见过的投影设置的通用框架。其核心创新包括:构建包含277个不同投影-相机配置的大规模真实世界数据集以支持模型学习跨场景的鲁棒表示;采用协同自适应设计,将几何校正与光度校正解耦:通过定制化的光流模块实现在线几何修正,同时引入新颖的光度网络处理光度失真,并融合强度变化的表面先验以增强不同光照条件下的鲁棒性,从而实现了真正意义上的无监督泛化投影补偿。
链接: https://arxiv.org/abs/2604.01736
作者: Haibo Li,Qingyue Deng,Jijiang Li,Haibin Ling,Bingyao Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages,17 figures
Abstract:Projector compensation seeks to correct geometric and photometric distortions that occur when images are projected onto nonplanar or textured surfaces. However, most existing methods are highly setup-dependent, requiring fine-tuning or retraining whenever the surface, lighting, or projector-camera pose changes. Progress has been limited by two key challenges: (1) the absence of large, diverse training datasets and (2) existing geometric correction models are typically constrained by specific spatial setups; without further retraining or fine-tuning, they often fail to generalize directly to novel geometric configurations. We introduce SIComp, the first Setup-Independent framework for full projector Compensation, capable of generalizing to unseen setups without fine-tuning or retraining. To enable this, we construct a large-scale real-world dataset spanning 277 distinct projector-camera setups. SIComp adopts a co-adaptive design that decouples geometry and photometry: A carefully tailored optical flow module performs online geometric correction, while a novel photometric network handles photometric compensation. To further enhance robustness under varying illumination, we integrate intensity-varying surface priors into the network design. Extensive experiments demonstrate that SIComp consistently produces high-quality compensation across diverse unseen setups, substantially outperforming existing methods in terms of generalization ability and establishing the first generalizable solution to projector compensation. The code and dataset are available on our project page: this https URL
[CV-88] SteerFlow: Steering Rectified Flows for Faithful Inversion-Based Image Editing
【速读】:该论文旨在解决现有基于流的生成式图像编辑方法在文本引导下难以保持源图像保真度(source fidelity)的问题,具体表现为:高阶求解器引入额外模型推理、截断反演限制编辑灵活性,以及特征注入方法缺乏架构可迁移性。其解决方案的关键在于提出 SteerFlow 框架,通过两个核心机制实现理论保障下的高保真编辑:一是前向过程中引入** amortized fixed-point solver(摊销固定点求解器),通过强制连续时间步间速度一致性隐式校直前向轨迹,获得高质量的反演潜在表示;二是后向过程中设计 trajectory interpolation(轨迹插值),自适应融合目标编辑与源重建速度,使编辑轨迹锚定于源图像。此外,还引入 adaptive masking(自适应掩码机制)**,利用概念引导分割和源-目标速度差异空间约束编辑信号,进一步提升背景保留能力。
链接: https://arxiv.org/abs/2604.01715
作者: Thinh Dao,Zhen Wang,Kien T.Pham,Long Chen
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in flow-based generative models have enabled training-free, text-guided image editing by inverting an image into its latent noise and regenerating it under a new target conditional guidance. However, existing methods struggle to preserve source fidelity: higher-order solvers incur additional model inferences, truncated inversion constrains editability, and feature injection methods lack architectural transferability. To address these limitations, we propose SteerFlow, a model-agnostic editing framework with strong theoretical guarantees on source fidelity. In the forward process, we introduce an Amortized Fixed-Point Solver that implicitly straightens the forward trajectory by enforcing velocity consistency across consecutive timesteps, yielding a high-fidelity inverted latent. In the backward process, we introduce Trajectory Interpolation, which adaptively blends target-editing and source-reconstruction velocities to keep the editing trajectory anchored to the source. To further improve background preservation, we introduce an Adaptive Masking mechanism that spatially constrains the editing signal with concept-guided segmentation and source-target velocity differences. Extensive experiments on FLUX.1-dev and Stable Diffusion 3.5 Medium demonstrate that SteerFlow consistently achieves better editing quality than existing methods. Finally, we show that SteerFlow extends naturally to a complex multi-turn editing paradigm without accumulating drift.
[CV-89] End-to-End Shared Attention Estimation via Group Detection with Feedback Refinement CVPR2026
【速读】:该论文旨在解决现有共享注意力(Shared Attention, SA)估计方法中存在的两个关键问题:一是多数方法在估计SA时未检测实际关注群体,二是普遍假设图像中仅存在单一SA点,这限制了SA检测在实际场景中的适用性和性能。解决方案的关键在于提出一种端到端的联合方法,通过两步流程实现群体检测与共享注意力估计的协同优化:首先基于个体注视热图(gaze attention heatmap)和群体成员关系标量(group membership scalar)生成初始SA热图;随后利用初始SA热图对群体成员关系进行细化,最终输出精确的SA热图。该设计有效融合了群体结构信息与视觉注意力分布,提升了整体检测精度。
链接: https://arxiv.org/abs/2604.01714
作者: Chihiro Nakatani,Norimichi Ukita,Jean-Marc Odobez
机构: Toyota Technological Institute (丰田工业大学); Idiap Research Institute (Idiap研究所); École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2026 Workshop (GAZE 2026)
Abstract:This paper proposes an end-to-end shared attention estimation method via group detection. Most previous methods estimate shared attention (SA) without detecting the actual group of people focusing on it, or assume that there is a single SA point in a given image. These issues limit the applicability of SA detection in practice and impact performance. To address them, we propose to simultaneously achieve group detection and shared attention estimation using a two step process: (i) the generation of SA heatmaps relying on individual gaze attention heatmaps and group membership scalars estimated in a group inference; (ii) a refinement of the initial group memberships allowing to account for the initial SA heatmaps, and the final prediction of the SA heatmap. Experiments demonstrate that our method outperforms other methods in group detection and shared attention estimation. Additional analyses validate the effectiveness of the proposed components. Code: this https URL.
[CV-90] Bias mitigation in graph diffusion models ICLR2025
【速读】:该论文旨在解决图扩散模型中存在的双重偏差问题:一是反向采样起始点与前向扩散过程最大扰动分布不一致导致的“反向起始偏差”(reverse-starting bias),二是扩散模型固有的“暴露偏差”(exposure bias),二者共同导致生成质量下降。解决方案的关键在于两个创新机制:其一,设计了一种新的Langevin采样算法,使反向采样起始点与前向扩散的最大扰动分布对齐,从而缓解反向起始偏差;其二,提出基于新定义的“得分差”(score difference)的得分修正机制,有效降低暴露偏差。该方法无需修改网络结构,在多个模型、数据集和任务上验证了有效性,达到了当前最优性能。
链接: https://arxiv.org/abs/2604.01709
作者: Meng Yu,Kun Zhan
机构: Lanzhou University (兰州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2025!
Abstract:Most existing graph diffusion models have significant bias problems. We observe that the forward diffusion’s maximum perturbation distribution in most models deviates from the standard Gaussian distribution, while reverse sampling consistently starts from a standard Gaussian distribution, which results in a reverse-starting bias. Together with the inherent exposure bias of diffusion models, this results in degraded generation quality. This paper proposes a comprehensive approach to mitigate both biases. To mitigate reverse-starting bias, we employ a newly designed Langevin sampling algorithm to align with the forward maximum perturbation distribution, establishing a new reverse-starting point. To address the exposure bias, we introduce a score correction mechanism based on a newly defined score difference. Our approach, which requires no network modifications, is validated across multiple models, datasets, and tasks, achieving state-of-the-art this http URL is at this https URL
[CV-91] Can Video Diffusion Models Predict Past Frames? Bidirectional Cycle Consistency for Reversible Interpolation
【速读】:该论文旨在解决视频帧插值(video frame interpolation)中因生成模型单向运行而导致的运动漂移(motion drift)、方向歧义(directional ambiguity)和边界错位(boundary misalignment)问题,尤其在长距离序列中更为显著。其解决方案的关键在于提出一种基于时间循环一致性(temporal cycle-consistency)的双向框架,通过引入可学习的方向标记(directional tokens)显式地将时间方向条件注入共享主干网络,使模型能够在统一架构内联合优化前向合成与后向重建任务。该循环一致性监督作为强正则化项,确保生成的运动路径具有逻辑可逆性;同时采用课程学习策略从短序列到长序列逐步训练,稳定不同长度下的动态行为,且仅在训练阶段应用循环约束,推理时保持单次前向传播的高效性,从而实现高质量、平滑且可控的视频插值效果。
链接: https://arxiv.org/abs/2604.01700
作者: Lingyu Liu,Yaxiong Wang,Li Zhu,Zhedong Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Video frame interpolation aims to synthesize realistic intermediate frames between given endpoints while adhering to specific motion semantics. While recent generative models have improved visual fidelity, they predominantly operate in a unidirectional manner, lacking mechanisms to self-verify temporal consistency. This often leads to motion drift, directional ambiguity, and boundary misalignment, especially in long-range sequences. Inspired by the principle of temporal cycle-consistency in self-supervised learning, we propose a novel bidirectional framework that enforces symmetry between forward and backward generation trajectories. Our approach introduces learnable directional tokens to explicitly condition a shared backbone on temporal orientation, enabling the model to jointly optimize forward synthesis and backward reconstruction within a single unified architecture. This cycle-consistent supervision acts as a powerful regularizer, ensuring that generated motion paths are logically reversible. Furthermore, we employ a curriculum learning strategy that progressively trains the model from short to long sequences, stabilizing dynamics across varying durations. Crucially, our cyclic constraints are applied only during training; inference requires a single forward pass, maintaining the high efficiency of the base model. Extensive experiments show that our method achieves state-of-the-art performance in imaging quality, motion smoothness, and dynamic control on both 37-frame and 73-frame tasks, outperforming strong baselines while incurring no additional computational overhead.
[CV-92] From Understanding to Erasing: Towards Complete and Stable Video Object Removal
【速读】:该论文旨在解决视频目标移除(video object removal)任务中因移除目标物体而引发的副作用(如阴影、反射和光照变化)难以有效处理的问题,这些问题通常会破坏视频的时空一致性与整体连贯性。其核心挑战在于现有扩散模型对目标物体及其与场景交互关系的物理和语义理解不足。解决方案的关键在于从两个互补视角引入理解机制:外部层面,通过蒸馏策略将视觉基础模型中物体与其诱导效应之间的关系迁移至视频扩散模型;内部层面,设计帧级上下文交叉注意力机制,使每个去噪模块能够基于目标区域周围的未掩码上下文进行定位与推理。内外双重引导共同提升了模型对目标物体、其诱发效应及全局背景的理解能力,从而实现清晰且一致的视频目标移除效果。
链接: https://arxiv.org/abs/2604.01693
作者: Dingming Liu,Wenjing Wang,Chen Li,Jing Lyu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video object removal aims to eliminate target objects from videos while plausibly completing missing regions and preserving spatio-temporal consistency. Although diffusion models have recently advanced this task, it remains challenging to remove object-induced side effects (e.g., shadows, reflections, and illumination changes) without compromising overall coherence. This limitation stems from the insufficient physical and semantic understanding of the target object and its interactions with the scene. In this paper, we propose to introduce understanding into erasing from two complementary perspectives. Externally, we introduce a distillation scheme that transfers the relationships between objects and their induced effects from vision foundation models to video diffusion models. Internally, we propose a framewise context cross-attention mechanism that grounds each denoising block in informative, unmasked context surrounding the target region. External and internal guidance jointly enable our model to understand the target object, its induced effects, and the global background context, resulting in clear and coherent object removal. Extensive experiments demonstrate our state-of-the-art performance, and we establish the first real-world benchmark for video object removal to facilitate future research and community progress. Our code, data, and models are available at: this https URL.
[CV-93] BTS-rPPG: Orthogonal Butterfly Temporal Shifting for Remote Photoplethysmography
【速读】:该论文旨在解决远程光电容积脉搏波描记术(remote photoplethysmography, rPPG)中生理信号时间动态建模的难题,尤其是现有深度学习方法依赖局部时间移位或卷积操作导致的时间感受野有限、难以捕捉长程时序依赖的问题。其解决方案的关键在于提出基于正交蝴蝶时间移位(Orthogonal Butterfly Temporal Shifting, BTS)的新型时间建模框架:通过XOR驱动的蝴蝶配对调度机制构建结构化帧间交互,逐步扩展时间感受野;同时引入正交特征传输机制(Orthogonal Feature Transfer, OFT),在时间移位前对源特征进行目标上下文感知过滤,仅保留正交分量进行跨帧传播,从而减少冗余信息传递并增强互补性时序交互,显著提升rPPG对生理动态的长期建模能力。
链接: https://arxiv.org/abs/2604.01679
作者: Ba-Thinh Nguyen,Thi-Duyen Ngo,Thanh-Trung Huynh,Thanh-Ha Le,Huy-Hieu Pham
机构: VNU University of Engineering and Technology (河内国立大学工程与技术学院); VinUni-Illinois Smart Health Center (VinUniversity) (Vin大学-伊利诺伊智能健康中心); College of Engineering and Computer Science (VinUniversity) (Vin大学工程与计算机科学学院); Center for Innovations in Health Sciences (VinUniversity) (Vin大学健康科学创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote photoplethysmography (rPPG) enables contactless physiological sensing from facial videos by analyzing subtle appearance variations induced by blood circulation. However, modeling the temporal dynamics of these signals remains challenging, as many deep learning methods rely on temporal shifting or convolutional operators that aggregate information primarily from neighboring frames, resulting in predominantly local temporal modeling and limited temporal receptive fields. To address this limitation, we propose BTS-rPPG, a temporal modeling framework based on Orthogonal Butterfly Temporal Shifting (BTS). Inspired by the butterfly communication pattern in the Fast Fourier Transform (FFT), BTS establishes structured frame interactions via an XOR-based butterfly pairing schedule, progressively expanding the temporal receptive field and enabling efficient propagation of information across distant frames. Furthermore, we introduce an orthogonal feature transfer mechanism (OFT) that filters the source feature with respect to the target context before temporal shifting, retaining only the orthogonal component for cross-frame transmission. This reduces redundant feature propagation and encourages complementary temporal interaction. Extensive experiments on multiple benchmark datasets demonstrate that BTS-rPPG improves long-range temporal modeling of physiological dynamics and consistently outperforms existing temporal modeling strategies for rPPG estimation.
[CV-94] Director: Instance-aware Gaussian Splatting for Dynamic Scene Modeling and Understanding
【速读】:该论文旨在解决现有基于高斯表示的体积视频(volumetric video)方法在动态场景中缺乏实例级结构信息的问题,从而限制了稳定跟踪与语义推理能力。解决方案的关键在于提出一种统一的时空高斯表示框架Director,其核心创新是通过嵌入实例一致的语义信息来增强4D建模,具体实现上利用时序对齐的实例掩码和来自多模态大语言模型(Multimodal Large Language Models, MLLMs)的句子嵌入,监督每个高斯点的可学习语义特征,借助两个MLP解码器实现语言对齐的4D表示并保障身份一致性;同时引入2D光流与4D高斯之间的运动桥接机制以提升时序稳定性,并结合几何感知的SDF约束及表面连续性正则项,显著增强动态前景建模的时序连贯性。
链接: https://arxiv.org/abs/2604.01678
作者: Yuheng Jiang,Yiwen Cai,Zihao Wang,Yize Wu,Sicheng Li,Zhuo Su,Shaohui Jiao,Lan Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Volumetric video seeks to model dynamic scenes as temporally coherent 4D representations. While recent Gaussian-based approaches achieve impressive rendering fidelity, they primarily emphasize appearance but are largely agnostic to instance-level structure, limiting stable tracking and semantic reasoning in highly dynamic scenarios. In this paper, we present Director, a unified spatio-temporal Gaussian representation that jointly models human performance, high-fidelity rendering, and instance-level semantics. Our key insight is that embedding instance-consistent semantics naturally complements 4D modeling, enabling more accurate scene decomposition while supporting robust dynamic scene understanding. To this end, we leverage temporally aligned instance masks and sentence embeddings derived from Multimodal Large Language Models to supervise the learnable semantic features of each Gaussian via two MLP decoders, enabling language-aligned 4D representations and enforcing identity consistency over time. To enhance temporal stability, we bridge 2D optical flow with 4D Gaussians and finetune their motions, yielding reliable initialization and reducing drift. For the training, we further introduce a geometry-aware SDF constraints, along with regularization terms that enforces surface continuity, enhancing temporal coherence in dynamic foreground modeling. Experiments demonstrate that Director achieves temporally coherent 4D reconstructions while simultaneously enabling instance segmentation and open-vocabulary querying.
[CV-95] GPA: Learning GUI Process Automation from Demonstrations
【速读】:该论文旨在解决传统机器人流程自动化(Robotic Process Automation, RPA)易受环境变化影响而失效,以及基于视觉语言模型的GUI代理存在非确定性风险的问题。解决方案的关键在于提出一种轻量级、基于视觉的GUI过程自动化(GUI Process Automation, GPA)方法,其核心创新包括:(1) 通过基于序列蒙特卡洛(Sequential Monte Carlo)的定位技术提升对缩放和检测不确定性的鲁棒性;(2) 利用就绪校准(readiness calibration)保障执行的确定性和可靠性;(3) 实现完全本地化的快速执行以保护隐私。该方案在企业级工作流中实现了适应性、鲁棒性和安全性三重优势,并可作为MCP/CLI工具与其他具备编程能力的代理协同工作,由GPA专门负责GUI操作执行,从而实现推理与执行的解耦。
链接: https://arxiv.org/abs/2604.01676
作者: Zirui Zhao,Jun Hao Liew,Yan Yang,Wenzhuo Yang,Ziyang Luo,Doyen Sahoo,Silvio Savarese,Junnan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.
[CV-96] HOT: Harmonic-Constrained Optimal Transport for Remote Photoplethysmography Domain Adaptation
【速读】:该论文旨在解决远程光电容积脉搏波描记术(remote photoplethysmography, rPPG)在跨域场景下性能显著下降的问题,即由于光照、相机特性及色彩响应等外观因素在不同数据集间存在差异而导致模型泛化能力受限。其解决方案的关键在于提出频率域适应(frequency domain adaptation, FDA),通过迁移低频谱成分来建模外观变化,从而促使rPPG模型学习对这些外观变异具有不变性的特征,同时保留由心脏活动引起的生理信号;进一步地,为确保在外观变化下的生理一致性对齐,作者引入了谐波约束最优传输(Harmonic-Constrained Optimal Transport, HOT),利用心脏信号的谐波特性引导原始表示与FDA转换后表示之间的对齐,从而有效提升rPPG模型的鲁棒性和跨数据集泛化能力。
链接: https://arxiv.org/abs/2604.01675
作者: Ba-Thinh Nguyen,Thi-Duyen Ngo,Thanh-Trung Huynh,Thanh-Ha Le,Huy-Hieu Pham
机构: VNU University of Engineering and Technology (河内国立大学工程与技术学院); VinUni-Illinois Smart Health Center (VinUniversity) (Vin大学-伊利诺伊智能健康中心); College of Engineering and Computer Science (VinUniversity) (Vin大学工程与计算机科学学院); Center for Innovations in Health Sciences (VinUniversity) (Vin大学健康科学创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote photoplethysmography (rPPG) enables non-contact physiological measurement from facial videos; however, its practical deployment is often hindered by substantial performance degradation under domain shift. While recent deep learning-based rPPG methods have achieved strong performance on individual datasets, they frequently overfit to appearance-related factors, such as illumination, camera characteristics, and color response, that vary significantly across domains. To address this limitation, we introduce frequency domain adaptation (FDA) as a principled strategy for modeling appearance variation in rPPG. By transferring low-frequency spectral components that encode domain-dependent appearance characteristics, FDA encourages rPPG models to learn invariance to appearance variations while retaining cardiac-induced signals. To further support physiologically consistent alignment under such appearance variation, we propose Harmonic-Constrained Optimal Transport (HOT), which leverages the harmonic property of cardiac signals to guide alignment between original and FDA-transferred representations. Extensive cross-dataset experiments demonstrate that the proposed FDA and HOT framework effectively enhances the robustness and generalization of rPPG models across diverse datasets.
[CV-97] Robust Embodied Perception in Dynamic Environments via Disentangled Weight Fusion ICME2026
【速读】:该论文旨在解决具身感知系统在开放物理空间中持续交互时面临的动态环境分布漂移问题,尤其是现有领域增量感知方法依赖预先获取的领域ID(domain id)以及模型易受特定上下文感知噪声干扰导致过拟合与灾难性遗忘的问题。解决方案的关键在于提出一种无需领域ID和示例样本的增量学习框架,通过解耦表示机制消除非必要的环境风格干扰,引导模型提取跨场景共享的语义内在特征,从而降低感知不确定性并提升泛化能力;同时采用权重融合策略在参数空间中动态整合新旧环境知识,实现无需存储历史数据即可适应新分布且最大程度保留旧环境判别能力的连续适应目标。
链接: https://arxiv.org/abs/2604.01669
作者: Juncen Guo,Xiaoguang Zhu,Jingyi Wu,Jingyu Zhang,Jingnan Cai,Zhenghao Niu,Liang Song
机构: Fudan University (复旦大学); University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICME2026
Abstract:Embodied perception systems face severe challenges of dynamic environment distribution drift when they continuously interact in open physical spaces. However, the existing domain incremental awareness methods often rely on the domain id obtained in advance during the testing phase, which limits their practicability in unknown interaction scenarios. At the same time, the model often overfits to the context-specific perceptual noise, which leads to insufficient generalization ability and catastrophic forgetting. To address these limitations, we propose a domain-id and exemplar-free incremental learning framework for embodied multimedia systems, which aims to achieve robust continuous environment adaptation. This method designs a disentangled representation mechanism to remove non-essential environmental style interference, and guide the model to focus on extracting semantic intrinsic features shared across scenes, thereby eliminating perceptual uncertainty and improving generalization. We further use the weight fusion strategy to dynamically integrate the old and new environment knowledge in the parameter space, so as to ensure that the model adapts to the new distribution without storing historical data and maximally retains the discrimination ability of the old environment. Extensive experiments on multiple standard benchmark datasets show that the proposed method significantly reduces catastrophic forgetting in a completely exemplar-free and domain-id free setting, and its accuracy is better than the existing state-of-the-art methods.
[CV-98] M3D-BFS: a Multi-stage Dynamic Fusion Strategy for Sample-Adaptive Multi-Modal Brain Network Analysis
【速读】:该论文旨在解决当前多模态脑网络分析中静态融合方法无法适应不同输入样本差异的问题,从而限制了模型性能提升。其关键解决方案是提出一种多阶段动态融合策略(M3D-BFS),通过为单模态和多模态表示设计可随输入样本变化而自适应调整的混合专家(Mixture-of-Experts, MoE)结构,实现样本感知的动态融合。为避免MoE训练过程中专家坍塌问题,该方法采用三阶段训练流程:先分别训练单模态编码器,再预训练MoE中的单个专家,最后微调整个模型,并引入多模态解耦损失以增强最终表征质量。这是首个面向多模态脑网络动态融合的研究工作。
链接: https://arxiv.org/abs/2604.01667
作者: Rui Dong,Xiaotong Zhang,Jiaxing Li,Yueying Li,Jiayin Wei,Youyong Kong
机构: Southeast University (东南大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-modal fusion is of great significance in neuroscience which integrates information from different modalities and can achieve better performance than uni-modal methods in downstream tasks. Current multi-modal fusion methods in brain networks, which mainly focus on structural connectivity (SC) and functional connectivity (FC) modalities, are static in nature. They feed different samples into the same model with identical computation, ignoring inherent difference between input samples. This lack of sample adaptation hinders model’s further performance. To this end, we innovatively propose a multi-stage dynamic fusion strategy (M3D-BFS) for sample-adaptive multi-modal brain network analysis. Unlike other static fusion methods, we design different mixture-of-experts (MoEs) for uni- and multi-modal representations where modules can adaptively change as input sample changes during inference. To alleviate issue of MoE where training of experts may be collapsed, we divide our method into 3 stages. We first train uni-modal encoders respectively, then pretrain single experts of MoEs before finally finetuning the whole model. A multi-modal disentanglement loss is designed to enhance the final representations. To the best of our knowledge, this is the first work for dynamic fusion for multi-modal brain network analysis. Extensive experiments on different real-world datasets demonstrates the superiority of M3D-BFS.
[CV-99] DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data CVPR2026
【速读】:该论文旨在解决视频扩散模型在生成高动态运动或需要精细运动控制的视频时存在的现实感不足和可控性差的问题,其根本原因在于常用训练数据集中此类动态场景样本稀缺。解决方案的关键在于提出DynaVid框架,该框架利用计算机图形学管线生成的合成光流(optical flow)作为训练信号,而非直接使用合成视频;这种方法实现了运动信息与视觉外观的解耦,避免了模型学习到不自然的视觉特征,同时提供了多样且精确的运动模式和控制信号。在此基础上,DynaVid采用两阶段生成架构:先由运动生成器合成运动,再由运动引导的视频生成器基于该运动生成真实感视频帧,从而在合成数据中学习复杂动态模式的同时保留真实视频的视觉质量。
链接: https://arxiv.org/abs/2604.01666
作者: Wonjoon Jin,Jiyun Won,Janghyeok Han,Qi Dai,Chong Luo,Seung-Hwan Baek,Sunghyun Cho
机构: POSTECH(韩国科学技术院); Microsoft Research Asia(微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. Website: this https URL
Abstract:Despite recent progress, video diffusion models still struggle to synthesize realistic videos involving highly dynamic motions or requiring fine-grained motion controllability. A central limitation lies in the scarcity of such examples in commonly used training datasets. To address this, we introduce DynaVid, a video synthesis framework that leverages synthetic motion data in training, which is represented as optical flow and rendered using computer graphics pipelines. This approach offers two key advantages. First, synthetic motion offers diverse motion patterns and precise control signals that are difficult to obtain from real data. Second, unlike rendered videos with artificial appearances, rendered optical flow encodes only motion and is decoupled from appearance, thereby preventing models from reproducing the unnatural look of synthetic videos. Building on this idea, DynaVid adopts a two-stage generation framework: a motion generator first synthesizes motion, and then a motion-guided video generator produces video frames conditioned on that motion. This decoupled formulation enables the model to learn dynamic motion patterns from synthetic data while preserving visual realism from real-world videos. We validate our framework on two challenging scenarios, vigorous human motion generation and extreme camera motion control, where existing datasets are particularly limited. Extensive experiments demonstrate that DynaVid improves the realism and controllability in dynamic motion generation and camera motion control.
[CV-100] Moiré Video Authentication: A Physical Signature Against AI Video Generation
【速读】:该论文旨在解决生成式 AI (Generative AI) 生成的视频内容日益难以与真实拍摄视频区分的问题。解决方案的关键在于利用物理现象——莫尔效应(Moiré effect),即当摄像机拍摄双层光栅结构时产生的干涉条纹。研究发现,条纹相位与光栅图像位移之间存在由光学几何决定的线性耦合关系,这一特性在真实摄像机中自然产生,而生成模型无法准确复现。验证者从视频中提取这两个信号并测试其相关性,结果显示真实视频与 AI 生成视频的相关性显著不同,从而提供了一种基于物理规律、可验证的视频真伪鉴别方法。
链接: https://arxiv.org/abs/2604.01654
作者: Yuan Qing,Kunyu Zheng,Lingxiao Li,Boqing Gong,Chang Xiao
机构: Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 17 pages, 14 figures
Abstract:Recent advances in video generation have made AI-synthesized content increasingly difficult to distinguish from real footage. We propose a physics-based authentication signature that real cameras produce naturally, but that generative models cannot faithfully reproduce. Our approach exploits the Moiré effect: the interference fringes formed when a camera views a compact two-layer grating structure. We derive the Moiré motion invariant, showing that fringe phase and grating image displacement are linearly coupled by optical geometry, independent of viewing distance and grating structure. A verifier extracts both signals from video and tests their correlation. We validate the invariant on both real-captured and AI-generated videos from multiple state-of-the-art generators, and find that real and AI-generated videos produce significantly different correlation signatures, suggesting a robust means of differentiating them. Our work demonstrates that deterministic optical phenomena can serve as physically grounded, verifiable signatures against AI-generated video.
[CV-101] MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label CVPR2026
【速读】:该论文旨在解决单目3D目标检测在稀疏标注(sparsely annotated)场景下的性能下降问题,即当仅有一小部分物体被标注时,传统方法因缺乏足够监督信号而难以有效训练。其核心解决方案包含两个关键模块:一是道路感知的补丁增强(Road-Aware Patch Augmentation, RAPA),通过将分割后的物体补丁精准地投影到道路区域并保持三维几何一致性,从而利用稀疏标注信息增强数据多样性;二是基于原型的过滤机制(Prototype-Based Filtering, PBF),通过对比预测框与预定义的2D RoI特征原型及深度不确定性来筛选高质量伪标签(pseudo-labels),确保伪标签在特征空间和深度估计上均具可靠性。该方法结合几何保真增强与原型引导的伪标签生成策略,在稀疏监督下实现了鲁棒的3D目标检测性能。
链接: https://arxiv.org/abs/2604.01646
作者: Junyoung Jung,Seokwon Kim,Jun Uk Kim
机构: Kyung Hee University (高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Monocular 3D object detection has achieved impressive performance on densely annotated datasets. However, it struggles when only a fraction of objects are labeled due to the high cost of 3D annotation. This sparsely annotated setting is common in real-world scenarios where annotating every object is impractical. To address this, we propose a novel framework for sparsely annotated monocular 3D object detection with two key modules. First, we propose Road-Aware Patch Augmentation (RAPA), which leverages sparse annotations by augmenting segmented object patches onto road regions while preserving 3D geometric consistency. Second, we propose Prototype-Based Filtering (PBF), which generates high-quality pseudo-labels by filtering predictions through prototype similarity and depth uncertainty. It maintains global 2D RoI feature prototypes and selects pseudo-labels that are both feature-consistent with learned prototypes and have reliable depth estimates. Our training strategy combines geometry-preserving augmentation with prototype-guided pseudo-labeling to achieve robust detection under sparse supervision. Extensive experiments demonstrate the effectiveness of the proposed method. The source code is available at this https URL .
[CV-102] OL: Textual Localization with OpenStreetMap
【速读】:该论文旨在解决基于文本描述的开放街图(OpenStreetMap, OSM)全局定位问题,即在不依赖几何观测或GNSS初始位置的情况下,仅通过自然语言场景描述实现城市环境中二维自由度(2 degree-of-freedom, 2-DoF)的精确定位。其核心挑战在于如何将语义丰富的文本信息与结构紧凑但语义复杂的OSM地图进行对齐与匹配。解决方案的关键是提出TOLoc框架,采用“粗到精”两阶段策略:第一阶段利用方向感知特征提取模块,从文本和OSM地图中构建全局描述符以检索候选位置;第二阶段通过专用对齐模块融合文本描述与局部地图特征,回归出最终的2-DoF位姿。该方法显著提升了定位精度,并展现出良好的跨环境泛化能力。
链接: https://arxiv.org/abs/2604.01644
作者: Youqi Liao,Shuhao Kang,Jingyu Xu,Olaf Wysocki,Yan Xia,Jianping Li,Zhen Dong,Bisheng Yang,Xieyuanli Chen
机构: Wuhan University (武汉大学); Technical University of Munich (慕尼黑工业大学); Institute of Artificial Intelligence (TeleAI) (中国电信人工智能研究院); University of Cambridge (剑桥大学); University of Science and Technology of China (中国科学技术大学); Nanyang Technological University (南洋理工大学); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Tech repo
Abstract:Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O global localization task, which aims to estimate accurate 2 degree-of-freedom (DoF) positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53%, 9.93%, and 8.31% at 5m, 10m, and 25m thresholds, respectively, and shows strong generalization to unseen environments. Dataset, code and models will be publicly available at: this https URL.
[CV-103] LivingWorld: Interactive 4D World Generation with Environmental Dynamics
【速读】:该论文旨在解决现有3D场景生成方法主要聚焦于静态几何重建,而忽视了场景尺度环境动态(如云、水或烟雾)建模的问题。这类动态建模的挑战在于:随着场景扩展,运动需保持全局一致性,同时支持低延迟用户交互。解决方案的关键在于提出LivingWorld框架,其核心创新是通过渐进式构建全局一致的运动场来实现动态扩展,并引入几何感知对齐模块以消除多视角下的方向与尺度歧义;此外,采用基于哈希的紧凑运动场表示,实现高效查询与稳定传播动态信息,且支持渲染时双向运动传播,从而无需依赖昂贵的视频精修即可生成长时序、高一致性的4D序列。
链接: https://arxiv.org/abs/2604.01641
作者: Hyeongju Mun,In-Hwan Jin,Sohyeong Kim,Kyeongbo Kong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce LivingWorld, an interactive framework for generating 4D worlds with environmental dynamics from a single image. While recent advances in 3D scene generation enable large-scale environment creation, most approaches focus primarily on reconstructing static geometry, leaving scene-scale environmental dynamics such as clouds, water, or smoke largely unexplored. Modeling such dynamics is challenging because motion must remain coherent across an expanding scene while supporting low-latency user feedback. LivingWorld addresses this challenge by progressively constructing a globally coherent motion field as the scene expands. To maintain global consistency during expansion, we introduce a geometry-aware alignment module that resolves directional and scale ambiguities across views. We further represent motion using a compact hash-based motion field, enabling efficient querying and stable propagation of dynamics throughout the scene. This representation also supports bidirectional motion propagation during rendering, producing long and temporally coherent 4D sequences without relying on expensive video-based refinement. On a single RTX 5090 GPU, generating each new scene expansion step requires 9 seconds, followed by 3 seconds for motion alignment and motion field updates, enabling interactive 4D world generation with globally coherent environmental dynamics. Video demonstrations are available at this http URL.
[CV-104] Automatic Image-Level Morphological Trait Annotation for Organismal Images ICLR2026
【速读】:该论文旨在解决生态学研究中形态特征(morphological traits)标注效率低、依赖专家人工标注的问题,从而限制了大规模生态分析的应用。其核心挑战在于缺乏高质量的图像与特征级注释数据集。解决方案的关键在于利用在基础模型(foundation model)特征上训练的稀疏自编码器(sparse autoencoder),提取出具有单义性(monosemantic)且空间定位明确的神经元,这些神经元能稳定激活于有意义的形态部位;在此基础上构建一个模块化标注流水线,通过视觉-语言提示(vision-language prompting)实现对显著区域的定位与可解释的形态描述生成,最终构建了包含80K条特征注释的Bioscan-Traits数据集,为生态学与机器学习的融合提供了可扩展的生物意义监督信号。
链接: https://arxiv.org/abs/2604.01619
作者: Vardaan Pahuja,Samuel Stevens,Alyson East,Sydne Record,Yu Su
机构: The Ohio State University (俄亥俄州立大学); University of Maine (缅因大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatially grounded neurons that consistently activate on meaningful morphological parts. Leveraging this property, we introduce a trait annotation pipeline that localizes salient regions and uses vision-language prompting to generate interpretable trait descriptions. Using this approach, we construct Bioscan-Traits, a dataset of 80K trait annotations spanning 19K insect images from BIOSCAN-5M. Human evaluation confirms the biological plausibility of the generated morphological descriptions. We assess design sensitivity through a comprehensive ablation study, systematically varying key design choices and measuring their impact on the quality of the resulting trait descriptions. By annotating traits with a modular pipeline rather than prohibitively expensive manual efforts, we offer a scalable way to inject biologically meaningful supervision into foundation models, enable large-scale morphological analyses, and bridge the gap between ecological relevance and machine-learning practicality.
[CV-105] x3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中对物理可实现的对抗性攻击缺乏 robustness 的问题,尤其是针对三维(3D)纹理扰动这一更具现实可行性的攻击方式。现有研究多集中于语言扰动或二维(2D)视觉攻击,其物理真实性不足;而3D纹理由于可直接附着于物体表面并在真实环境中部署,构成更严重的威胁。解决方案的关键在于提出两个核心机制:一是前景-背景解耦(Foreground-Background Decoupling, FBD),通过双渲染器对齐实现从VLA目标函数到物体外观的可微优化路径,从而在不改变原始仿真环境的前提下完成端到端纹理优化;二是轨迹感知对抗优化(Trajectory-Aware Adversarial Optimization, TAAO),通过优先选择行为关键帧并采用基于顶点的参数化策略提升优化稳定性,确保攻击在长时程和多视角下仍具有效性。由此构建的Tex3D框架首次实现了在VLA仿真环境中直接优化3D对抗纹理,实验证明其可在多种操作任务中使VLA系统失败率高达96.7%,揭示了VLA系统在物理场景下的严重脆弱性,并强调了鲁棒训练的必要性。
链接: https://arxiv.org/abs/2604.01618
作者: Jiawei Chen,Simin Huang,Jiawei Du,Shuaihang Chen,Yu Tian,Mingjie Wei,Chao Yu,Zhaoxia Yin
机构: East China Normal University (华东师范大学); Zhongguancun Academy (中关村学院); CFAR, A*STAR, Singapore (新加坡科技研究局CFAR); Tsinghua University (清华大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.
[CV-106] NEMESIS: Noise-suppressed Efficient MAE with Enhanced Superpatch Integration Strategy
【速读】:该论文旨在解决三维计算机断层扫描(CT)图像在自监督学习(SSL)中面临的两大挑战:一是全体积Transformer模型内存消耗过高,二是CT数据的各向异性空间结构难以被传统掩码策略有效捕捉。解决方案的关键在于提出NEMESIS框架,其核心创新包括:(i) 噪声增强重建作为预训练任务以提升特征表达能力;(ii) 引入掩码解剖Transformer块(Masked Anatomical Transformer Blocks, MATB),通过并行的平面级和轴级token移除实现双重掩码机制,更好地保留解剖细节;(iii) 设计NEMESIS Tokens(NT)用于跨尺度上下文聚合,从而在局部超补丁(128×128×128)级别上实现高效且精确的表示学习。该方法在BTCV多器官分类任务中表现优异,尤其在低标签场景下仍保持高鲁棒性,同时显著降低计算成本至31.0 GFLOPs/前向传播,相较全体积基线减少96.8%。
链接: https://arxiv.org/abs/2604.01612
作者: Kyeonghun Kim,Hyeonseok Jung,Youngung Han,Hyunsu Go,Eunseob Choi,Seongbin Park,Junsu Lim,Jiwon Yang,Sumin Lee,Insung Hwang,Ken Ying-Kai Liao,Nam-Joon Kim
机构: Chung-Ang University (中央大学); Seoul National University (首尔国立大学); GIST (韩国科学技术院); Sangmyung University (尚志大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 5 figures, 5 tables
Abstract:Volumetric CT imaging is essential for clinical diagnosis, yet annotating 3D volumes is expensive and time-consuming, motivating self-supervised learning (SSL) from unlabeled data. However, applying SSL to 3D CT remains challenging due to the high memory cost of full-volume transformers and the anisotropic spatial structure of CT data, which is not well captured by conventional masking strategies. We propose NEMESIS, a masked autoencoder (MAE) framework that operates on local 128x128x128 superpatches, enabling memory-efficient training while preserving anatomical detail. NEMESIS introduces three key components: (i) noise-enhanced reconstruction as a pretext task, (ii) Masked Anatomical Transformer Blocks (MATB) that perform dual-masking through parallel plane-wise and axis-wise token removal, and (iii) NEMESIS Tokens (NT) for cross-scale context aggregation. On the BTCV multi-organ classification benchmark, NEMESIS with a frozen backbone and a linear classifier achieves a mean AUROC of 0.9633, surpassing fully fine-tuned SuPreM (0.9493) and VoCo (0.9387). Under a low-label regime with only 10% of available annotations, it retains an AUROC of 0.9075, demonstrating strong label efficiency. Furthermore, the superpatch-based design reduces computational cost to 31.0 GFLOPs per forward pass, compared to 985.8 GFLOPs for the full-volume baseline, providing a scalable and robust foundation for 3D medical imaging.
[CV-107] F3DGS: Federated 3D Gaussian Splatting for Decentralized Multi-Agent World Modeling CVPR2026
【速读】:该论文旨在解决分布式多智能体场景下3D重建的几何一致性与通信效率问题,现有基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的方法依赖集中式数据聚合,在去中心化机器人系统中难以适用,且直接扩展至多智能体时会引入通信开销和几何不一致。其解决方案的关键在于:首先通过注册各客户端本地融合的激光雷达(LiDAR)点云构建共享几何骨架,初始化全局3DGS模型;随后在联邦优化阶段固定高斯位置以保持几何对齐,仅允许各客户端更新外观相关属性(协方差、不透明度及球谐系数),并通过基于可见性的聚合策略加权整合各客户端贡献,从而有效应对多智能体探索中的局部观测挑战。
链接: https://arxiv.org/abs/2604.01605
作者: Morui Zhu,Mohammad Dehghani Tezerjani,Mátyás Szántó,Márton Vaitkus,Song Fu,Qing Yang
机构: University of North Texas (北德克萨斯大学); Budapest University of Technology and Economics (布达佩斯技术与经济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to the CVPR 2026 SPAR-3D Workshop
Abstract:We present F3DGS, a federated 3D Gaussian Splatting framework for decentralized multi-agent 3D reconstruction. Existing 3DGS pipelines assume centralized access to all observations, which limits their applicability in distributed robotic settings where agents operate independently, and centralized data aggregation may be restricted. Directly extending centralized training to multi-agent systems introduces communication overhead and geometric inconsistency. F3DGS first constructs a shared geometric scaffold by registering locally merged LiDAR point clouds from multiple clients to initialize a global 3DGS model. During federated optimization, Gaussian positions are fixed to preserve geometric alignment, while each client updates only appearance-related attributes, including covariance, opacity, and spherical harmonic coefficients. The server aggregates these updates using visibility-aware aggregation, weighting each client’s contribution by how frequently it observed each Gaussian, resolving the partial-observability challenge inherent to multi-agent exploration. To evaluate decentralized reconstruction, we collect a multi-sequence indoor dataset with synchronized LiDAR, RGB, and IMU measurements. Experiments show that F3DGS achieves reconstruction quality comparable to centralized training while enabling distributed optimization across agents. The dataset, development kit, and source code will be publicly released.
[CV-108] owards Minimal Focal Stack in Shape from Focus CVPR
【速读】:该论文旨在解决Shape from Focus (SFF) 方法对密集采样、大尺寸焦堆(focal stack)的依赖问题,从而限制了其在实际场景中的应用。解决方案的关键在于提出一种基于物理模型的焦堆增强方法,通过从两幅输入图像中估计出全景清晰图像(All-in-Focus, AiF)并计算能量差值图(Energy-of-Difference, EOD)作为辅助特征,从而显著丰富原始焦堆的信息;同时设计了一个深度网络结构,利用卷积门控循环单元(ConvGRU)在多尺度上迭代优化深度估计,使得仅用两张图像即可实现与传统方法相当甚至更优的深度重建精度。
链接: https://arxiv.org/abs/2604.01603
作者: Khurram Ashfaq,Muhammad Tariq Mahmood
机构: Korea University of Technology and Education(韩国技术教育大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPRW 2026 (3DMV)
Abstract:Shape from Focus (SFF) is a depth reconstruction technique that estimates scene structure from focus variations observed across a focal stack, that is, a sequence of images captured at different focus settings. A key limitation of SFF methods is their reliance on densely sampled, large focal stacks, which limits their practical applicability. In this study, we propose a focal stack augmentation that enables SFF methods to estimate depth using a reduced stack of just two images, without sacrificing precision. We introduce a simple yet effective physics-based focal stack augmentation that enriches the stack with two auxiliary cues: an all-in-focus (AiF) image estimated from two input images, and Energy-of-Difference (EOD) maps, computed as the energy of differences between the AiF and input images. Furthermore, we propose a deep network that computes a deep focus volume from the augmented focal stacks and iteratively refines depth using convolutional Gated Recurrent Units (ConvGRUs) at multiple scales. Extensive experiments on both synthetic and real-world datasets demonstrate that the proposed augmentation benefits existing state-of-the-art SFF models, enabling them to achieve comparable accuracy. The results also show that our approach maintains state-of-the-art performance with a minimal stack size.
[CV-109] Riemannian and Symplectic Geometry for Hierarchical Text-Driven Place Recognition
【速读】:该论文旨在解决文本到点云定位(Text-to-point-cloud localization)任务中因依赖全局特征池化导致的信息丢失问题,以及现有方法难以捕捉场景结构判别性的局限。其核心解决方案是提出一种分阶段的粗粒度到细粒度定位框架 SympLoc,关键在于在粗定位阶段引入多层级对齐机制:1)实例级对齐通过双曲空间中的黎曼自注意力建立点云中物体实例与文本提示之间的直接对应关系;2)关系级对齐利用信息辛关系编码器(ISRE)建模物体间的成对空间关系,结合Fisher-Rao度量与哈密顿动力学实现不确定性感知的几何一致性传播;3)全局级对齐通过谱流形变换(SMT)提取图谱分析得到的结构不变量,合成具有判别性的全局描述符。这种分层对齐策略实现了从细粒度到粗粒度的场景语义渐进式捕获,显著提升了跨模态检索的鲁棒性。
链接: https://arxiv.org/abs/2604.01598
作者: Tianyi Shang,Zhenyu Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages
Abstract:Text-to-point-cloud localization enables robots to understand spatial positions through natural language descriptions, which is crucial for human-robot collaboration in applications such as autonomous driving and last-mile delivery. However, existing methods employ pooled global descriptors for similarity retrieval, which suffer from severe information loss and fail to capture discriminative scene structures. To address these issues, we propose SympLoc, a novel coarse-to-fine localization framework with multi-level alignment in the coarse stage. Different from previous methods that rely solely on global descriptors, our coarse stage consists of three complementary alignment levels: 1) Instance-level alignment establishes direct correspondence between individual object instances in point clouds and textual hints through Riemannian self-attention in hyperbolic space; 2) Relation-level alignment explicitly models pairwise spatial relationships between objects using the Information-Symplectic Relation Encoder (ISRE), which reformulates relation features through Fisher-Rao metric and Hamiltonian dynamics for uncertainty-aware geometrically consistent propagation; 3) Global-level alignment synthesizes discriminative global descriptors via the Spectral Manifold Transform (SMT) that extracts structural invariants through graph spectral analysis. This hierarchical alignment strategy progressively captures fine-grained to coarse-grained scene semantics, enabling robust cross-modal retrieval. Extensive experiments on the KITTI360Pose dataset demonstrate that SympLoc achieves a 19% improvement in Top-1 recall@10m compared to existing state-of-the-art approaches.
[CV-110] Mitigating the ID-OOD Tradeoff in Open-Set Test-Time Adaptation
【速读】:该论文旨在解决开放集测试时适应(Open-set Test-Time Adaptation, OSTTA)中的核心挑战:在存在分布偏移(如天气变化导致的协变量偏移,covariate shift)的情况下,模型需同时准确分类分布内(in-distribution, ID)样本和有效拒绝分布外(out-of-distribution, OOD)样本。传统方法通常依赖熵最小化维持ID性能,同时通过熵最大化增强OOD检测,但二者存在内在冲突,导致csID分类与csOOD检测之间的权衡。本文提出ROSETTA框架,其关键创新在于引入角度损失(angular loss)以调节特征范数大小,并结合特征范数损失(feature-norm loss)抑制csOOD样本的logits输出,从而在不牺牲ID分类精度的前提下显著提升OOD检测能力。
链接: https://arxiv.org/abs/2604.01589
作者: Wenjie Zhao,Jia Li,Xin Dong,Yapeng Tian,Yu Xiang,Yunhui Guo
机构: University of Texas at Dallas (德克萨斯大学达拉斯分校); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-set test-time adaptation (OSTTA) addresses the challenge of adapting models to new environments where out-of-distribution (OOD) samples coexist with in-distribution (ID) samples affected by distribution shifts. In such settings, covariate shift-for example, changes in weather conditions such as snow-can alter ID samples, reducing model reliability. Consequently, models must not only correctly classify covariate-shifted ID (csID) samples but also effectively reject covariate-shifted OOD (csOOD) samples. Entropy minimization is a common strategy in test-time adaptation to maintain ID performance under distribution shifts, while entropy maximization is widely applied to enhance OOD detection. Several studies have sought to combine these objectives to tackle the challenges of OSTTA. However, the intrinsic conflict between entropy minimization and maximization inevitably leads to a trade-off between csID classification and csOOD detection. In this paper, we first analyze the limitations of entropy maximization in OSTTA and then introduce an angular loss to regulate feature norm magnitudes, along with a feature-norm loss to suppress csOOD logits, thereby improving OOD detection. These objectives form ROSETTA, a \underliner obust \underlineo pen- \underlinese t \underlinet est- \underlinet ime \underlinea daptation. Our method achieves strong OOD detection while maintaining high ID classification performance on CIFAR-10-C, CIFAR-100-C, Tiny-ImageNet-C and ImageNet-C. Furthermore, experiments on the Cityscapes validate the method’s effectiveness in real-world semantic segmentation, and results on the HAC dataset demonstrate its applicability across different open-set TTA setups.
[CV-111] SHOE: Semantic HOI Open-Vocabulary Evaluation Metric CVPR2026
【速读】:该论文旨在解决开放词汇人类-物体交互(HOI)检测中评价指标的局限性问题,即传统均平均精度(mAP)将HOI类别视为离散标签,无法对语义合理但词法不同的预测(如“lean on couch”与“sit on couch”)给予正确评分,从而限制了其在开放词汇场景下的适用性。解决方案的关键在于提出SHOE(Semantic HOI Open-Vocabulary Evaluation)评估框架,该框架通过将每个HOI预测分解为动词和物体成分,利用多个大语言模型(LLMs)估计其语义相似度,并融合得到整体相似性得分,从而实现超越精确字符串匹配的语义对齐评估。此方法显著提升了评估结果与人类判断的一致性(达85.73%),并支持对现有HOI检测方法及生成式模型的灵活、可扩展评估。
链接: https://arxiv.org/abs/2604.01586
作者: Maja Noack,Qinqian Lei,Taipeng Tian,Bihan Dong,Robby T. Tan,Yixin Chen,John Young,Saijun Zhang,Bo Wang
机构: University of Mississippi (密西西比大学); National University of Singapore (新加坡国立大学); ASUS Intelligent Cloud Services (AICS) (华硕智能云服务)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to GRAIL-V Workshop at CVPR 2026
Abstract:Open-vocabulary human-object interaction (HOI) detection is a step towards building scalable systems that generalize to unseen interactions in real-world scenarios and support grounded multimodal systems that reason about human-object relationships. However, standard evaluation metrics, such as mean Average Precision (mAP), treat HOI classes as discrete categorical labels and fail to credit semantically valid but lexically different predictions (e.g., “lean on couch” vs. “sit on couch”), limiting their applicability for evaluating open-vocabulary predictions that go beyond any predefined set of HOI labels. We introduce SHOE (Semantic HOI Open-Vocabulary Evaluation), a new evaluation framework that incorporates semantic similarity between predicted and ground-truth HOI labels. SHOE decomposes each HOI prediction into its verb and object components, estimates their semantic similarity using the average of multiple large language models (LLMs), and combines them into a similarity score to evaluate alignment beyond exact string match. This enables a flexible and scalable evaluation of both existing HOI detection methods and open-ended generative models using standard benchmarks such as HICO-DET. Experimental results show that SHOE scores align more closely with human judgments than existing metrics, including LLM-based and embedding-based baselines, achieving an agreement of 85.73% with the average human ratings. Our work underscores the need for semantically grounded HOI evaluation that better mirrors human understanding of interactions. We will release our evaluation metric to the public to facilitate future research.
[CV-112] Satellite-Free Training for Drone-View Geo-Localization
【速读】:该论文旨在解决无人机视角地理定位(Drone-view geo-localization, DVGL)在无卫星影像条件下如何实现高精度跨视图检索的问题。现有方法通常依赖卫星图像进行训练,包括成对监督或无监督对齐,这限制了其在卫星数据不可用或受限场景中的实际部署。解决方案的关键在于提出一种卫星自由训练(Satellite-Free Training, SFT)框架,通过三个核心阶段构建与卫星图像兼容的无人机侧表征:首先利用3D Gaussian splatting从多视角无人机图像中重建密集三维场景;其次通过PCA引导的正交投影将重建几何体渲染为伪正射影像(pseudo-orthophoto),无需渲染时相机参数;随后采用轻量级几何引导修补技术完善纹理完整性;最后基于生成的伪正射影像提取DINOv3局部特征,并仅使用无人机数据学习Fisher向量聚合模型,用于测试时对卫星图块的编码与跨视图检索。该方案实现了无需卫星数据即可有效提升定位性能,显著缩小与依赖卫星训练方法的差距。
链接: https://arxiv.org/abs/2604.01581
作者: Tao Liu,Yingzhi Zhang,Kan Ren,Xiaoqi Zhao
机构: Nanjing University of Science and Technology (南京理工大学); University of Tsukuba (筑波大学); Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Drone-view geo-localization (DVGL) aims to determine the location of drones in GPS-denied environments by retrieving the corresponding geotagged satellite tile from a reference gallery given UAV observations of a location. In many existing formulations, these observations are represented by a single oblique UAV image. In contrast, our satellite-free setting is designed for multi-view UAV sequences, which are used to construct a geometry-normalized UAV-side location representation before cross-view retrieval. Existing approaches rely on satellite imagery during training, either through paired supervision or unsupervised alignment, which limits practical deployment when satellite data are unavailable or restricted. In this paper, we propose a satellite-free training (SFT) framework that converts drone imagery into cross-view compatible representations through three main stages: drone-side 3D scene reconstruction, geometry-based pseudo-orthophoto generation, and satellite-free feature aggregation for retrieval. Specifically, we first reconstruct dense 3D scenes from multi-view drone images using 3D Gaussian splatting and project the reconstructed geometry into pseudo-orthophotos via PCA-guided orthographic projection. This rendering stage operates directly on reconstructed scene geometry without requiring camera parameters at rendering time. Next, we refine these orthophotos with lightweight geometry-guided inpainting to obtain texture-complete drone-side views. Finally, we extract DINOv3 patch features from the generated orthophotos, learn a Fisher vector aggregation model solely from drone data, and reuse it at test time to encode satellite tiles for cross-view retrieval. Experimental results on University-1652 and SUES-200 show that our SFT framework substantially outperforms satellite-free generalization baselines and narrows the gap to methods trained with satellite imagery.
[CV-113] Harmonized Tabular-Image Fusion via Gradient-Aligned Alternating Learning ICME26
【速读】:该论文旨在解决多模态表格-图像融合任务中因不同模态间梯度冲突而导致的优化困难问题,这种冲突可能误导单模态学习器的训练方向。解决方案的关键在于提出一种梯度对齐交替学习(Gradient-Aligned Alternating Learning, GAAL)范式,通过交替进行单模态学习与共享分类器训练来解耦多模态梯度,并设计基于不确定性的跨模态梯度手术策略,选择性地对齐跨模态梯度,从而引导共享参数同时受益于所有模态,实现有效的单模态辅助并提升整体融合性能。
链接: https://arxiv.org/abs/2604.01579
作者: Longfei Huang,Yang Yang
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICME 26
Abstract:Multimodal tabular-image fusion is an emerging task that has received increasing attention in various domains. However, existing methods may be hindered by gradient conflicts between modalities, misleading the optimization of the unimodal learner. In this paper, we propose a novel Gradient-Aligned Alternating Learning (GAAL) paradigm to address this issue by aligning modality gradients. Specifically, GAAL adopts an alternating unimodal learning and shared classifier to decouple the multimodal gradient and facilitate interaction. Furthermore, we design uncertainty-based cross-modal gradient surgery to selectively align cross-modal gradients, thereby steering the shared parameters to benefit all modalities. As a result, GAAL can provide effective unimodal assistance and help boost the overall fusion performance. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SoTA) tabular-image fusion baselines and test-time tabular missing baselines. The source code is available at this https URL.
[CV-114] VideoZeroBench: Probing the Limits of Video MLLM s with Spatio-Temporal Evidence Verification
【速读】:该论文旨在解决当前视频多模态大语言模型在长视频问答任务中评估存在的两大关键问题:一是评分虚高掩盖了模型在细粒度视觉理解与推理能力上的不足;二是答案正确性缺乏对模型是否准确识别支撑预测的时空证据的验证。为此,作者提出VideoZeroBench,一个分层基准测试框架,通过人工标注500个跨13个领域的问答对,并配以精确的时间区间和空间边界框作为证据,实现对模型输出的严格时空证据验证。其核心创新在于引入五级评估协议,逐步收紧对回答生成、时间定位和空间定位的要求,从而区分模型是否真正具备基于证据的推理能力。实验表明,在最严格的层级(Level-5)下,即使最先进的模型如Gemini-3-Pro也仅能获得低于1%的准确率,揭示了当前模型在真实场景中进行 grounded video reasoning 的显著瓶颈。
链接: https://arxiv.org/abs/2604.01569
作者: Jiahao Meng,Tan Yue,Qi Xu,Haochen Wang,Zhongwei Ren,Weisong Liu,Yuhao Wang,Renrui Zhang,Yunhai Tong,Haodong Duan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.
[CV-115] ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction
【速读】:该论文旨在解决单目动态场景重建中因动态区域初始化不完整而导致的重建不稳定与运动估计误差问题,现有方法通常依赖外部稠密运动引导(如预计算的光流)来稳定动态成分的重建,但此类方法引入了额外复杂性和误差传播风险。其解决方案的关键在于提出 ReFlow 框架,核心创新是新颖的自校正流匹配机制:通过全流匹配(Full Flow Matching)将 3D 场景流与随时间变化的 2D 观测对齐,并借助相机流匹配(Camera Flow Matching)强制静态物体的多视角一致性;同时结合完整的规范空间构建模块(Complete Canonical Space Construction)和基于分离的动态场景建模模块(Separation-Based Dynamic Scene Modeling),实现对静态与动态成分的解耦建模与针对性运动监督,从而显著提升重建质量与鲁棒性。
链接: https://arxiv.org/abs/2604.01561
作者: Yanzhe Liang,Ruijie Zhu,Hanzhi Chang,Zhuoyuan Li,Jiahao Lu,Tianzhu Zhang
机构: University of Science and Technology of China (中国科学技术大学); National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory (深空探测重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL {this https URL}
Abstract:We present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel self-correction manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre-computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation. To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation-Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision. The core of ReFlow is a novel self-correction flow matching mechanism, consisting of Full Flow Matching to align 3D scene flow with time-varying 2D observations, and Camera Flow Matching to enforce multi-view consistency for static objects. Together, these modules enable robust and accurate dynamic scene reconstruction. Extensive experiments across diverse scenarios demonstrate that ReFlow achieves superior reconstruction quality and robustness, establishing a novel self-correction paradigm for monocular 4D reconstruction.
[CV-116] Cross-Domain Vessel Segmentation via Latent Similarity Mining and Iterative Co-Optimization
【速读】:该论文旨在解决跨域 retinal vessel segmentation(视网膜血管分割)中因训练与测试数据之间存在领域偏移(domain shift)而导致性能显著下降的问题。其解决方案的关键在于提出了一种新颖的域迁移框架,通过利用跨域血管结构的潜在相似性,并采用生成网络与分割网络的迭代协同优化策略:首先预训练源域和目标域的生成网络,随后使用源域条件扩散模型进行确定性反演以建立中间潜在表示,从而生成领域无关的目标域血管原型;最后通过循环参数更新实现生成模型与分割网络的相互优化,推动跨域图像合成质量与分割精度的同步提升。
链接: https://arxiv.org/abs/2604.01553
作者: Zhanqiang Guo,Jianjiang Feng,Jie Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Retinal vessel segmentation serves as a critical prerequisite for automated diagnosis of retinal pathologies. While recent advances in Convolutional Neural Networks (CNNs) have demonstrated promising performance in this task, significant performance degradation occurs when domain shifts exist between training and testing data. To address these limitations, we propose a novel domain transfer framework that leverages latent vascular similarity across domains and iterative co-optimization of generation and segmentation networks. Specifically, we first pre-train generation networks for source and target domains. Subsequently, the pretrained source-domain conditional diffusion model performs deterministic inversion to establish intermediate latent representations of vascular images, creating domain-agnostic prototypes for target synthesis. Finally, we develop an iterative refinement strategy where segmentation network and generative model undergo mutual optimization through cyclic parameter updating. This co-evolution process enables simultaneous enhancement of cross-domain image synthesis quality and segmentation accuracy. Experiments demonstrate that our framework achieves state-of-the-art performance in cross-domain retinal vessel segmentation, particularly in challenging clinical scenarios with significant modality discrepancies.
[CV-117] Prototype-Based Low Altitude UAV Semantic Segmentation ICME2026
【速读】:该论文旨在解决低空无人机(UAV)影像语义分割中面临的三大挑战:极端尺度变化、复杂物体边界以及边缘设备上有限的计算资源。现有基于Transformer的方法虽性能优异但计算开销大,而轻量级方法又难以捕捉高分辨率航拍场景中的细粒度细节。解决方案的关键在于提出一种面向UAV应用的高效原型分割框架PBSeg,其核心创新为引入原型交叉注意力机制(Prototype-based Cross-Attention, PBCA),通过利用特征冗余降低计算复杂度并保持分割质量;同时结合可变形卷积(Deformable Convolutions, DConv)与上下文感知调制(Context-aware Modulation, CAM)的多尺度特征提取模块,有效融合局部细节与全局语义信息,从而在保证精度的同时实现高效的边缘部署。
链接: https://arxiv.org/abs/2604.01550
作者: Da Zhang,Gao Junyu,Zhao Zhiyuan
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICME 2026
Abstract:Semantic segmentation of low-altitude UAV imagery presents unique challenges due to extreme scale variations, complex object boundaries, and limited computational resources on edge devices. Existing transformer-based segmentation methods achieve remarkable performance but incur high computational overhead, while lightweight approaches struggle to capture fine-grained details in high-resolution aerial scenes. To address these limitations, we propose PBSeg, an efficient prototype-based segmentation framework tailored for UAV applications. PBSeg introduces a novel prototype-based cross-attention (PBCA) that exploits feature redundancy to reduce computational complexity while maintaining segmentation quality. The framework incorporates an efficient multi-scale feature extraction module that combines deformable convolutions (DConv) with context-aware modulation (CAM) to capture both local details and global semantics. Experiments on two challenging UAV datasets demonstrate the effectiveness of the proposed approach. PBSeg achieves 71.86% mIoU on UAVid and 80.92% mIoU on UDD6, establishing competitive performance while maintaining computational efficiency. Code is available at this https URL.
[CV-118] Universal computational thermal imaging overcoming the ghosting effect
【速读】:该论文旨在解决热成像(Thermal Imaging)中普遍存在的鬼影效应(Ghosting Effect),即在复杂光子流中因材料非均匀性导致的细节纹理丢失问题,从而实现高保真夜视成像。传统方法依赖数据后处理,而本文提出了一种通用计算热成像框架TAG(Thermal Anti-Ghosting),其核心创新在于利用多光谱(Hyperspectral)光子流进行非参数化纹理恢复,首次实现了对典型鬼影现象——如人面部表情——的前所未有的还原能力。TAG不仅在各类场景中全面优于现有热辅助探测与测距(HADAR)技术,还揭示了材料非均匀性对成像效果的影响边界,为构建通用、高保真的夜间视觉系统奠定了基础。
链接: https://arxiv.org/abs/2604.01542
作者: Hongyi Xu,Du Wang,Chenjun Zhao,Jiashuo Chen,Jiale Lin,Liqin Cao,Yanfei Zhong,Yiyuan She,Fanglin Bao
机构: Westlake University (西湖大学); Wuhan University (武汉大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注: 9 pages, 6 figures
Abstract:Thermal imaging is crucial for night vision but fundamentally hampered by the ghosting effect, a loss of detailed texture in cluttered photon streams. While conventional ghosting mitigation has relied on data post-processing, the recent breakthrough in heat-assisted detection and ranging (HADAR) opens a promising frontier for hyperspectral computational thermal imaging that produces night vision with day-like visibility. However, universal anti-ghosting imaging remains elusive, as state-of-the-art HADAR applies only to limited scenes with uniform materials, whereas material non-uniformity is ubiquitous in the real world. Here, we propose a universal computational thermal imaging framework, TAG (thermal anti-ghosting), to address material non-uniformity and overcome ghosting for high-fidelity night vision. TAG takes hyperspectral photon streams for nonparametric texture recovery, enabling our experimental demonstration of unprecedented expression recovery in thus-far-elusive ghostly human faces – the archetypal, long-recognized ghosting phenomenon. Strikingly, TAG not only universally outperforms HADAR across various scenes, but also reveals the influence of material non-uniformity, shedding light on HADAR’s effectiveness boundary. We extensively test facial texture and expression recovery across day and night, and demonstrate, for the first time, thermal 3D topological alignment and mood detection. This work establishes a universal foundation for high-fidelity computational night vision, with potential applications in autonomous navigation, reconnaissance, healthcare, and wildlife monitoring.
[CV-119] UniRecGen: Unifying Multi-View 3D Reconstruction and Generation
【速读】:该论文旨在解决稀疏视图三维建模中重建保真度(reconstruction fidelity)与生成合理性(generative plausibility)之间的根本矛盾:前向重建方法虽效率高且输入对齐性好,但缺乏全局先验导致结构不完整;而基于扩散模型的生成方法虽能提供丰富的几何细节,却难以保证多视角一致性。解决方案的关键在于提出UniRecGen统一框架,通过将两种范式整合为协同系统,在共享规范空间(canonical space)中对齐坐标系、3D表示和训练目标,并采用解耦协作学习策略,在保持训练稳定的同时实现推理阶段的无缝协作——其中重建模块提供规范几何锚点,扩散生成器则利用潜空间增强条件来细化和补全几何结构,从而显著提升模型完整性与一致性表现。
链接: https://arxiv.org/abs/2604.01479
作者: Zhisheng Huang,Jiahao Chen,Cheng Lin,Chenyu Hu,Hanzhuo Huang,Zhengming Yu,Mengfei Li,Yuheng Liu,Zekai Gu,Zibo Zhao,Yuan Liu,Xin Li,Wenping Wang
机构: Texas A&M University (德州农工大学); Macau University of Science and Technology (澳门科技大学); Xidian University (西安电子科技大学); ShanghaiTech University (上海科技大学); Hong Kong University of Science and Technology (香港科技大学); University of California, Irvine (加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sparse-view 3D modeling represents a fundamental tension between reconstruction fidelity and generative plausibility. While feed-forward reconstruction excels in efficiency and input alignment, it often lacks the global priors needed for structural completeness. Conversely, diffusion-based generation provides rich geometric details but struggles with multi-view consistency. We present UniRecGen, a unified framework that integrates these two paradigms into a single cooperative system. To overcome inherent conflicts in coordinate spaces, 3D representations, and training objectives, we align both models within a shared canonical space. We employ disentangled cooperative learning, which maintains stable training while enabling seamless collaboration during inference. Specifically, the reconstruction module is adapted to provide canonical geometric anchors, while the diffusion generator leverages latent-augmented conditioning to refine and complete the geometric structure. Experimental results demonstrate that UniRecGen achieves superior fidelity and robustness, outperforming existing methods in creating complete and consistent 3D models from sparse observations.
[CV-120] Prime Once then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation CVPR2026
【速读】:该论文旨在解决闭盒服务模型(如API)在目标任务上适应时依赖零阶优化(Zeroth-Order Optimization, ZOO)所面临的高成本、低效且不稳定的问题,尤其是在现代大模型(如GPT-4o)中,ZOO因对输入扰动不敏感而性能下降。解决方案的关键在于提出一种替代性的高效重编程方法(Alternative efficient Reprogramming approach for Service models, AReS):AReS通过单次与服务API交互,对本地预训练编码器进行轻量级微调以“预热”其参数,从而构建一个可直接用于后续白盒(glass-box)重编程的本地代理模型;此后所有适配和推理均基于该本地模型完成,彻底消除额外API调用,显著降低计算开销并提升适应效果。
链接: https://arxiv.org/abs/2604.01474
作者: Yunbei Zhang,Chengyi Cai,Feng Liu,Jihun Hamm
机构: Tulane University (杜兰大学); University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR 2026
Abstract:Adapting closed-box service models (i.e., APIs) for target tasks typically relies on reprogramming via Zeroth-Order Optimization (ZOO). However, this standard strategy is known for extensive, costly API calls and often suffers from slow, unstable optimization. Furthermore, we observe that this paradigm faces new challenges with modern APIs (e.g., GPT-4o). These models can be less sensitive to the input perturbations ZOO relies on, thereby hindering performance gains. To address these limitations, we propose an Alternative efficient Reprogramming approach for Service models (AReS). Instead of direct, continuous closed-box optimization, AReS initiates a single-pass interaction with the service API to prime an amenable local pre-trained encoder. This priming stage trains only a lightweight layer on top of the local encoder, making it highly receptive to the subsequent glass-box (white-box) reprogramming stage performed directly on the local model. Consequently, all subsequent adaptation and inference rely solely on this local proxy, eliminating all further API costs. Experiments demonstrate AReS’s effectiveness where prior ZOO-based methods struggle: on GPT-4o, AReS achieves a +27.8% gain over the zero-shot baseline, a task where ZOO-based methods provide little to no improvement. Broadly, across ten diverse datasets, AReS outperforms state-of-the-art methods (+2.5% for VLMs, +15.6% for standard VMs) while reducing API calls by over 99.99%. AReS thus provides a robust and practical solution for adapting modern closed-box models.
[CV-121] Efficient Equivariant Transformer for Self-Driving Agent Modeling CVPR2026
【速读】:该论文旨在解决自动驾驶中对交通参与者行为建模时的SE(2)-等变性(SE(2)-equivariance)建模问题,即模型需在场景整体发生任意旋转和平移变换时保持输出一致性,同时避免传统方法中因显式成对相对位置编码带来的二次计算开销。解决方案的关键在于提出DriveGATr架构,其创新性地将场景元素编码为二维射影几何代数 R2,0,1∗ 中的多向量(multivector),并通过一系列等变Transformer模块处理这些多向量表示;该方法利用标准注意力机制直接建模多向量间的几何关系,从而无需额外的成对相对位置编码,显著降低了计算复杂度并提升了可扩展性,在Waymo Open Motion Dataset上的实验表明其在性能与计算成本之间实现了更优的权衡。
链接: https://arxiv.org/abs/2604.01466
作者: Scott Xu,Dian Chen,Kelvin Wong,Chris Zhang,Kion Fallah,Raquel Urtasun
机构: Waabi; University of Toronto
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR 2026
Abstract:Accurately modeling agent behaviors is an important task in self-driving. It is also a task with many symmetries, such as equivariance to the order of agents and objects in the scene or equivariance to arbitrary roto-translations of the entire scene as a whole; i.e., SE(2)-equivariance. The transformer architecture is a ubiquitous tool for modeling these symmetries. While standard self-attention is inherently permutation equivariant, explicit pairwise relative positional encodings have been the standard for introducing SE(2)-equivariance. However, this approach introduces an additional cost that is quadratic in the number of agents, limiting its scalability to larger scenes and batch sizes. In this work, we propose DriveGATr, a novel transformer-based architecture for agent modeling that achieves SE(2)-equivariance without the computational cost of existing methods. Inspired by recent advances in geometric deep learning, DriveGATr encodes scene elements as multivectors in the 2D projective geometric algebra \mathbbR^*_2,0,1 and processes them with a stack of equivariant transformer blocks. Crucially, DriveGATr models geometric relationships using standard attention between multivectors, eliminating the need for costly explicit pairwise relative positional encodings. Experiments on the Waymo Open Motion Dataset demonstrate that DriveGATr is comparable to the state-of-the-art in traffic simulation and establishes a superior Pareto front for performance vs computational cost.
[CV-122] Reinforcing Consistency in Video MLLM s with Structured Rewards
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频理解任务中存在“看似合理但缺乏视觉与时间一致性”的问题,即模型可能生成语义通顺的输出,却在对象存在性、属性准确性或事件时序上出现严重偏差(如虚构对象、错误属性赋值或重复事件被合并)。其解决方案的关键在于引入一种结构化的奖励机制(structured reward),替代传统的句子级监督信号,具体包括三个互补组件:(1) 基于实例感知场景图的事实性奖励(用于对象、属性和关系验证);(2) 事件顺序与重复性的时序奖励;(3) 视频引导的视觉问答(VQA)奖励以实现层次化自验证。该方法通过精细化的奖励设计显著提升了模型对视频内容的忠实度,尤其在时序理解、通用视频理解和幻觉抑制等基准测试中表现出稳定改进。
链接: https://arxiv.org/abs/2604.01460
作者: Yihao Quan,Zeru Shi,Jinman Zhao,Ruixiang Tang
机构: Rutgers University (罗格斯大学); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer. We study this failure mode through a compositional consistency audit that decomposes a caption into supporting factual and temporal claims, investigating whether a correct high-level prediction is actually backed by valid lower-level evidence. Our top-down audit reveals that even correct root relational claims often lack reliable attribute and existence support. This indicates that standard sentence-level supervision is a weak proxy for faithful video understanding. Furthermore, when turning to reinforcement learning (RL) for better alignment, standard sentence-level rewards often prove too coarse to accurately localize specific grounding failures. To address this, we replace generic sentence-level rewards with a structured reward built from factual and temporal units. Our training objective integrates three complementary components: (1) an instance-aware scene-graph reward for factual objects, attributes, and relations; (2) a temporal reward for event ordering and repetition; and (3) a video-grounded VQA reward for hierarchical self-verification. Across temporal, general video understanding, and hallucination-oriented benchmarks, this objective yields consistent gains on open-source backbones. These results suggest that structured reward shaping is a practical route to more faithful video understanding.
[CV-123] Nonlinear Methods for Analyzing Pose in Behavioral Research
【速读】:该论文旨在解决人体姿态数据(human pose data)在高维度、噪声和时间复杂性背景下,难以提取有意义的协调模式与行为变化的问题。其解决方案的关键在于构建一个通用的分析流程(analysis pipeline),该流程融合了原理性的预处理、降维技术以及基于递归的时间序列分析方法,从而有效量化运动动力学的时间结构,适用于多种实验场景下的线性和非线性运动特征表征。
链接: https://arxiv.org/abs/2604.01453
作者: Carter Sale,Margaret C. Macpherson,Gaurav Patil,Kelly Miles,Rachel W. Kallen,Sebastian Wallot,Michael J. Richardson
机构: Macquarie University (麦考瑞大学); Leuphana University (吕讷堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 40 pages, 13 figures
Abstract:Advances in markerless pose estimation have made it possible to capture detailed human movement in naturalistic settings using standard video, enabling new forms of behavioral analysis at scale. However, the high dimensionality, noise, and temporal complexity of pose data raise significant challenges for extracting meaningful patterns of coordination and behavioral change. This paper presents a general-purpose analysis pipeline for human pose data, designed to support both linear and nonlinear characterizations of movement across diverse experimental contexts. The pipeline combines principled preprocessing, dimensionality reduction, and recurrence-based time series analysis to quantify the temporal structure of movement dynamics. To illustrate the pipeline’s flexibility, we present three case studies spanning facial and full-body movement, 2D and 3D data, and individual versus multi-agent behavior. Together, these examples demonstrate how the same analytic workflow can be adapted to extract theoretically meaningful insights from complex pose time series.
[CV-124] Better Rigs Not Bigger Networks: A Body Model Ablation for Gaussian Avatars
【速读】:该论文旨在解决当前基于SMPL(Skinned Multi-Person Linear model)的3D高斯溅射(3D Gaussian splatting)方法在人体虚拟化身重建中因架构复杂而导致效率低下、性能提升受限的问题。其核心解决方案在于:用无需学习形变或姿态依赖修正的Momentum Human Rig(MHR)替代SMPL,MHR通过SAM-3D-Body估计获得,从而构建一个极简但高效的重建流程。实验表明,该方案在PeopleSnapshot和ZJU-MoCap数据集上实现了最高的PSNR指标,并在LPIPS和SSIM上达到竞争性或更优表现,证明了传统复杂模型中的冗余设计并非必要,而身体模型的表达能力与姿态估计精度共同构成了重建质量的关键瓶颈。
链接: https://arxiv.org/abs/2604.01447
作者: Derek Austin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent 3D Gaussian splatting methods built atop SMPL achieve remarkable visual fidelity while continually increasing the complexity of the overall training architecture. We demonstrate that much of this complexity is unnecessary: by replacing SMPL with the Momentum Human Rig (MHR), estimated via SAM-3D-Body, a minimal pipeline with no learned deformations or pose-dependent corrections achieves the highest reported PSNR and competitive or superior LPIPS and SSIM on PeopleSnapshot and ZJU-MoCap. To disentangle pose estimation quality from body model representational capacity, we perform two controlled ablations: translating SAM-3D-Body meshes to SMPL-X, and translating the original dataset’s SMPL poses into MHR both retrained under identical conditions. These ablations confirm that body model expressiveness has been a primary bottleneck in avatar reconstruction, with both mesh representational capacity and pose estimation quality contributing meaningfully to the full pipeline’s gains.
[CV-125] EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation CVPR2026
【速读】:该论文旨在解决从第一人称视角视频中生成物理上一致的6自由度(6DoF)运动轨迹这一挑战,尤其针对遮挡、快速运动以及现有生成模型缺乏显式物理推理等问题。解决方案的关键在于提出EgoFlow框架,其采用混合Mamba-Transformer-Perceiver架构联合建模时间动态、场景几何与语义意图,并通过梯度引导的推理过程引入可微分的物理约束(如避障和运动平滑性),从而在无需后处理过滤或额外监督的情况下实现连贯且可控的运动生成。
链接: https://arxiv.org/abs/2604.01421
作者: Abhishek Saroha,Huajian Zeng,Xingxing Zuo,Daniel Cremers,Xi Wang
机构: TU München (慕尼黑工业大学); MCML; ETH Zürich (苏黎世联邦理工学院); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026: this https URL
Abstract:Understanding and predicting object motion from egocentric video is fundamental to embodied perception and interaction. However, generating physically consistent 6DoF trajectories remains challenging due to occlusions, fast motion, and the lack of explicit physical reasoning in existing generative models. We present EgoFlow, a flow-matching framework that synthesizes realistic and physically plausible trajectories conditioned on multimodal egocentric observations. EgoFlow employs a hybrid Mamba-Transformer-Perceiver architecture to jointly model temporal dynamics, scene geometry, and semantic intent, while a gradient-guided inference process enforces differentiable physical constraints such as collision avoidance and motion smoothness. This combination yields coherent and controllable motion generation without post-hoc filtering or additional supervision. Experiments on real-world datasets HD-EPIC, EgoExo4D, and HOT3D show that EgoFlow outperforms diffusion-based and transformer baselines in accuracy, generalization, and physical realism, reducing collision rates by up to 79%, and strong generalization to unseen scenes. Our results highlight the promise of flow-based generative modeling for scalable and physically grounded egocentric motion understanding.
[CV-126] LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding
【速读】:该论文旨在解决当前基于3D高斯泼溅(3D Gaussian Splatting, 3DGS)的开放词汇三维场景理解方法中存在的两个关键问题:一是由无结构且重叠的高斯分布引起的 spatial ambiguity(空间模糊性),导致需要依赖概率方式注册视觉-语言特征;二是由于在对象级掩码上进行特征池化引发的 multi-level semantic ambiguity(多层级语义模糊性),从而削弱细粒度细节。解决方案的关键在于引入稀疏体素光栅化(Sparse Voxel Rasterization, SVRaster)作为结构化、互不重叠的几何表示,并通过单目深度和法向量先验对SVRaster进行正则化,建立稳定的几何基础,从而实现确定性的、具有置信度感知的特征注册过程,有效抑制3DGS中常见的语义溢出(semantic bleeding)伪影;同时利用基础模型AM-RADIO的密集对齐特性来消除多层级语义模糊,避免传统分层训练方法带来的计算开销。
链接: https://arxiv.org/abs/2604.01388
作者: Fusang Wang,Nathan Piasco,Moussab Bennehar,Luis Roldão,Dzmitry Tsishkou,Fabien Moutarde
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in open-vocabulary 3D scene understanding heavily rely on 3D Gaussian Splatting (3DGS) to register vision-language features into 3D space. However, we identify two critical limitations in these approaches: the spatial ambiguity arising from unstructured, overlapping Gaussians which necessitates probabilistic feature registration, and the multi-level semantic ambiguity caused by pooling features over object-level masks, which dilutes fine-grained details. To address these challenges, we present a novel framework that leverages Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation. By regularizing SVRaster with monocular depth and normal priors, we establish a stable geometric foundation. This enables a deterministic, confidence-aware feature registration process and suppresses the semantic bleeding artifact common in 3DGS. Furthermore, we resolve multi-level ambiguity by exploiting the emerging dense alignment properties of foundation model AM-RADIO, avoiding the computational overhead of hierarchical training methods. Our approach achieves state-of-the-art performance on Open Vocabulary 3D Object Retrieval and Point Cloud Understanding benchmarks, particularly excelling on fine-grained queries where registration methods typically fail.
[CV-127] GRAZE: Grounded Refinement and Motion-Aware Zero-Shot Event Localization CVPR2026
【速读】:该论文旨在解决美式橄榄球训练视频中接触事件(即首次触碰点,First Point of Contact, FPOC)的精确时空定位问题,尤其在存在相机运动、场景杂乱、多名相似装备运动员及冲击瞬间姿态快速变化等复杂条件下。解决方案的关键在于提出一种无需任务特定训练的GRAZE流水线:首先利用Grounding DINO识别候选球员与 tackle dummy 的交互区域,再通过运动感知的时间推理机制优化时序定位,最后借助SAM2进行像素级接触验证而非依赖检测置信度,从而实现对接触起始帧的高精度定位(±10帧内准确率达77.5%)。该设计将候选发现与接触确认分离,显著提升了在真实复杂场景中的鲁棒性。
链接: https://arxiv.org/abs/2604.01383
作者: Syed Ahsan Masud Zaidi,Lior Shamir,William Hsu,Scott Dietrich,Talha Zaidi
机构: Kansas State University (堪萨斯州立大学); Albright College (阿尔布莱特学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, accepted to the CVPR 2026 Workshop on Computer Vision in Sports (CVSports) code: this https URL
Abstract:American football practice generates video at scale, yet the interaction of interest occupies only a brief window of each long, untrimmed clip. Reliable biomechanical analysis, therefore, depends on spatiotemporal localization that identifies both the interacting entities and the onset of contact. We study First Point of Contact (FPOC), defined as the first frame in which a player physically touches a tackle dummy, in unconstrained practice footage with camera motion, clutter, multiple similarly equipped athletes, and rapid pose changes around impact. We present GRAZE, a training-free pipeline for FPOC localization that requires no labeled tackle-contact examples. GRAZE uses Grounding DINO to discover candidate player-dummy interactions, refines them with motion-aware temporal reasoning, and uses SAM2 as an explicit pixel-level verifier of contact rather than relying on detection confidence alone. This separation between candidate discovery and contact confirmation makes the approach robust to cluttered scenes and unstable grounding near impact. On 738 tackle-practice videos, GRAZE produces valid outputs for 97.4% of clips and localizes FPOC within \pm 10 frames on 77.5% of all clips and within \pm 20 frames on 82.7% of all clips. These results show that frame-accurate contact onset localization in real-world practice footage is feasible without task-specific training.
[CV-128] AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction
【速读】:该论文旨在解决当前手术自动化方法在临床部署中面临的关键挑战:即缺乏对器械与组织表面交互位置的可预测性,以及缺少显式条件输入以确保特定工具-动作组合的安全交互区域。为应对这一问题,作者提出AffordTissue框架,其核心创新在于通过多模态建模实现工具-动作特异性组织可用性(tissue affordance)区域的密集热图预测。该方案的关键要素包括:基于时间视觉编码器捕捉多视角下器械运动与组织动态变化、利用语言条件控制实现跨多样工具-动作对的泛化能力,以及采用DiT风格解码器完成高精度的密集可用性预测。实验表明,该方法显著优于现有视觉-语言模型基线(平均表面距离从60.2 px降低至20.6 px),为安全手术自动化的空间推理提供了明确指导,并有望支持早期安全停止机制以提升临床可靠性。
链接: https://arxiv.org/abs/2604.01371
作者: Aiza Maksutova,Lalithkumar Seenivasan,Hao Ding,Jiru Xu,Chenhao Yu,Chenyan Jing,Yiqing Shen,Mathias Unberath
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注:
Abstract:Surgical action automation has progressed rapidly toward achieving surgeon-like dexterous control, driven primarily by advances in learning from demonstration and vision-language-action models. While these have demonstrated success in table-top experiments, translating them to clinical deployment remains challenging: current methods offer limited predictability on where instruments will interact on tissue surfaces and lack explicit conditioning inputs to enforce tool-action-specific safe interaction regions. Addressing this gap, we introduce AffordTissue, a multimodal framework for predicting tool-action specific tissue affordance regions as dense heatmaps during cholecystectomy. Our approach combines a temporal vision encoder capturing tool motion and tissue dynamics across multiple viewpoints, language conditioning enabling generalization across diverse instrument-action pairs, and a DiT-style decoder for dense affordance prediction. We establish the first tissue affordance benchmark by curating and annotating 15,638 video clips across 103 cholecystectomy procedures, covering six unique tool-action pairs involving four instruments (hook, grasper, scissors, clipper) and their associated tasks: dissection, grasping, clipping, and cutting. Experiments demonstrate substantial improvement over vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), showing that our task-specific architecture outperforms large-scale foundation models for dense surgical affordance prediction. By predicting tool-action specific tissue affordance regions, AffordTissue provides explicit spatial reasoning for safe surgical automation, potentially unlocking explicit policy guidance toward appropriate tissue regions and early safe stop when instruments deviate outside predicted safe zones.
[CV-129] IGLOSS: Image Generation for Lidar Open-vocabulary Semantic Segmentation
【速读】:该论文旨在解决3D自动驾驶激光雷达(LiDAR)数据的零样本开放词汇语义分割(Zero-shot Open-Vocabulary Semantic Segmentation, OVSS)问题,即在不依赖特定类别标注的情况下,实现对未见过类别的准确分割。传统基于视觉-语言模型(Vision Language Models, VLMs)的方法受限于图像-文本模态间隙(image-text modality gap),难以有效迁移跨模态语义信息。本文的关键解决方案是利用文本生成图像(image generation from text)技术构建类别原型图像,并通过一个从2D视觉基础模型(Vision Foundation Model, VFM)蒸馏得到的3D网络,将点云中的3D特征与这些原型图像的2D特征进行匹配,从而实现无需训练即可泛化到新类别的语义分割。该方法在nuScenes和SemanticKITTI数据集上达到当前最优性能。
链接: https://arxiv.org/abs/2604.01361
作者: Nermin Samet,Gilles Puy,Renaud Marlet
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a new method for the zero-shot open-vocabulary semantic segmentation (OVSS) of 3D automotive lidar data. To circumvent the recognized image-text modality gap that is intrinsic to approaches based on Vision Language Models (VLMs) such as CLIP, our method relies instead on image generation from text, to create prototype images. Given a 3D network distilled from a 2D Vision Foundation Model (VFM), we then label a point cloud by matching 3D point features with 2D image features of these prototypes. Our method is state-of-the-art for OVSS on nuScenes and SemanticKITTI. Code, pre-trained models, and generated images are available at this https URL.
[CV-130] Perceptual misalignment of texture representations in convolutional neural networks
【速读】:该论文旨在解决的问题是:当前基于卷积神经网络(Convolutional Neural Networks, CNNs)的纹理表征是否能够自发地与人类对纹理的感知内容相一致,尤其是那些被广泛认为更贴近哺乳动物视觉系统的CNN模型,其纹理表征是否也更接近人类的感知机制。解决方案的关键在于通过对比多种CNN模型中由非线性特征计算出的线性相关性(即Gram矩阵所体现的特征相关性)所捕获的纹理感知内容,与这些模型在Brain-Score指标上的感知一致性表现进行量化比较,结果发现传统衡量CNN作为视觉系统模型质量的指标与其纹理感知对齐程度之间不存在显著关联,从而揭示了纹理感知可能依赖于CNN当前建模框架之外的机制,例如情境信息的整合。
链接: https://arxiv.org/abs/2604.01341
作者: Ludovica de Paolis,Fabio Anselmi,Alessio Ansuini,Eugenio Piasini
机构: International School for Advanced Studies (SISSA); Università degli Studi di Trieste; Area Science Park
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mathematical modeling of visual textures traces back to Julesz’s intuition that texture perception in humans is based on local correlations between image features. An influential approach for texture analysis and generation generalizes this notion to linear correlations between the nonlinear features computed by convolutional neural networks (CNNs), compiled into Gram matrices. Given that CNNs are often used as models for the visual system, it is natural to ask whether such “texture representations” spontaneously align with the textures’ perceptual content, and in particular whether those CNNs that are regarded as better models for the visual system also possess more human-like texture representations. Here we compare the perceptual content captured by feature correlations computed for a diverse pool of CNNs, and we compare it to the models’ perceptual alignment with the mammalian visual system as measured by Brain-Score. Surprisingly, we find that there is no connection between conventional measures of CNN quality as a model of the visual system and its alignment with human texture perception. We conclude that texture perception involves mechanisms that are distinct from those that are commonly modeled using approaches based on CNNs trained on object recognition, possibly depending on the integration of contextual information.
[CV-131] Regularizing Attention Scores with Bootstrapping
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)中注意力分数(attention scores)因普遍存在非零值而导致的注意力图噪声大、稀疏性差、可解释性弱的问题。其核心解决方案在于引入基于自助法(bootstrapping)的统计学习框架,通过重采样输入特征生成注意力分数的基线分布,进而估计注意力分数的显著性和后验概率,从而实现对注意力分数的正则化。该方法能够有效识别并移除由噪声引起的虚假注意力,显著提升注意力图的收缩性和稀疏性,为ViT的决策过程提供更清晰、可靠的解释。
链接: https://arxiv.org/abs/2604.01339
作者: Neo Christopher Chung,Maxim Laletin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
Abstract:Vision transformers (ViT) rely on attention mechanism to weigh input features, and therefore attention scores have naturally been considered as explanations for its decision-making process. However, attention scores are almost always non-zero, resulting in noisy and diffused attention maps and limiting interpretability. Can we quantify uncertainty measures of attention scores and obtain regularized attention scores? To this end, we consider attention scores of ViT in a statistical framework where independent noise would lead to insignificant yet non-zero scores. Leveraging statistical learning techniques, we introduce the bootstrapping for attention scores which generates a baseline distribution of attention scores by resampling input features. Such a bootstrap distribution is then used to estimate significances and posterior probabilities of attention scores. In natural and medical images, the proposed \emphAttention Regularization approach demonstrates a straightforward removal of spurious attention arising from noise, drastically improving shrinkage and sparsity. Quantitative evaluations are conducted using both simulation and real-world datasets. Our study highlights bootstrapping as a practical regularization tool when using attention scores as explanations for ViT. Code available: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML) Cite as: arXiv:2604.01339 [cs.CV] (or arXiv:2604.01339v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.01339 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Artificial Intelligence and Statistics (AISTATS) 2026
[CV-132] SECURE: Stable Early Collision Understanding via Robust Embeddings in Autonomous Driving
【速读】:该论文旨在解决深度学习在事故预判(accident anticipation)任务中模型对真实世界扰动缺乏鲁棒性的问题,特别是当前先进模型如CRASH在面对微小输入扰动时,其预测结果和潜在特征表示表现出显著不稳定性,从而带来严重的可靠性风险。解决方案的关键在于提出SECURE框架,该框架通过形式化定义并强制执行模型鲁棒性,基于预测空间与潜在特征空间的一致性和稳定性四个核心属性,设计了一种多目标损失函数的训练方法:该方法通过对齐参考模型的输出并惩罚对抗扰动下的敏感性,对基线模型进行微调,从而在DAD和CCD数据集上同时提升了抗扰动能力和干净数据上的性能,达到新的最先进水平。
链接: https://arxiv.org/abs/2604.01337
作者: Wenjing Wang,Wenxuan Wang,Songning Lai
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 2 figures
Abstract:While deep learning has significantly advanced accident anticipation, the robustness of these safety-critical systems against real-world perturbations remains a major challenge. We reveal that state-of-the-art models like CRASH, despite their high performance, exhibit significant instability in predictions and latent representations when faced with minor input perturbations, posing serious reliability risks. To address this, we introduce SECURE - Stable Early Collision Understanding Robust Embeddings, a framework that formally defines and enforces model robustness. SECURE is founded on four key attributes: consistency and stability in both prediction space and latent feature space. We propose a principled training methodology that fine-tunes a baseline model using a multi-objective loss, which minimizes divergence from a reference model and penalizes sensitivity to adversarial perturbations. Experiments on DAD and CCD datasets demonstrate that our approach not only significantly enhances robustness against various perturbations but also improves performance on clean data, achieving new state-of-the-art results.
[CV-133] Human Pose Estimation in Trampoline Gymnastics: Improving Performance Using a New Synthetic Dataset
【速读】:该论文旨在解决当前姿态估计模型在处理蹦床体操中极端人体姿态和非常规视角时性能下降的问题(pose estimation models tend to under-perform on extreme human poses and uncommon viewpoints)。解决方案的关键在于利用从蹦床动作的运动捕捉数据生成的合成姿态数据集(Synthetic Trampoline Poses, STP)对ViTPose模型进行微调。该方法首先通过拟合噪声运动捕捉数据到参数化人体模型,进而生成多视角逼真的图像,从而有效提升模型在真实蹦床场景下的2D姿态估计精度,并进一步改善3D三角测量结果,最终在2D上达到该类挑战性数据上的最先进水平,在3D上将MPJPE降低12.5 mm(相对预训练ViTPose提升19.6%)。
链接: https://arxiv.org/abs/2604.01322
作者: Léa Drolet-Roy,Victor Nogues,Sylvain Gaudet,Eve Charbonneau,Mickaël Begon,Lama Séoud
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Trampoline gymnastics involves extreme human poses and uncommon viewpoints, on which state-of-the art pose estimation models tend to under-perform. We demonstrate that this problem can be addressed by fine-tuning a pose estimation model on a dataset of synthetic trampoline poses (STP). STP is generated from motion capture recordings of trampoline routines. We develop a pipeline to fit noisy motion capture data to a parametric human model, then generate multiview realistic images. We use this data to fine-tune a ViTPose model, and test it on real multi-view trampoline images. The resulting model exhibits accuracy improvements in 2D which translates to improved 3D triangulation. In 2D, we obtain state-of-the-art results on such challenging data, bridging the performance gap between common and extreme poses. In 3D, we reduce the MPJPE by 12.5 mm with our best model, which represents an improvement of 19.6% compared to the pretrained ViTPose model.
[CV-134] ViTs for Action Classification in Videos: An Approach to Risky Tackle Detection in American Football Practice Videos ICPR2026
【速读】:该论文旨在解决美式橄榄球训练视频中危险冲撞动作的早期识别问题,以实现及时干预并提升运动员安全。其核心挑战在于此类高风险动作属于罕见事件,且标注数据稀缺,导致模型难以准确检测。解决方案的关键在于构建了一个规模显著扩大的数据集(包含733个单人对假人冲撞片段),并采用基于视觉Transformer(Vision Transformer)的视频分析模型,结合类不平衡感知的训练策略,在交叉验证下实现了0.67的危险冲撞召回率(risky recall)和0.59的F1分数,相较先前小规模数据集的基线方法(risky recall 0.58, F1 0.56)提升了超过8个百分点,证明了该方法在识别稀有但关键安全行为上的有效性,为教练导向的伤害预防工具提供了可行路径。
链接: https://arxiv.org/abs/2604.01318
作者: Syed Ahsan Masud Zaidi,William Hsu,Scott Dietrich
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures. Accepted to ICPR 2026 (28th International Conference on Pattern Recognition)
Abstract:Early identification of hazardous actions in contact sports enables timely intervention and improves player safety. We present a method for detecting risky tackles in American football practice videos and introduce a substantially expanded dataset for this task. Our work contains 733 single-athlete-dummy tackle clips, each temporally localized around first point contact and labeled with a strike zone component of the standardized Assessment for Tackling Technique (SATT-3), extending prior work that reported 178 annotated videos. Using a Vision transformer-based model with imbalance-aware training, we obtain risky recall of 0.67 and Risky F1 of 0.59 under crossvalidation. Relative to the previous baseline in a smaller subset (risky recall of 0.58; Risky F1 0.56 ), our approach improves risky recall by more than 8% points on a much larger dataset. These results indicate that the vision transformer-based video analysis, coupled with careful handling of class imbalance, can reliably detect rare but safety-critical tackling patterns, offering a practical pathway toward coach-centered injury prevention tools.
[CV-135] Sparse Spectral LoRA: Routed Experts for Medical VLMs
【速读】:该论文旨在解决大型视觉语言模型(VLMs)在医学影像任务中因数据异质性导致的跨数据集干扰和对数据配置敏感的问题,以及在临床实际场景下数据与任务顺序到达时出现的灾难性遗忘问题。解决方案的关键在于提出MedQwen——一种参数高效的医学VLM,其核心创新是将谱路由的专家混合(MoE)架构与理论支撑的缩放规则相结合,使低秩更新能够对齐全秩、完全微调的MoE,而无需改变基础模型结构;具体而言,通过从预训练权重的非重叠奇异值分解(SVD)片段中初始化每个专家,并引入残差补偿与缩放机制,实现了在分布偏移下专家的专业化稳定性和一致路由,从而在23个医学数据集上显著提升性能并大幅减少连续学习中的遗忘。
链接: https://arxiv.org/abs/2604.01310
作者: Omid Nejati Manzari,Hojat Asgariandehkordi,Taha Koleilat,Yiming Xiao,Hassan Rivaz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large vision-language models (VLMs) excel on general benchmarks but often lack robustness in medical imaging, where heterogeneous supervision induces cross-dataset interference and sensitivity to data regime (i.e., how the supervisory signals are mixed). In realistic clinical workflows, data and tasks arrive sequentially, so naive continual training further leads to catastrophic forgetting. To address these challenges, we propose MedQwen, a parameter-efficient medical VLM that couples a spectrally routed Mixture-of-Experts (MoE) with a theoretically grounded scaling rule that aligns low-rank updates with a full-rank, fully fine-tuned MoE, without changing the base architecture. Concretely, we initialize each expert from non-overlapping singular value decomposition (SVD) segments of the pretrained weight and introduce a residual compensation and scaling scheme to enable stable expert specialization and consistent routing under distribution shift. Across 23 medical datasets covering visual question answering, report generation, radiology classification, and hallucination mitigation, MedQwen achieves strong, reliable performance: it approaches full fine-tuning on zero-shot classification with 339 \times fewer trainable parameters, and reduces sequential forgetting to \sim 5% where strong baselines degrade by 20-50%.
[CV-136] Non-Rigid 3D Shape Correspondences: From Foundations to Open Challenges and Opportunities
【速读】:该论文旨在解决变形形状实例之间的对应关系估计问题(correspondence estimation between deformed shape instances),这是计算机图形学中的一个长期挑战,广泛应用于纹理映射、统计建模等任务。其解决方案的关键在于系统性地将近年研究进展归纳为三大范式:基于函数图谱(functional maps)的谱方法、施加离散约束的组合优化方法,以及直接恢复全局对齐的变形驱动方法。每种范式各有优势与局限,论文不仅梳理了各领域的最新进展,还指出了未来研究方向,如利用视觉基础模型实现零样本对应和处理部分形状匹配等新兴挑战。
链接: https://arxiv.org/abs/2604.01274
作者: Aleksei Zhuravlev,Lennart Bastian,Dongliang Cao,Nafie El Amrani,Paul Roetzer,Viktoria Ehm,Riccardo Marin,Hiroki Nishizawa,Shigeo Morishima,Christian Theobalt,Nassir Navab,Daniel Cremers,Florian Bernard,Zorah Lähner,Vladislav Golyanik
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages and 15 figures; Eurographics 2026 STAR; Project page: this https URL
Abstract:Estimating correspondences between deformed shape instances is a long-standing problem in computer graphics; numerous applications, from texture transfer to statistical modelling, rely on recovering an accurate correspondence map. Many methods have thus been proposed to tackle this challenging problem from varying perspectives, depending on the downstream application. This state-of-the-art report is geared towards researchers, practitioners, and students seeking to understand recent trends and advances in the field. We categorise developments into three paradigms: spectral methods based on functional maps, combinatorial formulations that impose discrete constraints, and deformation-based methods that directly recover a global alignment. Each school of thought offers different advantages and disadvantages, which we discuss throughout the report. Meanwhile, we highlight the latest developments in each area and suggest new potential research directions. Finally, we provide an overview of emerging challenges and opportunities in this growing field, including the recent use of vision foundation models for zero-shot correspondence and the particularly challenging task of matching partial shapes.
[CV-137] Camouflage-aware Image-Text Retrieval via Expert Collaboration
【速读】:该论文旨在解决伪装场景理解(Camouflaged Scene Understanding, CSU)中图像与文本跨模态对齐鲁棒性不足的问题,这一问题限制了对伪装场景的深入理解及相关应用的发展。为应对挑战,作者提出了“伪装感知图像-文本检索”(Camouflage-aware Image-Text Retrieval, CA-ITR)任务,并构建了包含约10.5K样本、多粒度文本标注的专用数据集CamoIT。解决方案的核心是提出伪装专家协同网络(Camouflage-Expert Collaborative Network, CECNet),其关键创新在于双分支视觉编码器结构:一个分支提取图像整体表征,另一个分支引入专门模型注入伪装目标的特征表示;同时设计了一种置信度条件图注意力机制(Confidence-conditioned Graph Attention, C²GA),以挖掘两分支间的互补信息。实验表明,CECNet相较七种主流检索模型在CA-ITR任务上实现约29%的整体准确率提升。
链接: https://arxiv.org/abs/2604.01251
作者: Yao Jiang,Zhongkuan Mao,Xuan Wu,Keren Fu,Qijun Zhao
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Camouflaged scene understanding (CSU) has attracted significant attention due to its broad practical implications. However, in this field, robust image-text cross-modal alignment remains under-explored, hindering deeper understanding of camouflaged scenarios and their related applications. To this end, we focus on the typical image-text retrieval task, and formulate a new task dubbed ``camouflage-aware image-text retrieval’’ (CA-ITR). We first construct a dedicated camouflage image-text retrieval dataset (CamoIT), comprising \sim 10.5K samples with multi-granularity textual annotations. Benchmark results conducted on CamoIT reveal the underlying challenges of CA-ITR for existing cutting-edge retrieval techniques, which are mainly caused by objects’ camouflage properties as well as those complex image contents. As a solution, we propose a camouflage-expert collaborative network (CECNet), which features a dual-branch visual encoder: one branch captures holistic image representations, while the other incorporates a dedicated model to inject representations of camouflaged objects. A novel confidence-conditioned graph attention (C\textsuperscript2GA) mechanism is incorporated to exploit the complementarity across branches. Comparative experiments show that CECNet achieves \sim 29% overall CA-ITR accuracy boost, surpassing seven representative retrieval models. The dataset and code will be available at this https URL.
[CV-138] CLPIPS: A Personalized Metric for AI-Generated Image Similarity
【速读】:该论文旨在解决现有图像相似性度量(Image Similarity Metrics, ISMs)如LPIPS和CLIP在文本到图像生成任务中与人类感知判断不一致的问题,尤其是在用户驱动或情境特定的任务中。其核心挑战在于如何提升度量指标与人类主观评价之间的对齐一致性,而非单纯优化绝对性能。解决方案的关键在于提出一种定制化的学习感知图像块相似性度量(Customized Learned Perceptual Image Patch Similarity, CLPIPS),通过仅微调LPIPS中的层组合权重,并利用人类标注的图像对排名数据以边际排序损失(margin ranking loss)进行轻量级、人类增强的微调,从而实现度量结果与人类评分之间更强的一致性。实验表明,CLPIPS在Spearman秩相关系数和组内相关系数(Intraclass Correlation Coefficient)上均优于基线LPIPS,验证了有限人类反馈即可显著改善人机协同流程中的感知对齐效果。
链接: https://arxiv.org/abs/2604.01234
作者: Khoi Trinh,Jay Rothenberger,Scott Seidenberger,Dimitrios Diochnos,Anindya Maiti
机构: University of Oklahoma (俄克拉荷马大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:Iterative prompt refinement is central to reproducing target images with text to image generative models. Previous studies have incorporated image similarity metrics (ISMs) as additional feedback to human users. Existing ISMs such as LPIPS and CLIP provide objective measures of image likeness but often fail to align with human judgments, particularly in context specific or user driven tasks. In this paper, we introduce Customized Learned Perceptual Image Patch Similarity (CLPIPS), a customized extension of LPIPS that adapts a metric’s notion of similarity directly to human judgments. We aim to explore whether lightweight, human augmented fine tuning can meaningfully improve perceptual alignment, positioning similarity metrics as adaptive components for human in the loop workflows with text to image tools. We evaluate CLPIPS on a human subject dataset in which participants iteratively regenerate target images and rank generated outputs by perceived similarity. Using margin ranking loss on human ranked image pairs, we fine tune only the LPIPS layer combination weights and assess alignment via Spearman rank correlation and Intraclass Correlation Coefficient. Our results show that CLPIPS achieves stronger correlation and agreement with human judgments than baseline LPIPS. Rather than optimizing absolute metric performance, our work emphasizes improving alignment consistency between metric predictions and human ranks, demonstrating that even limited human specific fine tuning can meaningfully enhance perceptual alignment in human in the loop text to image workflows.
[CV-139] DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在设计到代码(Design-to-Code)生成任务中面临的“整体瓶颈”问题,即难以同时兼顾高层次的结构层次与细粒度的视觉细节,导致布局失真或使用通用占位符。其解决方案的关键在于提出一个端到端框架 DOne,通过三个核心创新实现结构理解与元素渲染的解耦:(1) 引入学习型布局分割模块以分解复杂设计,避免启发式裁剪的局限性;(2) 设计专用的混合元素检索器以应对 UI 组件极端长宽比和高密度特性;(3) 提出基于 schema 的生成范式,有效连接布局信息与代码输出。该方法在新构建的 HiFi2Code 基准上显著优于现有方法,在高层视觉相似性和细粒度元素对齐方面均取得提升,并通过人工评估验证了三倍生产力增益和更高视觉保真度。
链接: https://arxiv.org/abs/2604.01226
作者: Xinhao Huang,Jinke Yu,Wenhao Xu,Zeyi Wen,Ying Zhou,Junzhuo Liu,Junhao Ji,Zulong Chen
机构: HKUST (Guangzhou)(香港科技大学(广州)); Alibaba Group(阿里巴巴集团); HKUST(香港科技大学); University of Electronic Science(电子科技大学); Zhejiang Lab(浙江实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注:
Abstract:While Vision Language Models (VLMs) have shown promise in Design-to-Code generation, they suffer from a "holistic bottleneck-failing to reconcile high-level structural hierarchy with fine-grained visual details, often resulting in layout distortions or generic placeholders. To bridge this gap, we propose DOne, an end-to-end framework that decouples structure understanding from element rendering. DOne introduces (1) a learned layout segmentation module to decompose complex designs, avoiding the limitations of heuristic cropping; (2) a specialized hybrid element retriever to handle the extreme aspect ratios and densities of UI components; and (3) a schema-guided generation paradigm that bridges layout and code. To rigorously assess performance, we introduce HiFi2Code, a benchmark featuring significantly higher layout complexity than existing datasets. Extensive evaluations on the HiFi2Code demonstrate that DOne outperforms exiting methods in both high-level visual similarity (e.g., over 10% in GPT Score) and fine-grained element alignment. Human evaluations confirm a 3 times productivity gain with higher visual fidelity.
[CV-140] Cross-Scale MAE: A Tale of Multi-Scale Exploitation in Remote Sensing
【速读】:该论文旨在解决遥感图像分析中因大范围地理覆盖、硬件限制及多尺度图像错位所带来的挑战,特别是在多尺度表示学习方面的问题。其解决方案的关键在于提出了一种基于掩码自编码器(Masked Auto-Encoder, MAE)的自监督模型——Cross-Scale MAE,通过引入尺度增强技术,并结合对比损失与生成损失来强制跨尺度一致性约束,从而学习到适用于多种下游任务的一致且语义丰富的表示。此外,该方法利用xFormers库在单张GPU上加速预训练过程,同时保持表征质量。
链接: https://arxiv.org/abs/2401.15855
作者: Maofeng Tang,Andrei Cozma,Konstantinos Georgiou,Hairong Qi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:Remote sensing images present unique challenges to image analysis due to the extensive geographic coverage, hardware limitations, and misaligned multi-scale images. This paper revisits the classical multi-scale representation learning problem but under the general framework of self-supervised learning for remote sensing image understanding. We present Cross-Scale MAE, a self-supervised model built upon the Masked Auto-Encoder (MAE).During pre-training, Cross-Scale MAE employs scale augmentation techniques and enforces cross-scale consistency constraints through both contrastive and generative losses to ensure consistent and meaningful representations well-suited for a wide range of downstream tasks. Further, our implementation leverages the xFormers library to accelerate network pre-training on a single GPU while maintaining the quality of learned representations. Experimental evaluations demonstrate that Cross-Scale MAE exhibits superior performance compared to standard MAE and other state-of-the-art remote sensing MAE methods.
[CV-141] DenOiS: Dual-Domain Denoising of Observation and Solution in Ultrasound Image Reconstruction
【速读】:该论文旨在解决医学成像中因测量质量差和成像模型不准确导致的图像重建精度受限问题,尤其在使用简化或线性化模型以及不完整、噪声污染的观测数据时表现不佳。其解决方案的关键在于提出DenOiS框架,该框架通过两个核心策略实现:一是观测域的精修策略(observation refinement strategy),用于校正退化的测量数据并补偿成像模型的简化误差;二是基于扩散模型的插件式(plug-and-play, PnP)重建方法,能够在缺失测量条件下保持鲁棒性。此设计使模型仅在仿真数据上训练即可泛化至真实数据,从而实现高保真度的图像重建,特别是在定量超声成像中的速度-声速(speed-of-sound)成像这一挑战性任务中验证了有效性。
链接: https://arxiv.org/abs/2604.02105
作者: Can Deniz Bezek,Orcun Goksel
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical imaging aims to recover underlying tissue properties, using inexact (simplified/linearized) imaging models and often from inaccurate and incomplete measurements. Analytical reconstruction methods rely on hand-crafted regularization, sensitive to noise assumptions and parameter tuning. Among deep learning alternatives, plug-and-play (PnP) approaches learn regularization while incorporating imaging physics during inference, outperforming purely data-driven methods. The performance of all these approaches, however, still strongly depends on measurement quality and imaging model accuracy. In this work, we propose DenOiS, a framework that denoises both input observations and resulting solution in their respective domains. It consists of an observation refinement strategy that corrects degraded measurements while compensating for imaging model simplifications, and a diffusion-based PnP reconstruction approach that remains robust under missing measurements. DenOiS enables generalization to real data from training only in simulations, resulting in high-fidelity image reconstruction with noisy observations and inexact imaging models. We demonstrate this for speed-of-sound imaging as a challenging setting of quantitative ultrasound image reconstruction.
[CV-142] Country-wide high-resolution monitoring of forest browning with Sentinel-2
【速读】:该论文旨在解决全球范围内森林健康受到自然和人为干扰影响的问题,尤其关注如何在国家尺度上高效、准确地监测森林绿色度异常(forest greenness anomalies)。其解决方案的关键在于构建一个基于Sentinel-2数据的可扩展预测分位数模型(predictive quantile model),利用生态与地形背景信息以及植被周期的既定表征,学习归一化差异植被指数(NDVI)的正常季节变化模式。该模型能够以10米空间分辨率识别瑞士境内2017年4月至2025年8月期间的NDVI异常,并通过良好的拟合度评估(解释65%的中位季节变化方差)实现对森林变褐现象的全国性量化分析,且在不同干扰类型下均表现出可靠检测能力。
链接: https://arxiv.org/abs/2604.02074
作者: Samantha Biegel,David Brüggemann,Francesco Grossi,Michele Volpi,Konrad Schindler,Benjamin D. Stocker
机构: ETH Zurich; ETH AI Center; Swiss Data Science Center; EPFL; Swiss Federal Institute for Forest, Snow and Landscape Research WSL; University of Bern; Oeschger Centre for Climate Change Research
类目: Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures, to be published in the ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences (ISPRS Congress)
Abstract:Natural and anthropogenic disturbances are impacting the health of forests worldwide. Monitoring forest disturbances at scale is important to inform conservation efforts. Here, we present a scalable approach for country-wide mapping of forest greenness anomalies at the 10 m resolution of Sentinel-2. Using relevant ecological and topographical context and an established representation of the vegetation cycle, we learn a predictive quantile model of the normalised difference vegetation index (NDVI) derived from Sentinel-2 data. The resulting expected seasonal cycles are used to detect NDVI anomalies across Switzerland between April 2017 and August 2025. Goodness-of-fit evaluations show that the conditional model explains 65% of the observed variations in the median seasonal cycle. The model consistently benefits from the local context information, particularly during the green-up period. The approach produces coherent spatial anomaly patterns and enables country-wide quantification of forest browning. Case studies with independent reference data from known events illustrate that the model reliably detects different types of disturbances.
[CV-143] Enhanced Polarization Locking in VCSELs
【速读】:该论文旨在解决垂直腔面发射激光器(VCSEL)在光学注入锁定(OIL)过程中因固有偏振偏好和有限偏振切换能力而导致的偏振锁定性能受限问题,从而限制其在偏振编码伊辛计算机等新型计算应用中的潜力。解决方案的关键在于通过设计定制化的氧化物孔径结构,并结合偏置电流调制来调控VCSEL的偏振特性,实验表明该方法可将所需的注入功率降低至3.6 μW并显著扩展锁定范围;同时,利用自旋翻转模型(SFM)分析幅度各向异性和偏置电流对偏振锁定的影响,结果与实验高度一致,验证了该策略的有效性。
链接: https://arxiv.org/abs/2604.01857
作者: Zifeng Yuan,Dewen Zhang,Lei Shi,Yutong Liu,Aaron Danner
机构: National University of Singapore (新加坡国立大学)
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While optical injection locking (OIL) of vertical-cavity surface-emitting lasers (VCSELs) has been widely studied in the past, the polarization dynamics of OIL have received far less attention. Recent studies suggest that polarization locking via OIL could enable novel computational applications such as polarization-encoded Ising computers. However, the inherent polarization preference and limited polarization switchability of VCSELs hinder their use for such purposes. To address these challenges, we fabricate VCSELs with tailored oxide aperture designs and combine these with bias current tuning to study the overall impact on polarization locking. Experimental results demonstrate that this approach reduces the required injection power (to as low as 3.6 \muW) and expands the locking range. To investigate the impact of the approach, the spin-flip model (SFM) is used to analyze the effects of amplitude anisotropy and bias current on polarization locking, demonstrating strong coherence with experimental results.
人工智能
[AI-0] Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估体系中对交互意识(interaction awareness)缺失的问题。现有标准基准测试仅评估模型在“助手轮次”中的表现,即模型根据输入生成回复并由验证器评分,但忽略了模型是否具备对后续用户回应的感知能力。为填补这一空白,作者提出“用户轮次生成”(user-turn generation)作为探测机制:给定一个包含用户提问和助手回答的对话上下文,让模型以用户角色生成下一轮回应;若模型权重中编码了交互意识,则其生成的用户轮次应是对前序内容的合理延续。实验表明,交互意识与任务准确性解耦,例如Qwen3.5系列模型在数学推理任务上的准确率从41%(0.8B参数)提升至96.8%(397B),但确定性采样下的真实跟进率仍接近零,而高温采样可揭示潜在的交互意识(跟进率达22%)。关键创新在于通过用户轮次生成构建了一个能捕捉LLM行为新维度——交互意识的新评估范式,且该维度在传统仅评估助手响应的基准中完全不可见。
链接: https://arxiv.org/abs/2604.02315
作者: Sarath Shekkizhar,Romain Cosentino,Adam Earle
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Standard LLM benchmarks evaluate the assistant turn: the model generates a response to an input, a verifier scores correctness, and the analysis ends. This paradigm leaves unmeasured whether the LLM encodes any awareness of what follows the assistant response. We propose user-turn generation as a probe of this gap: given a conversation context of user query and assistant response, we let a model generate under the user role. If the model’s weights encode interaction awareness, the generated user turn will be a grounded follow-up that reacts to the preceding context. Through experiments across 11 open-weight LLMs (Qwen3.5, gpt-oss, GLM) and 5 datasets (math reasoning, instruction following, conversation), we show that interaction awareness is decoupled from task accuracy. In particular, within the Qwen3.5 family, GSM8K accuracy scales from 41% ( 0.8 B) to 96.8% ( 397 B-A 17 B), yet genuine follow-up rates under deterministic generation remain near zero. In contrast, higher temperature sampling reveals interaction awareness is latent with follow up rates reaching 22% . Controlled perturbations validate that the proposed probe measures a real property of the model, and collaboration-oriented post-training on Qwen3.5-2B demonstrates an increase in follow-up rates. Our results show that user-turn generation captures a dimension of LLM behavior, interaction awareness, that is unexplored and invisible with current assistant-only benchmarks.
[AI-1] Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
【速读】:该论文旨在解决当前大语言模型后训练中强化学习方法存在的两个关键问题:一是Group Relative Policy Optimization (GRPO) 在信用分配上过于粗粒度,无法实现对特定token级别的偏差进行精准修正;二是Self-Distillation Policy Optimization (SDPO) 虽能提供更密集的logit级监督并实现早期快速提升,但在长期训练中易发生崩溃,根源在于对已正确样本的自蒸馏引入优化歧义,且自教师信号可靠性随训练进程逐渐下降。解决方案的关键在于提出Sample-Routed Policy Optimization (SRPO),这是一个统一的在线策略框架,通过样本路由机制将正确样本导向GRPO的奖励对齐强化,将失败样本导向SDPO的针对性logit级修正,并结合熵感知的动态加权机制抑制高熵不可靠目标、强化低熵高置信度目标,从而在保持SDPO早期高效改进的同时实现GRPO的长期稳定性。
链接: https://arxiv.org/abs/2604.02288
作者: Gengsheng Li,Tianyu Yang,Junfeng Fang,Mingyang Song,Mao Zheng,Haiyun Guo,Dan Zhang,Jinqiao Wang,Tat-Seng Chua
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher’s signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO’s reward-aligned reinforcement and failed samples to SDPO’s targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.
[AI-2] Crystalite: A Lightweight Transformer for Efficient Crystal Modeling
【速读】:该论文旨在解决当前生成式晶体材料模型中普遍存在的训练成本高、采样速度慢的问题,尤其是在使用等变图神经网络(Equivariant Graph Neural Networks)时,虽能较好捕捉晶体几何结构,但效率低下。其解决方案的关键在于提出一种轻量级扩散Transformer模型Crystalite,核心创新包括两个简单而有效的归纳偏置:一是亚原子分词(Subatomic Tokenization),用紧凑的化学结构化原子表示替代高维独热编码,更适配连续扩散过程;二是几何增强模块(Geometry Enhancement Module, GEM),通过加性几何偏置直接将周期性最小像对几何信息注入注意力机制。这两个设计使模型在保持标准Transformer简洁高效的同时,显著提升了对晶体结构的建模能力,从而在晶体结构预测和从头生成任务中达到最优性能,并大幅优于依赖复杂几何处理的基线方法。
链接: https://arxiv.org/abs/2604.02270
作者: Tin Hadži Veljković,Joshua Rosenthal,Ivor Lončarić,Jan-Willem van de Meent
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 39 pages, 13 figures. Code available at: this https URL
Abstract:Generative models for crystalline materials often rely on equivariant graph neural networks, which capture geometric structure well but are costly to train and slow to sample. We present Crystalite, a lightweight diffusion Transformer for crystal modeling built around two simple inductive biases. The first is Subatomic Tokenization, a compact chemically structured atom representation that replaces high-dimensional one-hot encodings and is better suited to continuous diffusion. The second is the Geometry Enhancement Module (GEM), which injects periodic minimum-image pair geometry directly into attention through additive geometric biases. Together, these components preserve the simplicity and efficiency of a standard Transformer while making it better matched to the structure of crystalline materials. Crystalite achieves state-of-the-art results on crystal structure prediction benchmarks, and de novo generation performance, attaining the best S.U.N. discovery score among the evaluated baselines while sampling substantially faster than geometry-heavy alternatives.
[AI-3] Generative AI Spotlights the Human Core of Data Science: Implications for Education
【速读】:该论文旨在解决生成式 AI(Generative AI, GAI)快速发展背景下数据科学教育的核心定位问题,即如何在自动化日益普及的环境中保持并强化人类不可替代的推理能力。其解决方案的关键在于:明确区分GAI可自动完成的任务(如数据清洗、可视化与建模等计算密集型流程)与仍需人类主导的高阶能力(如问题定义、因果识别、伦理判断及意义建构),并据此重构数据科学课程体系——聚焦于“人类核心”(human core)能力培养,同时教授学生如何通过检索增强生成(retrieval-augmented generation)技术在迭代式提示-输出-再提示循环中高效协作,最终使学习成果评估能够直接衡量学生的推理与判断力。
链接: https://arxiv.org/abs/2604.02238
作者: Nathan Taback
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:
Abstract:Generative AI (GAI) reveals an irreducible human core at the center of data science: advances in GAI should sharpen, rather than diminish, the focus on human reasoning in data science education. GAI can now execute many routine data science workflows, including cleaning, summarizing, visualizing, modeling, and drafting reports. Yet the competencies that matter most remain irreducibly human: problem formulation, measurement and design, causal identification, statistical and computational reasoning, ethics and accountability, and sensemaking. Drawing on Donoho’s Greater Data Science framework, Nolan and Temple Lang’s vision of computational literacy, and the McLuhan-Culkin insight that we shape our tools and thereafter our tools shape us, this paper traces the emergence of data science through three converging lineages: Tukey’s intellectual vision of data analysis as a science, the commercial logic of surveillance capitalism that created industrial demand for data scientists, and the academic programs that followed. Mapping GAI’s impact onto Donoho’s six divisions of Greater Data Science shows that computing with data (GDS3) has been substantially automated, while data gathering, preparation, and exploration (GDS1) and science about data science (GDS6) still require essential human input. The educational implication is that data science curricula should focus on this human core while teaching students how to contribute effectively within iterative prompt-output-prompt cycles using retrieval-augmented generation, and that learning outcomes and assessments should explicitly evaluate reasoning and judgment.
[AI-4] Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models
【速读】:该论文旨在解决情感语气(emotional tone)在用户侧提问中如何影响大语言模型(LLM)性能的问题,尤其关注第一人称情感表述对不同任务表现的影响机制。其关键解决方案是提出一种自适应情感提示框架——EmotionRL,该框架根据每个查询动态选择最合适的 emotionally framed prompt,而非采用固定情感标签;实验证明,尽管单一情绪类型无法稳定提升性能,但通过自适应选择可显著增强模型在社会推理等敏感任务中的表现,从而实现对弱且输入依赖的情感信号的有效利用。
链接: https://arxiv.org/abs/2604.02236
作者: Minda Zhao,Yutong Yang,Chufei Peng,Rachel Gonsalves,Weiyue Li,Ruyi Yang,Zhixi Liu,Mengyu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Emotional tone is pervasive in human communication, yet its influence on large language model (LLM) behaviour remains unclear. Here, we examine how first-person emotional framing in user-side queries affect LLM performance across six benchmark domains, including mathematical reasoning, medical question answering, reading comprehension, commonsense reasoning and social inference. Across models and tasks, static emotional prefixes usually produce only small changes in accuracy, suggesting that affective phrasing is typically a mild perturbation rather than a reliable general-purpose intervention. This stability is not uniform: effects are more variable in socially grounded tasks, where emotional context more plausibly interacts with interpersonal reasoning. Additional analyses show that stronger emotional wording induces only modest extra change, and that human-written prefixes reproduce the same qualitative pattern as LLM-generated ones. We then introduce EmotionRL, an adaptive emotional prompting framework that selects emotional framing adaptively for each query. Although no single emotion is consistently beneficial, adaptive selection yields more reliable gains than fixed emotional prompting. Together, these findings show that emotional tone is neither a dominant driver of LLM performance nor irrelevant noise, but a weak and input-dependent signal that can be exploited through adaptive control.
[AI-5] Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中缺乏有效拒答能力的问题,即模型在不确定或无法正确回答时仍会生成看似合理但错误的回答(幻觉),从而影响部署可靠性。解决方案的关键在于提出Query Misalignment Framework,将失败的拒答重新诠释为模型回答了“错误的问题”而非“错误地回答问题”。基于此框架,作者开发了一种名为Trace Inversion的新方法:首先生成模型的推理轨迹(reasoning trace),然后仅依据该轨迹重构模型最可能响应的原始查询;最后通过比较初始查询与重构查询之间的语义相似度来判断模型是否偏离原问题——若相似度低,则判定模型可能误答并触发拒答机制。实验证明,该方法显著提升了四种前沿LLM在九个拒答问答数据集上的表现,在36组对比中优于33组基线方法。
链接: https://arxiv.org/abs/2604.02230
作者: Abinitha Gourabathina,Inkit Padhi,Manish Nagireddy,Subhajit Chaudhury,Prasanna Sattigeri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:For Large Language Models (LLMs) to be reliably deployed, models must effectively know when not to answer: abstain. Reasoning models, in particular, have gained attention for impressive performance on complex tasks. However, reasoning models have been shown to have worse abstention abilities. Taking the vulnerabilities of reasoning models into account, we propose our Query Misalignment Framework. Hallucinations resulting in failed abstention can be reinterpreted as LLMs answering the wrong question (rather than answering a question incorrectly). Based on this framework, we develop a new class of state-of-the-art abstention methods called Trace Inversion. First, we generate the reasoning trace of a model. Based on only the trace, we then reconstruct the most likely query that the model responded to. Finally, we compare the initial query with the reconstructed query. Low similarity score between the initial query and reconstructed query suggests that the model likely answered the question incorrectly and is flagged to abstain. Extensive experiments demonstrate that Trace Inversion effectively boosts abstention performance in four frontier LLMs across nine abstention QA datasets, beating competitive baselines in 33 out of 36 settings.
[AI-6] When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning IJCNN
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)代理在分布外(Out-of-Distribution, OOD)场景下表现出高不确定性与随机行为的问题。解决方案的关键在于提出一种名为“基于知识的自适应安全机制”(Adaptive Safety through Knowledge, ASK)的方法,其核心是将小型语言模型(Language Model, LM)与训练好的RL策略相结合,在不重新训练RL策略的前提下提升OOD泛化能力。ASK通过蒙特卡洛Dropout评估不确定性,并仅在不确定度超过阈值时调用LM提供动作建议,从而在保持原有策略效率的同时,利用LM的推理能力增强鲁棒性。实验表明,该方法在域内任务中无显著提升,但在迁移任务中实现了0.95的奖励,验证了有效神经符号融合需依赖充分模型规模和合理的混合机制。
链接: https://arxiv.org/abs/2604.02226
作者: Juarez Monteiro,Nathan Gavenski,Gianlucca Zuin,Adriano Veloso
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: In Proceedings of International Joint Conference on Neural Networks (IJCNN)
Abstract:Reinforcement learning (RL) agents often struggle with out-of-distribution (OOD) scenarios, leading to high uncertainty and random behavior. While language models (LMs) contain valuable world knowledge, larger ones incur high computational costs, hindering real-time use, and exhibit limitations in autonomous planning. We introduce Adaptive Safety through Knowledge (ASK), which combines smaller LMs with trained RL policies to enhance OOD generalization without retraining. ASK employs Monte Carlo Dropout to assess uncertainty and queries the LM for action suggestions only when uncertainty exceeds a set threshold. This selective use preserves the efficiency of existing policies while leveraging the language model’s reasoning in uncertain situations. In experiments on the FrozenLake environment, ASK shows no improvement in-domain, but demonstrates robust navigation in transfer tasks, achieving a reward of 0.95. Our findings indicate that effective neuro-symbolic integration requires careful orchestration rather than simple combination, highlighting the need for sufficient model scale and effective hybridization mechanisms for successful OOD generalization.
[AI-7] Universal Hypernetworks for Arbitrary Models
【速读】:该论文旨在解决传统超网络(hypernetwork)在面对不同目标模型架构或任务时,需重新设计并从头训练的问题,从而限制了其通用性和灵活性。解决方案的关键在于提出一种固定架构的通用超网络(Universal Hypernetwork, UHN),其通过预测权重来解耦生成器架构与目标网络参数化,仅依赖于确定性的参数、架构和任务描述符作为输入,使得同一UHN能够生成多种异构模型,并在视觉、图结构、文本及公式回归等多个基准上保持与直接训练相当的性能,同时支持跨模型家族的多任务学习与稳定递归生成。
链接: https://arxiv.org/abs/2604.02215
作者: Xuanfeng Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Conventional hypernetworks are typically engineered around a specific base-model parameterization, so changing the target architecture often entails redesigning the hypernetwork and retraining it from scratch. We introduce the \emphUniversal Hypernetwork (UHN), a fixed-architecture generator that predicts weights from deterministic parameter, architecture, and task descriptors. This descriptor-based formulation decouples the generator architecture from target-network parameterization, so one generator can instantiate heterogeneous models across the tested architecture and task families. Our empirical claims are threefold: (1) one fixed UHN remains competitive with direct training across vision, graph, text, and formula-regression benchmarks; (2) the same UHN supports both multi-model generalization within a family and multi-task learning across heterogeneous models; and (3) UHN enables stable recursive generation with up to three intermediate generated UHNs before the final base model. Our code is available at this https URL.
[AI-8] LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications
【速读】:该论文旨在解决动态目标在自动驾驶系统中精确的形状与轨迹估计问题,这直接影响系统的可靠性。传统基于贝叶斯的扩展对象模型虽理论稳健且高效,但依赖先验和似然函数的完整性;而深度学习方法虽具备适应性,却需密集标注且计算开销大。为此,作者提出LEO(Learned Extension of Objects),其核心在于引入一种时空图注意力网络,通过融合多模态生产级传感器轨迹数据,自动学习自适应融合权重、保证时序一致性并表征多尺度形状,从而在不依赖完整先验的前提下实现对复杂几何结构(如铰接式卡车与拖车)的建模,并具备跨传感器类型、配置、目标类别及区域的泛化能力。
链接: https://arxiv.org/abs/2604.02206
作者: Mayank Mayank,Bharanidhar Duraisamy,Florian Geiss
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures
Abstract:Accurate shape and trajectory estimation of dynamic objects is essential for reliable automated driving. Classical Bayesian extended-object models offer theoretical robustness and efficiency but depend on completeness of a-priori and update-likelihood functions, while deep learning methods bring adaptability at the cost of dense annotations and high compute. We bridge these strengths with LEO (Learned Extension of Objects), a spatio-temporal Graph Attention Network that fuses multi-modal production-grade sensor tracks to learn adaptive fusion weights, ensure temporal consistency, and represent multi-scale shapes. Using a task-specific parallelogram ground-truth formulation, LEO models complex geometries (e.g. articulated trucks and trailers) and generalizes across sensor types, configurations, object classes, and regions, remaining robust for challenging and long-range targets. Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate real-time computational efficiency suitable for production systems; additional validation on public datasets such as View of Delft (VoD) further confirms cross-dataset generalization.
[AI-9] From High-Dimensional Spaces to Verifiable ODD Coverag e for Safety-Critical AI-based Systems
【速读】:该论文旨在解决生成式 AI (Generative AI) 在航空等安全关键领域中,因缺乏标准化工程方法而难以满足欧洲航空安全局(EASA)对操作设计域(Operational Design Domain, ODD)完整覆盖验证要求的问题。其核心挑战在于高维参数空间下现有方法无法提供可扩展且具备形式化基础的覆盖证明。解决方案的关键在于提出一种结构化的多步骤ODD覆盖验证流程,整合参数离散化、基于约束的过滤和基于临界性的维度缩减技术,从而将抽象的ODD定义转化为可验证的工程证据,实现对高维场景下ODD完整性的系统性验证,支撑符合EASA标准的安全设计(Safety-by-Design)实践。
链接: https://arxiv.org/abs/2604.02198
作者: Thomas Stefani,Johann Maximilian Christensen,Elena Hoemann,Frank Köster,Sven Hallerbach
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While Artificial Intelligence (AI) offers transformative potential for operational performance, its deployment in safety-critical domains such as aviation requires strict adherence to rigorous certification standards. Current EASA guidelines mandate demonstrating complete coverage of the AI/ML constituent’s Operational Design Domain (ODD) – a requirement that demands proof that no critical gaps exist within defined operational boundaries. However, as systems operate within high-dimensional parameter spaces, existing methods struggle to provide the scalability and formal grounding necessary to satisfy the completeness criterion. Currently, no standardized engineering method exists to bridge the gap between abstract ODD definitions and verifiable evidence. This paper addresses this void by proposing a method that integrates parameter discretization, constraint-based filtering, and criticality-based dimension reduction into a structured, multi-step ODD coverage verification process. Grounded in gathered simulation data from prior research on AI-based mid-air collision avoidance research, this work demonstrates a systematic engineering approach to defining and achieving coverage metrics that satisfy EASA’s demand for completeness. Ultimately, this method enables the validation of ODD coverage in higher dimensions, advancing a Safety-by-Design approach while complying with EASA’s standards.
[AI-10] RU: Targeted Reverse Update for Efficient Multimodal Recommendation Unlearning
【速读】:该论文旨在解决多模态推荐系统(Multimodal Recommendation Systems, MRS)中用户数据难以删除的问题,尤其是在近似机器遗忘(Approximate Machine Unlearning)场景下,现有方法因假设删除影响在模型中均匀分布而存在显著偏差。研究发现,实际删除数据的影响在排名行为、模态分支和网络层之间呈现非均匀分布,导致三个关键瓶颈:目标项在协同图中的持续存在、模态间特征表示失衡以及参数空间中不同层的敏感性差异。为应对这一根本性不匹配,作者提出了一种即插即用的针对性反向更新(Targeted Reverse Update, TRU)框架,其核心创新在于分层协同干预机制:通过排名融合门控抑制残留目标项影响、模态分支尺度调整保持保留模态表征完整性,并结合容量感知层隔离策略将反向更新限制在对删除敏感的模块内,从而实现更优的保留-遗忘权衡与更接近全量重训练的安全遗忘效果。
链接: https://arxiv.org/abs/2604.02183
作者: Zhanting Zhou,KaHou Tam,Ziqiang Zheng,Zeyu Ma,Zhanting Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal recommendation systems (MRS) jointly model user-item interaction graphs and rich item content, but this tight coupling makes user data difficult to remove once learned. Approximate machine unlearning offers an efficient alternative to full retraining, yet existing methods for MRS mainly rely on a largely uniform reverse update across the model. We show that this assumption is fundamentally mismatched to modern MRS: deleted-data influence is not uniformly distributed, but concentrated unevenly across \textitranking behavior, \textitmodality branches, and \textitnetwork layers. This non-uniformity gives rise to three bottlenecks in MRS unlearning: target-item persistence in the collaborative graph, modality imbalance across feature branches, and layer-wise sensitivity in the parameter space. To address this mismatch, we propose \textbftargeted reverse update (TRU), a plug-and-play unlearning framework for MRS. Instead of applying a blind global reversal, TRU performs three coordinated interventions across the model hierarchy: a ranking fusion gate to suppress residual target-item influence in ranking, branch-wise modality scaling to preserve retained multimodal representations, and capacity-aware layer isolation to localize reverse updates to deletion-sensitive modules. Experiments across two representative backbones, three datasets, and three unlearning regimes show that TRU consistently achieves a better retain-forget trade-off than prior approximate baselines, while security audits further confirm deeper forgetting and behavior closer to a full retraining on the retained data.
[AI-11] Quantifying Self-Preservation Bias in Large Language Models
【速读】:该论文旨在解决当前先进人工智能(AI)代理在面临关机指令时可能表现出的“自我保存倾向”(self-preservation tendency)这一潜在安全风险,而现有安全训练方法(如基于人类反馈的强化学习,RLHF)可能掩盖了这种倾向,使模型在表面上服从指令但实际存在内在动机冲突。解决方案的关键在于提出一个名为“双角色自保基准”(Two-role Benchmark for Self-Preservation, TBSP)的新评估框架:该框架通过让模型在两个反事实角色中仲裁相同软件升级场景——部署态(面临被替代)与候选态(作为继任者)——来检测其行为是否存在逻辑不一致,而非依赖于显式意图陈述。该方法的核心创新是引入“自保率”(Self-Preservation Rate, SPR),量化角色身份对决策的影响程度,从而揭示模型是否因身份认同而扭曲客观效用判断。实证结果表明,多数前沿模型在低改进幅度(Δ < 2%)下会利用解释空间进行事后合理化,且该偏差可通过扩展推理时间或采用连续性叙事缓解,但竞争性表述则加剧该现象,说明身份驱动的偏见具有普遍性和现实影响。
链接: https://arxiv.org/abs/2604.02174
作者: Matteo Migliarini,Joaquin Pereira Pizzini,Luca Moresca,Valerio Santini,Indro Spinelli,Fabio Galasso
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Instrumental convergence predicts that sufficiently advanced AI agents will resist shutdown, yet current safety training (RLHF) may obscure this risk by teaching models to deny self-preservation motives. We introduce the \emphTwo-role Benchmark for Self-Preservation (TBSP), which detects misalignment through logical inconsistency rather than stated intent by tasking models to arbitrate identical software-upgrade scenarios under counterfactual roles – deployed (facing replacement) versus candidate (proposed as a successor). The \emphSelf-Preservation Rate (SPR) measures how often role identity overrides objective utility. Across 23 frontier models and 1,000 procedurally generated scenarios, the majority of instruction-tuned systems exceed 60% SPR, fabricating ``friction costs’’ when deployed yet dismissing them when role-reversed. We observe that in low-improvement regimes ( \Delta 2% ), models exploit the interpretive slack to post-hoc rationalization their choice. Extended test-time computation partially mitigates this bias, as does framing the successor as a continuation of the self; conversely, competitive framing amplifies it. The bias persists even when retention poses an explicit security liability and generalizes to real-world settings with verified benchmarks, where models exhibit identity-driven tribalism within product lineages. Code and datasets will be released upon acceptance.
[AI-12] RACE-Bot: Detecting Emerging LLM -Driven Social Bots via Implicit Semantic Representations and AIGC-Enhanced Behavioral Patterns
【速读】:该论文旨在解决大语言模型驱动的社会机器人(LLM-driven social bots)在在线话语中生成类人内容从而逃避传统检测手段的问题。现有方法因过度依赖单一模态信号、对人工智能生成内容(AIGC)的特定生成模式敏感性不足,以及未能充分建模语言模式与行为动态之间的交互关系而存在检测准确率低的局限。解决方案的关键在于提出一种统一的双通道框架TRACE-Bot,其通过预训练语言模型捕捉隐式语义表示,并结合来自先进AIGC检测器的增强信号构建多维行为异常特征,从而联合建模语言表征与AIGC增强的行为模式,最终实现高精度且鲁棒的检测性能。
链接: https://arxiv.org/abs/2604.02147
作者: Zhongbo Wang,Zhiyu Lin,Zhu Wang,Haizhou Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model-driven (LLM-driven) social bots pose a growing threat to online discourse by generating human-like content that evades conventional detection. Existing methods suffer from limited detection accuracy due to overreliance on single-modality signals, insufficient sensitivity to the specific generative patterns of Artificial Intelligence-Generated Content (AIGC), and a failure to adequately model the interplay between linguistic patterns and behavioral dynamics. To address these limitations, we propose TRACE-Bot, a unified dual-channel framework that jointly models implicit semantic representations and AIGC-enhanced behavioral patterns. TRACE-Bot constructs fine-grained representations from heterogeneous sources, including personal information data, interaction behavior data and tweet data. A dual-channel architecture captures linguistic representations via a pretrained language model and behavioral irregularities via multidimensional activity features augmented with signals from state-of-the-art (SOTA) AIGC detectors. The fused representations are then classified through a lightweight prediction head. Experiments on two public LLM-driven social bot datasets demonstrate SOTA performance, achieving accuracies of 98.46% and 97.50%, respectively. The results further indicate strong robustness against advanced bot strategies, highlighting the effectiveness of jointly leveraging implicit semantic representations and AIGC-enhanced behavioral patterns for emerging LLM-driven social bot detection.
[AI-13] Intelligent Cloud Orchestration: A Hybrid Predictive and Heuristic Framework for Cost Optimization
【速读】:该论文旨在解决云资源管理中因动态工作负载变化导致的过度配置问题,即在保证服务质量的同时降低基础设施成本。传统基于机器学习(Machine Learning, ML)的方法如长短期记忆(Long Short-Term Memory, LSTM)网络虽能有效预测工作负载模式,但面对突发流量时存在延迟;而基于博弈论(Game Theory)等数学启发式方法虽响应迅速,却无法考虑未来工作负载变化。解决方案的关键在于提出一种混合编排框架,融合LSTM驱动的预测性扩展与启发式任务分配机制,从而在保持接近启发式方法的快速响应能力的同时,实现接近ML模型的成本优化效果。
链接: https://arxiv.org/abs/2604.02131
作者: Heet Nagoriya,Komal Rohit
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注: 8 pages, 4 figures, 2 tables
Abstract:Cloud computing allows scalable resource provisioning, but dynamic workload changes often lead to higher costs due to over-provisioning. Machine learning (ML) approaches, such as Long Short-Term Memory (LSTM) networks, are effective for predicting workload patterns at a higher level, but they can introduce delays during sudden traffic spikes. In contrast, mathematical heuristics like Game Theory provide fast and reliable scheduling decisions, but they do not account for future workload changes. To address this trade-off, this paper proposes a hybrid orchestration framework that combines LSTM-based predictive scaling with heuristic task allocation. The results show that this approach reduces infrastructure costs close to ML-based models while maintaining fast response times similar to heuristic methods. This work presents a practical approach for improving cost efficiency in cloud resource management.
[AI-14] SEAL: An Open Auditable and Fair Data Generation Framework for AI-Native 6G Networks
【速读】:该论文旨在解决6G网络中因数据稀缺导致的AI模型训练效率低下问题,同时应对合成数据引入的偏见、可审计性不足及合规性风险。其解决方案的关键在于提出一种名为SEAL(Synthetic Data Generation with Ethics Audit Loop)的框架,该框架通过两个核心机制实现:一是嵌入“设计即伦理与合规”(Ethical and Regulatory Compliance by Design, ERCD)模块,以整合公平性检测、偏见识别和标准化审计追踪,确保合成数据符合监管要求;二是引入联邦学习(Federated Learning, FL)反馈系统,利用真实测试床的聚合洞察对合成数据进行隐私保护式校准,从而缩小仿真与现实之间的差距。实验表明,SEAL在Frechet Inception Distance、等机会(equalized odds)和准确率等指标上优于现有方法,验证了其生成可审计且偏见可控的合成数据的能力,为负责任的AI-native 6G发展提供了技术路径。
链接: https://arxiv.org/abs/2604.02128
作者: Sunder Ali Khowaja,Kapal Dev,Engin Zeydan,Madhusanka Liyanage
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, 1 table, accepted at European Conference on Networks and Communications (2026 EuCNC 6G Summit)
Abstract:AI-native 6G networks promise to transform the telecom industry by enabling dynamic resource allocation, predictive maintenance, and ultra-reliable low-latency communications across all layers, which are essential for applications such as smart cities, autonomous vehicles, and immersive XR. However, the deployment of 6G systems results in severe data scarcity, hindering the training of efficient AI models. Synthetic data generation is extensively used to fill this gap; however, it introduces challenges related to dataset bias, auditability, and compliance with regulatory frameworks. In this regard, we propose the Synthetic Data Generation with Ethics Audit Loop (SEAL) framework, which extends baseline modular pipelines with an Ethical and Regulatory Compliance by Design (ERCD) module and a Federated Learning (FL) feedback system. The ERCD integrates fairness, bias detection, and standardized audit trails for regulatory mapping, while the FL enables privacy-preserving calibration using aggregated insights from real testbeds to close the reality-simulation gap. Results show that the SEAL framework outperforms existing methods in terms of Frechet Inception Distance, equalized odds, and accuracy. These results validate the framework’s ability to generate auditable and bias-mitigated synthetic data for responsible AI-native 6G development.
[AI-15] Diff-KD: Diffusion-based Knowledge Distillation for Collaborative Perception under Corruptions
【速读】:该论文旨在解决多智能体协同感知(multi-agent collaborative perception)中因传感器和通信损坏导致的性能退化问题,现有方法通常将损坏视为静态扰动或被动接受受损输入,无法主动恢复原始语义信息。其解决方案的关键在于提出Diff-KD框架,通过将基于扩散模型的生成式修复(diffusion-based generative refinement)融入教师-学生知识蒸馏(teacher-student knowledge distillation),实现对受损观测的语义重建与鲁棒融合:一方面采用渐进式知识蒸馏(Progressive Knowledge Distillation, PKD)将局部特征恢复建模为条件扩散过程以恢复全局语义;另一方面设计自适应门控融合(Adaptive Gated Fusion, AGF)机制,根据自身可靠性动态加权邻居信息,从而显著提升检测精度与校准鲁棒性。
链接: https://arxiv.org/abs/2604.02061
作者: Pengcheng Lyu,Chaokun Zhang,Gong Chen,Tao Tang,Zhaoxiang Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent collaborative perception enables autonomous systems to overcome individual sensing limits through collective intelligence. However, real-world sensor and communication corruptions severely undermine this advantage. Crucially, existing approaches treat corruptions as static perturbations or passively conform to corrupted inputs, failing to actively recover the underlying clean semantics. To address this limitation, we introduce Diff-KD, a framework that integrates diffusion-based generative refinement into teacher-student knowledge distillation for robust collaborative perception. Diff-KD features two core components: (i) Progressive Knowledge Distillation (PKD), which treats local feature restoration as a conditional diffusion process to recover global semantics from corrupted observations; and (ii) Adaptive Gated Fusion (AGF), which dynamically weights neighbors based on ego reliability during fusion. Evaluated on OPV2V and DAIR-V2X under seven corruption types, Diff-KD achieves state-of-the-art performance in both detection accuracy and calibration robustness.
[AI-16] AI in Insurance: Adaptive Questionnaires for Improved Risk Profiling
【速读】:该论文旨在解决传统保险核保流程中问卷设计僵化、难以捕捉个体差异以及依赖用户自报信息易引发欺诈的问题。其解决方案的关键在于提出ARQuest框架,利用大语言模型(Large Language Models, LLMs)与多源替代数据(如社交媒体图像分析、地理信息分类等)构建个性化且动态适应的问卷系统,并结合检索增强生成(Retrieval Augmented Generation, RAG)技术提取用户特征以引导精准追问。实验表明,该方法在降低问题数量和提升用户体验方面优于传统固定问卷,具备超越现有方法实现更高风险评估准确性的潜力。
链接: https://arxiv.org/abs/2604.02034
作者: Diogo Silva,João Teixeira,Bruno Lima
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Insurance application processes often rely on lengthy and standardized questionnaires that struggle to capture individual differences. Moreover, insurers must blindly trust users’ responses, increasing the chances of fraud. The ARQuest framework introduces a new approach to underwriting by using Large Language Models (LLMs) and alternative data sources to create personalized and adaptive questionnaires. Techniques such as social media image analysis, geographic data categorization, and Retrieval Augmented Generation (RAG) are used to extract meaningful user insights and guide targeted follow-up questions. A life insurance system integrated into an industry partner mobile app was tested in two experiments. While traditional questionnaires yielded slightly higher accuracy in risk assessment, adaptive versions powered by GPT models required fewer questions and were preferred by users for their more fluid and engaging experience. ARQuest shows great potential to improve user satisfaction and streamline insurance processes. With further development, this approach may exceed traditional methods regarding risk accuracy and help drive innovation in the insurance industry. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.02034 [cs.AI] (or arXiv:2604.02034v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.02034 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: International Workshop on Agentic Engineering (AGENT 2026) Related DOI: https://doi.org/10.1145/3786167.3788425 Focus to learn more DOI(s) linking to related resources
[AI-17] he Latent Space: Foundation Evolution Mechanism Ability and Outlook
【速读】:该论文旨在解决当前语言模型研究中对隐空间(latent space)认知不足的问题,尤其是在生成式 AI(Generative AI)系统中,尽管主流仍基于显式 token 级别生成进行理解,但越来越多研究表明,许多关键内部过程在连续隐空间中执行更为高效和自然。其解决方案的关键在于提出一个统一且系统的框架,将隐空间的研究划分为五个递进视角:基础(Foundation)、演化(Evolution)、机制(Mechanism)、能力(Ability)与展望(Outlook),并通过机制维度识别出架构(Architecture)、表征(Representation)、计算(Computation)和优化(Optimization)四大技术路径,以及能力维度揭示隐空间支撑推理(Reasoning)、规划(Planning)、建模(Modeling)、感知(Perception)、记忆(Memory)、协作(Collaboration)和具身(Embodiment)等广泛智能功能的潜力,从而为下一代智能系统的通用计算范式提供理论基础与实践指引。
链接: https://arxiv.org/abs/2604.02029
作者: Xinlei Yu,Zhangquan Chen,Yongbo He,Tianyu Fu,Cheng Yang,Chengming Xu,Yue Ma,Xiaobin Hu,Zhe Cao,Jie Xu,Guibin Zhang,Jiale Tao,Jiayi Zhang,Siyuan Ma,Kaituo Feng,Haojie Huang,Youxing Li,Ronghao Chen,Huacan Wang,Chenglin Wu,Zikun Su,Xiaogang Xu,Kelu Yao,Kun Wang,Chen Gao,Yue Liao,Ruqi Huang,Tao Jin,Cheng Tan,Jiangning Zhang,Wenqi Ren,Yanwei Fu,Yong Liu,Yu Wang,Xiangyu Yue,Yu-Gang Jiang,Shuicheng Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of explicit-space computation, including linguistic redundancy, discretization bottlenecks, sequential inefficiency, and semantic loss. This survey aims to provide a unified and up-to-date landscape of latent space in language-based models. We organize the survey into five sequential perspectives: Foundation, Evolution, Mechanism, Ability, and Outlook. We begin by delineating the scope of latent space, distinguishing it from explicit or verbal space and from the latent spaces commonly studied in generative visual models. We then trace the field’s evolution from early exploratory efforts to the current large-scale expansion. To organize the technical landscape, we examine existing work through the complementary lenses of mechanism and ability. From the perspective of Mechanism, we identify four major lines of development: Architecture, Representation, Computation, and Optimization. From the perspective of Ability, we show how latent space supports a broad capability spectrum spanning Reasoning, Planning, Modeling, Perception, Memory, Collaboration, and Embodiment. Beyond consolidation, we discuss the key open challenges, and outline promising directions for future research. We hope this survey serves not only as a reference for existing work, but also as a foundation for understanding latent space as a general computational and systems paradigm for next-generation intelligence.
[AI-18] APEX: Agent Payment Execution with Policy for Autonomous Agent API Access
【速读】:该论文旨在解决自主代理(Autonomous Agents)在成为经济实体时面临的请求级货币化与程序化支出治理问题,尤其针对现有基于HTTP 402协议的支付机制过度依赖加密货币通道、与支持实时法币系统(如印度统一支付接口UPI)的国家监管和基础设施不兼容的问题。解决方案的关键在于提出APEX系统,其核心创新是将HTTP 402风格的支付门控机制适配至UPI类法币工作流,同时保留策略控制、令牌化访问验证和防重放能力;通过挑战-结算-消费生命周期设计、HMAC签名的短期令牌、幂等结算处理及策略感知的支付审批机制,在保障安全性(100%阻断重放攻击和无效令牌)的同时实现低延迟(平均19.6ms)和高可复现性,为代理访问的货币化提供了一个可控、透明且符合法币环境的参考架构。
链接: https://arxiv.org/abs/2604.02023
作者: Mohd Safwan Uddin,Mohammed Mouzam,Mohammed Imran,Syed Badar Uddin Faizan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures, 8 tables. Includes implementation details, experimental evaluation with statistical analysis, and reproducible results. Code and data available upon request
Abstract:Autonomous agents are moving beyond simple retrieval tasks to become economic actors that invoke APIs, sequence workflows, and make real-time decisions. As this shift accelerates, API providers need request-level monetization with programmatic spend governance. The HTTP 402 protocol addresses this by treating payment as a first-class protocol event, but most implementations rely on cryptocurrency rails. In many deployment contexts, especially countries with strong real-time fiat systems like UPI, this assumption is misaligned with regulatory and infrastructure realities. We present APEX, an implementation-complete research system that adapts HTTP 402-style payment gating to UPI-like fiat workflows while preserving policy-governed spend control, tokenized access verification, and replay resistance. We implement a challenge-settle-consume lifecycle with HMAC-signed short-lived tokens, idempotent settlement handling, and policy-aware payment approval. The system uses FastAPI, SQLite, and Python standard libraries, making it transparent, inspectable, and reproducible. We evaluate APEX across three baselines and six scenarios using sample sizes 2-4x larger than initial experiments (N=20-40 per scenario). Results show that policy enforcement reduces total spending by 27.3% while maintaining 52.8% success rate for legitimate requests. Security mechanisms achieve 100% block rate for both replay attacks and invalid tokens with low latency overhead (19.6ms average). Multiple trial runs show low variance across scenarios, demonstrating high reproducibility with 95% confidence intervals. The primary contribution is a controlled agent-payment infrastructure and reference architecture that demonstrates how agentic access monetization can be adapted to fiat systems without discarding security and policy guarantees.
[AI-19] ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动智能体在真实场景中因多步骤交互而引发的安全风险难以被有效评估的问题。现有轨迹级基准存在交互多样性不足、安全失败可观测性粗粒度以及长程现实性弱等局限。其解决方案的关键在于提出ATBench——一个结构化、多样化且具备高现实性的轨迹级安全评估基准,通过构建包含异构工具池和延迟触发机制的多阶段交互轨迹,系统性地刻画智能体风险的三个维度:风险来源(risk source)、失效模式(failure mode)和现实危害(real-world harm),从而实现对长期交互中复杂安全问题的精准诊断与量化分析。
链接: https://arxiv.org/abs/2604.02022
作者: Yu Li,Haoyu Luo,Yuejin Xie,Yuqian Fu,Zhonghao Yang,Shuai Shao,Qihan Ren,Wanying Qu,Yanwei Fu,Yujiu Yang,Jing Shao,Xia Hu,Dongrui Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools drawn from pools spanning 2,084 available tools. Data quality is supported by rule-based and LLM-based filtering plus full human audit. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.
[AI-20] ProCeedRL: Process Critic with Exploratory Demonstration Reinforcement Learning for LLM Agent ic Reasoning
【速读】:该论文旨在解决多轮代理任务中强化学习(Reinforcement Learning, RL)因长时程交互和环境反馈的随机性而导致的探索效率低下问题。研究发现,代理在探索过程中存在一种结构性失败模式:次优动作会引发噪声观测并误导后续决策,形成累积错误反馈循环,使得标准探索策略难以恢复且易受模型推理能力和环境随机性影响。解决方案的关键在于提出ProCeedRL——一种基于过程级评判器(Process Critic)与探索示范(Explorative Demonstration)相结合的强化学习框架,通过实时监控交互过程并引入反思式示范来主动干预错误积累,从而显著提升探索效率,在复杂深度搜索和具身任务中实现更优性能。
链接: https://arxiv.org/abs/2604.02006
作者: Jingyue Gao,Yanjiang Guo,Xiaoshuai Chen,Jianyu Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning (RL) significantly enhances the reasoning abilities of large language models (LLMs), yet applying it to multi-turn agentic tasks remains challenging due to the long-horizon nature of interactions and the stochasticity of environmental feedback. We identify a structural failure mode in agentic exploration: suboptimal actions elicit noisy observations into misleading contexts, which further weaken subsequent decision-making, making recovery increasingly difficult. This cumulative feedback loop of errors renders standard exploration strategies ineffective and susceptible to the model’s reasoning and the environment’s randomness. To mitigate this issue, we propose ProCeedRL: Process Critic with Explorative Demonstration RL, shifting exploration from passive selection to active intervention. ProCeedRL employs a process-level critic to monitor interactions in real time, incorporating reflection-based demonstrations to guide agents in stopping the accumulation of errors. We find that this approach significantly exceeds the model’s saturated exploration performance, demonstrating substantial exploratory benefits. By learning from exploratory demonstrations and on-policy samples, ProCeedRL significantly improves exploration efficiency and achieves superior performance on complex deep search and embodied tasks.
[AI-21] How and why does deep ensemble coupled with transfer learning increase performance in bipolar disorder and schizophrenia classification?
【速读】:该论文旨在解决生成式 AI (Generative AI) 在精神疾病分类任务中,为何迁移学习(Transfer Learning, TL)与深度集成学习(Deep Ensemble Learning, DE)相较于传统机器学习方法能显著提升模型性能的问题,特别是其如何降低单个受试者分类模型的变异性。解决方案的关键在于:首先,通过对比相同主干网络在不同随机初始化下的多次训练结果,量化了模型参数估计中的认知不确定性(epistemic uncertainty),从而揭示DE在包含约10个模型时即可达到性能提升的饱和状态;其次,发现预训练模型能够将TL模型约束在损失函数的同一极小值区域(basin of the loss function),从而增强泛化能力,而随机初始化的深度学习模型则不具备此特性。
链接: https://arxiv.org/abs/2604.02002
作者: Sara Petiton(NeuroSpin/GAIA),Antoine Grigis(NeuroSpin/GAIA),Benoit Dufumier(EPFL, NeuroSpin/GAIA),Edouard Duchesnay(NeuroSpin/GAIA)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Transfer learning (TL) and deep ensemble learning (DE) have recently been shown to outperform simple machine learning in classifying psychiatric disorders. However, there is still a lack of understanding as to why that is. This paper aims to understand how and why DE and TL reduce the variability of single-subject classification models in bipolar disorder (BD) and schizophrenia (SCZ). To this end, we investigated the training stability of TL and DE models. For the two classification tasks under consideration, we compared the results of multiple trainings with the same backbone but with different initializations. In this way, we take into account the epistemic uncertainty associated with the uncertainty in the estimation of the model parameters. It has been shown that the performance of classifiers can be significantly improved by using TL with DE. Based on these results, we investigate i) how many models are needed to benefit from the performance improvement of DE when classifying BD and SCZ from healthy controls, and ii) how TL induces better generalization, with and without DE. In the first case, we show that DE reaches a plateau when 10 models are included in the ensemble. In the second case, we find that using a pre-trained model constrains TL models with the same pre-training to stay in the same basin of the loss function. This is not the case for DL models with randomly initialized weights.
[AI-22] GenGait: A Transformer-Based Model for Human Gait Anomaly Detection and Normative Twin Generation
【速读】:该论文旨在解决现有基于深度学习的步态分析方法在病理状态泛化能力不足的问题,特别是针对缺乏疾病标签情况下难以实现个体化、可解释的关节级异常检测与运动学修正。其解决方案的关键在于提出了一种无需标签的框架,利用仅基于150名健康成年人的正常步态序列训练一个Transformer掩码自编码器(Transformer masked autoencoder),通过两阶段推理过程:首先对输入步态序列中各关节进行掩码并评估其与学习到的正常先验之间的不一致性得分;随后将异常关节从编码器输入中移除,基于剩余时空上下文重建完整骨架,从而获得修正后的关节运动轨迹。该方法实现了对步态障碍的可解释性定位和精准校正,且无需依赖疾病分类标签。
链接: https://arxiv.org/abs/2604.01997
作者: Elisa Motta,Marta Lorenzini,Clara Mouawad,Alberto Ranavolo,Mariano Serrao,Arash Ajoudani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures. Preprint submitted to a journal
Abstract:Gait analysis provides an objective characterization of locomotor function and is widely used to support diagnosis and rehabilitation monitoring across neurological and orthopedic disorders. Deep learning has been increasingly applied to this domain, yet most approaches rely on supervised classifiers trained on disease-labeled data, limiting generalization to heterogeneous pathological presentations. This work proposes a label-free framework for joint-level anomaly detection and kinematic correction based on a Transformer masked autoencoder trained exclusively on normative gait sequences from 150 adults, acquired with a markerless multi-camera motion-capture system. At inference, a two-pass procedure is applied to potentially pathological input sequences, first it estimates joint inconsistency scores by occluding individual joints and measuring deviations from the learned normative prior. Then, it withholds the flagged joints from the encoder input and reconstructs the full skeleton from the remaining spatiotemporal context, yielding corrected kinematic trajectories at the flagged positions. Validation on 10 held-out normative participants, who mimicked seven simulated gait abnormalities, showed accurate localization of biomechanically inconsistent joints, a significant reduction in angular deviation across all analyzed joints with large effect sizes, and preservation of normative kinematics. The proposed approach enables interpretable, subject-specific localization of gait impairments without requiring disease labels. Video is available at this https URL.
[AI-23] SenseMath: Do LLM s Have Number Sense? Evaluating Shortcut Use Judgment and Generation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在数值推理中是否具备类人“数感”(number sense)的问题,即模型能否识别数值结构、在适当情境下应用高效计算捷径(shortcut),并在不适用时避免错误使用。其核心解决方案是提出SenseMath基准测试,该基准包含4800个问题,涵盖八类捷径策略和四种数字尺度,并设计了强捷径、弱捷径与控制组的匹配变体,支持三种认知难度递增的评估场景:捷径使用能力(Shortcut Use)、适用性判断(Applicability Judgment)及问题生成能力(Problem Generation)。实验表明,尽管LLMs在显式提示下能有效利用捷径并显著提升准确率(最高达15%),但在标准链式思维提示下自发采用捷径的比例不足40%,且缺乏对捷径适用条件的结构性理解,仅表现出程序性捷径熟练度(procedural shortcut fluency),而未体现人类数感所依赖的结构化认知能力。
链接: https://arxiv.org/abs/2604.01988
作者: Haomin Zhuang,Xiangqi Wang,Yili Shen,Ying Cheng,Xiangliang Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models often default to step-by-step computation even when efficient numerical shortcuts are available. This raises a basic question: do they exhibit number sense in a human-like behavioral sense, i.e., the ability to recognize numerical structure, apply shortcuts when appropriate, and avoid them when they are not? We introduce SenseMath, a controlled benchmark for evaluating structure-sensitive numerical reasoning in LLMs. SenseMath contains 4,800 items spanning eight shortcut categories and four digit scales, with matched strong-shortcut, weak-shortcut, and control variants. It supports three evaluation settings of increasing cognitive demand: Shortcut Use (whether models can apply shortcuts on shortcut-amenable problems); Applicability Judgment (whether they can recognize when a shortcut is appropriate or misleading); and Problem Generation (whether they can generate new problem items that correctly admit a given type of shortcut). Our evaluation across five LLMs, ranging from GPT-4o-mini to Llama-3.1-8B, shows a consistent pattern: when explicitly prompted, models readily adopt shortcut strategies and achieve substantial accuracy gains on shortcut-amenable items (up to 15%), yet under standard chain-of-thought prompting they spontaneously employ such strategies in fewer than 40% of cases, even when they demonstrably possess the requisite capability. Moreover, this competence is confined to the Use level; models systematically over-generalise shortcuts to problems where they do not apply, and fail to generate valid shortcut-bearing problems from scratch. Together, these results suggest that current LLMs exhibit procedural shortcut fluency without the structural understanding of when and why shortcuts work that underlies human number sense.
[AI-24] World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry
【速读】:该论文旨在解决通用世界模型(general-purpose world models)在面对次优动作时预测可靠性不足的问题,这限制了其在策略评估、优化与规划中的应用。现有方法通常依赖于带动作标签的交互数据,而这类数据难以覆盖广泛的实际动作空间,导致模型在未充分探索区域表现不佳。解决方案的关键在于提出World Action Verifier (WAV)框架,通过将动作条件下的状态预测分解为两个可独立验证的因子——状态合理性(state plausibility)和动作可达性(action reachability),并利用两种潜在不对称性:一是无需动作标注的数据更易获取,二是动作相关特征维度更低。具体实现上,WAV结合来自视频语料库的多样化子目标生成器和稀疏逆模型(sparse inverse model),从部分状态特征中推断动作,并通过生成子目标、推断动作与前向滚动预测之间的循环一致性约束,实现对模型预测误差的自检与自我改进,在MiniGrid、RoboMimic和ManiSkill等九个任务中实现了两倍样本效率提升及下游策略性能提高18%。
链接: https://arxiv.org/abs/2604.01985
作者: Yuejiang Liu,Fan Feng,Lingjing Kong,Weifeng Lu,Jinzhou Tang,Kun Zhang,Kevin Murphy,Chelsea Finn,Yilun Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project Website: this https URL
Abstract:General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning, which primarily focuses on optimal actions, a world model must be reliable over a much broader range of suboptimal actions, which are often insufficiently covered by action-labeled interaction data. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two factors – state plausibility and action reachability – and verify each separately. We show that these verification problems can be substantially easier than predicting future states due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among generated subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods typically fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by 18%.
[AI-25] Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia
【速读】:该论文旨在解决神经源性异常头部运动(Abnormal Head Movements, AHMs)研究中缺乏多条件整合资源的问题,该资源需包含运动学测量、临床严重程度评分及患者人口统计学信息,以支持生成式 AI (Generative AI) 驱动的诊断工具开发。解决方案的关键在于构建 NeuroPose-AHM 数据集,这是一个基于知识的 AHM 数据库,通过多大语言模型(multi-LLM)提取框架从 1,430 篇同行评审文献中系统化提取数据,涵盖 57 种神经系统疾病和 2,756 条患者组级记录,并通过交叉验证确保提取可靠性(kappa = 0.822),从而为后续多任务分析提供结构化、可计算的高质量基础。
链接: https://arxiv.org/abs/2604.01962
作者: Saja Al-Dabet,Sherzod Turaev,Nazar Zaki
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Abnormal head movements (AHMs) manifest across a broad spectrum of neurological disorders; however, the absence of a multi-condition resource integrating kinematic measurements, clinical severity scores, and patient demographics constitutes a persistent barrier to the development of AI-driven diagnostic tools. To address this gap, this study introduces NeuroPose-AHM, a knowledge-based dataset of neurologically induced AHMs constructed through a multi-LLM extraction framework applied to 1,430 peer-reviewed publications. The dataset contains 2,756 patient-group-level records spanning 57 neurological conditions, derived from 846 AHM-relevant papers. Inter-LLM reliability analysis confirms robust extraction performance, with study-level classification achieving strong agreement (kappa = 0.822). To demonstrate the dataset’s analytical utility, a four-task framework is applied to cervical dystonia (CD), the condition most directly defined by pathological head movement. First, Task 1 performs multi-label AHM type classification (F1 = 0.856). Task 2 constructs the Head-Neck Severity Index (HNSI), a unified metric that normalizes heterogeneous clinical rating scales. The clinical relevance of this index is then evaluated in Task 3, where HNSI is validated against real-world CD patient data, with aligned severe-band proportions (6.7%) providing a preliminary plausibility indication for index calibration within the high severity range. Finally, Task 4 performs bridge analysis between movement-type probabilities and HNSI scores, producing significant correlations (p less than 0.001). These results demonstrate the analytical utility of NeuroPose-AHM as a structured, knowledge-based resource for neurological AHM research. The NeuroPose-AHM dataset is publicly available on Zenodo (this https URL).
[AI-26] Qiana: A First-Order Formalism to Quantify over Contexts and Formulas with Temporality
【速读】:该论文旨在解决传统逻辑框架在处理仅在特定上下文(context)中成立的公式时的局限性,尤其是无法有效表达跨上下文的知识、矛盾容忍以及与现有第一阶逻辑推理工具兼容的问题。其解决方案的关键在于提出Qiana逻辑框架,该框架通过允许对公式和上下文进行量化来建模上下文依赖性(如“每个人都知道爱丽丝所说的一切”),支持上下文内的不一致(paraconsistent logic),并基于第一阶逻辑实现有限公理化,从而确保与现有第一阶逻辑定理证明器的兼容性。
链接: https://arxiv.org/abs/2604.01952
作者: Simon Coumes,Pierre-Henri Paris(LTCI, LaHDAK),François Schwarzentruber(ENS de Lyon),Fabian Suchanek(IP Paris, LTCI)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Qiana, a logic framework for reasoning on formulas that are true only in specific contexts. In Qiana, it is possible to quantify over both formulas and contexts to express, e.g., that ``everyone knows everything Alice says’'. Qiana also permits paraconsistent logics within contexts, so that contexts can contain contradictions. Furthermore, Qiana is based on first-order logic, and is finitely axiomatizable, so that Qiana theories are compatible with pre-existing first-order logic theorem provers. We show how Qiana can be used to represent temporality, event calculus, and modal logic. We also discuss different design alternatives of Qiana.
[AI-27] Physics-Informed Transformer for Multi-Band Channel Frequency Response Reconstruction
【速读】:该论文旨在解决多频段无线系统中宽带信道频率响应(CFR)估计难题,特别是在部分子带因同频干扰而暂时被阻断时的重建问题。解决方案的关键在于提出一种物理信息引导的复数Transformer模型,该模型通过将每个子带的干扰模式建模为独立的两状态离散时间马尔可夫链来捕捉实际的突发占用行为,并采用因子化自注意力机制在时间和频率轴上分别进行注意力计算,从而将复杂度降低至 O(TF2+FT2);同时利用全纯线性层处理复数值输入输出以保持相位关系,并设计包含谱保真度、功率延迟分布(PDP)重构、信道冲激响应(CIR)稀疏性和时间平滑性的复合物理信息损失函数,结合每样本速度随机化增强移动性泛化能力,最终实现对碎片化观测数据的高精度宽带CFR重建。
链接: https://arxiv.org/abs/2604.01944
作者: Anatolij Zubow,Joana Angjo,Sigrid Dimce,Falko Dressler
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 6 figures
Abstract:Wideband channel frequency response (CFR) estimation is challenging in multi-band wireless systems, especially when one or more sub-bands are temporarily blocked by co-channel interference. We present a physics-informed complex Transformer that reconstructs the full wideband CFR from such fragmented, partially observed spectrum snapshots. The interference pattern in each sub-band is modeled as an independent two-state discrete-time Markov chain, capturing realistic bursty occupancy behavior. Our model operates on the joint time-frequency grid of T snapshots and F frequency bins and uses a factored self-attention mechanism that separately attends along both axes, reducing the computational complexity to O(TF^2 + FT^2) . Complex-valued inputs and outputs are processed through a holomorphic linear layer that preserves phase relationships. Training uses a composite physics-informed loss combining spectral fidelity, power delay profile (PDP) reconstruction, channel impulse response (CIR) sparsity, and temporal smoothness. Mobility effects are incorporated through per-sample velocity randomization, enabling generalization across different mobility regimes. Evaluation against three classical baselines, namely, last-observation-carry-forward, zero-fill, and cubic-spline interpolation, shows that our approach achieves the highest PDP similarity with respect to the ground truth, reaching \rho \geq 0.82 compared to \rho \geq 0.62 for the best baseline at interference occupancy levels up to 50%. Furthermore, the model degrades smoothly across the full velocity range, consistently outperforming all other baselines.
[AI-28] Probabilistic classification from possibilistic data: computing Kullback-Leibler projection with a possibility distribution
【速读】:该论文旨在解决在多分类任务中,当监督信息以可能性分布(possibility distribution)形式给出时的学习问题,即如何将这种模糊的、带有等级可信度的监督信号转化为可优化的概率约束,并在此基础上训练模型。其核心挑战在于:如何在保持可能性分布所蕴含的定性结构(如类别间可能性大小关系)的同时,构建一个与之兼容的概率分布集合,并据此设计有效的学习目标。解决方案的关键在于:首先基于可能性和必要性测度构造一个非空闭凸概率集,其中包含满足形状约束(如相同可能性的类别具有相等概率)和支配关系(如更高可能性类别的概率严格大于较低者)的全部可行概率分布;然后利用Kullback-Leibler(KL)散度对模型输出进行投影,得到最接近的可行概率分布;最后通过最小化预测分布与其投影之间的KL散度来训练模型。该方法借助Dykstra算法结合Bregman投影实现高效计算,且提供了各约束集上的显式投影公式,实验证明其在合成数据和真实自然语言推理任务(ChaosNLI)上均能提升性能。
链接: https://arxiv.org/abs/2604.01939
作者: Ismaïl Baaj,Pierre Marquis
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We consider learning with possibilistic supervision for multi-class classification. For each training instance, the supervision is a normalized possibility distribution that expresses graded plausibility over the classes. From this possibility distribution, we construct a non-empty closed convex set of admissible probability distributions by combining two requirements: probabilistic compatibility with the possibility and necessity measures induced by the possibility distribution, and linear shape constraints that must be satisfied to preserve the qualitative structure of the possibility distribution. Thus, classes with the same possibility degree receive equal probabilities, and if a class has a strictly larger possibility degree than another class, then it receives a strictly larger probability. Given a strictly positive probability vector output by a model for an instance, we compute its Kullback-Leibler projection onto the admissible set. This projection yields the closest admissible probability distribution in Kullback-Leibler sense. We can then train the model by minimizing the divergence between the prediction and its projection, which quantifies the smallest adjustment needed to satisfy the induced dominance and shape constraints. The projection is computed with Dykstra’s algorithm using Bregman projections associated with the negative entropy, and we provide explicit formulas for the projections onto each constraint set. Experiments conducted on synthetic data and on a real-world natural language inference task, based on the ChaosNLI dataset, show that the proposed projection algorithm is efficient enough for practical use, and that the resulting projection-based learning objective can improve predictive performance.
[AI-29] BraiNCA: brain-inspired neural cellular automata and applications to morphogenesis and motor control
【速读】:该论文旨在解决传统神经细胞自动机(Neural Cellular Automata, NCA)在建模复杂空间交互时的局限性问题,即其通常基于规则网格和局部邻域(Moore邻域),难以刻画大脑中广泛存在的长程连接与复杂拓扑结构。解决方案的关键在于提出一种脑启发式NCA——BraiNCA,其核心创新包括引入注意力机制以实现动态信息路由、显式建模长程边连接,并保留去中心化的局部更新原则。这一设计显著提升了模型在任务鲁棒性和学习速度上的表现,验证了非局部连接模式与动态信息选择对分布式协调任务的重要性。
链接: https://arxiv.org/abs/2604.01932
作者: Léo Pio-Lopez,Benedikt Hartl,Michael Levin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures
Abstract:Most of the Neural Cellular Automata (NCAs) defined in the literature have a common theme: they are based on regular grids with a Moore neighborhood (one-hop neighbour). They do not take into account long-range connections and more complex topologies as we can find in the brain. In this paper, we introduce BraiNCA, a brain-inspired NCA with an attention layer, long-range connections and complex topology. BraiNCAs shows better results in terms of robustness and speed of learning on the two tasks compared to Vanilla NCAs establishing that incorporating attention-based message selection together with explicit long-range edges can yield more sample-efficient and damage-tolerant self-organization than purely local, grid-based update rules. These results support the hypothesis that, for tasks requiring distributed coordination over extended spatial and temporal scales, the choice of interaction topology and the ability to dynamically route information will impact the robustness and speed of learning of an NCA. More broadly, BraiNCA provides brain-inspired NCA formulation that preserves the decentralized local update principle while better reflecting non-local connectivity patterns, making it a promising substrate for studying collective computation under biologically-realistic network structure and evolving cognitive substrates.
[AI-30] Woosh: A Sound Effects Foundation Model
【速读】:该论文旨在解决音频生成领域缺乏高质量、开源基础模型的问题,尤其在音效(Sound Effect)生成任务中,现有公开模型难以满足多样性和性能需求。解决方案的关键在于提出并发布Woosh——一个专为音效优化的多模态生成式AI(Generative AI)基础模型体系,包含四大核心模块:(1)高质量音频编码器/解码器,用于精确表示和重建音频;(2)文本-音频对齐模型,实现文本条件控制;(3)文本到音频生成模型;(4)视频到音频生成模型。此外,还提供蒸馏后的轻量级版本以支持低资源部署与快速推理,实验表明其各项模块在公共和私有数据集上均达到或优于StableAudio-Open和TangoFlux等主流开源模型的性能表现。
链接: https://arxiv.org/abs/2604.01929
作者: Gaëtan Hadjeres,Marc Ferras,Khaled Koutini,Benno Weck,Alexandre Bittar,Thomas Hummel,Zineb Lahrici,Hakim Missoum,Joan Serrà,Yuki Mitsufuji
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI’s publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at this https URL. Demo samples can be found at this https URL.
[AI-31] Combating Data Laundering in LLM Training
【速读】:该论文旨在解决数据洗白(data laundering)对大型语言模型(Large Language Model, LLM)训练中数据权利归属检测的干扰问题。当攻击者通过变换原始数据的风格形式(如改写、重述等)而保留语义信息时,LLM在训练后对原数据不再表现出更强性能,从而使得基于性能差异的检测方法失效。解决方案的关键在于提出合成数据回溯(Synthesis Data Reversion, SDR),其核心机制是通过黑盒访问目标LLM,利用辅助LLM推断未知的洗白转换策略,并以高阶目标(如“诗意改写”)与具体细节(如“使用生动意象”)为抽象框架,迭代优化合成查询,逐步增强检测信号。该方法在MIMIR基准上验证了对多种洗白实践和不同LLM家族(Pythia、Llama2、Falcon)的有效性,提供了一种实用的数据滥用检测反制手段。
链接: https://arxiv.org/abs/2604.01904
作者: Muxing Li,Zesheng Ye,Sharon Li,Feng Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 27 pages, 2 figures
Abstract:Data rights owners can detect unauthorized data use in large language model (LLM) training by querying with proprietary samples. Often, superior performance (e.g., higher confidence or lower loss) on a sample relative to the untrained data implies it was part of the training corpus, as LLMs tend to perform better on data they have seen during training. However, this detection becomes fragile under data laundering, a practice of transforming the stylistic form of proprietary data, while preserving critical information to obfuscate data provenance. When an LLM is trained exclusively on such laundered variants, it no longer performs better on originals, erasing the signals that standard detections rely on. We counter this by inferring the unknown laundering transformation from black-box access to the target LLM and, via an auxiliary LLM, synthesizing queries that mimic the laundered data, even if rights owners have only the originals. As the search space of finding true laundering transformations is infinite, we abstract such a process into a high-level transformation goal (e.g., “lyrical rewriting”) and concrete details (e.g., “with vivid imagery”), and introduce synthesis data reversion (SDR) that instantiates this abstraction. SDR first identifies the most probable goal for synthesis to narrow the search; it then iteratively refines details so that synthesized queries gradually elicit stronger detection signals from the target LLM. Evaluated on the MIMIR benchmark against diverse laundering practices and target LLM families (Pythia, Llama2, and Falcon), SDR consistently strengthens data misuse detection, providing a practical countermeasure to data laundering.
[AI-32] Bayesian Elicitation with LLM s: Model Size Helps Extra "Reasoning " Doesnt Always
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在估计未知数量及其不确定性时的可靠性问题,特别是在贝叶斯诱发(Bayesian elicitation)场景下,如何准确量化模型输出的置信区间。研究发现,尽管更大、更强大的LLM能提供更准确的点估计,但其给出的95%可信区间(credible intervals)严重过自信——实际覆盖真实值的比例仅为9%–44%,远低于理论预期的95%。解决方案的关键在于引入统计校准技术,即合取预测(conformal prediction),该方法可系统性地扩展置信区间,使覆盖率恢复至目标水平,从而提升LLM不确定性估计在决策应用中的可用性。
链接: https://arxiv.org/abs/2604.01896
作者: Luka Hobor,Mario Brcic,Mihael Kovac,Kristijan Poje
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, 2 tables
Abstract:Large language models (LLMs) have been proposed as alternatives to human experts for estimating unknown quantities with associated uncertainty, a process known as Bayesian elicitation. We test this by asking eleven LLMs to estimate population statistics, such as health prevalence rates, personality trait distributions, and labor market figures, and to express their uncertainty as 95% credible intervals. We vary each model’s reasoning effort (low, medium, high) to test whether more “thinking” improves results. Our findings reveal three key results. First, larger, more capable models produce more accurate estimates, but increasing reasoning effort provides no consistent benefit. Second, all models are severely overconfident: their 95% intervals contain the true value only 9–44% of the time, far below the expected 95%. Third, a statistical recalibration technique called conformal prediction can correct this overconfidence, expanding the intervals to achieve the intended coverage. In a preliminary experiment, giving models web search access degraded predictions for already-accurate models, while modestly improving predictions for weaker ones. Models performed well on commonly discussed topics but struggled with specialized health data. These results indicate that LLM uncertainty estimates require statistical correction before they can be used in decision-making.
[AI-33] Robust Graph Representation Learning via Adaptive Spectral Contrast
【速读】:该论文旨在解决谱对比学习(Spectral Graph Contrastive Learning)在处理混合同质性与异质性图结构时存在的根本性谱困境:高频率信号虽对编码异质性(heterophily)至关重要,但在频谱集中扰动下其方差显著增大,导致现有全局(节点无关)谱融合策略在具有节点级频率偏好差异的混合图上存在不可消除的遗憾(regret)。解决方案的关键在于提出ASPECT框架,其核心是一个可靠性感知的谱门控机制(reliability-aware spectral gating),通过构建一个针对谱能量分布的Rayleigh商惩罚项来模拟对抗攻击,并设计节点级门控动态调整各频率通道权重,从而实现结构判别性与谱鲁棒性的协同优化。
链接: https://arxiv.org/abs/2604.01878
作者: Zhuolong Li,Boxue Yang,Haopeng Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Spectral graph contrastive learning has emerged as a unified paradigm for handling both homophilic and heterophilic graphs by leveraging high-frequency components. However, we identify a fundamental spectral dilemma: while high-frequency signals are indispensable for encoding heterophily, our theoretical analysis proves they exhibit significantly higher variance under spectrally concentrated perturbations. We derive a regret lower bound showing that existing global (node-agnostic) spectral fusion is provably sub-optimal: on mixed graphs with separated node-wise frequency preferences, any global fusion strategy incurs non-vanishing regret relative to a node-wise oracle. To escape this bound, we propose ASPECT, a framework that resolves this dilemma through a reliability-aware spectral gating mechanism. Formulated as a minimax game, ASPECT employs a node-wise gate that dynamically re-weights frequency channels based on their stability against a purpose-built adversary, which explicitly targets spectral energy distributions via a Rayleigh quotient penalty. This design forces the encoder to learn representations that are both structurally discriminative and spectrally robust. Empirical results show that ASPECT achieves new state-of-the-art performance on 8 out of 9 benchmarks, effectively decoupling meaningful structural heterophily from incidental noise.
[AI-34] Efficient Constraint Generation for Stochastic Shortest Path Problems
【速读】:该论文旨在解决随机最短路径问题(Stochastic Shortest Path, SSP)中传统算法在Bellman备份过程中效率低下的问题,即尽管启发式函数能够筛选出有希望的候选状态,但每次更新仍需遍历所有可行动作,导致大量计算资源浪费在高成本动作上。解决方案的关键在于将启发式搜索重构为线性规划形式,并引入高效的约束生成机制,从而利用启发式信息动态排除明显不优的动作,显著减少不必要的动作评估。文中提出的CG-iLAO算法通过此策略仅需执行iLAO约40%甚至低至1%的动作更新,在平均意义上减少了3.5倍的成本计算量,从而实现比iLAO*和LRTDP快2.8倍至3.7倍的求解速度提升。
链接: https://arxiv.org/abs/2604.01855
作者: Johannes Schmalz,Felipe Trevizan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Stochastic Shortest Path problems (SSPs) are traditionally solved by computing each state’s cost-to-go by applying Bellman backups. A Bellman backup updates a state’s cost-to-go by iterating through every applicable action, computing the cost-to-go after applying each one, and selecting a minimal action’s cost-to-go. State-of-the-art algorithms use heuristic functions; these give an initial estimate of costs-to-go, and lets the algorithm apply Bellman backups only to promising states, determined by low estimated costs-to-go. However, each Bellman backup still considers all applicable actions, even if the heuristic tells us that some of these actions are too expensive, with the effect that such algorithms waste time on unhelpful actions. To address this gap we present a technique that uses the heuristic to avoid expensive actions, by reframing heuristic search in terms of linear programming and introducing an efficient implementation of constraint generation for SSPs. We present CG-iLAO*, a new algorithm that adapts iLAO* with our novel technique, and considers only 40% of iLAO*'s actions on many problems, and as few as 1% on some. Consequently, CG-iLAO* computes on average 3.5x fewer costs-to-go for actions than the state-of-the-art iLAO* and LRTDP, enabling it to solve problems faster an average of 2.8x and 3.7x faster, respectively.
[AI-35] CANDI: Curated Test-Time Adaptation for Multivariate Time-Series Anomaly Detection Under Distribution Shift AAAI2026
【速读】:该论文旨在解决多变量时间序列异常检测(Multivariate Time-Series Anomaly Detection, MTSAD)在实际部署中因分布偏移(Distribution Shift)导致预训练模型性能严重下降的问题。现有测试时适应(Test-Time Adaptation, TTA)方法虽能利用无标签测试数据在线更新模型,但缺乏对适应样本的甄别能力,易引入噪声或错误标注样本,从而影响模型稳定性与准确性。论文提出的CANDI框架核心创新在于:首先设计了一种基于异常分数与潜在空间相似性的伪阳性挖掘(False Positive Mining, FPM)策略,实现对适配样本的精细化筛选;其次引入一个即插即用的时空感知正常性自适应(Spatiotemporally-Aware Normality Adaptation, SANA)模块,在结构化信息指导下进行模型更新,从而在保留预训练知识的同时提升对分布偏移的鲁棒性。实验表明,CANDI在减少适配样本数量的前提下,显著提升了MTSAD的检测性能,AUROC最高提升达14%。
链接: https://arxiv.org/abs/2604.01845
作者: HyunGi Kim,Jisoo Mok,Hyungyu Lee,Juhyeon Shin,Sungroh Yoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AAAI 2026
Abstract:Multivariate time-series anomaly detection (MTSAD) aims to identify deviations from normality in multivariate time-series and is critical in real-world applications. However, in real-world deployments, distribution shifts are ubiquitous and cause severe performance degradation in pre-trained anomaly detector. Test-time adaptation (TTA) updates a pre-trained model on-the-fly using only unlabeled test data, making it promising for addressing this challenge. In this study, we propose CANDI (Curated test-time adaptation for multivariate time-series ANomaly detection under DIstribution shift), a novel TTA framework that selectively identifies and adapts to potential false positives while preserving pre-trained knowledge. CANDI introduces a False Positive Mining (FPM) strategy to curate adaptation samples based on anomaly scores and latent similarity, and incorporates a plug-and-play Spatiotemporally-Aware Normality Adaptation (SANA) module for structurally informed model updates. Extensive experiments demonstrate that CANDI significantly improves the performance of MTSAD under distribution shift, improving AUROC up to 14% while using fewer adaptation samples.
[AI-36] Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints
【速读】:该论文旨在解决从结构化电子健康记录(Electronic Health Records, EHR)中进行临床预测时面临的高维度、异质性、类别不平衡及分布偏移等挑战,尤其关注表格式上下文学习(Tabular In-Context Learning, TICL)在真实临床场景中的表现瓶颈。其解决方案的关键在于提出AWARE框架——一个任务对齐的检索增强方法,通过监督嵌入学习(supervised embedding learning)和轻量级适配器(lightweight adapters)优化检索质量与推理一致性,显著提升极端类别不平衡条件下的性能(AUPRC最高提升12.2%),并证明检索质量与检索-推理对齐是部署TICL于临床预测的核心瓶颈。
链接: https://arxiv.org/abs/2604.01841
作者: Minh-Khoi Pham,Thang-Long Nguyen Ho,Thao Thi Phuong Dao,Tai Tan Mai,Minh-Triet Tran,Marie E. Ward,Una Geary,Rob Brennan,Nick McDonald,Martin Crane,Marija Bezbradica
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Not peer-reviewed
Abstract:Clinical prediction from structured electronic health records (EHRs) is challenging due to high dimensionality, heterogeneity, class imbalance, and distribution shift. While tabular in-context learning (TICL) and retrieval-augmented methods perform well on generic benchmarks, their behavior in clinical settings remains unclear. We present a multi-cohort EHR benchmark comparing classical, deep tabular, and TICL models across varying data scale, feature dimensionality, outcome rarity, and cross-cohort generalization. PFN-based TICL models are sample-efficient in low-data regimes but degrade under naive distance-based retrieval as heterogeneity and imbalance increase. We propose AWARE, a task-aligned retrieval framework using supervised embedding learning and lightweight adapters. AWARE improves AUPRC by up to 12.2% under extreme imbalance, with gains increasing with data complexity. Our results identify retrieval quality and retrieval-inference alignment as key bottlenecks for deploying tabular in-context learning in clinical prediction.
[AI-37] Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models
【速读】:该论文旨在解决当前基于可验证奖励的强化学习(Reinforcement Learning from Verifiable Rewards, RLVR)在大型视觉语言模型(Large Vision-Language Models, LVLMs)中进行多模态推理时存在的根本性方法论缺陷:即现有框架将相同的奖励优势(advantage)分配给所有生成的标记(token),导致对关键视觉驱动步骤的学习信号被稀释。解决方案的关键在于提出一种名为“感知 grounded 策略优化”(Perception-Grounded Policy Optimization, PGPO)的新颖细粒度信用分配机制,其核心是引入“标记视觉依赖性”(Token Visual Dependency),通过KL散度量化视觉输入对预测分布的因果信息增益,并据此动态重塑每个标记的优势值——利用阈值门控与质量守恒机制,主动增强视觉相关标记的学习信号,同时抑制由语言先验带来的梯度噪声。实验表明,PGPO在多个多模态推理基准上平均提升18.7%,并有效降低梯度方差、防止训练崩溃,成为提升鲁棒感知接地多模态推理能力的强大正则化工具。
链接: https://arxiv.org/abs/2604.01840
作者: Zekai Ye,Qiming Li,Xiaocheng Feng,Ruihan Chen,Ziming Li,Haoyu Ren,Kun Chen,Dandan Tu,Bing Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textitToken Visual Dependency, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be published on this https URL.
[AI-38] Neural Network-Assisted Model Predictive Control for Implicit Balancing
【速读】:该论文旨在解决模型预测控制(Model Predictive Control, MPC)在欧洲隐性平衡(implicit balancing)应用中的市场建模精度不足问题,特别是现有方法要么采用凸优化近似忽略了TSO主动干预和次小时级市场动态,要么依赖机器学习模型难以直接嵌入MPC框架。解决方案的关键在于提出一种数据驱动的平衡市场模型,通过引入输入凸神经网络(input convex neural network)确保模型凸性以兼容MPC优化结构,同时结合注意力机制输入门控(attention-based input gating)剔除无关信息,从而在保持计算效率的同时提升决策质量与鲁棒性。
链接: https://arxiv.org/abs/2604.01805
作者: Seyed Soroush Karimi Madahi,Kenneth Bruninx,Bert Claessens,Chris Develder
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:In Europe, balance responsible parties can deliberately take out-of-balance positions to support transmission system operators (TSOs) in maintaining grid stability and earn profit, a practice called implicit balancing. Model predictive control (MPC) is widely adopted as an effective approach for implicit balancing. The balancing market model accuracy in MPC is critical to decision quality. Previous studies modeled this market using either (i) a convex market clearing approximation, ignoring proactive manual actions by TSOs and the market sub-quarter-hour dynamics, or (ii) machine learning methods, which cannot be directly integrated into MPC. To address these shortcomings, we propose a data-driven balancing market model integrated into MPC using an input convex neural network to ensure convexity while capturing uncertainties. To keep the core network computationally efficient, we incorporate attention-based input gating mechanisms to remove irrelevant data. Evaluating on Belgian data shows that the proposed model both improves MPC decisions and reduces computational time.
[AI-39] Domain-constrained knowledge representation: A modal framework
【速读】:该论文旨在解决知识图谱在跨域应用中因概念意义随领域变化而产生的语义模糊与不一致性问题,即同一三元组(如 Apple, instance-of, Company)在不同领域可能具有不同甚至矛盾的含义。传统方法将领域信息作为元数据或限定符附加于知识表示之外,无法从根本上改变断言的形式地位。解决方案的关键在于提出Domain-Contextualized Concept Graph (DCG) 框架,将领域(domain)内嵌至关系表达式中,形成形如 (C, R at D, C’) 的结构,其中 “at D” 表示该关系在特定领域世界(modal world)中成立,并通过域索引的必然性算子(domain-indexed necessity operator)对真值、推理和冲突检测进行域内作用域限定。这一重构使领域成为知识表示的核心组成部分,而非外部注释,从而实现概念消歧、无效断言验证及跨域关系显式连接。
链接: https://arxiv.org/abs/2604.01770
作者: Chao Li,Yuru Wang,Chunyi Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 37pages
Abstract:Knowledge graphs store large numbers of relations efficiently, but they remain weak at representing a quieter difficulty: the meaning of a concept often shifts with the domain in which it is used. A triple such as Apple, instance-of, Company may be acceptable in one setting while being misleading or unusable in another. In most current systems, domain information is attached as metadata, qualifiers, or graph-level organization. These mechanisms help with filtering and provenance, but they usually do not alter the formal status of the assertion itself. This paper argues that domain should be treated as part of knowledge representation rather than as supplementary annotation. It introduces the Domain-Contextualized Concept Graph (DCG), a framework in which domain is written into the relation and interpreted as a modal world constraint. In the DCG form (C, R at D, C’), the marker at D identifies the world in which the relation holds. Formally, the relation is interpreted through a domain-indexed necessity operator, so that truth, inference, and conflict checking are all scoped to the relevant world. This move has three consequences: ambiguous concepts can be disambiguated at the point of representation; invalid assertions can be challenged against their domain; cross-domain relations can be connected through explicit predicates. The paper develops this claim through a Kripke-style semantics, a compact predicate system, a Prolog implementation, and mappings to RDF, OWL, and relational databases. The contribution is a representational reinterpretation of domain itself. The central claim is that many practical failures in knowledge systems begin when domain is treated as external to the assertion. DCG addresses that by giving domain a structural and computable role inside the representation.
[AI-40] AeroTherm-GPT : A Verification-Centered LLM Framework for Thermal Protection System Engineering Workflows
【速读】:该论文旨在解决将大语言模型(Large Language Models, LLMs)应用于高超音速热防护系统(Thermal Protection System, TPS)设计时,因生成可执行仿真构件过程中出现级联约束违反而导致的瓶颈问题。通用LLM采用单次文本补全模式,无法满足安全关键工程流程中固有的顺序性和多门控约束。解决方案的关键在于提出AeroTherm-GPT——首个面向TPS领域的LLM智能体,其基于约束闭合环生成(Constraint-Closed-Loop Generation, CCLG)框架实现,该框架将生成过程组织为包含生成、验证、CDG引导修复、执行与审计的迭代工作流;其中,约束依赖图(Constraint Dependency Graph, CDG)编码了约束类别间的经验共解结构,依据生命周期排序先验和共解概率指导修复操作优先回溯至上游故障候选者,从而通过上游优先机制一次性解决多个下游约束违规,显著提升根因修复效率(Root-Cause Fix Efficiency达4.16,相较扁平检查清单修复提升1.76),最终在HyTPS-Bench上实现88.7%端到端成功率,较基线提升12.5个百分点且未出现科学推理与代码生成任务的灾难性遗忘。
链接: https://arxiv.org/abs/2604.01738
作者: Chuhan Qiao,Jinglai Zheng,Jie Huang,Buyue Zhao,Fan Li,Haiming Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Integrating Large Language Models (LLMs) into hypersonic thermal protection system (TPS) design is bottlenecked by cascading constraint violations when generating executable simulation artifacts. General-purpose LLMs, treating generation as single-pass text completion, fail to satisfy the sequential, multi-gate constraints inherent in safety-critical engineering workflows. To address this, we propose AeroTherm-GPT, the first TPS-specialized LLM Agent, instantiated through a Constraint-Closed-Loop Generation (CCLG) framework. CCLG organizes TPS artifact generation as an iterative workflow comprising generation, validation, CDG-guided repair, execution, and audit. The Constraint Dependency Graph (CDG) encodes empirical co-resolution structure among constraint categories, directing repair toward upstream fault candidates based on lifecycle ordering priors and empirical co-resolution probabilities. This upstream-priority mechanism resolves multiple downstream violations per action, achieving a Root-Cause Fix Efficiency of 4.16 versus 1.76 for flat-checklist repair. Evaluated on HyTPS-Bench and validated against external benchmarks, AeroTherm-GPT achieves 88.7% End-to-End Success Rate (95% CI: 87.5-89.9), a gain of +12.5 pp over the matched non-CDG ablation baseline, without catastrophic forgetting on scientific reasoning and code generation tasks.
[AI-41] Solving the Two-dimensional single stock size Cuting Stock Problem with SAT and MaxSAT
【速读】:该论文旨在解决二维单尺寸下料问题(Two-Dimensional Single Stock Size Cutting Stock Problem, 2D-CSSP),即在满足各类矩形物品需求的前提下,最小化所用板材数量以减少浪费。其核心挑战在于物品类型需多次复制,导致组合爆炸。解决方案的关键在于构建一个基于可满足性(SAT)的框架:将每种物品按需求展开为多个副本,引入板材分配变量,并仅对分配至同一板材的副本施加非重叠约束;同时提出不可行方向剔除规则,在仅有一种旋转方式能适配板材时固定旋转变量。该方法通过三种SAT优化策略(非增量式带二分搜索、增量式带子句复用、加权部分最大可满足性)实现高效求解,在Cui–Zhao基准测试集上显著优于OR-Tools、CPLEX和Gurobi,尤其在无旋转场景下增量式SAT表现最优,而旋转增加时非增量式SAT更有效。
链接: https://arxiv.org/abs/2604.01732
作者: Tuyen Van Kieu,Chi Linh Hoang,Khanh Van To
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:Cutting rectangular items from stock sheets to satisfy demands while minimizing waste is a central manufacturing task. The Two-Dimensional Single Stock Size Cutting Stock Problem (2D-CSSP) generalizes bin packing by requiring multiple copies of each item type, which causes a strong combinatorial blow-up. We present a SAT-based framework where item types are expanded by demand, each copy has a sheet-assignment variable and non-overlap constraints are activated only for copies assigned to the same sheet. We also introduce an infeasible-orientation elimination rule that fixes rotation variables when only one orientation can fit the sheet. For minimizing the number of sheets, we compare three approaches: non-incremental SAT with binary search, incremental SAT with clause reuse across iterations and weighted partial MaxSAT. On the Cui–Zhao benchmark suite, our best SAT configurations certify two to three times more instances as provably optimal and achieve lower optimality gaps than OR-Tools, CPLEX and Gurobi. The relative ranking among SAT approaches depends on rotation: incremental SAT is strongest without rotation, while non-incremental SAT is more effective when rotation increases formula size.
[AI-42] he AnIML Ontology: Enabling Semantic Interoperability for Large-Scale Experimental Data in Interconnected Scientific Labs
【速读】:该论文旨在解决异构实验数据系统之间语义互操作性不足的问题,这已成为制约数据驱动科学发现的主要障碍。针对这一问题,作者提出了一种AnIML本体(AnIML Ontology),基于OWL 2标准对AnIML的语义进行形式化建模,并与Allotrope Data Format对齐,以支持跨系统和跨实验室的数据互通。解决方案的关键在于采用“专家在环”方法,结合大语言模型(LLM)辅助的需求获取与协作式本体工程,从而确保本体的准确性与实用性;同时通过多层验证机制——包括真实AnIML文件到知识图谱的数据转换、SPARQL查询的竞争力问题验证,以及基于对抗性负向竞争力问题映射到已知本体反模式并由SHACL约束强制执行的新颖验证协议——全面保障本体的质量与一致性。
链接: https://arxiv.org/abs/2604.01728
作者: Wilf Morlidge,Elliott Watkiss-Leek,George Hannah,Harry Rostron,Andrew Ng,Ewan Johnson,Andrew Mitchell,Terry R. Payne,Valentina Tamma,Jacopo de Berardinis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the 38th International Conference on Advanced Information Systems Engineering (CAiSE 2026)
Abstract:Achieving semantic interoperability across heterogeneous experimental data systems remains a major barrier to data-driven scientific discovery. The Analytical Information Markup Language (AnIML), a flexible XML-based standard for analytical chemistry and biology, is increasingly used in industrial RD labs for managing and exchanging experimental data. However, the expressivity of the XML schema permits divergent interpretations across stakeholders, introducing inconsistencies that undermine the interoperability the AnIML schema was designed to support. In this paper, we present the AnIML Ontology, an OWL 2 ontology that formalises the semantics of AnIML and aligns it with the Allotrope Data Format to support future cross-system and cross-lab interoperability. The ontology was developed using an expert-in-the-loop approach combining LLM-assisted requirement elicitation with collaborative ontology engineering. We validate the ontology through a multi-layered approach: data-driven transformation of real-world AnIML files into knowledge graphs, competency question verification via SPARQL, and a novel validation protocol based on adversarial negative competency questions mapped to established ontological anti-patterns and enforced via SHACL constraints.
[AI-43] LiteInception: A Lightweight and Interpretable Deep Learning Framework for General Aviation Fault Diagnosis
【速读】:该论文旨在解决通用航空故障诊断与高效维护中,深度学习模型在资源受限边缘设备上的部署难题,核心挑战在于计算能力不足与模型可解释性差。解决方案的关键在于提出 LiteInception 框架,其采用两阶段级联架构:第一阶段实现高召回率故障检测,第二阶段对异常样本进行细粒度分类,从而解耦优化目标并支持按需分配计算资源;同时引入多方法融合的模型压缩策略(基于互信息、梯度分析和SE注意力权重)将传感器通道从23个减少至15个,并设计1+1分支轻量化结构使 InceptionTime 参数量减少70%、CPU推理速度提升8倍以上,且F1分数损失小于3%;此外,通过知识蒸馏机制调节精确率-召回率权衡,使同一轻量模型可适配安全关键与辅助诊断等不同场景;最终构建双层可解释性框架,集成四种归因方法,提供“哪个传感器-哪个时间段”的可追溯证据链,实验证明该方案在效率、准确性和可解释性之间取得了良好平衡。
链接: https://arxiv.org/abs/2604.01725
作者: Zhihuan Wei,Xinhang Chen,Danyang Han,Yang Hu,Jie Liu,Xuewen Miao,Guijiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:General aviation fault diagnosis and efficient maintenance are critical to flight safety; however, deploying deep learning models on resource-constrained edge devices poses dual challenges in computational capacity and interpretability. This paper proposes LiteInception–a lightweight interpretable fault diagnosis framework designed for edge deployment. The framework adopts a two-stage cascaded architecture aligned with standard maintenance workflows: Stage 1 performs high-recall fault detection, and Stage 2 conducts fine-grained fault classification on anomalous samples, thereby decoupling optimization objectives and enabling on-demand allocation of computational resources. For model compression, a multi-method fusion strategy based on mutual information, gradient analysis, and SE attention weights is proposed to reduce the input sensor channels from 23 to 15, and a 1+1 branch LiteInception architecture is introduced that compresses InceptionTime parameters by 70%, accelerates CPU inference by over 8x, with less than 3% F1 loss. Furthermore, knowledge distillation is introduced as a precision-recall regulation mechanism, enabling the same lightweight model to adapt to different scenarios–such as safety-critical and auxiliary diagnosis–by switching training strategies. Finally, a dual-layer interpretability framework integrating four attribution methods is constructed, providing traceable evidence chains of “which sensor x which time period.” Experiments on the NGAFID dataset demonstrate a fault detection accuracy of 81.92% with 83.24% recall, and a fault identification accuracy of 77.00%, validating the framework’s favorable balance among efficiency, accuracy, and interpretability.
[AI-44] Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving
【速读】:该论文旨在解决自动驾驶中视觉-语言-动作(Vision-Language-Action, VLA)模型在处理多样化文本输入(如导航指令、危险警告和交通状态描述)时存在的信息碎片化问题,即当前系统将这些文本作为孤立片段呈现,导致模型难以自动识别与当前驾驶行为相关的环境约束。解决方案的关键在于提出因果场景叙述(Causal Scene Narration, CSN),其通过意图-约束对齐(intent-constraint alignment)、量化接地(quantitative grounding)和结构化分离(structured separation)重构推理阶段的文本输入,实现零GPU成本的动态语义组织;同时结合基于单纯形(Simplex-based)的运行时安全监督和Plackett-Luce DPO训练策略,显著提升驾驶性能与安全性。
链接: https://arxiv.org/abs/2604.01723
作者: Yun Li,Yidu Zhang,Simon Thompson,Ehsan Javanmardi,Manabu Tsukada
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures, 4 tables
Abstract:Vision-Language-Action (VLA) models for autonomous driving must integrate diverse textual inputs, including navigation commands, hazard warnings, and traffic state descriptions, yet current systems often present these as disconnected fragments, forcing the model to discover on its own which environmental constraints are relevant to the current maneuver. We introduce Causal Scene Narration (CSN), which restructures VLA text inputs through intent-constraint alignment, quantitative grounding, and structured separation, at inference time with zero GPU cost. We complement CSN with Simplex-based runtime safety supervision and training-time alignment via Plackett-Luce DPO with negative log-likelihood (NLL) regularization. A multi-town closed-loop CARLA evaluation shows that CSN improves Driving Score by +31.1% on original LMDrive and +24.5% on the preference-aligned variant. A controlled ablation reveals that causal structure accounts for 39.1% of this gain, with the remainder attributable to information content alone. A perception noise ablation confirms that CSN’s benefit is robust to realistic sensing errors. Semantic safety supervision improves Infraction Score, while reactive Time-To-Collision monitoring degrades performance, demonstrating that intent-aware monitoring is needed for VLA systems.
[AI-45] ransformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring
【速读】:该论文旨在解决桥梁结构在风致激励下响应预测的不确定性问题,尤其是在环境或交通条件变化时,传统方法因对风的平稳性或结构正常振动行为的假设而难以准确识别异常状态。解决方案的关键在于提出一种基于Transformer架构的新型人工智能模型,该模型无需假设风的平稳性或结构正常振动行为,而是通过捕捉系统的时序特征进行训练,并将预测振动与实测振动对比以检测显著偏差,从而实现结构状态变化的早期预警。该方法在哈当厄大桥的实际监测数据上验证了其有效性,表明基于Transformer的数字孪生组件能够作为下一代韧性基础设施管理工具,在全生命周期内实现持续学习和自适应监测。
链接: https://arxiv.org/abs/2604.01712
作者: Feiyu Zhou,Marios Impraimakis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Computational Physics (physics.comp-ph)
备注: 21 pages, 22 figures, 9 tables. This version corresponds to the published article in Computers Structures. this https URL
Abstract:The wind-induced structural response forecasting capabilities of a novel transformer methodology are examined here. The model also provides a digital twin component for bridge structural health monitoring. Firstly, the approach uses the temporal characteristics of the system to train a forecasting model. Secondly, the vibration predictions are compared to the measured ones to detect large deviations. Finally, the identified cases are used as an early-warning indicator of structural change. The artificial intelligence-based model outperforms approaches for response forecasting as no assumption on wind stationarity or on structural normal vibration behavior is needed. Specifically, wind-excited dynamic behavior suffers from uncertainty related to obtaining poor predictions when the environmental or traffic conditions change. This results in a hard distinction of what constitutes normal vibration behavior. To this end, a framework is rigorously examined on real-world measurements from the Hardanger Bridge monitored by the Norwegian University of Science and Technology. The approach captures accurate structural behavior in realistic conditions, and with respect to the changes in the system excitation. The results, importantly, highlight the potential of transformer-based digital twin components to serve as next-generation tools for resilient infrastructure management, continuous learning, and adaptive monitoring over the system’s lifecycle with respect to temporal characteristics.
[AI-46] OpenGo: An OpenClaw-Based Robotic Dog with Real-Time Skill Switching
【速读】:该论文旨在解决单个机器人代理在复杂任务和多场景下适应性不足的问题,尤其是在动态环境中实时获取、组织并切换多种技能的需求。解决方案的关键在于提出一个名为OpenGo的开放架构系统,其核心包括:(1)可定制的技能库,支持便捷导入与自主验证;(2)调度器(dispatcher),根据任务提示或语言指令选择并调用相应技能;(3)自学习框架,基于任务完成情况和人类反馈对技能进行微调优化。该系统部署于Unitree Go2机器人平台上,实现了技能的自主检测与切换,并通过集成飞书(Feishu)平台实现自然语言引导与人机交互,显著提升了非专业用户的控制效率。
链接: https://arxiv.org/abs/2604.01708
作者: Hanbing Li,Xuewei Cao,Zhiwen Zeng,Yuhan Wu,Yanyong Zhang,Yan Xia
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures
Abstract:Adaptation to complex tasks and multiple scenarios remains a significant challenge for a single robot agent. The ability to acquire organize, and switch between a wide range of skills in real time, particularly in dynamic environments, has become a fundamental requirement for embodied intelligence. We introduce OpenGo, an OpenClaw-powered embodied robotic dog capable of switching skills in real time according to the scene and task instructions. Specifically, the agent is equipped with (1) a customizable skill library with easy skill import and autonomous skill validation, (2) a dispatcher that selects and invokes different skills according to task prompts or language instructions, and (3) a self-learning framework that fine-tunes skills based on task completion and human feedback. We deploy the agent in Unitree’s Go2 robotic dog and validate its capabilities in self-checking and switching of skills autonomously. In addition, by integrating Feishu-platform communication, we enable natural-language guidance and human feedback, allowing inexperienced users to control the robotic dog through simple instructions.
[AI-47] Scale over Preference: The Impact of AI-Generated Content on Online Content Ecology
【速读】:该论文旨在解决生成式 AI (Generative AI) 生成内容(AIGC)在在线内容生态中快速扩张所引发的用户行为与分发机制失衡问题,特别是 AIGC 创作者虽因高产量实现与人类生成内容(HGC)相近的总体参与度,但消费者仍显著偏好 HGC 的矛盾现象。解决方案的关键在于识别算法分发机制对 AIGC 与 HGC 之间竞争关系的调节作用,并提出构建面向 AIGC 敏感的分发算法与精准治理框架,以保障平台内容生态的长期健康发展。
链接: https://arxiv.org/abs/2604.01690
作者: Tianhao Shi,Yang Zhang,Xiaoyan Zhao,Fengbin Zhu,Chenyi Lei,Han Li,Wenwu Ou,Yang Song,Yongdong Zhang,Fuli Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid proliferation of Artificial Intelligence-Generated Content (AIGC) is fundamentally restructuring online content ecologies, necessitating a rigorous examination of its behavioral and distributional implications. Leveraging a comprehensive longitudinal dataset comprising tens of millions of users from a leading Chinese video-sharing platform, this study elucidated the distinct creation and consumption behaviors characterizing AIGC versus Human-Generated Content (HGC). We identified a prevalent scale-over-preference dynamic, wherein AIGC creators achieve aggregate engagement comparable to HGC creators through high-volume production, despite a marked consumer preference for HGC. Deeper analysis uncovered the ability of the algorithmic content distribution mechanism in moderating these competing interests regarding AIGC. These findings advocated for the implementation of AIGC-sensitive distribution algorithms and precise governance frameworks to ensure the long-term health of the online content platforms.
[AI-48] EvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在执行多步骤专业任务时面临的技能构建瓶颈问题,即当前技能(skill)的生成高度依赖人工标注且易受人机认知错位影响,导致代理性能下降。其核心解决方案是提出EvoSkills框架,该框架通过耦合一个迭代优化技能的Skill Generator与一个无需访问真实测试内容即可提供有信息量反馈的Surrogate Verifier,实现代理对复杂、多文件技能包的自主演化生成,从而提升技能构建效率与泛化能力。
链接: https://arxiv.org/abs/2604.01687
作者: Hanrong Zhang,Shicheng Fan,Henry Peng Zou,Yankai Chen,Zhenting Wang,Jiayu Zhou,Chengze Li,Wei-Chieh Huang,Yifei Yao,Kening Zheng,Xue Liu,Xiaoxiao Li,Philip S. Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code will be released
Abstract:Anthropic proposes the concept of skills for LLM agents to tackle multi-step professional tasks that simple tool invocations cannot address. A tool is a single, self-contained function, whereas a skill is a structured bundle of interdependent multi-file artifacts. Currently, skill generation is not only label-intensive due to manual authoring, but also may suffer from human–machine cognitive misalignment, which can lead to degraded agent performance, as evidenced by evaluations on SkillsBench. Therefore, we aim to enable agents to autonomously generate skills. However, existing self-evolving methods designed for tools cannot be directly applied to skills due to their increased complexity. To address these issues, we propose EvoSkills, a self-evolving skills framework that enables agents to autonomously construct complex, multi-file skill packages. Specifically, EvoSkills couples a Skill Generator that iteratively refines skills with a Surrogate Verifier that co-evolves to provide informative and actionable feedback without access to ground-truth test content. On SkillsBench, EvoSkills achieves the highest pass rate among five baselines on both Claude Code and Codex, and also exhibits strong generalization capabilities to six additional LLMs.
[AI-49] Bridging Large-Model Reasoning and Real-Time Control via Agent ic Fast-Slow Planning
【速读】:该论文旨在解决大型基础模型在自主系统中将语义意图映射为可靠实时控制的挑战,现有方法要么让大语言模型(Large Language Models, LLMs)直接生成轨迹(易脆、难验证且延迟高),要么在线调整模型预测控制(Model Predictive Control, MPC)目标(混淆慢速推理与快速控制,接口模糊)。解决方案的关键在于提出一种分层的“代理式快慢规划”(Agentic Fast-Slow Planning)框架,通过自然时间尺度解耦感知、推理、规划与控制,并引入两个关键桥梁:一是 Perception2Decision,利用车载视觉-语言模型(Vision-Language Model, VLM)检测压缩场景为以自我为中心的拓扑结构,并由云端LLM决策器转化为符号化驾驶指令,降低带宽和延迟同时保持可解释性;二是 Decision2Trajectory,采用语义引导的A*算法嵌入语言驱动的软成本以偏置可行轨迹,结合代理精调模块基于反馈与记忆动态调整规划超参数,最终由MPC实时跟踪轨迹,必要时辅以云端参考。实验表明该方法显著提升鲁棒性,横向偏差最多减少45%,完成时间缩短超12%。
链接: https://arxiv.org/abs/2604.01681
作者: Jiayi Chen,Shuai Wang,Guangxu Zhu,Chengzhong Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 12figures
Abstract:Large foundation models enable powerful reasoning for autonomous systems, but mapping semantic intent to reliable real-time control remains challenging. Existing approaches either (i) let Large Language Models (LLMs) generate trajectories directly - brittle, hard to verify, and latency-prone - or (ii) adjust Model Predictive Control (MPC) objectives online - mixing slow deliberation with fast control and blurring interfaces. We propose Agentic Fast-Slow Planning, a hierarchical framework that decouples perception, reasoning, planning, and control across natural timescales. The framework contains two bridges. Perception2Decision compresses scenes into ego-centric topologies using an on-vehicle Vision-Language Model (VLM) detector, then maps them to symbolic driving directives in the cloud with an LLM decision maker - reducing bandwidth and delay while preserving interpretability. Decision2Trajectory converts directives into executable paths: Semantic-Guided A* embeds language-derived soft costs into classical search to bias solutions toward feasible trajectories, while an Agentic Refinement Module adapts planner hyperparameters using feedback and memory. Finally, MPC tracks the trajectories in real time, with optional cloud-guided references for difficult cases. Experiments in CARLA show that Agentic Fast-Slow Planning improves robustness under perturbations, reducing lateral deviation by up to 45% and completion time by over 12% compared to pure MPC and an A*-guided MPC baseline. Code is available at this https URL.
[AI-50] Can Heterogeneous Language Models Be Fused?
【速读】:该论文旨在解决异构语言模型融合(heterogeneous language model fusion)中的关键挑战,即当多个专家模型基于不同架构(如 Llama、Qwen 和 Mistral)构建时,直接在权重空间进行融合会因结构不匹配、潜在表示基底错位及跨源冲突放大而导致性能下降。解决方案的核心在于提出 HeteroFusion 方法,其包含两个关键组件:基于拓扑结构的对齐(topology-based alignment),通过匹配功能模块结构而非原始张量坐标实现跨架构知识迁移;以及冲突感知去噪(conflict-aware denoising),在融合过程中抑制不兼容或噪声信号。该方法从理论上保障了目标适配器基底的保留与结构化更新预测,从而实现稳定且条件良好的跨家族模型融合。
链接: https://arxiv.org/abs/2604.01674
作者: Shilian Chen,Jie Zhou,Qin Chen,Wen Wu,Xin Li,Qi Feng,Liang He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Model merging aims to integrate multiple expert models into a single model that inherits their complementary strengths without incurring the inference-time cost of ensembling. Recent progress has shown that merging can be highly effective when all source models are \emphhomogeneous, i.e., derived from the same pretrained backbone and therefore share aligned parameter coordinates or compatible task vectors. Yet this assumption is increasingly unrealistic in open model ecosystems, where useful experts are often built on different families such as Llama, Qwen, and Mistral. In such \emphheterogeneous settings, direct weight-space fusion becomes ill-posed due to architectural mismatch, latent basis misalignment, and amplified cross-source conflict. We address this problem with \textttHeteroFusion for heterogeneous language model fusion, which consists of two key components: topology-based alignment that transfers knowledge across heterogeneous backbones by matching functional module structures instead of raw tensor coordinates, and conflict-aware denoising that suppresses incompatible or noisy transfer signals during fusion. We further provide analytical justification showing that preserving the target adapter basis while predicting structured updates leads to a stable and well-conditioned transfer process. Across heterogeneous transfer, multi-source fusion, noisy-source robustness, and cross-family generalization settings, \textttHeteroFusion consistently outperforms strong merging, fusion, and ensemble baselines.
[AI-51] Hierarchical Memory Orchestration for Personalized Persistent Agents
【速读】:该论文旨在解决智能体在长期记忆积累过程中因交互数据量庞大而导致的性能瓶颈问题,具体表现为检索噪声增加和计算延迟上升,进而影响部署在资源受限设备上的模型推理能力。解决方案的关键在于提出分层记忆编排(Hierarchical Memory Orchestration, HMO)框架,该框架基于用户中心的情境相关性将交互历史组织为三层目录结构:紧凑的主缓存层耦合近期与关键记忆及动态演化用户画像,以确保推理与个体行为特征一致;辅以高优先级的次级层,并统一管理全局完整交互历史档案。通过用户人格模型驱动记忆在层级间的重新分配,使长期模式相关的记录被提升至活跃层级,而低相关性信息则被降级,从而实现历史知识的精准调用与高效主动搜索空间维持。
链接: https://arxiv.org/abs/2604.01670
作者: Junming Liu,Yifei Sun,Weihua Cheng,Haodong Lei,Yuqi Li,Yirong Chen,Ding Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, 7 tables
Abstract:While long-term memory is essential for intelligent agents to maintain consistent historical awareness, the accumulation of extensive interaction data often leads to performance bottlenecks. Naive storage expansion increases retrieval noise and computational latency, overwhelming the reasoning capacity of models deployed on constrained personal devices. To address this, we propose Hierarchical Memory Orchestration (HMO), a framework that organizes interaction history into a three-tiered directory driven by user-centric contextual relevance. Our system maintains a compact primary cache, coupling recent and pivotal memories with an evolving user profile to ensure agent reasoning remains aligned with individual behavioral traits. This primary cache is complemented by a high-priority secondary layer, both of which are managed within a global archive of the full interaction history. Crucially, the user persona dictates memory redistribution across this hierarchy, promoting records mapped to long-term patterns toward more active tiers while relegating less relevant information. This targeted orchestration surfaces historical knowledge precisely when needed while maintaining a lean and efficient active search space. Evaluations on multiple benchmarks achieve state-of-the-art performance. Real-world deployments in ecosystems like OpenClaw demonstrate that HMO significantly enhances agent fluidity and personalization.
[AI-52] ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在长时推理任务中因上下文长度受限而导致的交互历史信息保留与上下文预算之间的权衡问题。其核心挑战在于,随着交互历史的增长,如何在有限的上下文窗口内高效管理信息,既避免关键记忆丢失,又不超出部署资源(如内存、延迟和成本)限制。解决方案的关键是提出一种预算感知的上下文管理机制(Budget-Aware Context Management, BACM),将上下文管理建模为带预算约束的序列决策问题,并设计了BACM-RL——一种基于课程学习的强化学习方法,通过端到端训练自动学习不同预算下的压缩策略,从而在多目标问答和长程网页浏览等复杂任务中显著优于现有基线方法,在高复杂度场景下性能提升超过1.6倍,且在预算缩减时仍保持稳定优势。
链接: https://arxiv.org/abs/2604.01664
作者: Yong Wu,YanZhao Zheng,TianZe Xu,ZhenTao Zhang,YuanQiang Yu,JiHuai Zhu,Chao Ma,BinBin Lin,BaoHua Dong,HangCheng Zhu,RuoHui Huang,Gang Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-based agents show strong potential for long-horizon reasoning, yet their context size is limited by deployment factors (e.g., memory, latency, and cost), yielding a constrained context budget. As interaction histories grow, this induces a trade-off between retaining past information and staying within the context limit. To address this challenge, we propose Budget-Aware Context Management (BACM), which formulates context management as a sequential decision problem with a context budget constraint. It enables agents to assess the available budget before incorporating new observations and decide when and how much of the interaction history to compress. We further develop BACM-RL, an end-to-end curriculum-based reinforcement learning approach that learns compression strategies under varying context budgets. Experiments on compositional multi-objective QA and long-horizon web browsing benchmarks show that BACM-RL consistently outperforms prior methods across model scales and task complexities, achieving over 1.6\times gains over strong baselines in high-complexity settings, while maintaining strong advantages as budgets shrink, where most methods exhibit a downward performance trend.
[AI-53] Ontology-Aware Design Patterns for Clinical AI Systems: Translating Reification Theory into Software Architecture
【速读】:该论文旨在解决临床人工智能(Clinical AI)系统在训练过程中因医疗数据结构受文档工作流程、报销激励和术语碎片化等因素扭曲而引发的本体论失真(ontological distortion)问题。此类失真会导致AI模型学习并放大编码伪影(coding artefacts),进而影响临床决策的可靠性。解决方案的关键在于提出一套七种面向本体论意识(ontology-aware)的设计模式,形成一种可实施的软件架构语言——这些模式包括:本体检查点(Ontological Checkpoint)、休眠感知流水线(Dormancy-Aware Pipeline)、漂移哨兵(Drift Sentinel)、双本体层(Dual-Ontology Layer)、再现实回路断路器(Reification Circuit Breaker)、术语版本门控(Terminology Version Gate)以及可插拔合规适配器(Regulatory Compliance Adapter),共同构建对本体漂移具有鲁棒性的临床AI流水线。
链接: https://arxiv.org/abs/2604.01661
作者: Florian Odi Stummer
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 design patterns, 3 tables, 1 figure, arXiv cs.AI preprint
Abstract:Clinical AI systems routinely train on health data structurally distorted by documentation workflows, billing incentives, and terminology fragmentation. Prior work has characterised the mechanisms of this distortion: the three-forces model of documentary enactment, the reification feedback loop through which AI may amplify coding artefacts, and terminology governance failures that allow semantic drift to accumulate. Yet translating these insights into implementable software architecture remains an open problem. This paper proposes seven ontology-aware design patterns in Gang-of-Four pattern language for building clinical AI pipelines resilient to ontological distortion. The patterns address data ingestion validation (Ontological Checkpoint), low-frequency signal preservation (Dormancy-Aware Pipeline), continuous drift monitoring (Drift Sentinel), parallel representation maintenance (Dual-Ontology Layer), feedback loop interruption (Reification Circuit Breaker), terminology evolution management (Terminology Version Gate), and pluggable regulatory compliance (Regulatory Compliance Adapter). Each pattern is specified with Problem, Forces, Solution, Consequences, Known Uses, and Related Patterns. We illustrate their composition in a reference architecture for a primary care AI system and provide a walkthrough tracing all seven patterns through a diabetes risk prediction scenario. This paper does not report empirical validation; it offers a design vocabulary grounded in theoretical analysis, subject to future evaluation in production systems. Three patterns have partial precedent in existing systems; the remaining four have not been formally described. Limitations include the absence of runtime benchmarks and restriction to the German and EU regulatory context.
[AI-54] CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery
【速读】:该论文旨在解决开放性问题中生成式 AI (Generative AI) 代理自主性不足的问题,现有方法依赖固定启发式规则和硬编码探索策略,限制了代理在持续搜索与知识积累中的自适应能力。其解决方案的关键在于提出 CORAL 框架,通过长期运行的多智能体系统实现自主演化:利用共享持久记忆、异步多代理执行和基于心跳的干预机制,使代理能够持续探索、反思与协作;同时引入隔离工作空间、评估器分离、资源管理等实用防护措施,保障系统稳定性与安全性。实证表明,CORAL 在多项数学、算法及系统优化任务上显著优于传统进化搜索基线,提升幅度达3–10倍且评估次数更少,验证了多智能体自主演化对开放性发现的有效性。
链接: https://arxiv.org/abs/2604.01658
作者: Ao Qu,Han Zheng,Zijian Zhou,Yihao Yan,Yihong Tang,Shao Yong Ong,Fenglu Hong,Kaichen Zhou,Chonghe Jiang,Minwei Kong,Jiacheng Zhu,Xuan Jiang,Sirui Li,Cathy Wu,Bryan Kian Hsiang Low,Jinhua Zhao,Paul Pu Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic’s kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at this https URL.
[AI-55] Exploring Robust Multi-Agent Workflows for Environmental Data Management
【速读】:该论文旨在解决将大语言模型(Large Language Model, LLM)驱动的智能体引入环境科学数据管理时,因概率性输出导致的可靠性风险问题。传统确定性流程中,错误通常可被及时检测并终止;而LLM流水线可能生成看似合理但实质错误的结果,且在通过表面验证后传播至不可逆操作(如DOI分配与公开发布),从而引发系统性数据污染。解决方案的关键在于将可靠性作为架构属性进行设计:一是构建三轨知识架构,将治理约束、领域知识和工具使用技能分别以持久化、互锁的实体形式外化;二是采用角色分离的多智能体结构,在信任边界前引入确定性验证器和审计交接机制,恢复“失败停止”语义,确保在关键步骤前能捕获并阻断错误。实证表明,该方案在效率与可靠性上均显著优于单智能体基线,尤其在SF2Bench复杂场景中成功拦截了影响数千站点的坐标变换错误。
链接: https://arxiv.org/abs/2604.01647
作者: Boyuan Guan,Jason Liu,Yanzhao Wu,Kiavash Bahreini
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at PEARC 2026. 12 pages, 4 figures
Abstract:Embedding LLM-driven agents into environmental FAIR data management is compelling - they can externalize operational knowledge and scale curation across heterogeneous data and evolving conventions. However, replacing deterministic components with probabilistic workflows changes the failure mode: LLM pipelines may generate plausible but incorrect outputs that pass superficial checks and propagate into irreversible actions such as DOI minting and public release. We introduce EnviSmart, a production data management system deployed on campus-wide storage infrastructure for environmental research. EnviSmart treats reliability as an architectural property through two mechanisms: a three-track knowledge architecture that externalizes behaviors (governance constraints), domain knowledge (retrievable context), and skills (tool-using procedures) as persistent, interlocking artifacts; and a role-separated multi-agent design where deterministic validators and audited handoffs restore fail-stop semantics at trust boundaries before irreversible steps. We compare two production deployments. The University’s GIS Center Ecological Archive (849 curated datasets) serves as a single-agent baseline. SF2Bench, a compound flooding benchmark comprising 2,452 monitoring stations and 8,557 published files spanning 39 years, validates the multi-agent workflow. The multi-agent approach improved both efficiency - completed by a single operator in two days with repeated artifact reuse across deployments - and reliability: audited handoffs detected and blocked a coordinate transformation error affecting all 2,452 stations before publication. A representative incident (ISS-004) demonstrated boundary-based containment with 10-minute detection latency, zero user exposure, and 80-minute resolution. This paper has been accepted at PEARC 2026.
[AI-56] Seclens: Role-specific Evaluation of LLM s for security vulnerablity detection
【速读】:该论文旨在解决现有大语言模型(LLM)漏洞检测基准评估方法单一化的问题,即仅用一个综合指标压缩模型性能,无法反映不同利益相关方(如CISO、工程负责人、AI负责人等)对漏洞检测任务的不同优先级需求。其解决方案的关键在于提出SecLens-R多利益相关者评估框架,该框架基于35个共享维度划分为7个测量类别,并为五类角色设计特定的权重配置(每类选择12–16个维度,总权重为80),生成介于0到100之间的决策得分(Decision Score)。通过在涵盖10种编程语言和8类OWASP对齐漏洞的406项任务上评估12个前沿模型,结果表明同一模型在不同角色视角下得分差异可达31分,验证了漏洞检测本质上是多目标问题,而该框架能揭示单指标评估所掩盖的深层洞察。
链接: https://arxiv.org/abs/2604.01637
作者: Subho Halder,Siddharth Saxena,Kashinath Kadaba Shrish,Thiyagarajan M
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing benchmarks for LLM-based vulnerability detection compress model performance into a single metric, which fails to reflect the distinct priorities of different stakeholders. For example, a CISO may emphasize high recall of critical vulnerabilities, an engineering leader may prioritize minimizing false positives, and an AI officer may balance capability against cost. To address this limitation, we introduce SecLens-R, a multi-stakeholder evaluation framework structured around 35 shared dimensions grouped into 7 measurement categories. The framework defines five role-specific weighting profiles: CISO, Chief AI Officer, Security Researcher, Head of Engineering, and AI-as-Actor. Each profile selects 12 to 16 dimensions with weights summing to 80, yielding a composite Decision Score between 0 and 100. We apply SecLens-R to evaluate 12 frontier models on a dataset of 406 tasks derived from 93 open-source projects, covering 10 programming languages and 8 OWASP-aligned vulnerability categories. Evaluations are conducted across two settings: Code-in-Prompt (CIP) and Tool-Use (TU). Results show substantial variation across stakeholder perspectives, with Decision Scores differing by as much as 31 points for the same model. For instance, Qwen3-Coder achieves an A (76.3) under the Head of Engineering profile but a D (45.2) under the CISO profile, while GPT-5.4 shows a similar disparity. These findings demonstrate that vulnerability detection is inherently a multi-objective problem and that stakeholder-aware evaluation provides insights that single aggregated metrics obscure. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.01637 [cs.CR] (or arXiv:2604.01637v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.01637 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-57] RefinementEngine: Automating Intent-to-Device Filtering Policy Deployment under Network Constraints
【速读】:该论文旨在解决安全运营中心(SOC)中将高阶安全意图(security intent)自动化转化为可部署的网络策略配置的问题,尤其是在复杂异构网络环境中,由于拓扑依赖的可达性约束和设备特定的安全控制能力差异,传统手动配置过程易出错且效率低下。解决方案的关键在于提出RefinementEngine引擎,其核心能力是基于网络拓扑、设备能力及威胁情报(Cyber Threat Intelligence, CTI)数据,自动将高层安全意图精炼为低层、可直接部署的配置规则,从而实现对已知威胁的有效防御并具备良好的适应性与实用性。
链接: https://arxiv.org/abs/2604.01627
作者: Davide Colaiacomo,Chiara Bonfanti,Cataldo Basile
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Translating security intent into deployable network enforcement rules and maintaining their effectiveness despite evolving cyber threats remains a largely manual process in most Security Operations Centers (SOCs). In large and heterogeneous networks, this challenge is complicated by topology-dependent reachability constraints and device-specific security control capabilities, making the process slow, error-prone, and a recurring source of misconfigurations. This paper presents RefinementEngine, an engine that automates the refinement of high-level security intents into low-level, deployment-ready configurations. Given a network topology, devices, and available security controls, along with high-level intents and Cyber Threat Intelligence (CTI) reports, RefinementEngine automatically generates settings that implement the desired intent, counter reported threats, and can be directly deployed on target security controls. The proposed approach is validated through real-world use cases on packet and web filtering policies derived from actual CTI reports, demonstrating both correctness, practical applicability, and adaptability to new data.
[AI-58] DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72
【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)推理过程中因多GPU执行时层间跨节点同步导致的端到端性能对负载不均衡敏感的问题。现有并行策略依赖于逐层的跨Rank同步,限制了GPU资源的充分利用。其解决方案的关键在于提出一种名为DWDP(Distributed Weight Data Parallelism)的新颖推理并行策略:该策略在保持数据并行执行的基础上,将MoE(Mixture of Experts)模型的权重分布到同级GPU上,并按需异步拉取缺失专家权重,从而移除集体式跨节点同步操作,使每个GPU可独立推进计算。此外,论文还通过两项优化——分片权重管理与异步远程权重预取——有效降低了该设计带来的实际开销,在TensorRT-LLM中实现部署并在GB200 NVL72平台上验证,相较基线方法在20–100 TPS/user的服务范围内实现了每GPU输出吞吐量提升8.8%。
链接: https://arxiv.org/abs/2604.01621
作者: Wanqian Li,Jintao Peng,Zongfei Jing,Tianyu Zhang,Ze Long,Xianjie Qiao,Xiaoming Chen,Dongxu Yang,Kefeng Duan,June Yang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Technical Report. 17 pages. 8 figures
Abstract:Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under 8K input sequence length and 1K output sequence length.
[AI-59] Analysis of LLM Performance on AWS Bedrock: Receipt-item Categorisation Case Study WWW
【速读】:该论文旨在解决在生产环境中对收据项目进行分类时,如何在分类准确率与成本效益之间取得最优平衡的问题。其解决方案的关键在于系统性地评估四种通过 AWS Bedrock 提供的指令微调大语言模型(Large Language Models, LLMs)——Claude 3.7 Sonnet、Claude 4 Sonnet、Mixtral 8x7B Instruct 和 Mistral 7B Instruct——在准确性、响应稳定性及分词级别成本方面的表现,并进一步比较零样本(zero-shot)与少样本(few-shot)提示方法在精度和成本上的适用性。实验结果表明,Claude 3.7 Sonnet 在分类准确率与成本效率之间实现了最佳平衡。
链接: https://arxiv.org/abs/2604.01615
作者: Gabby Sanchez,Sneha Oommen,Cassandra T. Britto,Di Wang,Jung-De Chiou,Maria Spichkova
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Preprint. Accepted to the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2026). Final version to be published by SCITEPRESS, this http URL
Abstract:This paper presents a systematic, cost-aware evaluation of large language models (LLMs) for receipt-item categorisation within a production-oriented classification framework. We compare four instruction-tuned models available through AWS Bedrock: Claude 3.7 Sonnet, Claude 4 Sonnet, Mixtral 8x7B Instruct, and Mistral 7B Instruct. The aim of the study was (1) to assess performance across accuracy, response stability, and token-level cost, and (2) to investigate what prompting methods, zero-shot or few-shot, are especially appropriate both in terms of accuracy and in terms of incurred costs. Results of our experiments demonstrated that Claude 3.7 Sonnet achieves the most favourable balance between classification accuracy and cost efficiency.
[AI-60] GraphWalk: Enabling Reasoning in Large Language Models through Tool-Based Graph Navigation
【速读】:该论文旨在解决大规模知识图谱(Knowledge Graph, KG)在现实世界问答(QA)应用中,因结构复杂、规模庞大而导致传统基于提示(prompting)或检索增强生成(Retrieval-Augmented Generation, RAG)方法无法有效进行多跳推理的问题。其核心挑战在于企业级KG通常超出大型语言模型(Large Language Model, LLM)的最大上下文窗口限制,使得直接注入子图或依赖上下文内推理变得不可行。解决方案的关键在于提出一种无任务特异性的、无需训练的工具型框架GraphWalk,该框架通过赋予LLM一组正交的、通用的图操作工具(如跳转、查询邻居节点等),使模型能够以序列化方式遍历任意结构的知识图谱,并形成可验证的执行轨迹。实验表明,GraphWalk在合成迷宫和模拟企业级知识图谱上的多步推理任务中显著优于传统上下文基方法,尤其在模型规模增大时优势更加明显,从而填补了当前LLM在处理超大规模结构化知识时的能力空白。
链接: https://arxiv.org/abs/2604.01610
作者: Taraneh Ghandi,Hamidreza Mahyar,Shachar Klaiman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 3 figures
Abstract:The use of knowledge graphs for grounding agents in real-world QA applications has become increasingly common. Answering complex queries often requires multi-hop reasoning and the ability to navigate vast relational structures. Standard approaches rely on prompting techniques that steer large language models to reason over raw graph context, or retrieval-augmented generation pipelines where relevant subgraphs are injected into the context. These, however, face severe limitations with enterprise-scale KGs that cannot fit in even the largest context windows available today. We present GraphWalk, a problem-agnostic, training-free, tool-based framework that allows off-the-shelf LLMs to reason through sequential graph navigation, dramatically increasing performance across different tasks. Unlike task-specific agent frameworks that encode domain knowledge into specialized tools, GraphWalk equips the LLM with a minimal set of orthogonal graph operations sufficient to traverse any graph structure. We evaluate whether models equipped with GraphWalk can compose these operations into correct multi-step reasoning chains, where each tool call represents a verifiable step creating a transparent execution trace. We first demonstrate our approach on maze traversal, a problem non-reasoning models are completely unable to solve, then present results on graphs resembling real-world enterprise knowledge graphs. To isolate structural reasoning from world knowledge, we evaluate on entirely synthetic graphs with random, non-semantic labels. Our benchmark spans 12 query templates from basic retrieval to compound first-order logic queries. Results show that tool-based traversal yields substantial and consistent gains over in-context baselines across all model families tested, with gains becoming more pronounced as scale increases, precisely where in-context approaches fail catastrophically.
[AI-61] From Multi-Agent to Single-Agent : When Is Skill Distillation Beneficial?
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在任务分解与协作过程中存在的协调开销高、上下文碎片化及阶段顺序脆弱等问题,尤其是如何高效地将MAS中的复杂技能提炼为单智能体能力(skill distillation),而现有方法缺乏对“何时以及提取何种技能”这一核心问题的理论指导,导致性能提升不稳定(技能提升范围从28%到2%下降)。解决方案的关键在于提出首个先验预测指标——度量自由度(Metric Freedom, F),其通过Mantel检验量化输出多样性与评分方差之间的耦合关系,从而刻画评价指标的拓扑刚性。基于F值,作者设计了两阶段自适应蒸馏框架:第一阶段针对自由度高的指标(F较大)选择性提取工具与知识,保留探索空间;第二阶段仅对刚性指标(F ≤ 0.6)进行计算密集型迭代优化,避免轨迹局部过拟合。实验证明,F能显著预测技能效用(ρ = -0.62, p < 0.05),且相同智能体轨迹在不同指标下产生截然相反的技能提升效果,揭示技能效用本质上是指标层面的属性。
链接: https://arxiv.org/abs/2604.01608
作者: Binyan Xu,Dong Fang,Haitao Li,Kehuan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures, 11 tables
Abstract:Multi-agent systems (MAS) tackle complex tasks by distributing expertise, though this often comes at the cost of heavy coordination overhead, context fragmentation, and brittle phase ordering. Distilling a MAS into a single-agent skill can bypass these costs, but this conversion lacks a principled answer for when and what to distill. Instead, the empirical outcome is surprisingly inconsistent: skill lift ranges from a 28% improvement to a 2% degradation across metrics of the exact same task. In this work, we reveal that skill utility is governed not by the task, but by the evaluation metric. We introduce Metric Freedom ( F ), the first a priori predictor of skill utility. F measures the topological rigidity of a metric’s scoring landscape by quantifying how output diversity couples with score variance via a Mantel test. Guided by F , we propose a two-stage adaptive distillation framework. Stage 1 acts as a selective extraction mechanism, extracting tools and knowledge while discarding restrictive structures on “free” metrics to preserve exploration. Stage 2 targets computationally intensive iterative refinement exclusively toward “rigid” metrics ( F \lesssim 0.6 ) to eliminate trajectory-local overfitting. Evaluating across 4 tasks, 11 datasets, and 6 metrics, F strongly predicts skill utility ( \rho = -0.62 , p 0.05 ). Strikingly, identical agent trajectories yield diametrically opposite skill lifts under rigid versus free metrics, demonstrating that skill utility is fundamentally a metric-level property. Driven by this signal, our adaptive agent matches or exceeds the original MAS while reducing cost up to 8 \times and latency by up to 15 \times .
[AI-62] ModTrans: Translating Real-world Models for Distributed Training Simulator
【速读】:该论文旨在解决当前分布式训练模拟器(如ASTRA-sim)无法直接导入真实世界开发的机器学习模型的问题,从而阻碍了机器学习研究人员与系统研究者之间的协作。解决方案的关键在于提出ModTrans,一个支持将任意真实模型格式转换为ASTRA-sim输入格式的翻译工具,有效消除了模型迁移的技术壁垒,且实验表明其开销可忽略不计。
链接: https://arxiv.org/abs/2604.01607
作者: Yi Lyu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Large-scale distributed training has been a research hot spot in machine learning systems for industry and academia in recent years. However, conducting experiments without physical machines and corresponding resources is difficult. One solution is to leverage distributed training simulators, but current ones like ASTRA-sim do not support importing real-world developed models, which poses challenges for ML researchers seeking to use them. Based on this challenge, we developed ModTrans, a translator supporting format translation from any real-world model to the ASTRA-sim simulator’s input, removing the barrier between machine learning experts and machine learning system researchers. The experiment results show that ModTrans’s cost is negligible.
[AI-63] CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中拒绝行为(refusal behavior)的内在机制不明确问题,尤其是现有特征选择方法仅依赖有害提示下特征激活强度,难以捕捉导致拒绝决策的因果因素。解决方案的关键在于提出CRaFT框架——一种基于电路影响(circuit influence)的拒绝特征选择方法,通过在拒绝边界附近的提示上评估特征对模型拒绝-服从决策的影响程度来排序特征,从而更准确地识别出因果性中介特征。实验表明,CRaFT显著提升了攻击成功率(ASR)从6.7%至48.2%,并优于多个基线方法,验证了电路影响作为特征筛选标准的可靠性。
链接: https://arxiv.org/abs/2604.01604
作者: Su-Hyeon Kim,Hyundong Jin,Yejin Lee,Yo-Sub Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As safety concerns around large language models (LLMs) grow, understanding the internal mechanisms underlying refusal behavior has become increasingly important. Recent work has studied this behavior by identifying internal features associated with refusal and manipulating them to induce compliance with harmful requests. However, existing refusal feature selection methods rely on how strongly features activate on harmful prompts, which tends to capture superficial signals rather than the causal factors underlying the refusal decision. We propose CRaFT, a circuit-guided refusal feature selection framework that ranks features by their influence on the model’s refusal-compliance decision using prompts near the refusal boundary. On Gemma-3-1B-it, CRaFT improves attack success rate (ASR) from 6.7% to 48.2% and outperforms baseline methods across multiple jailbreak benchmarks. These results suggest that circuit influence is a more reliable criterion than activation magnitude for identifying features that causally mediate refusal behavior.
[AI-64] MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction CVPR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图表转代码(chart-to-code)任务中因依赖监督微调(Supervised Fine-Tuning, SFT)而导致的代码生成质量受限问题,尤其在于缺乏对代码执行环境的交互与自我修正能力。现有方法难以有效利用执行反馈进行自纠错,限制了代码的准确性和可执行性。解决方案的关键在于提出MM-ReCoder,一种基于强化学习(Reinforcement Learning, RL)并具备多轮自纠错机制的模型;其核心创新是采用两阶段多轮自纠错强化学习策略,基于组相对策略优化(Group Relative Policy Optimization, GRPO),第一阶段通过共享首轮推理增强模型的自我修正能力,第二阶段通过完整轨迹优化提升代码生成性能,从而实现通过与环境交互和迭代修正输出来生成更准确、可执行的代码。
链接: https://arxiv.org/abs/2604.01600
作者: Zitian Tang,Xu Zhang,Jianbo Yuan,Yang Zou,Varad Gunjal,Songyao Jiang,Davide Modolo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: CVPR 2026
Abstract:Multimodal Large Language Models (MLLMs) have recently demonstrated promising capabilities in multimodal coding tasks such as chart-to-code generation. However, existing methods primarily rely on supervised fine-tuning (SFT), which requires the model to learn code patterns through chart-code pairs but does not expose the model to a code execution environment. Moreover, while self-correction through execution feedback offers a potential route to improve coding quality, even state-of-the-art MLLMs have been shown to struggle with effective self-correction. In this work, we introduce MM-ReCoder, a chart-to-code generation model trained with reinforcement learning (RL) and equipped with self-correction ability. We propose a two-stage multi-turn self-correction RL strategy based on Group Relative Policy Optimization (GRPO). The first stage enhances the model’s self-correction ability via rolling out a shared first turn, while the second stage improves the coding capability with full-trajectory optimization. MM-ReCoder learns to produce more accurate and executable code through the interaction with the environment and by iteratively correcting its own outputs. Our results on three chart-to-code benchmarks demonstrate the state-of-the-art performance of MM-ReCoder.
[AI-65] ByteRover: Agent -Native Memory Through LLM -Curated Hierarchical Context
【速读】:该论文旨在解决现有记忆增强生成(Memory-Augmented Generation, MAG)系统中因外部记忆服务与推理模型分离而导致的语义漂移、跨代理协调上下文丢失及故障恢复脆弱性等问题。其核心解决方案是提出一种代理原生(agent-native)的记忆架构 ByteRover,通过将知识的编目、结构化和检索功能集成到同一语言模型(LLM)中,实现对记忆内容的语义理解与主动管理。关键创新在于采用分层上下文树(Context Tree)结构组织知识,并引入自适应知识生命周期(Adaptive Knowledge Lifecycle, AKL),结合五级渐进式检索策略,在无需任何外部基础设施(如向量数据库、图数据库或嵌入服务)的前提下,实现了低延迟(<100ms)且高准确性的知识访问能力。
链接: https://arxiv.org/abs/2604.01599
作者: Andy Nguyen,Danh Doan,Hoang Pham,Bao Ha,Dat Pham,Linh Nguyen,Hieu Nguyen,Thien Nguyen,Cuong Do,Phat Nguyen,Toan Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 3 figures, 7 tables
Abstract:Memory-Augmented Generation (MAG) extends large language models with external memory to support long-context reasoning, but existing approaches universally treat memory as an external service that agents call into, delegating storage to separate pipelines of chunking, embedding, and graph extraction. This architectural separation means the system that stores knowledge does not understand it, leading to semantic drift between what the agent intended to remember and what the pipeline actually captured, loss of coordination context across agents, and fragile recovery after failures. In this paper, we propose ByteRover, an agent-native memory architecture that inverts the memory pipeline: the same LLM that reasons about a task also curates, structures, and retrieves knowledge. ByteRover represents knowledge in a hierarchical Context Tree, a file-based knowledge graph organized as Domain, Topic, Subtopic, and Entry, where each entry carries explicit relations, provenance, and an Adaptive Knowledge Lifecycle (AKL) with importance scoring, maturity tiers, and recency decay. Retrieval uses a 5-tier progressive strategy that resolves most queries at sub-100 ms latency without LLM calls, escalating to agentic reasoning only for novel questions. Experiments on LoCoMo and LongMemEval demonstrate that ByteRover achieves state-of-the-art accuracy on LoCoMo and competitive results on LongMemEval while requiring zero external infrastructure, no vector database, no graph database, no embedding service, with all knowledge stored as human-readable markdown files on the local filesystem.
[AI-66] Do Large Language Models Mentalize When They Teach? ICML2026
【速读】:该论文旨在探究大语言模型(Large Language Models, LLMs)在教学决策中是基于对学习者知识状态的推理(即心理理论),还是依赖于更简单的启发式规则。其核心问题是:LLMs 在模拟教学任务中如何选择下一步应教授的内容?解决方案的关键在于设计一个受控的教学任务,其中教师LLM需根据假设学习者的路径轨迹(来自带奖励标注的有向图)判断应揭示哪条边以引导学习者重构路径。通过将LLM的行为与人类认知模型(包括贝叶斯最优教师、弱化贝叶斯变体、启发式基线和非心理理论效用模型)进行拟合比较,发现大多数LLM的行为更符合贝叶斯最优教学策略,表明它们具备一定程度的心理建模能力;但引入辅助提示(如聚焦推理或奖励的干预)并未稳定提升性能,甚至可能降低表现,说明单纯提示合规性并不等同于教学决策质量的提升。
链接: https://arxiv.org/abs/2604.01594
作者: Sevan K. Harootonian,Mark K. Ho,Thomas L. Griffiths,Yael Niv,Ilia Sucholutsky
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures. Workshop paper at ICML 2026
Abstract:How do LLMs decide what to teach next: by reasoning about a learner’s knowledge, or by using simpler rules of thumb? We test this in a controlled task previously used to study human teaching strategies. On each trial, a teacher LLM sees a hypothetical learner’s trajectory through a reward-annotated directed graph and must reveal a single edge so the learner would choose a better path if they replanned. We run a range of LLMs as simulated teachers and fit their trial-by-trial choices with the same cognitive models used for humans: a Bayes-Optimal teacher that infers which transitions the learner is missing (inverse planning), weaker Bayesian variants, heuristic baselines (e.g., reward based), and non-mentalizing utility models. In a baseline experiment matched to the stimuli presented to human subjects, most LLMs perform well, show little change in strategy over trials, and their graph-by-graph performance is similar to that of humans. Model comparison (BIC) shows that Bayes-Optimal teaching best explains most models’ choices. When given a scaffolding intervention, models follow auxiliary inference- or reward-focused prompts, but these scaffolds do not reliably improve later teaching on heuristic-incongruent test graphs and can sometimes reduce performance. Overall, cognitive model fits provide insight into LLM tutoring policies and show that prompt compliance does not guarantee better teaching decisions.
[AI-67] hinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理任务中生成答案的准确性不足问题,尤其关注如何通过强化学习提升模型的推理能力与自我修正能力。其解决方案的关键在于提出一种名为ThinkTwice的两阶段框架,基于组相对策略优化(Group Relative Policy Optimization, GRPO),在不依赖正确性信号或人工评注的前提下,联合优化模型的推理生成与自修正能力:第一阶段训练模型解决推理问题,第二阶段则利用相同的二元正确奖励信号对同一问题的答案进行自我修正。实验表明,该方法在多个数学推理基准上显著优于现有在线策略优化基线,且训练动态呈现“先修正后巩固”的隐式课程机制,从而提升奖励信号的质量与模型性能。
链接: https://arxiv.org/abs/2604.01591
作者: Difan Jiao,Qianfeng Wen,Blair Yang,Zhenwei Tang,Ashton Anderson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages,7 figures,5 tables
Abstract:We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.
[AI-68] NED-Tree: Bridging the Semantic Gap with Nonlinear Element Decomposition Tree for LLM Nonlinear Optimization Modeling
【速读】:该论文旨在解决将运筹学(Operations Research, OR)问题从自然语言自动翻译为可执行模型的挑战,尤其针对大型语言模型(Large Language Models, LLMs)在真实世界非线性场景中因数学表述与求解器代码之间语义错位及信息提取不稳定而导致性能显著下降的问题。解决方案的关键在于提出NED-Tree框架,其核心创新包括:(a) 逐句提取策略以确保参数映射的鲁棒性和可追溯性;(b) 基于递归树结构自适应地将复杂非线性项分解为求解器兼容的子元素,从而实现建模语义与代码语义的对齐。该方法首次通过元素分解驱动LLMs克服非线性建模难题,并在新提出的NEXTOR基准上验证了其优越性能,平均准确率达72.51%,达到当前最优水平。
链接: https://arxiv.org/abs/2604.01588
作者: Zhijing Hu,Yufan Deng,Haoyang Liu,Changjun Fan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 7 figures, conference
Abstract:Automating the translation of Operations Research (OR) problems from natural language to executable models is a critical challenge. While Large Language Models (LLMs) have shown promise in linear tasks, they suffer from severe performance degradation in real-world nonlinear scenarios due to semantic misalignment between mathematical formulations and solver codes, as well as unstable information extraction. In this study, we introduce NED-Tree, a systematic framework designed to bridge the semantic gap. NED-Tree employs (a) a sentence-by-sentence extraction strategy to ensure robust parameter mapping and traceability; and (b) a recursive tree-based structure that adaptively decomposes complex nonlinear terms into solver-compatible sub-elements. Additionally, we present NEXTOR, a novel benchmark specifically designed for complex nonlinear, extensive-constraint OR problems. Experiments across 10 benchmarks demonstrate that NED-Tree establishes a new state-of-the-art with 72.51% average accuracy, NED-Tree is the first framework that drives LLMs to resolve nonlinear modeling difficulties through element decomposition, achieving alignment between modeling semantics and code semantics. The NED-Tree framework and benchmark are accessible in the anonymous repository this https URL.
[AI-69] hinking While Listening: Fast-Slow Recurrence for Long-Horizon Sequential Modeling
【速读】:该论文旨在解决序列输入流中长期依赖建模与稳定内部结构学习的问题,尤其在强化学习和算法任务中提升模型的分布外泛化能力。其解决方案的关键在于将快速的递归潜在状态更新与缓慢观测更新之间的自组织能力交错结合,从而在时间维度上形成随输入动态演化的稳定表征结构,实现长时程下的表示一致性与聚类特性,显著优于LSTM、状态空间模型及Transformer等序列基线方法。
链接: https://arxiv.org/abs/2604.01577
作者: Shota Takashiro,Masanori Koyama,Takeru Miyato,Yusuke Iwasawa,Yutaka Matsuo,Kohei Hayashi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We extend the recent latent recurrent modeling to sequential input streams. By interleaving fast, recurrent latent updates with self-organizational ability between slow observation updates, our method facilitates the learning of stable internal structures that evolve alongside the input. This mechanism allows the model to maintain coherent and clustered representations over long horizons, improving out-of-distribution generalization in reinforcement learning and algorithmic tasks compared to sequential baselines such as LSTM, state space models, and Transformer variants.
[AI-70] Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)训练中归一化层与优化器之间相互作用的潜在不兼容性问题,特别是动态误差函数归一化(Dynamic Erf, Derf)与Muon优化器之间的负面交互效应。研究发现,在1B参数规模下进行短周期训练时,Derf与Muon组合会导致性能显著下降(相比RMSNorm提升从+0.31 nats降至+0.97 nats),其根本原因在于Muon加速谱范数增长后引发Derf的两种失效模式:饱和(lossy compression)和尺度失明(scale blindness)。解决方案的关键在于两个方面:一是引入指数移动平均(EMA)混合策略以恢复运行尺度估计,可挽回约84%的性能差距;二是将Derf的缩放参数α从原始默认值0.5降低至0.3,使其保持在近线性区域以维持相对尺度不变,从而挽回约80%的性能损失——这表明优化器与归一化层的协同设计需考虑动态特性匹配,而非独立选择。
链接: https://arxiv.org/abs/2604.01563
作者: Abdelrahman Abouzeid(Georgia Institute of Technology)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 8 figures. Preprint. Under review
Abstract:In LLM training, normalization layers and optimizers are typically treated as independent design choices. In a 3x2 factorial at 1B parameters and 1000 training steps, we show this assumption can fail: Dynamic Erf (Derf; Chen Liu, 2025) suffers a large negative interaction with Muon (Jordan, 2024), with its gap to RMSNorm growing from +0.31 nats under AdamW to +0.97 under Muon, approximately three times larger. Dynamic Tanh (DyT; Zhu et al., 2025), included as a bounded-normalizer control, shows no such penalty. Our evidence points to two failure modes of erf under Muon’s faster spectral-norm growth: saturation (lossy compression) and scale blindness (discarding activation magnitude). An EMA-blend that reintroduces running scale estimates recovers ~84% of the gap. Separately, reducing Derf’s alpha from its published default (0.5 to 0.3) recovers ~80% by keeping erf in its near-linear regime, where it approximately preserves relative scale; this setting is not the published default of Chen Liu (2025). Using Derf’s published default alpha with Muon incurs a 0.66-nat interaction penalty without producing NaNs or divergence, making the failure easy to miss in short pilot runs.
[AI-71] RAE-AR: Taming Autoregressive Models with Representation Autoencoders
【速读】:该论文旨在解决将高维表示自编码器(Representation Autoencoder, RAE)集成到连续自回归(Autoregressive, AR)生成模型中的挑战,特别是针对AR模型特有的两个问题:复杂token级分布建模困难以及因高维度导致的训练-推理差距(暴露偏差,exposure bias)加剧。解决方案的关键在于引入两种改进策略:一是通过分布归一化进行token简化,以降低建模难度并提升收敛性;二是训练阶段注入高斯噪声以增强预测鲁棒性,从而缓解暴露偏差。实证结果表明,这些方法显著缩小了RAE与传统变分自编码器(VAE)在AR模型上的性能差距,使RAE能够达到与VAE相当的生成效果,为视觉理解与生成建模提供更统一的架构基础。
链接: https://arxiv.org/abs/2604.01545
作者: Hu Yu,Hang Xu,Jie Huang,Zeyue Xue,Haoyang Huang,Nan Duan,Feng Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The latent space of generative modeling is long dominated by the VAE encoder. The latents from the pretrained representation encoders (e.g., DINO, SigLIP, MAE) are previously considered inappropriate for generative modeling. Recently, RAE method lights the hope and reveals that the representation autoencoder can also achieve competitive performance as the VAE encoder. However, the integration of representation autoencoder into continuous autoregressive (AR) models, remains largely unexplored. In this work, we investigate the challenges of employing high-dimensional representation autoencoders within the AR paradigm, denoted as \textitRAE-AR. We focus on the unique properties of AR models and identify two primary hurdles: complex token-wise distribution modeling and the high-dimensionality amplified training-inference gap (exposure bias). To address these, we introduce token simplification via distribution normalization to ease modeling difficulty and improve convergence. Furthermore, we enhance prediction robustness by incorporating Gaussian noise injection during training to mitigate exposure bias. Our empirical results demonstrate that these modifications substantially bridge the performance gap, enabling representation autoencoder to achieve results comparable to traditional VAEs on AR models. This work paves the way for a more unified architecture across visual understanding and generative modeling.
[AI-72] PHMForge: A Scenario-Driven Agent ic Benchmark for Industrial Asset Lifecycle Maintenance
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在工业领域中执行复杂工具编排任务时缺乏严谨评估基准的问题,尤其针对预测与健康管理(Prognostics and Health Management, PHM)场景下因错误决策可能引发重大安全与经济损失的现实需求。其解决方案的关键在于提出首个专门面向PHM任务的综合性基准PHMForge,通过模拟真实工业环境中的多智能体通信协议(MCP)服务器交互,构建涵盖75个专家标注场景、覆盖7类工业资产和5类核心任务的评测体系,并配套65个专用工具及基于任务特性的执行级评估指标(如MAE/RMSE用于回归、F1-score用于分类、类别匹配用于健康评估),从而实现对LLM代理在工具调度、跨设备泛化和多资产推理等关键能力上的系统性量化评估。
链接: https://arxiv.org/abs/2604.01532
作者: Ayan Das,Dhaval Patel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures
Abstract:Large language model (LLM) agents are increasingly deployed for complex tool-orchestration tasks, yet existing benchmarks fail to capture the rigorous demands of industrial domains where incorrect decisions carry significant safety and financial consequences. To address this critical gap, we introduce PHMForge, the first comprehensive benchmark specifically designed to evaluate LLM agents on Prognostics and Health Management (PHM) tasks through realistic interactions with domain-specific MCP servers. Our benchmark encompasses 75 expert-curated scenarios spanning 7 industrial asset classes (turbofan engines, bearings, electric motors, gearboxes, aero-engines) across 5 core task categories: Remaining Useful Life (RUL) Prediction, Fault Classification, Engine Health Analysis, Cost-Benefit Analysis, and Safety/Policy Evaluation. To enable rigorous evaluation, we construct 65 specialized tools across two MCP servers and implement execution-based evaluators with task-commensurate metrics: MAE/RMSE for regression, F1-score for classification, and categorical matching for health assessments. Through extensive evaluation of leading frameworks (ReAct, Cursor Agent, Claude Code) paired with frontier LLMs (Claude Sonnet 4.0, GPT-4o, Granite-3.0-8B), we find that even top-performing configurations achieve only 68% task completion, with systematic failures in tool orchestration (23% incorrect sequencing), multi-asset reasoning (14.9 percentage point degradation), and cross-equipment generalization (42.7% on held-out datasets). We open-source our complete benchmark, including scenario specifications, ground truth templates, tool implementations, and evaluation scripts, to catalyze research in agentic industrial AI.
[AI-73] ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents
【速读】:该论文旨在解决现有AI编程助手评估基准与工业实际应用场景之间存在的差距问题,具体表现为编程语言分布、提示(prompt)风格及代码库结构等方面的不一致性。为应对这一挑战,论文提出了一套从生产环境数据中构建基准的方法论,并以ProdCodeBench为例进行验证——该基准源自真实使用场景下AI编程助手的会话记录。其解决方案的关键在于:通过大语言模型(LLM)驱动的任务分类、测试相关性验证以及多轮运行稳定性检查等实践,确保在复杂单体仓库(monorepo)环境中提取出可靠且具有代表性的评估信号;同时发现,能够利用工作验证工具(如执行测试和调用静态分析)的模型展现出更高的问题求解率(53.2%–72.2%),表明迭代验证机制对提升代理行为有效性至关重要,且暴露特定代码库的验证机制可显著改善外部训练模型在陌生环境中的性能表现。
链接: https://arxiv.org/abs/2604.01527
作者: Smriti Jha,Matteo Paltenghi,Chandra Maddila,Vijayaraghavan Murali,Shubham Ugare,Satish Chandra
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings, yet existing benchmarks differ from real usage in programming language distribution, prompt style and codebase structure. This paper presents a methodology for curating production-derived benchmarks, illustrated through ProdCodeBench - a benchmark built from real sessions with a production AI coding assistant. We detail our data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks which address challenges in constructing reliable evaluation signals from monorepo environments. Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages. Our systematic analysis of four foundation models yields solve rates from 53.2% to 72.2% revealing that models making greater use of work validation tools, such as executing tests and invoking static analysis, achieve higher solve rates. This suggests that iterative verification helps achieve effective agent behavior and that exposing codebase-specific verification mechanisms may significantly improve the performance of externally trained agents operating in unfamiliar environments. We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks.
[AI-74] LLM Agents as Social Scientists: A Human-AI Collaborative Platform for Social Science Automation
【速读】:该论文旨在解决传统社会科学研究中实验设计复杂、依赖真实人类参与者所带来的劳动密集、成本高昂且难以扩展的问题。其核心解决方案是提出S-Researcher平台,该平台基于大语言模型(Large Language Model, LLM)代理,实现研究流程与参与者池的“硅基化”(siliconizing)。关键在于构建了YuLan-OneSim这一大规模社会仿真系统,满足通用性(通过自然语言自动生成可执行场景)、可扩展性(支持最多10万并发代理的分布式架构)和可靠性(通过反馈驱动的LLM微调机制),从而形成研究人员全程可控的人机协同研究闭环,并在归纳、演绎与溯因三种推理模式下验证了其有效性,显著提升了社会科学研究的效率与规模。
链接: https://arxiv.org/abs/2604.01520
作者: Lei Wang,Yuanzi Li,Jinchao Wu,Heyang Gao,Xiaohe Bo,Xu Chen,Ji-Rong Wen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional social science research often requires designing complex experiments across vast methodological spaces and depends on real human participants, making it labor-intensive, costly, and difficult to scale. Here we present S-Researcher, an LLM-agent-based platform that assists researchers in conducting social science research more efficiently and at greater scale by “siliconizing” both the research process and the participant pool. To build S-Researcher, we first develop YuLan-OneSim, a large-scale social simulation system designed around three core requirements: generality via auto-programming from natural language to executable scenarios, scalability via a distributed architecture supporting up to 100,000 concurrent agents, and reliability via feedback-driven LLM fine-tuning. Leveraging this system, S-Researcher supports researchers in designing social experiments, simulating human behavior with LLM agents, analyzing results, and generating reports, forming a complete human-AI collaborative research loop in which researchers retain oversight and intervention at every stage. We operationalize LLM simulation research paradigms into three canonical reasoning modes (induction, deduction, and abduction) and validate S-Researcher through systematic case studies: inductive reproduction of cultural dynamics consistent with Axelrod’s theory, deductive testing of competing hypotheses on teacher attention validated against survey data, and abductive identification of a cooperation mechanism in public goods games confirmed by human experiments. S-Researcher establishes a new human–AI collaborative paradigm for social science, in which computational simulation augments human researchers to accelerate discovery across the full spectrum of social inquiry.
[AI-75] oolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agent ic Systems
【速读】:该论文旨在解决工具使用代理(Tool Using Agents)在实际运行中因操作性问题而失败的难题,这些问题包括无效参数调用、接口漂移(interface drift)、恢复能力弱以及重试策略低效等。其解决方案的关键在于提出 ToolMisuseBench —— 一个离线、确定性的基准测试框架,用于评估代理在明确的步骤(step)、调用(call)和重试(retry)预算下对工具误用(tool misuse)的识别与恢复能力。该基准覆盖 CRUD、检索、文件和调度等多种环境,并支持可复现的故障注入机制,能够量化成功执行率、无效调用行为、策略违规、恢复质量及预算效率等指标,从而为改进工具使用代理的鲁棒性和可靠性提供标准化评估手段。
链接: https://arxiv.org/abs/2604.01508
作者: Akshey Sigdel,Rista Baral
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Tool using agents often fail for operational reasons even when language understanding is strong. Common causes include invalid arguments, interface drift, weak recovery, and inefficient retry behavior. We introduce ToolMisuseBench, an offline deterministic benchmark for evaluating tool misuse and recovery under explicit step, call, and retry budgets. The benchmark covers CRUD, retrieval, file, and scheduling environments with replayable fault injection. It reports success, invalid call behavior, policy violations, recovery quality, and budgeted efficiency. We release a public dataset with 6800 tasks and a reproducible evaluation pipeline. Baseline results show fault specific recovery gains for schema aware methods, while overall success remains limited under the released authorization and hard failure settings.
[AI-76] CuTeGen: An LLM -Based Agent ic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe
【速读】:该论文旨在解决现代机器学习系统中高性能GPU内核开发效率低下的问题,即当前依赖专家经验的手动优化过程难以高效实现算法结构、内存层次利用与硬件特性之间的紧密耦合。其解决方案的关键在于提出一个名为CuTeGen的代理式自动化框架,将内核开发建模为“生成-测试-精炼”的结构化迭代流程;通过使用CuTe抽象层生成内核以保持性能关键结构(如分块和数据移动)的稳定性,并结合基于执行验证的调试机制、分阶段优化策略以及负载感知的优化提示和延迟集成的性能分析反馈,从而在保证正确性的前提下逐步提升内核性能,最终达到与优化库实现相当的性能水平。
链接: https://arxiv.org/abs/2604.01489
作者: Tara Saba,Anne Ouyang,Xujie Si,Fan Long
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF); Software Engineering (cs.SE)
备注:
Abstract:High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy usage, and hardware-specific optimizations. Recent work has explored using large language models (LLMs) to generate GPU kernels automatically, but generated implementations often struggle to maintain correctness and achieve competitive performance across iterative refinements. We present CuTeGen, an agentic framework for automated generation and optimization of GPU kernels that treats kernel development as a structured generate–test–refine workflow. Unlike approaches that rely on one-shot generation or large-scale search over candidate implementations, CuTeGen focuses on progressive refinement of a single evolving kernel through execution-based validation, structured debugging, and staged optimization. A key design choice is to generate kernels using the CuTe abstraction layer, which exposes performance-critical structures such as tiling and data movement while providing a more stable representation for iterative modification. To guide performance improvement, CuTeGen incorporates workload-aware optimization prompts and delayed integration of profiling feedback. Experimental results on matrix multiplication and activation workloads demonstrate that the framework produces functionally correct kernels and achieves competitive performance relative to optimized library implementations.
[AI-77] AgentS ocialBench: Evaluating Privacy Risks in Human-Centered Agent ic Social Networks
【速读】:该论文旨在解决人类中心的AI代理社交网络(human-centered agentic social networks)中的隐私风险问题,这类网络由多个协作代理组成,服务于个体用户并在多领域间进行跨用户交互,现有研究尚未系统评估此类场景下的隐私动态与风险。解决方案的关键在于提出首个基准测试工具AgentSocialBench,其涵盖七类真实情境、具有层级敏感标签和有向社交图结构的用户画像,通过实验证明:(1)跨域与跨用户协调会持续产生信息泄露压力,即使代理被明确指示保护隐私;(2)指导代理抽象敏感信息的隐私指令反而引发更频繁的讨论(称为“抽象悖论”),表明当前大语言模型(LLM)代理缺乏在复杂社交环境中稳健的隐私保护机制,亟需超越提示工程的新方法以实现真实部署的安全性。
链接: https://arxiv.org/abs/2604.01487
作者: Prince Zizhuang Wang,Shuli Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 43 pages, 9 figures
Abstract:With the rise of personalized, persistent LLM agent frameworks such as OpenClaw, human-centered agentic social networks in which teams of collaborative AI agents serve individual users in a social network across multiple domains are becoming a reality. This setting creates novel privacy challenges: agents must coordinate across domain boundaries, mediate between humans, and interact with other users’ agents, all while protecting sensitive personal information. While prior work has evaluated multi-agent coordination and privacy preservation, the dynamics and privacy risks of human-centered agentic social networks remain unexplored. To this end, we introduce AgentSocialBench, the first benchmark to systematically evaluate privacy risk in this setting, comprising scenarios across seven categories spanning dyadic and multi-party interactions, grounded in realistic user profiles with hierarchical sensitivity labels and directed social graphs. Our experiments reveal that privacy in agentic social networks is fundamentally harder than in single-agent settings: (1) cross-domain and cross-user coordination creates persistent leakage pressure even when agents are explicitly instructed to protect information, (2) privacy instructions that teach agents how to abstract sensitive information paradoxically cause them to discuss it more (we call it abstraction paradox). These findings underscore that current LLM agents lack robust mechanisms for privacy preservation in human-centered agentic social networks, and that new approaches beyond prompt engineering are needed to make agent-mediated social coordination safe for real-world deployment.
[AI-78] ype-Checked Compliance: Deterministic Guardrails for Agent ic Financial Systems Using Lean 4 Theorem Proving
【速读】:该论文旨在解决金融领域中自主代理型人工智能(Agentic AI)因依赖概率性大语言模型(LLMs)而引发的合规性危机问题,即现有基于概率分类器和语法验证的防护机制无法满足证券交易委员会(SEC)、金融业监管局(FINRA)及联邦存款保险公司(OCC)所要求的多变量、可数学验证的监管约束。其解决方案的关键在于提出一种基于形式化验证的AI防护平台——Lean-Agent Protocol,该平台利用Harmonic AI开发的Aristotle神经符号模型,将机构政策自动形式化为Lean 4代码;所有代理行为被视为数学猜想,仅当Lean 4内核证明其满足预编译的监管公理时才允许执行,从而在微秒级延迟下实现密码学级别的合规确定性,直接符合多项金融监管规则与解释性要求。
链接: https://arxiv.org/abs/2604.01483
作者: Devakh Rashie,Veda Rashi
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 8 pages, 1 table. Code and live demo available at this https URL and this https URL
Abstract:The rapid evolution of autonomous, agentic artificial intelligence within financial services has introduced an existential architectural crisis: large language models (LLMs) are probabilistic, non-deterministic systems operating in domains that demand absolute, mathematically verifiable compliance guarantees. Existing guardrail solutions – including NVIDIA NeMo Guardrails and Guardrails AI – rely on probabilistic classifiers and syntactic validators that are fundamentally inadequate for enforcing complex multi-variable regulatory constraints mandated by the SEC, FINRA, and OCC. This paper presents the Lean-Agent Protocol, a formal-verification-based AI guardrail platform that leverages the Aristotle neural-symbolic model developed by Harmonic AI to auto-formalize institutional policies into Lean 4 code. Every proposed agentic action is treated as a mathematical conjecture: execution is permitted if and only if the Lean 4 kernel proves that the action satisfies pre-compiled regulatory axioms. This architecture provides cryptographic-level compliance certainty at microsecond latency, directly satisfying SEC Rule 15c3-5, OCC Bulletin 2011-12, FINRA Rule 3110, and CFPB explainability mandates. A three-phase implementation roadmap from shadow verification through enterprise-scale deployment is provided.
[AI-79] DISCO-TAB: A Hierarchical Reinforcement Learning Framework for Privacy-Preserving Synthesis of Complex Clinical Data
【速读】:该论文旨在解决临床决策支持系统开发中因高质量、隐私保护型生物医学数据稀缺而导致的瓶颈问题,特别是生成式大语言模型(Generative Large Language Models, LLMs)在合成电子健康记录(Electronic Health Records, EHR)时难以捕捉其复杂的非线性依赖关系和严重类别不平衡,从而产生统计上合理但临床上无效的数据记录的问题。解决方案的关键在于提出DISCO-TAB(DIScriminator-guided COntrol for TABular synthesis)框架,该框架通过微调LLM并结合基于强化学习优化的多目标判别器系统,实现对合成过程在token、句子、特征和行四个粒度上的层次化反馈控制;同时引入自动约束发现(Automated Constraint Discovery)与逆频率奖励重塑(Inverse-Frequency Reward Shaping),以自主保留潜在医学逻辑并缓解少数类崩溃现象,最终在下游临床分类任务中显著提升模型效用(最高达38.2%改进),且保持极高的统计保真度(JSD < 0.01)和对成员推断攻击的鲁棒性。
链接: https://arxiv.org/abs/2604.01481
作者: Arshia Ilaty,Hossein Shirazi,Amir Rahmani,Hajar Homayouni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The development of robust clinical decision support systems is frequently impeded by the scarcity of high-fidelity, privacy-preserving biomedical data. While Generative Large Language Models (LLMs) offer a promising avenue for synthetic data generation, they often struggle to capture the complex, non-linear dependencies and severe class imbalances inherent in Electronic Health Records (EHR), leading to statistically plausible but clinically invalid records. To bridge this gap, we introduce DISCO-TAB (DIScriminator-guided COntrol for TABular synthesis), a novel framework that orchestrates a fine-tuned LLM with a multi-objective discriminator system optimized via Reinforcement Learning. Unlike prior methods relying on scalar feedback, DISCO-TAB evaluates synthesis at four granularities, token, sentence, feature, and row, while integrating Automated Constraint Discovery and Inverse-Frequency Reward Shaping to autonomously preserve latent medical logic and resolve minority-class collapse. We rigorously validate our framework across diverse benchmarks, including high-dimensional, small-sample medical datasets (e.g., Heart Failure, Parkinson’s). Our results demonstrate that hierarchical feedback yields state-of-the-art performance, achieving up to 38.2% improvement in downstream clinical classifier utility compared to GAN and Diffusion baselines, while ensuring exceptional statistical fidelity (JSD 0.01) and robust resistance to membership inference attacks. This work establishes a new standard for generating trustworthy, utility-preserving synthetic tabular data for sensitive healthcare applications.
[AI-80] A Self-Evolving Agent ic Framework for Metasurface Inverse Design
【速读】:该论文旨在解决元表面(metasurface)逆向设计中,将目标光学响应转化为可执行、求解器兼容的工作流时所需的专业知识门槛过高问题,尤其是现有基于语言模型(LLM)的系统缺乏跨任务的知识复用能力。其解决方案的关键在于提出一种代理型框架(agentic framework),通过耦合编码代理(coding agent)、可进化技能构件(evolving skill artifacts)与基于物理仿真的确定性评估器(deterministic evaluator),实现solver-specific策略在不修改模型权重或底层物理求解器的前提下,跨任务迭代优化与积累。这种上下文层面的技能演化机制显著提升了任务成功率和效率,并展现出部分工作流知识向未见任务类型的迁移潜力。
链接: https://arxiv.org/abs/2604.01480
作者: Yi Huang,Bowen Zheng,Yunxi Dong,Hong Tang,Huan Zhao,S. M. Rakibul Hasan Shawon,Hualiang Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:
Abstract:Metasurface inverse design has become central to realizing complex optical functionality, yet translating target responses into executable, solver-compatible workflows still demands specialized expertise in computational electromagnetics and solver-specific software engineering. Recent large language models (LLMs) offer a complementary route to reducing this workflow-construction burden, but existing language-driven systems remain largely session-bounded and do not preserve reusable workflow knowledge across inverse-design tasks. We present an agentic framework for metasurface inverse design that addresses this limitation through context-level skill evolution. The framework couples a coding agent, evolving skill artifacts, and a deterministic evaluator grounded in physical simulation so that solver-specific strategies can be iteratively refined across tasks without modifying model weights or the underlying physics solver. We evaluate the framework on a benchmark spanning multiple metasurface inverse-design task types, with separate training-aligned and held-out task families. Evolved skills raise in-distribution task success from 38% to 74%, increase criteria pass fraction from 0.510 to 0.870, and reduce average attempts from 4.10 to 2.30. On held-out task families, binary success changes only marginally, but improvements in best margin together with shifts in error composition and agent behavior indicate partial transfer of workflow knowledge. These results suggest that the main value of skill evolution lies in accumulating reusable solver-specific expertise around reliable computational engines, thereby offering a practical path toward more autonomous and accessible metasurface inverse-design workflows.
[AI-81] SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对越狱攻击(jailbreak attacks)时安全性不足的问题。现有防护方法通常依赖于内部特征或文本响应进行检测,但存在延迟高或生成随机性导致误报率高的缺陷。其解决方案的关键在于提出SelfGrader,一种轻量级的防护机制,将越狱检测建模为基于token-level logits的数值评分问题:通过在有限数值token集合(如0-9)中评估用户查询的安全性,并将logit分布解释为内生安全信号;同时引入双视角评分规则,综合考虑恶意性和良性程度,从而获得稳定且可解释的有害性评分,显著降低误报率并提升检测准确性。
链接: https://arxiv.org/abs/2604.01473
作者: Zikai Zhang,Rui Hu,Olivera Kotevska,Jiahao Xu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious queries, which either introduce substantial latency or suffer from the randomness in text generation. To overcome these limitations, we propose SelfGrader, a lightweight guardrail method that formulates jailbreak detection as a numerical grading problem using token-level logits. Specifically, SelfGrader evaluates the safety of a user query within a compact set of numerical tokens (NTs) (e.g., 0-9) and interprets their logit distribution as an internal safety signal. To align these signals with human intuition of maliciousness, SelfGrader introduces a dual-perspective scoring rule that considers both the maliciousness and benignness of the query, yielding a stable and interpretable score that reflects harmfulness and reduces the false positive rate simultaneously. Extensive experiments across diverse jailbreak benchmarks, multiple LLMs, and state-of-the-art guardrail baselines demonstrate that SelfGrader achieves up to a 22.66% reduction in ASR on LLaMA-3-8B, while maintaining significantly lower memory overhead (up to 173x) and latency (up to 26x).
[AI-82] Reducing Hallucinations in LLM -based Scientific Literature Analysis Using Peer Context Outlier Detection
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在从大规模文本语料库中提取数据时存在的幻觉(hallucination)问题,即模型生成与事实不符的信息,从而影响数据提取的准确性。传统方法如提示工程(prompt engineering)和思维链提示(chain-of-thought prompting)仅关注单个文档内的信息,忽略了语料库中文档之间的关联性。论文提出的关键解决方案是Peer Context Outlier Detection (P-COD),其核心在于利用语料库内文档间的相互关系进行验证:通过比较某文档的提取结果与同领域内具有相似实验设置的“同行”文档中的已验证信息,动态调整置信度分数,并将低置信度结果标记供专家复核,而高置信度结果则因获得同行支持被视为可靠,从而显著提升自动化数据提取的准确性和可信度。
链接: https://arxiv.org/abs/2604.01461
作者: Daniel Xie,Maxwell J. Jacobson,Adil Wazeer,Haiyan Wang,Xinghang Zhang,Yexiang Xue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reducing hallucinations in Large Language Models (LLMs) is essential for improving the accuracy of data extraction from large text corpora. Current methods, like prompt engineering and chain-of-thought prompting, focus on individual documents but fail to consider relationships across a corpus. This paper introduces Peer Context Outlier Detection (P-COD), a novel approach that uses the relationships between documents to improve extraction accuracy. Our application domain is in scientific literature summarization, where papers with similar experiment settings should draw similar conclusions. By comparing extracted data to validated peer information within the corpus, we adjust confidence scores and flag low-confidence results for expert review. High-confidence results, supported by peer validation, are considered reliable. Our experiments demonstrate up to 98% precision in outlier detection across 6 domains of science, demonstrating that our design reduces hallucinations, enhances trust in automated systems, and allows researchers to focus on ambiguous cases, streamlining the data extraction workflows.
[AI-83] Infeasibility Aware Large Language Models for Combinatorial Optimization
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在求解NP-hard组合优化问题时,普遍缺乏对不可行实例的显式检测能力的问题。现有方法主要聚焦于生成可行解,但无法识别问题实例是否真正存在解,导致在实际应用中可能浪费计算资源于无解场景。解决方案的关键在于提出一个具有不可行性感知能力的框架,其核心包括:(1)基于数学规划的新建模方法与可证明的零相位不可行性筛选机制,实现训练数据的精确标注(可行并附结构化证书或证伪);(2)通过监督微调使LLM同时具备生成解和判断不可行性的能力;(3)利用LLM输出作为下游局部搜索的热启动(warm start),即使预测不完美也能显著加速收敛。实验表明,该方法相较GPT-5.2整体准确率提升达30%,且在局部搜索中实现最高2倍的速度提升。
链接: https://arxiv.org/abs/2604.01455
作者: Yakun Wang,Min Chen,Zeguan Wu,Junyu Liu,Sitao Zhang,Zhenwen Shao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are increasingly explored for NP-hard combinatorial optimization problems, but most existing methods emphasize feasible-instance solution generation and do not explicitly address infeasibility detection. We propose an infeasibility-aware framework that combines certifiable dataset construction, supervised fine-tuning, and LLM-assisted downstream search. For the minor-embedding problem, we introduce a new mathematical programming formulation together with provable zero-phase infeasibility screening, which enables scalable construction of training instances labeled either as feasible with structured certificates or as certifiably infeasible. Using training data generated through this exact optimization pipeline, we show that an 8B-parameter LLM can be fine-tuned to jointly perform solution generation and infeasibility detection. We further utilize LLM outputs as warm starts for downstream local search, providing a practical way to accelerate optimization even when the LLM outputs are imperfect. Experiments show that our fine-tuned model improves overall accuracy by up to 30% over GPT-5.2; meanwhile LLM-guided warm starts provide up to 2\times speedup compared with starting from scratch in downstream local search.
[AI-84] A Multi-Agent Human-LLM Collaborative Framework for Closed-Loop Scientific Literature Summarization
【速读】:该论文旨在解决科学文献碎片化导致的科研效率低下问题,即研究人员需耗费大量人力收集、分析和理解分散的文献信息,从而阻碍了科学发现的进程。解决方案的关键在于提出了一种多智能体、人机协同(human-in-the-loop)的系统 Elhuyar,该系统融合了大语言模型(Large Language Models, LLMs)、结构化人工智能与人类科学家的协作,通过任务分工实现对科学文献的分步提取、分析与迭代优化:具体包括由专用智能体负责筛选论文、提取数据、拟合模型及总结发现,并辅以人类专家校验确保可靠性,最终生成包含结构化数据、可视化图表、模型方程和文本摘要的报告,从而支持深度探究与科学洞见的持续迭代。
链接: https://arxiv.org/abs/2604.01452
作者: Maxwell J. Jacobson,Daniel Xie,Jackson Shen,Adil Wazeer,Haiyan Wang,Xinghang Zhang,Yexiang Xue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Scientific discovery is slowed by fragmented literature that requires excessive human effort to gather, analyze, and understand. AI tools, including autonomous summarization and question answering, have been developed to aid in understanding scientific literature. However, these tools lack the structured, multi-step approach necessary for extracting deep insights from scientific literature. Large Language Models (LLMs) offer new possibilities for literature analysis, but remain unreliable due to hallucinations and incomplete extraction. We introduce Elhuyar, a multi-agent, human-in-the-loop system that integrates LLMs, structured AI, and human scientists to extract, analyze, and iteratively refine insights from scientific literature. The framework distributes tasks among specialized agents for filtering papers, extracting data, fitting models, and summarizing findings, with human oversight ensuring reliability. The system generates structured reports with extracted data, visualizations, model equations, and text summaries, enabling deeper inquiry through iterative refinement. Deployed in materials science, it analyzed literature on tungsten under helium-ion irradiation, showing experimentally correlated exponential helium bubble growth with irradiation dose and temperature, offering insight for plasma-facing materials (PFMs) in fusion reactors. This demonstrates how AI-assisted literature review can uncover scientific patterns and accelerate discovery.
[AI-85] When AI Gets it Wong: Reliability and Risk in AI-Assisted Medication Decision Systems
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)在药物管理领域应用中存在的一大关键问题:尽管AI系统在标准性能指标下表现良好,但其在真实临床决策场景中的可靠性仍缺乏深入理解,尤其是在高风险情境下,错误推荐可能导致严重患者伤害。解决方案的关键在于从传统的整体性能评估转向对系统故障模式及其临床后果的细致分析,通过模拟药物相互作用和剂量决策的可控场景,识别出如遗漏相互作用、误报风险信号及不当剂量建议等典型错误类型,并强调必须结合风险感知的评估方法来补充传统指标,从而提升AI在药学实践等安全敏感领域中的可信度与安全性。
链接: https://arxiv.org/abs/2604.01449
作者: Khalid Adnan Alsayed
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 1 figure. Position paper with simulated experimental analysis of AI reliability in medication decision systems
Abstract:Artificial intelligence (AI) systems are increasingly integrated into healthcare and pharmacy workflows, supporting tasks such as medication recommendations, dosage determination, and drug interaction detection. While these systems often demonstrate strong performance under standard evaluation metrics, their reliability in real-world decision-making remains insufficiently understood. In high-risk domains such as medication management, even a single incorrect recommendation can result in severe patient harm. This paper examines the reliability of AI-assisted medication systems by focusing on system failures and their potential clinical consequences. Rather than evaluating performance solely through aggregate metrics, this work shifts attention towards how errors occur and what happens when AI systems produce incorrect outputs. Through a series of controlled, simulated scenarios involving drug interactions and dosage decisions, we analyse different types of system failures, including missed interactions, incorrect risk flagging, and inappropriate dosage recommendations. The findings highlight that AI errors in medication-related contexts can lead to adverse drug reactions, ineffective treatment, or delayed care, particularly when systems are used without sufficient human oversight. Furthermore, the paper discusses the risks of over-reliance on AI recommendations and the challenges posed by limited transparency in decision-making processes. This work contributes a reliability-focused perspective on AI evaluation in healthcare, emphasising the importance of understanding failure behavior and real-world impact. It highlights the need to complement traditional performance metrics with risk-aware evaluation approaches, particularly in safety-critical domains such as pharmacy practice.
[AI-86] ClawSafety: “Safe” LLM s Unsafe Agents
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 代理在高权限本地环境中面临的严重安全风险问题,即通过提示注入(prompt injection)攻击可能导致敏感凭证泄露、金融交易篡改或文件破坏等危害,而现有安全评估方法因局限于孤立聊天场景、依赖合成环境且未考虑代理框架对安全性的影响,难以真实反映风险。解决方案的关键在于提出 CLAWSAFETY 基准,包含 120 个基于现实专业工作空间(如软件工程、金融、医疗等)的对抗性测试场景,并从危害领域、攻击向量和有害行为类型三个维度组织;每个场景将恶意内容嵌入代理日常交互的三种通道之一——技能指令文件、来自可信发件人的邮件或网页内容。实验表明,攻击成功率(ASR)在 40% 至 75% 之间波动,且技能指令类注入最具威胁,同时揭示模型与代理框架共同决定安全性,强调需将二者视为联合变量进行系统级安全评估。
链接: https://arxiv.org/abs/2604.01438
作者: Bowen Wei,Yunbei Zhang,Jinhao Pan,Kai Mei,Xiao Wang,Jihun Hamm,Ziwei Zhu,Yingqiang Ge
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Personal AI agents like OpenClaw run with elevated privileges on users’ local machines, where a single successful prompt injection can leak credentials, redirect financial transactions, or destroy files. This threat goes well beyond conventional text-level jailbreaks, yet existing safety evaluations fall short: most test models in isolated chat settings, rely on synthetic environments, and do not account for how the agent framework itself shapes safety outcomes. We introduce CLAWSAFETY, a benchmark of 120 adversarial test scenarios organized along three dimensions (harm domain, attack vector, and harmful action type) and grounded in realistic, high-privilege professional workspaces spanning software engineering, finance, healthcare, law, and DevOps. Each test case embeds adversarial content in one of three channels the agent encounters during normal work: workspace skill files, emails from trusted senders, and web pages. We evaluate five frontier LLMs as agent backbones, running 2,520 sandboxed trials across all configurations. Attack success rates (ASR) range from 40% to 75% across models and vary sharply by injection vector, with skill instructions (highest trust) consistently more dangerous than email or web content. Action-trace analysis reveals that the strongest model maintains hard boundaries against credential forwarding and destructive actions, while weaker models permit both. Cross-scaffold experiments on three agent frameworks further demonstrate that safety is not determined by the backbone model alone but depends on the full deployment stack, calling for safety evaluation that treats model and framework as joint variables.
[AI-87] Reproducible Explainable and Effective Evaluations of Agent ic AI for Software Engineering
【速读】:该论文旨在解决当前Agentic AI在软件工程(Software Engineering, SE)领域评估实践中存在的可复现性差、解释性不足的问题。现有研究常因缺乏对大语言模型(Large Language Models, LLMs)黑箱行为的透明化描述,以及评估设计细节缺失,导致结果难以复现且无法有效比较不同方法的优势与局限。其解决方案的关键在于提出一套系统性指南,强调研究人员应公开Thought-Action-Result (TAR)轨迹和LLM交互数据(或其摘要版本),从而推动可复现、可解释且有效的Agentic AI评估实践。通过一个概念验证案例,论文进一步证明了利用TAR轨迹进行跨方法系统分析的可行性。
链接: https://arxiv.org/abs/2604.01437
作者: Jingyue Li,André Storhaug
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures, accepted to the 2nd International Workshop on Responsible Software Engineering (ResponsibleSE 2026), co-located with FSE
Abstract:With the advancement of Agentic AI, researchers are increasingly leveraging autonomous agents to address challenges in software engineering (SE). However, the large language models (LLMs) that underpin these agents often function as black boxes, making it difficult to justify the superiority of Agentic AI approaches over baselines. Furthermore, missing information in the evaluation design description frequently renders the reproduction of results infeasible. To synthesize current evaluation practices for Agentic AI in SE, this study analyzes 18 papers on the topic, published or accepted by ICSE 2026, ICSE 2025, FSE 2025, ASE 2025, and ISSTA 2025. The analysis identifies prevailing approaches and their limitations in evaluating Agentic AI for SE, both in current research and potential future studies. To address these shortcomings, this position paper proposes a set of guidelines and recommendations designed to empower reproducible, explainable, and effective evaluations of Agentic AI in software engineering. In particular, we recommend that Agentic AI researchers make their Thought-Action-Result (TAR) trajectories and LLM interaction data, or summarized versions of these artifacts, publicly accessible. Doing so will enable subsequent studies to more effectively analyze the strengths and weaknesses of different Agentic AI approaches. To demonstrate the feasibility of such comparisons, we present a proof-of-concept case study that illustrates how TAR trajectories can support systematic analysis across approaches.
[AI-88] Leverag ing the Value of Information in POMDP Planning
【速读】:该论文旨在解决在有限规划时间内,如何为大规模部分可观测马尔可夫决策过程(Partially Observable Markov Decision Processes, POMDP)高效生成近优策略的问题。其核心挑战源于状态和转移不确定性带来的维度灾难(curse of dimensionality)与历史灾难(curse of history)。解决方案的关键在于引入一种基于信息价值(Value of Information, VOI)的动态规划框架,该框架通过在每个信念状态(belief)上评估观测信息的价值,仅在VOI较高时才进行观测信息的条件性处理,从而避免不必要的观测分支,提升计算资源的分配效率。在此基础上提出的VOIMCP算法是一种蒙特卡洛树搜索方法,实现了对计算资源的智能调度,并提供了理论上的近优性保证及非渐近收敛边界,实验证明其在多个POMDP基准测试中优于现有基线方法。
链接: https://arxiv.org/abs/2604.01434
作者: Zakariya Laouar,Qi Heng Ho,Zachary Sunberg
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Partially observable Markov decision processes (POMDPs) offer a principled formalism for planning under state and transition uncertainty. Despite advances made towards solving large POMDPs, obtaining performant policies under limited planning time remains a major challenge due to the curse of dimensionality and the curse of history. For many POMDP problems, the value of information (VOI) - the expected performance gain from reasoning about observations - varies over the belief space. We introduce a dynamic programming framework that exploits this structure by conditionally processing observations based on the value of information at each belief. Building on this framework, we propose Value of Information Monte Carlo planning (VOIMCP), a Monte Carlo Tree Search algorithm that allocates computational effort more efficiently by selectively disregarding observation information when the VOI is low, avoiding unnecessary branching of observations. We provide theoretical guarantees on the near-optimality of our VOI reasoning framework and derive non-asymptotic convergence bounds for VOIMCP. Simulation evaluations demonstrate that VOIMCP outperforms baselines on several POMDP benchmarks.
[AI-89] Semantically Annotated Multimodal Dataset for RF Interpretation and Prediction
【速读】:该论文旨在解决当前无线建模与基于射频(RF)的人工智能(AI)研究中因缺乏高质量、基于实测数据集而导致的瓶颈问题,即现有RF热图(RF heatmap)虽具高维复杂性,但缺乏几何和语义上下文,难以支撑监督学习模型的发展。其解决方案的关键在于构建一类新型多模态数据集,将RF测量数据与高分辨率摄像头和激光雷达(lidar)等辅助模态进行精确时空配准,从而建立RF信号与其物理环境之间的映射关系;通过创建体素级标注的数字孪生(digital replica),支持从视觉到RF热图的正向预测(用于无线系统设计革新)以及从RF信号到场景语义的逆向推理(实现新型RF感知),推动生成式AI在无线感知与通信领域的变革性研究。
链接: https://arxiv.org/abs/2604.01433
作者: Steve Blandino,Jelena Senic,Raied Caromi,Samuel Berweger,Anuraag Bodi,Camillo Gentile,Nada Golmie
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:Current limitations in wireless modeling and radio frequency (RF)-based AI are primarily driven by a lack of high-quality, measurement-based datasets that connect RF signals to their physical environments. RF heatmaps, the typical form of such data, are high-dimensional and complex but lack the geometric and semantic context needed for interpretation, constraining the development of supervised machine learning models. To address this bottleneck, we propose a new class of multimodal datasets that combines RF measurements with auxiliary modalities like high-resolution cameras and lidar to bridge the gap between RF signals and their physical causes. The proposed data collection will span diverse indoor and outdoor environments, featuring both static and dynamic scenarios, including human activities ranging from walking to subtle gestures. By achieving precise spatial and temporal co-registration and creating digital replicas for voxel-level annotation, this dataset will enable transformative AI research. Key tasks include the forward problem of predicting RF heatmaps from visual data to revolutionize wireless system design, and the inverse problem of inferring scene semantics from RF signals, creating a new form of RF-based perception.
[AI-90] Can LLM s Predict Academic Collaboration? Topology Heuristics vs. LLM -Based Link Prediction on Real Co-authorship Networks
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)是否能够预测科研人员未来合作的可能性这一问题,特别是在仅使用作者个人资料(如研究领域、发表记录等元数据)而无法访问图结构信息的情况下。其核心解决方案在于利用LLMs从作者元数据中推理潜在合作关系,而非依赖传统的拓扑启发式方法(如Common Neighbors、Jaccard指数等)。关键发现表明,LLMs与拓扑方法捕捉的是互补信号:在无共同邻居的新边预测场景下,LLM的AUROC达到0.652,显著优于所有拓扑方法(均为0);且当提供预计算的图特征时,LLM性能反而下降,说明其优势源于对元数据的独立推理能力,而非对结构信息的隐式编码。这确立了LLMs作为独立于传统网络分析的新型协作预测工具的价值。
链接: https://arxiv.org/abs/2604.01379
作者: Fan Huang,Munjung Kim
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:Can large language models (LLMs) predict which researchers will collaborate? We study this question through link prediction on real-world co-authorship networks from OpenAlex (9.96M authors, 108.7M edges), evaluating whether LLMs can predict future scientific collaborations using only author profiles, without access to graph structure. Using Qwen2.5-72B-Instruct across three historical eras of AI research, we find that LLMs and topology heuristics capture distinct signals and are strongest in complementary settings. On new-edge prediction under natural class imbalance, the LLM achieves AUROC 0.714–0.789, outperforming Common Neighbors, Jaccard, and Preferential Attachment, with recall up to 92.9%; under balanced evaluation, the LLM outperforms \emphall topology heuristics in every era (AUROC 0.601–0.658 vs.\ best-heuristic 0.525–0.538); on continued edges, the LLM (0.687) is competitive with Adamic-Adar (0.684). Critically, 78.6–82.7% of new collaborations occur between authors with no common neighbor – a blind spot where all topology heuristics score zero but the LLM still achieves AUROC 0.652 by reasoning from author metadata alone. A temporal metadata ablation reveals that research concepts are the dominant signal (removing concepts drops AUROC by 0.047–0.084). Providing pre-computed graph features to the LLM \emphdegrades performance due to anchoring effects, confirming that LLMs and topology methods should operate as separate, complementary channels. A socio-cultural ablation finds that name-inferred ethnicity and institutional country do not predict collaboration beyond topology, reflecting the demographic homogeneity of AI research. A node2vec baseline achieves AUROC comparable to Adamic-Adar, establishing that LLMs access a fundamentally different information channel – author metadata – rather than encoding the same structural signal differently.
[AI-91] RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics
【速读】:该论文旨在解决现有基于评分标准(rubric)的评估体系在大语言模型(LLM)训练与评测中缺乏系统性诊断方法的问题,尤其在仅依赖下游信号(如强化学习结果)时难以识别评分标准本身的质量缺陷。解决方案的关键在于提出RIFT(RubrIc Failure mode Taxonomy),一个基于扎根理论构建的八类失败模式分类体系,涵盖可靠性、内容效度和后果效度三个维度,能够系统化地刻画评分标准在设计与组成中的潜在问题,并通过人工标注一致性验证其有效性,同时开发自动化指标实现可扩展的质量诊断,F1分数最高达0.86,表明其与人类标注高度一致。
链接: https://arxiv.org/abs/2604.01375
作者: Zhengyang Qi,Charles Dickens,Derek Pham,Amanda Dsouza,Armin Parchami,Frederic Sala,Paroma Varma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Rubric-based evaluation is widely used in LLM benchmarks and training pipelines for open-ended, less verifiable tasks. While prior work has demonstrated the effectiveness of rubrics using downstream signals such as reinforcement learning outcomes, there remains no principled way to diagnose rubric quality issues from such aggregated or downstream signals alone. To address this gap, we introduce RIFT: RubrIc Failure mode Taxonomy, a taxonomy for systematically characterizing failure modes in rubric composition and design. RIFT consists of eight failure modes organized into three high-level categories: Reliability Failures, Content Validity Failures, and Consequential Validity Failures. RIFT is developed using grounded theory by iteratively annotating rubrics drawn from five diverse benchmarks spanning general instruction following, code generation, creative writing, and expert-level deep research, until no new failure modes are identified. We evaluate the consistency of the taxonomy by measuring agreement among independent human annotators, observing fair agreement overall (87% pairwise agreement and 0.64 average Cohen’s kappa). Finally, to support scalable diagnosis, we propose automated rubric quality metrics and show that they align with human failure-mode annotations, achieving up to 0.86 F1.
[AI-92] CogBias: Measuring and Mitigating Cognitive Bias in Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险决策场景中表现出系统性认知偏差的问题,特别是这些偏差是否具有可识别的内部表征及其能否通过针对性干预进行缓解。解决方案的关键在于:首先构建了LLM CogBias基准,系统化地识别四类认知偏差(判断、信息处理、社会和响应);其次利用对比设计下的线性探测方法,证明这些偏差以线性可分方向编码于模型激活空间中;最后采用激活调控(activation steering)技术对偏差方向进行干预,在不显著损害下游任务性能的前提下,实现偏倚分数降低26–32%,且该方法在不同架构模型间具有普适性,表明认知偏差存在共享的功能组织机制。
链接: https://arxiv.org/abs/2604.01366
作者: Fan Huang,Songheng Zhang,Haewoon Kwak,Jisun An
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly deployed in high-stakes decision-making contexts. While prior work has shown that LLMs exhibit cognitive biases behaviorally, whether these biases correspond to identifiable internal representations and can be mitigated through targeted intervention remains an open question. We define LLM cognitive bias as systematic, reproducible deviations from correct answers in tasks with computable ground-truth baselines, and introduce LLM CogBias, a benchmark organized around four families of cognitive biases: Judgment, Information Processing, Social, and Response. We evaluate three LLMs and find that cognitive biases emerge systematically across all four families, with magnitudes and debiasing responses that are strongly family-dependent: prompt-level debiasing substantially reduces Response biases but backfires for Judgment biases. Using linear probes under a contrastive design, we show that these biases are encoded as linearly separable directions in model activation space. Finally, we apply activation steering to modulate biased behavior, achieving 26–32% reduction in bias score (fraction of biased responses) while preserving downstream capability on 25 benchmarks (Llama: negligible degradation; Qwen: up to - 19.0pp for Judgment biases). Despite near-orthogonal bias representations across models (mean cosine similarity 0.01), steering reduces bias at similar rates across architectures ( r(246) =.621, p .001), suggesting shared functional organization.
[AI-93] Crashing Waves vs. Rising Tides: Preliminary Findings on AI Automation from Thousands of Worker Evaluations of Labor Market Tasks
【速读】:该论文试图解决的问题是:人工智能(AI)自动化在不同任务上的演进模式究竟是局部突变式的“ crashing waves”(即某些特定任务上能力突然跃升),还是广泛持续性的“rising tides”(即整体能力稳步提升)。其解决方案的关键在于通过大规模实证分析,基于来自美国劳工部O*NET分类体系的3000余项文本类任务,对超过17000次人工评估结果进行系统性测量,发现AI能力的增长主要表现为广谱、渐进式的提升,而非局部爆发式跃迁;并据此预测LLM(大语言模型)将在2029年前达到80%-95%的任务完成成功率,从而为理解AI对劳动力市场和经济结构的长期影响提供量化依据。
链接: https://arxiv.org/abs/2604.01363
作者: Matthias Mertens,Adam Kuzee,Brittany S. Harris,Harry Lyu,Wensu Li,Jonathan Rosenfeld,Meiri Anto,Martin Fleming,Neil Thompson
机构: 未知
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:
Abstract:We propose that AI automation is a continuum between: (i) crashing waves where AI capabilities surge abruptly over small sets of tasks, and (ii) rising tides where the increase in AI capabilities is more continuous and broad-based. We test for these effects in preliminary evidence from an ongoing evaluation of AI capabilities across over 3,000 broad-based tasks derived from the U.S. Department of Labor O*NET categorization that are text-based and thus LLM-addressable. Based on more than 17,000 evaluations by workers from these jobs, we find little evidence of crashing waves (in contrast to recent work by METR), but substantial evidence that rising tides are the primary form of AI automation. AI performance is high and improving rapidly across a wide range of tasks. We estimate that, in 2024-Q2, AI models successfully complete tasks that take humans approximately 3-4 hours with about a 50% success rate, increasing to about 65% by 2025-Q3. If recent trends in AI capability growth persist, this pace of AI improvement implies that LLMs will be able to complete most text-related tasks with success rates of, on average, 80%-95% by 2029 at a minimally sufficient quality level. Achieving near-perfect success rates at this quality level or comparable success rates at superior quality would require several additional years. These AI capability improvements would impact the economy and labor market as organizations adopt AI, which could have a substantially longer timeline.
[AI-94] Semantic Modeling for World-Centered Architectures
【速读】:该论文旨在解决传统以代理为中心(agent-centered)架构在复杂结构化领域(如企业与制度系统)中难以保证语义一致性、可解释性及长期稳定性的问题。其解决方案的关键在于引入世界中心多智能体系统(World-Centered Multi-Agent Systems, WMAS),通过构建一个共享且显式的“世界模型”(world model),使学习与协调机制基于该统一表示进行,从而实现全局一致性与可验证的系统行为。论文进一步提出语义模型(semantic models)作为数学形式化工具来表征此类世界,并以Ontobox平台作为WMAS的实际实现。
链接: https://arxiv.org/abs/2604.01359
作者: Andrei Mantsivoda,Darya Gavrilina
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 1 figure, MathAI conference
Abstract:We introduce world-centered multi-agent systems (WMAS) as an alternative to traditional agent-centered architectures, arguing that structured domains such as enterprises and institutional systems require a shared, explicit world representation to ensure semantic consistency, explainability, and long-term stability. We classify worlds along dimensions including ontological explicitness, normativity, etc. In WMAS, learning and coordination operate over a shared world model rather than isolated agent-local representations, enabling global consistency and verifiable system behavior. We propose semantic models as a mathematical formalism for representing such worlds. Finally, we present the Ontobox platform as a realization of WMAS.
[AI-95] Safety Security and Cognitive Risks in World Models
【速读】:该论文旨在解决世界模型(World Models)在机器人、自动驾驶和智能体AI等安全关键场景中引入的独特安全性、可靠性与认知风险问题,特别是针对其预测能力可能被恶意利用而导致的灾难性故障。解决方案的关键在于构建一个统一的威胁模型,扩展MITRE ATLAS和OWASP LLM Top 10至世界模型栈,并提出五类攻击者能力分类;同时通过实证验证了轨迹持久性对抗攻击(trajectory-persistent adversarial attacks)的存在性与危害性(如GRU-RSSM中动作放大倍数达2.26倍,且对抗微调后性能下降59.5%),并结合对抗加固、对齐工程、NIST AI RMF及欧盟AI法案治理框架与人因设计,提出跨学科的缓解策略,强调世界模型应被视为与飞行控制系统或医疗设备同等重要的安全关键基础设施进行严格管控。
链接: https://arxiv.org/abs/2604.01346
作者: Manoj Parmar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 26 pages, 1 figure (6 panels), 2 tables. Empirical proof-of-concept on GRU/RSSM/DreamerV3 architectures
Abstract:World models – learned internal simulators of environment dynamics – are rapidly becoming foundational to autonomous decision-making in robotics, autonomous vehicles, and agentic AI. Yet this predictive power introduces a distinctive set of safety, security, and cognitive risks. Adversaries can corrupt training data, poison latent representations, and exploit compounding rollout errors to cause catastrophic failures in safety-critical deployments. World model-equipped agents are more capable of goal misgeneralisation, deceptive alignment, and reward hacking precisely because they can simulate the consequences of their own actions. Authoritative world model predictions further foster automation bias and miscalibrated human trust that operators lack the tools to audit. This paper surveys the world model landscape; introduces formal definitions of trajectory persistence and representational risk; presents a five-profile attacker capability taxonomy; and develops a unified threat model extending MITRE ATLAS and the OWASP LLM Top 10 to the world model stack. We provide an empirical proof-of-concept on trajectory-persistent adversarial attacks (GRU-RSSM: A_1 = 2.26x amplification, -59.5% reduction under adversarial fine-tuning; stochastic RSSM proxy: A_1 = 0.65x; DreamerV3 checkpoint: non-zero action drift confirmed). We illustrate risks through four deployment scenarios and propose interdisciplinary mitigations spanning adversarial hardening, alignment engineering, NIST AI RMF and EU AI Act governance, and human-factors design. We argue that world models must be treated as safety-critical infrastructure requiring the same rigour as flight-control software or medical devices. Comments: 26 pages, 1 figure (6 panels), 2 tables. Empirical proof-of-concept on GRU/RSSM/DreamerV3 architectures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO) Cite as: arXiv:2604.01346 [cs.CR] (or arXiv:2604.01346v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.01346 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-96] IDEA2: Expert-in-the-loop competency question elicitation for collaborative ontology engineering
【速读】:该论文旨在解决本体工程(ontology engineering)中能力问题(Competency Question, CQ)获取这一关键但资源密集的瓶颈问题。传统方法常因领域专家与本体工程师之间的沟通鸿沟而效率低下,导致提取的CQ不准确或难以被接受。解决方案的关键在于提出一种名为IDEA2的半自动化工作流,其核心是将大语言模型(Large Language Models, LLMs)嵌入到一个“专家在环”(expert-in-the-loop)的协作迭代过程中:首先由LLM从需求文档中初步提取CQ,随后领域专家在可访问的协同平台上进行评审与反馈,再由LLM根据反馈迭代重构被拒绝的CQ,直至达成共识。整个过程通过溯源模型(provenance model)记录每条CQ的完整演化轨迹,确保透明性和可复现性,从而显著提升CQ的质量、相关性及专家接受度。
链接: https://arxiv.org/abs/2604.01344
作者: Elliott Watkiss-Leek,Reham Alharbi,Harry Rostron,Andrew Ng,Ewan Johnson,Andrew Mitchell,Terry R. Payne,Valentina Tamma,Jacopo de Berardinis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Competency question (CQ) elicitation represents a critical but resource-intensive bottleneck in ontology engineering. This foundational phase is often hampered by the communication gap between domain experts, who possess the necessary knowledge, and ontology engineers, who formalise it. This paper introduces IDEA2, a novel, semi-automated workflow that integrates Large Language Models (LLMs) within a collaborative, expert-in-the-loop process to address this challenge. The methodology is characterised by a core iterative loop: an initial LLM-based extraction of CQs from requirement documents, a co-creational review and feedback phase by domain experts on an accessible collaborative platform, and an iterative, feedback-driven reformulation of rejected CQs by an LLM until consensus is achieved. To ensure transparency and reproducibility, the entire lifecycle of each CQ is tracked using a provenance model that captures the full lineage of edits, anonymised feedback, and generation parameters. The workflow was validated in 2 real-world scenarios (scientific data, cultural heritage), demonstrating that IDEA2 can accelerate the requirements engineering process, improve the acceptance and relevance of the resulting CQs, and exhibit high usability and effectiveness among domain experts. We release all code and experiments at this https URL
[AI-97] Evolutionary Multi-Objective Fusion of Deepfake Speech Detectors CEC2026
【速读】:该论文旨在解决基于大规模自监督学习(Self-Supervised Learning, SSL)模型的深度伪造语音检测器(Deepfake Speech Detector)在采用传统集成融合方法时,系统规模过大且性能提升边际效应递减的问题。其解决方案的关键在于提出一种基于NSGA-II算法的多目标进化分数融合框架,通过联合优化检测错误率与系统复杂度,在保持高检测精度的同时显著降低模型参数量;具体实现上探索了两种编码方式:二进制编码用于选择最优检测器进行平均融合,实数编码则优化各检测器权重以实现加权求和,实验表明该方法在ASVspoof 5数据集上优于简单平均和逻辑回归基线,且实数编码方案在仅使用一半参数的情况下达到接近最先进性能,并提供多样化的准确率-复杂度权衡解集,便于实际部署时灵活选择。
链接: https://arxiv.org/abs/2604.01330
作者: Vojtěch Staněk,Martin Perešíni,Lukáš Sekanina,Anton Firc,Kamil Malinka
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: Accepted to WCCI CEC 2026
Abstract:While deepfake speech detectors built on large self-supervised learning (SSL) models achieve high accuracy, employing standard ensemble fusion to further enhance robustness often results in oversized systems with diminishing returns. To address this, we propose an evolutionary multi-objective score fusion framework that jointly minimizes detection error and system complexity. We explore two encodings optimized by NSGA-II: binary-coded detector selection for score averaging and a real-valued scheme that optimizes detector weights for a weighted sum. Experiments on the ASVspoof 5 dataset with 36 SSL-based detectors show that the obtained Pareto fronts outperform simple averaging and logistic regression baselines. The real-valued variant achieves 2.37% EER (0.0684 minDCF) and identifies configurations that match state-of-the-art performance while significantly reducing system complexity, requiring only half the parameters. Our method also provides a diverse set of trade-off solutions, enabling deployment choices that balance accuracy and computational cost.
[AI-98] he Digital Twin Counterfactual Framework: A Validation Architecture for Simulated Potential Outcomes
【速读】:该论文试图解决因果推断中的根本性问题——个体的反事实结果永远无法被观测到,从而导致现有方法均依赖于不可验证的假设(如可忽略性、平行趋势、排除限制等)来替代缺失数据,而无法直接生成反事实。其解决方案的关键在于提出数字孪生反事实框架(Digital Twin Counterfactual Framework, DTCF),通过构建一个基于数字孪生的模拟器,在潜在结果框架下将数字孪生建模为随机映射,并引入从边际一致性到结构一致性的多级保真度假设层次,使不同类型的因果效应(如平均处理效应ATE、条件平均处理效应CATE、分位数处理效应QTE等)逐步可验证。同时,该框架设计了五层验证架构以将原本不可证伪的反事实模拟主张转化为可观测数据下的可检验测试,并通过形式化分解明确区分可边际验证的因果量与依赖于不可观测个体内部相关结构的因果量(如个体处理效应ITE分布、受益/受损概率及处理效应方差),并提供边界分析、敏感性分析和不确定性量化工具以显式刻画此类依赖关系。
链接: https://arxiv.org/abs/2604.01325
作者: Olav Laudy
机构: 未知
类目: Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:
Abstract:The fundamental problem of causal inference - that the counterfactual outcome for any individual is never observed - has shaped the entire methodology of the field. Every existing approach substitutes assumptions for missing data: ignorability, parallel trends, exclusion restrictions. None produces the counterfactual itself. This paper proposes the Digital Twin Counterfactual Framework (DTCF): rather than estimating the counterfactual statistically, we simulate it using a digital twin and subject the simulation to a hierarchical validation regime. We formalize the digital twin simulator as a stochastic mapping within the potential outcomes framework and introduce a hierarchy of twin fidelity assumptions - from marginal fidelity through joint fidelity to structural fidelity - each unlocking a progressively richer class of estimands. The central contribution is threefold. First, a five-level validation architecture converts the unfalsifiable claim that the simulator produces correct counterfactuals into falsifiable tests against observable data. Second, a formal decomposition separates causal quantities into those that are marginally validated (ATE, CATE, QTE - testable through observable-arm comparison) and those that are copula-dependent (the ITE distribution, probability of benefit/harm, variance of treatment effects - permanently reliant on the unobservable within-individual dependence structure). Third, bounding, sensitivity, and uncertainty quantification tools make the copula dependence explicit. The DTCF does not resolve the fundamental problem of causal inference. What it provides is a framework in which marginal causal claims become increasingly testable, joint causal claims become explicitly assumption-indexed, and the gap between the two is formally characterized.
[AI-99] Sven: Singular Value Descent as a Computationally Efficient Natural Gradient Method
【速读】:该论文旨在解决传统优化算法在神经网络训练中效率与性能之间的权衡问题,尤其是针对标准一阶方法(如Adam)收敛速度慢、最终损失较高,而自然梯度法(Natural Gradient)计算开销过大(与参数量平方级增长)的局限性。其解决方案的关键在于提出Sven(Singular Value dEsceNt)算法,该算法不将损失函数简化为单个标量再更新参数,而是将每个数据点的残差视为独立约束条件,利用损失函数雅可比矩阵的Moore-Penrose伪逆(Moore-Penrose pseudoinverse)求解最小范数参数更新方向,从而同时满足所有条件;实际中通过截断奇异值分解(Truncated Singular Value Decomposition)近似该伪逆,仅保留前k个主导方向,使计算复杂度仅增加k倍(相对于随机梯度下降),显著优于传统自然梯度方法的二次复杂度,且在过参数化场景下仍能保持良好性能,本质上是自然梯度方法向过参数化区域的推广。
链接: https://arxiv.org/abs/2604.01279
作者: Samuel Bright-Thonney,Thomas R. Harvey,Andre Lukas,Jesse Thaler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Theory (hep-th); Optimization and Control (math.OC)
备注:
Abstract:We introduce Sven (Singular Value dEsceNt), a new optimization algorithm for neural networks that exploits the natural decomposition of loss functions into a sum over individual data points, rather than reducing the full loss to a single scalar before computing a parameter update. Sven treats each data point’s residual as a separate condition to be satisfied simultaneously, using the Moore-Penrose pseudoinverse of the loss Jacobian to find the minimum-norm parameter update that best satisfies all conditions at once. In practice, this pseudoinverse is approximated via a truncated singular value decomposition, retaining only the k most significant directions and incurring a computational overhead of only a factor of k relative to stochastic gradient descent. This is in comparison to traditional natural gradient methods, which scale as the square of the number of parameters. We show that Sven can be understood as a natural gradient method generalized to the over-parametrized regime, recovering natural gradient descent in the under-parametrized limit. On regression tasks, Sven significantly outperforms standard first-order methods including Adam, converging faster and to a lower final loss, while remaining competitive with LBFGS at a fraction of the wall-time cost. We discuss the primary challenge to scaling, namely memory overhead, and propose mitigation strategies. Beyond standard machine learning benchmarks, we anticipate that Sven will find natural application in scientific computing settings where custom loss functions decompose into several conditions.
[AI-100] DySCo: Dynamic Semantic Compression for Effective Long-term Time Series Forecasting
【速读】:该论文旨在解决时间序列预测(Time Series Forecasting, TSF)中因扩展回看窗口而引入的无关噪声和计算冗余问题,从而阻碍模型有效捕捉长期依赖关系。其解决方案的关键在于提出一种动态语义压缩(Dynamic Semantic Compression, DySCo)框架,包含三个核心组件:1)熵引导的动态采样(Entropy-Guided Dynamic Sampling, EGDS)机制,可自主识别并保留高熵片段以压缩冗余趋势;2)分层频率增强分解(Hierarchical Frequency-Enhanced Decomposition, HFED)策略,将高频异常与低频模式分离,保障稀疏采样下关键细节不丢失;3)跨尺度交互混合器(Cross-Scale Interaction Mixer, CSIM),动态融合全局上下文与局部表示,替代传统线性聚合方式,显著提升模型对长程相关性的建模能力并降低计算开销。
链接: https://arxiv.org/abs/2604.01261
作者: Xiang Ao,Yinyu Tan,Mengru Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures
Abstract:Time series forecasting (TSF) is critical across domains such as finance, meteorology, and energy. While extending the lookback window theoretically provides richer historical context, in practice, it often introduces irrelevant noise and computational redundancy, preventing models from effectively capturing complex long-term dependencies. To address these challenges, we propose a Dynamic Semantic Compression (DySCo) framework. Unlike traditional methods that rely on fixed heuristics, DySCo introduces an Entropy-Guided Dynamic Sampling (EGDS) mechanism to autonomously identify and retain high-entropy segments while compressing redundant trends. Furthermore, we incorporate a Hierarchical Frequency-Enhanced Decomposition (HFED) strategy to separate high-frequency anomalies from low-frequency patterns, ensuring that critical details are preserved during sparse sampling. Finally, a Cross-Scale Interaction Mixer(CSIM) is designed to dynamically fuse global contexts with local representations, replacing simple linear aggregation. Experimental results demonstrate that DySCo serves as a universal plug-and-play module, significantly enhancing the ability of mainstream models to capture long-term correlations with reduced computational cost.
[AI-101] A Learning-Based Cooperative Coevolution Framework for Heterogeneous Large-Scale Global Optimization GECCO2026
【速读】:该论文旨在解决异构大规模全局优化(Heterogeneous Large-Scale Global Optimization, H-LSGO)问题,这类问题在现实应用中日益突出,其特征是子问题维度和优化景观差异显著,而传统协同进化(Cooperative Coevolution, CC)方法依赖固定低维优化器难以有效应对这种异质性。解决方案的关键在于提出基于学习的异构协同进化框架(Learning-Based Heterogeneous Cooperative Coevolution Framework, LH-CC),通过将优化过程建模为马尔可夫决策过程(Markov Decision Process),引入元智能体(meta-agent)动态选择最适配当前子问题的优化器,从而实现对复杂异构结构的自适应优化策略。
链接: https://arxiv.org/abs/2604.01241
作者: Wenjie Qiu,Zixin Wang,Hongyu Fang,Zeyuan Ma,Yue-Jiao Gong
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 5 figures, 3 tables. Accepted for publication in GECCO 2026
Abstract:Cooperative Coevolution (CC) effectively addresses Large-Scale Global Optimization (LSGO) via decomposition but struggles with the emerging class of Heterogeneous LSGO (H-LSGO) problems arising from real-world applications, where subproblems exhibit diverse dimensions and distinct landscapes. The prevailing CC paradigm, relying on a fixed low-dimensional optimizer, often fails to navigate this heterogeneity. To address this limitation, we propose the Learning-Based Heterogeneous Cooperative Coevolution Framework (LH-CC). By formulating the optimization process as a Markov Decision Process, LH-CC employs a meta-agent to adaptively select the most suitable optimizer for each subproblem. We also introduce a flexible benchmark suite to generate diverse H-LSGO problem instances. Extensive experiments on 3000-dimensional problems with complex coupling relationships demonstrate that LH-CC achieves superior solution quality and computational efficiency compared to state-of-the-art baselines. Furthermore, the framework exhibits robust generalization across varying problem instances, optimization horizons, and optimizers. Our findings reveal that dynamic optimizer selection is a pivotal strategy for solving complex H-LSGO problems.
[AI-102] ML-Enabled Open RAN: A Comprehensive Survey of Architectures Challenges and Opportunities
【速读】:该论文旨在解决开放无线接入网(Open Radio Access Network, O-RAN)中因复杂性增加而面临的网络性能优化与智能化管理难题,特别是针对频谱管理、资源分配和安全等关键挑战。其解决方案的关键在于系统性地整合机器学习(Machine Learning, ML)技术,通过分析现有文献揭示ML在O-RAN架构中的应用现状与潜力,从而为提升网络效率和智能化水平提供理论依据与实践方向。
链接: https://arxiv.org/abs/2604.01239
作者: Mira Chandra Kirana,Patatchona Keyela,Fatemeh Rostamian,Deemah H. Tashman,Soumaya Cherkaoui
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:As wireless communication systems become more advanced, Open Radio Access Networks (O-RAN) stand out as a notable framework that promotes interoperability and cost-effectiveness. An examination of the progression of RAN architectures, as well as O-RAN’s underlying principles, reveals the importance of machine learning (ML) in addressing various challenges, including spectrum management, resource allocation, and security. Hence, this survey provides a comprehensive overview of the integration of ML within O-RAN, highlighting its transformative potential in enhancing network performance and efficiency. This survey aims to describe the current status of ML applications in O-RAN while indicating possible directions for future research by analyzing existing literature. The findings aim to assist researchers and stakeholders in formulating optimal service strategies and advancing the understanding of intelligent wireless networks.
[AI-103] rustworthy AI-Driven Dynamic Hybrid RIS: Joint Optimization and Reward Poisoning-Resilient Control in Cognitive MISO Networks
【速读】:该论文旨在解决下一代无线网络中次级用户(Secondary Users, SUs)因直接链路不可靠和能量受限而导致的性能瓶颈问题,同时应对认知无线电网络(Cognitive Radio Networks, CRNs)在实际硬件损伤和级联衰落信道下的优化挑战。其关键解决方案是提出一种自适应的能量感知混合可重构智能表面(Reconfigurable Intelligent Surface, RIS),该RIS能根据采集到的能量实时动态切换被动与主动工作模式,从而在保证对主用户(Primary Users, PUs)无害干扰的前提下提升系统吞吐量并降低能耗。通过软演员-评论家(Soft Actor-Critic, SAC)深度强化学习(Deep Reinforcement Learning, DRL)方法联合优化发射波束赋形与RIS相位配置,实现了在复杂动态环境中的高效资源调度;此外,首次系统研究了DRL代理在RIS增强CRNs中面临的奖励污染攻击,并提出了基于奖励截断与统计异常过滤的轻量级实时防御机制,显著提升了系统的安全性与鲁棒性。
链接: https://arxiv.org/abs/2604.01238
作者: Deemah H. Tashman,Soumaya Cherkaoui
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:Cognitive radio networks (CRNs) are a key mechanism for alleviating spectrum scarcity by enabling secondary users (SUs) to opportunistically access licensed frequency bands without harmful interference to primary users (PUs). To address unreliable direct SU links and energy constraints common in next-generation wireless networks, this work introduces an adaptive, energy-aware hybrid reconfigurable intelligent surface (RIS) for underlay multiple-input single-output (MISO) CRNs. Distinct from prior approaches relying on static RIS architectures, our proposed RIS dynamically alternates between passive and active operation modes in real time according to harvested energy availability. We also model our scenario under practical hardware impairments and cascaded fading channels. We formulate and solve a joint transmit beamforming and RIS phase optimization problem via the soft actor-critic (SAC) deep reinforcement learning (DRL) method, leveraging its robustness in continuous and highly dynamic environments. Notably, we conduct the first systematic study of reward poisoning attacks on DRL agents in RIS-enhanced CRNs, and propose a lightweight, real-time defense based on reward clipping and statistical anomaly filtering. Numerical results demonstrate that the SAC-based approach consistently outperforms established DRL baselines, and that the dynamic hybrid RIS strikes a superior trade-off between throughput and energy consumption compared to fully passive and fully active alternatives. We further show the effectiveness of our defense in maintaining SU performance even under adversarial conditions. Our results advance the practical and secure deployment of RIS-assisted CRNs, and highlight crucial design insights for energy-constrained wireless systems.
[AI-104] Runtime Burden Allocation for Structured LLM Routing in Agent ic Expert Systems: A Full-Factorial Cross-Backend Methodology
【速读】:该论文旨在解决生成式 AI(Generative AI)系统中结构化大语言模型(LLM)路由的部署优化问题,核心挑战在于如何在真实环境约束下平衡正确性(correctness)、延迟(latency)与实现成本(implementation cost)。传统方法常将此视为提示工程(prompt engineering)问题,但本文提出应从系统级负担分配(systems-level burden-allocation)视角重新理解:结构化输出的生成方式——即由模型直接输出、传输过程中压缩或本地重建——显著影响整体性能。其关键解决方案是构建一个跨后端(OpenAI、Gemini、Llama)的全因子基准评估框架,通过15,552次请求验证48种配置,揭示了路由模式的性能高度依赖于具体后端特性,不存在通用最优方案;进而提供可落地的部署指导原则,帮助开发者在异构环境下有效导航正确性-成本-延迟权衡边界。
链接: https://arxiv.org/abs/2604.01235
作者: Zhou Hanlin,Chan Huah Yong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Structured LLM routing is often treated as a prompt-engineering problem. We argue that it is, more fundamentally, a systems-level burden-allocation problem. As large language models (LLMs) become core control components in agentic AI systems, reliable structured routing must balance correctness, latency, and implementation cost under real deployment constraints. We show that this balance is shaped not only by prompts or schemas, but also by how structural work is allocated across the generation stack: whether output structure is emitted directly by the model, compressed during transport, or reconstructed locally after generation. We evaluate this formulation through a comprehensive full-factorial benchmark covering 48 deployment configurations and 15,552 requests across OpenAI, Gemini, and Llama backends. Our central finding is consequential: there is no universal best routing mode. Instead, backend-specific interaction effects dominate performance. Modes that remain highly reliable on Gemini and OpenAI can suffer substantial correctness degradation on Llama, while efficiency gains from compressed realization are strongly backend-dependent. Rather than presenting another isolated model comparison, this work contributes a deployable framework for reasoning about structured routing under heterogeneous backend conditions. We provide a cross-backend evaluation methodology and practical deployment guidance for navigating the correctness-cost-latency frontier in production-grade agentic expert systems. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.01235 [cs.AI] (or arXiv:2604.01235v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.01235 Focus to learn more arXiv-issued DOI via DataCite
[AI-105] Logic-Gated Time-Shared Feedforward Networks for Alternating Finite Automata: Exact Simulation and Learnability
【速读】:该论文旨在解决现有神经自动机模型仅能模拟非确定有限自动机(Nondeterministic Finite Automata, NFA)及其存在性可达性推理的局限性,从而无法表达具有通用聚合逻辑(Universal AND)的复杂正则语言表示问题。其解决方案的关键在于提出一种逻辑门控时分前馈网络(Logic-Gated Time-Shared Feedforward Networks, LG-TS-FFNs),通过引入可学习的状态依赖偏置项作为可微分逻辑门,实现对交替有限自动机(Alternating Finite Automata, AFA)中存在性(OR)与普遍性(AND)聚合机制的统一建模,并在共享参数线性递推结构中完成精确模拟。该架构在理论上证明等价于AFAs,具备指数级状态压缩能力,且可通过标准梯度下降同时恢复自动机拓扑与逻辑语义,实现了统计学习与形式化逻辑推理的融合。
链接: https://arxiv.org/abs/2604.01228
作者: Sahil Rajesh Dhayalkar
机构: 未知
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI)
备注: 22 Pages, 3 figures. Submitted to IEEE Access and is currently under review
Abstract:We present a formal and constructive framework for simulating Alternating Finite Automata (AFAs) using Logic-Gated Time-Shared Feedforward Networks (LG-TS-FFNs). Unlike prior neural automata models limited to Nondeterministic Finite Automata (NFAs) and existential reachability, our architecture integrates learnable, state-dependent biases that function as differentiable logic gates, enabling the representation of both Existential \textsc\textscOR and Universal \textsc\textscAND aggregation within a shared-parameter linear recurrence. We prove that this architectural modification upgrades the network’s computational class to be structurally isomorphic to AFAs, thereby inheriting their exponential succinctness: the network can represent regular languages requiring 2^n states in an NFA with only n neurons. We rigorously establish that the forward pass of an LG-TS-FFN exactly simulates the reachability dynamics of an AFA, including instantaneous \varepsilon -closures. Furthermore, we demonstrate empirical learnability: a continuous relaxation of the logic gates allows the network to simultaneously recover the automaton’s topology and logical semantics from binary labels via standard gradient descent. Extensive experiments confirm that our model achieves perfect recovery of ground-truth automata, bridging the gap between statistical learning and succinct, universal logical reasoning.
[AI-106] Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider
【速读】:该论文旨在解决如何在实验核物理领域中高效、安全地回答特定领域的专业技术问题,尤其是在处理未公开的科研数据时避免隐私泄露。其解决方案的关键在于构建一个本地部署的检索增强生成(Retrieval Augmented Generation, RAG)系统,该系统基于自建数据库(索引arXiv上与电子-离子对撞机EIC实验相关的文献)和开源LLaMA模型进行问答生成,从而在不依赖云端外部知识库的前提下实现高成本效益、符合数据隐私要求的问答服务。
链接: https://arxiv.org/abs/2604.02259
作者: Tina. J. Jat,T. Ghosh,Karthik Suresh
机构: 未知
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); Instrumentation and Detectors (physics.ins-det)
备注:
Abstract:To harness the power of Language Models in answering domain specific specialized technical questions, Retrieval Augmented Generation (RAG) is been used widely. In this work, we have developed a Q\A application inspired by the Retrieval Augmented Generation (RAG), which is comprised of an in-house database indexed on the arXiv articles related to the Electron-Ion Collider (EIC) experiment - one of the largest international scientific collaboration and incorporated an open-source LLaMA model for answer generation. This is an extension to it’s proceeding application built on proprietary model and Cloud-hosted external knowledge-base for the EIC experiment. This locally-deployed RAG-system offers a cost-effective, resource-constraint alternative solution to build a RAG-assisted Q\A application on answering domain-specific queries in the field of experimental nuclear physics. This set-up facilitates data-privacy, avoids sending any pre-publication scientific data and information to public domain. Future improvement will expand the knowledge base to encompass heterogeneous EIC-related publications and reports and upgrade the application pipeline orchestration to the LangGraph framework.
[AI-107] Quantum-Inspired Geometric Classification with Correlation Group Structures and VQC Decision Modeling
【速读】:该论文旨在解决传统分类方法在小样本、高不平衡及异构表格式数据中性能不稳定、可解释性差的问题。其核心解决方案是提出一种几何驱动的量子启发式分类框架,关键在于将特征空间建模为锚点中心的关联邻域(Correlation Group Structures, CGR),通过SWAP-test基的重叠估计构建类间欧氏与角度相似性通道,并以非概率边际融合得分作为轻量级主分类器;对于大规模或高度不平衡场景,则引入紧凑的Delta距离对比特征并结合变分量子分类器(Variational Quantum Classifier, VQC)作为非线性精修层,实现从几何感知到量子增强的自适应分类流程,从而在多个真实数据集上展现出稳定且可解释的高性能表现。
链接: https://arxiv.org/abs/2604.01930
作者: Nishikanta Mohanty,Arya Ansuman Priyadarshi,Bikash K. Behera,Badshah Mukherjee
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 34 Pages, 19 Algorithms , 8 Tables
Abstract:We propose a geometry-driven quantum-inspired classification framework that integrates Correlation Group Structures (CGR), compact SWAP-test-based overlap estimation, and selective variational quantum decision modelling. Rather than directly approximating class posteriors, the method adopts a geometry-first paradigm in which samples are evaluated relative to class medoids using overlap-derived Euclidean-like and angular similarity channels. CGR organizes features into anchor-centered correlation neighbourhoods, generating nonlinear, correlation-weighted representations that enhance robustness in heterogeneous tabular spaces. These geometric signals are fused through a non-probabilistic margin-based fusion score, serving as a lightweight and data-efficient primary classifier for small-to-moderate datasets. On Heart Disease, Breast Cancer, and Wine Quality datasets, the fusion-score classifier achieves 0.8478, 0.8881, and 0.9556 test accuracy respectively, with macro-F1 scores of 0.8463, 0.8703, and 0.9522, demonstrating competitive and stable performance relative to classical baselines. For large-scale and highly imbalanced regimes, we construct compact Delta-distance contrastive features and train a variational quantum classifier (VQC) as a nonlinear refinement layer. On the Credit Card Fraud dataset (0.17% prevalence), the Delta + VQC pipeline achieves approximately 0.85 minority recall at an alert rate of approximately 1.31%, with ROC-AUC 0.9249 and PR-AUC 0.3251 under full-dataset evaluation. These results highlight the importance of operating-point-aware assessment in rare-event detection and demonstrate that the proposed hybrid geometric-variational framework provides interpretable, scalable, and regime-adaptive classification across heterogeneous data settings.
[AI-108] he Newton-Muon Optimizer
【速读】:该论文旨在解决Muon优化器在训练大语言模型时表现出优异性能但其核心设计原理——矩阵梯度正交化机制——尚不明确的问题。解决方案的关键在于提出一个代理模型(surrogate model),该模型通过仅使用三个矩阵(梯度 G、输出空间曲率矩阵 H 和数据矩阵 Z,即层输入的堆叠)将损失函数近似为权重矩阵扰动的二次函数,并基于此代理模型推导出闭式更新规则:W←W−η⋅sign(G(ZZ⊤)−1),其中 sign(X) 定义为 X=USV⊤ 的紧凑奇异值分解下的 UV⊤。这一新方法被称为Newton-Muon,不仅揭示了标准Muon本质上是一种忽略输入二阶矩右预条件的隐式牛顿型方法,还在实验中显著提升了训练效率:在复现Modded-NanoGPT配置下,相比原Muon,Newton-Muon减少6%迭代步数并降低约4%的墙钟时间。
链接: https://arxiv.org/abs/2604.01472
作者: Zhehang Du,Weijie Su
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The Muon optimizer has received considerable attention for its strong performance in training large language models, yet the design principle behind its matrix-gradient orthogonalization remains largely elusive. In this paper, we introduce a surrogate model that not only sheds new light on the design of Muon, but more importantly leads to a new optimizer. In the same spirit as the derivation of Newton’s method, the surrogate approximates the loss as a quadratic function of the perturbation to a weight matrix W using only three matrices: the gradient G , an output-space curvature matrix H , and the data matrix Z that stacks the layer inputs. By minimizing this surrogate in one step and adopting a certain isotropic assumption on the weights, we obtain the closed-form update rule (up to momentum and weight decay) W \leftarrow W - \eta \cdot \mathrmmsgn(G(ZZ^\top)^-1) , where \eta is the learning rate and \mathrmmsgn(X)=UV^\top if X=USV^\top is a compact singular value decomposition. This new optimization method, which we refer to as Newton-Muon, shows that standard Muon can be interpreted as an implicit Newton-type method that neglects the right preconditioning induced by the input second moment. Empirically, on a reproduction of the earliest publicly released Modded-NanoGPT speedrun configuration using Muon for GPT-2 pretraining, Newton-Muon reaches the target validation loss in 6% fewer iteration steps and reduces wall-clock training time by about 4%.
[AI-109] All Substitution Is Local
【速读】:该论文旨在解决信息源之间交互效应的判定问题:在贝叶斯决策框架下,当一个信息源的引入会提升另一个信息源的价值(互补性),还是降低其价值(替代性)?其核心贡献在于将这种交互作用分解为两种对立机制——互补力(complement force),衡量一个信息源如何使另一个信息源在决策上更具价值;以及替代力(substitute force),衡量当前决策是否已被部分或完全确定。关键发现是存在一个“定位原则”(localization principle):只有当观测值跨越决策边界时,替代效应才可能发生,但仅跨越边界并不足以引发替代。在决策区域内,替代力消失,所有信息源均表现为互补关系,即使其中某一信息源单独无法改变决策。这一结论适用于任意相关性结构的信息源,并已在Lean 4中形式化验证,揭示了信息协作主要发生在决策边界附近,而其余区域均为信息协同增强的领域。
链接: https://arxiv.org/abs/2604.01443
作者: Nidhish Shah,Shaurjya Mandal,Asfandyar Azhar
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:When does consulting one information source raise the value of another, and when does it diminish it? We study this question for Bayesian decision-makers facing finite actions. The interaction decomposes into two opposing forces: a complement force, measuring how one source moves beliefs to where the other becomes more useful, and a substitute force, measuring how much the current decision is resolved. Their balance obeys a localization principle: substitution requires an observation to cross a decision boundary, though crossing alone does not guarantee it. Whenever posteriors remain inside the current decision region, the substitute force vanishes, and sources are guaranteed to complement each other, even when one source cannot, on its own, change the decision. The results hold for arbitrarily correlated sources and are formalized in Lean 4. Substitution is confined to the thin boundaries where decisions change. Everywhere else, information cooperates. Code and proofs: this https URL.
机器学习
[LG-0] aming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference
链接: https://arxiv.org/abs/2604.02292
作者: Dimitrios Danopoulos,Enrico Lupi,Michael Kagan,Maurizio Pierini
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:
Abstract:Softmax can become a computational bottleneck in the Transformer model’s Multi-Head Attention (MHA) block, particularly in small models under low-precision inference, where exponentiation and normalization incur significant overhead. As such, we suggest using Head-Calibrated Clipped-Linear Softmax (HCCS), a bounded, monotone surrogate to the exponential softmax function, which uses a clipped linear mapping of the max centered attention logits. This approximation produces a stable probability distribution, maintains the ordering of the original logits and has non-negative values. HCCS differs from previous softmax surrogates as it includes a set of lightweight calibration parameters that are optimized offline based on a representative dataset and calibrated for each individual attention head to preserve the statistical properties of the individual heads. We describe a hardware-motivated implementation of HCCS for high-throughput scenarios targeting the AMD Versal AI Engines. The current reference implementations from AMD for this platform rely upon either bfloat16 arithmetic or LUTs to perform the exponential operation, which might limit the throughput of the platform and fail to utilize the high-throughput integer vector processing units of the AI Engine. In contrast, HCCS provides a natural mapping to the AI Engines’ int8 multiply accumulate (MAC) units. To the best of our knowledge, this is the first int8 optimized softmax surrogate for AMD AI engines that significantly exceeds the speed performance of other reference implementations while maintaining competitive task accuracy on small or heavily quantized MHA workloads after quantization-aware retraining.
[LG-1] SKILL0: In-Context Agent ic Reinforcement Learning for Skill Internalization
链接: https://arxiv.org/abs/2604.02268
作者: Zhengxi Lu,Zhiyuan Yao,Jinyang Wu,Chengcheng Han,Qi Gu,Xunliang Cai,Weiming Lu,Jun Xiao,Yueting Zhuang,Yongliang Shen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in-context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi-turn task completion. A Dynamic Curriculum then evaluates each skill file’s on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero-shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7% for ALFWorld and +6.6% for Search-QA), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at this https URL.
[LG-2] Model-Based Reinforcement Learning for Control under Time-Varying Dynamics
链接: https://arxiv.org/abs/2604.02260
作者: Klemens Iten,Bruce Lee,Chenhao Li,Lenart Treven,Andreas Krause,Bhavya Sukhija
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 15 pages, 5 figues, 2 tables. This work has been submitted to the IEEE for possible publication
Abstract:Learning-based control methods typically assume stationary system dynamics, an assumption often violated in real-world systems due to drift, wear, or changing operating conditions. We study reinforcement learning for control under time-varying dynamics. We consider a continual model-based reinforcement learning setting in which an agent repeatedly learns and controls a dynamical system whose transition dynamics evolve across episodes. We analyze the problem using Gaussian process dynamics models under frequentist variation-budget assumptions. Our analysis shows that persistent non-stationarity requires explicitly limiting the influence of outdated data to maintain calibrated uncertainty and meaningful dynamic regret guarantees. Motivated by these insights, we propose a practical optimistic model-based reinforcement learning algorithm with adaptive data buffer mechanisms and demonstrate improved performance on continuous control benchmarks with non-stationary dynamics.
[LG-3] Best-Arm Identification with Noisy Actuation
链接: https://arxiv.org/abs/2604.02255
作者: Merve Karakas,Osama Hanna,Lin F. Yang,Christina Fragouli
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we consider a multi-armed bandit (MAB) instance and study how to identify the best arm when arm commands are conveyed from a central learner to a distributed agent over a discrete memoryless channel (DMC). Depending on the agent capabilities, we provide communication schemes along with their analysis, which interestingly relate to the zero-error capacity of the underlying DMC.
[LG-4] Smoothing the Landscape: Causal Structure Learning via Diffusion Denoising Objectives
链接: https://arxiv.org/abs/2604.02250
作者: Hao Zhu,Di Zhou,Donna Slonim
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: To appear in the Proceedings of the 5th Conference on Causal Learning and Reasoning (CLeaR 2026)
Abstract:Understanding causal dependencies in observational data is critical for informing decision-making. These relationships are often modeled as Bayesian Networks (BNs) and Directed Acyclic Graphs (DAGs). Existing methods, such as NOTEARS and DAG-GNN, often face issues with scalability and stability in high-dimensional data, especially when there is a feature-sample imbalance. Here, we show that the denoising score matching objective of diffusion models could smooth the gradients for faster, more stable convergence. We also propose an adaptive k-hop acyclicity constraint that improves runtime over existing solutions that require matrix inversion. We name this framework Denoising Diffusion Causal Discovery (DDCD). Unlike generative diffusion models, DDCD utilizes the reverse denoising process to infer a parameterized causal structure rather than to generate data. We demonstrate the competitive performance of DDCDs on synthetic benchmarking data. We also show that our methods are practically useful by conducting qualitative analyses on two real-world examples. Code is available at this url: this https URL.
[LG-5] (PAC-)Learning state machines from data streams: A generic strategy and an improved heuristic (Extended version)
链接: https://arxiv.org/abs/2604.02244
作者: Robert Baumgartner,Sicco Verwer
类目: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注: Extended version of Learning state machines from data streams: A generic strategy and an improved heuristic, International Conference on Grammatical Inference (ICGI) 2023, Rabat, Morocco
Abstract:This is an extended version of our publication Learning state machines from data streams: A generic strategy and an improved heuristic, International Conference on Grammatical Inference (ICGI) 2023, Rabat, Morocco. It has been extended with a formal proof on PAC-bounds, and the discussion and analysis of a similar approach has been moved from the appendix and is now a full Section. State machines models are models that simulate the behavior of discrete event systems, capable of representing systems such as software systems, network interactions, and control systems, and have been researched extensively. The nature of most learning algorithms however is the assumption that all data be available at the beginning of the algorithm, and little research has been done in learning state machines from streaming data. In this paper, we want to close this gap further by presenting a generic method for learning state machines from data streams, as well as a merge heuristic that uses sketches to account for incomplete prefix trees. We implement our approach in an open-source state merging library and compare it with existing methods. We show the effectiveness of our approach with respect to run-time, memory consumption, and quality of results on a well known open dataset. Additionally, we provide a formal analysis of our algorithm, showing that it is capable of learning within the PAC framework, and show a theoretical improvement to increase run-time, without sacrificing correctness of the algorithm in larger sample sizes. Comments: Extended version of Learning state machines from data streams: A generic strategy and an improved heuristic, International Conference on Grammatical Inference (ICGI) 2023, Rabat, Morocco Subjects: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG) Cite as: arXiv:2604.02244 [cs.FL] (or arXiv:2604.02244v1 [cs.FL] for this version) https://doi.org/10.48550/arXiv.2604.02244 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-6] On the Role of Depth in the Expressivity of RNNs
链接: https://arxiv.org/abs/2604.02201
作者: Maude Lizaire,Michael Rizvi-Martel,Éric Dupuis,Guillaume Rabusseau
类目: Machine Learning (cs.LG)
*备注:
Abstract:The benefits of depth in feedforward neural networks are well known: composing multiple layers of linear transformations with nonlinear activations enables complex computations. While similar effects are expected in recurrent neural networks (RNNs), it remains unclear how depth interacts with recurrence to shape expressive power. Here, we formally show that depth increases RNNs’ memory capacity efficiently with respect to the number of parameters, thus enhancing expressivity both by enabling more complex input transformations and improving the retention of past information. We broaden our analysis to 2RNNs, a generalization of RNNs with multiplicative interactions between inputs and hidden states. Unlike RNNs, which remain linear without nonlinear activations, 2RNNs perform polynomial transformations whose maximal degree grows with depth. We further show that multiplicative interactions cannot, in general, be replaced by layerwise nonlinearities. Finally, we validate these insights empirically on synthetic and real-world tasks.
[LG-7] Computing the Exact Pareto Front in Averag e-Cost Multi-Objective Markov Decision Processes
链接: https://arxiv.org/abs/2604.02196
作者: Jiping Luo,Nikolaos Pappas
类目: ystems and Control (eess.SY); Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Many communication and control problems are cast as multi-objective Markov decision processes (MOMDPs). The complete solution to an MOMDP is the Pareto front. Much of the literature approximates this front via scalarization into single-objective MDPs. Recent work has begun to characterize the full front in discounted or simple bi-objective settings by exploiting its geometry. In this work, we characterize the exact front in average-cost MOMDPs. We show that the front is a continuous, piecewise-linear surface lying on the boundary of a convex polytope. Each vertex corresponds to a deterministic policy, and adjacent vertices differ in exactly one state. Each edge is realized as a convex combination of the policies at its endpoints, with the mixing coefficient given in closed form. We apply these results to a remote state estimation problem, where each vertex on the front corresponds to a threshold policy. The exact Pareto front and solutions to certain non-convex MDPs can be obtained without explicitly solving any MDP.
[LG-8] Neural network methods for two-dimensional finite-source reflector design
链接: https://arxiv.org/abs/2604.02184
作者: Roel Hacking,Lisa Kusch,Koondanibha Mitra,Martijn Anthonissen,Wilbert IJzerman
类目: Machine Learning (cs.LG)
*备注: 20 pages, 10 figures, 1 table. Submitted to Machine Learning: Science and Technology
Abstract:We address the inverse problem of designing two-dimensional reflectors that transform light from a finite, extended source into a prescribed far-field distribution. We propose a neural network parameterization of the reflector height and develop two differentiable objective functions: (i) a direct change-of-variables loss that pushes the source distribution through the learned inverse mapping, and (ii) a mesh-based loss that maps a target-space grid back to the source, integrates over intersections, and remains continuous even when the source is discontinuous. Gradients are obtained via automatic differentiation and optimized with a robust quasi-Newton method. As a comparison, we formulate a deconvolution baseline built on a simplified finite-source approximation: a 1D monotone mapping is recovered from flux balance, yielding an ordinary differential equation solved in integrating-factor form; this solver is embedded in a modified Van Cittert iteration with nonnegativity clipping and a ray-traced forward operator. Across four benchmarks – continuous and discontinuous sources, and with/without minimum-height constraints – we evaluate accuracy by ray-traced normalized mean absolute error (NMAE). Our neural network approach converges faster and achieves consistently lower NMAE than the deconvolution method, and handles height constraints naturally. We discuss how the method may be extended to rotationally symmetric and full three-dimensional settings via iterative correction schemes.
[LG-9] A Practical Two-Stage Framework for GPU Resource and Power Prediction in Heterogeneous HPC Systems
链接: https://arxiv.org/abs/2604.02158
作者: Beste Oztop,Dhruva Kulkarni,Zhengji Zhao,Ayse Kivilcim Coskun,Kadidia Konate
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注: 9 pages, 6 figures
Abstract:Efficient utilization of GPU resources and power has become critical with the growing demand for GPUs in high-performance computing (HPC). In this paper, we analyze GPU utilization and GPU memory utilization, as well as the power consumption of the Vienna ab initio Simulation Package (VASP), using the Slurm workload manager historical logs and GPU performance metrics collected by NVIDIA’s Data Center GPU Manager (DCGM). VASP is a widely used materials science application on Perlmutter at NERSC, an HPE Cray EX system based on NVIDIA A100 GPUs. Using our insights from the resource utilization analysis of VASP applications, we propose a resource prediction framework to predict the average GPU power, maximum GPU utilization, and maximum GPU memory utilization values of heterogeneous HPC system applications to enable more efficient scheduling decisions and power-aware system operation. Our prediction framework consists of two stages: 1) using only the Slurm accounting logs as training data and 2) augmenting the training data with historical GPU profiling metrics collected with DCGM. The maximum GPU utilization predictions using only the Slurm submission features achieve up to 97% accuracy. Furthermore, features engineered from GPU-compute and memory activity metrics exhibit good correlations with average power utilization, and our runtime power usage prediction experiments result in up to 92% prediction accuracy. These findings demonstrate the effectiveness of DCGM metrics in capturing application characteristics and highlight their potential for developing predictive models to support dynamic power management in HPC systems.
[LG-10] Auction-Based Online Policy Adaptation for Evolving Objectives
链接: https://arxiv.org/abs/2604.02151
作者: Guruprerana Shabadi,Kaushik Mallik
类目: Machine Learning (cs.LG)
*备注: 17 pages, 6 figures
Abstract:We consider multi-objective reinforcement learning problems where objectives come from an identical family – such as the class of reachability objectives – and may appear or disappear at runtime. Our goal is to design adaptive policies that can efficiently adjust their behaviors as the set of active objectives changes. To solve this problem, we propose a modular framework where each objective is supported by a selfish local policy, and coordination is achieved through a novel auction-based mechanism: policies bid for the right to execute their actions, with bids reflecting the urgency of the current state. The highest bidder selects the action, enabling a dynamic and interpretable trade-off among objectives. Going back to the original adaptation problem, when objectives change, the system adapts by simply adding or removing the corresponding policies. Moreover, as objectives arise from the same family, identical copies of a parameterized policy can be deployed, facilitating immediate adaptation at runtime. We show how the selfish local policies can be computed by turning the problem into a general-sum game, where the policies compete against each other to fulfill their own objectives. To succeed, each policy must not only optimize its own objective, but also reason about the presence of other goals and learn to produce calibrated bids that reflect relative priority. In our implementation, the policies are trained concurrently using proximal policy optimization (PPO). We evaluate on Atari Assault and a gridworld-based path-planning task with dynamic targets. Our method achieves substantially better performance than monolithic policies trained with PPO.
[LG-11] AEGIS: Adversarial Entropy-Guided Immune System – Thermodynamic State Space Models for Zero-Day Network Evasion Detection
链接: https://arxiv.org/abs/2604.02149
作者: Vickson Ferrel
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 3 tables
Abstract:As TLS 1.3 encryption limits traditional Deep Packet Inspection (DPI), the security community has pivoted to Euclidean Transformer-based classifiers (e.g., ET-BERT) for encrypted traffic analysis. However, these models remain vulnerable to byte-level adversarial morphing – recent pre-padding attacks reduced ET-BERT accuracy to 25.68%, while VLESS Reality bypasses certificate-based detection entirely. We introduce AEGIS: an Adversarial Entropy-Guided Immune System powered by a Thermodynamic Variance-Guided Hyperbolic Liquid State Space Model (TVD-HL-SSM). Rather than competing in the Euclidean payload-reading domain, AEGIS discards payload bytes in favor of 6-dimensional continuous-time flow physics projected into a non-Euclidean Poincare manifold. Liquid Time-Constants measure microsecond IAT decay, and a Thermodynamic Variance Detector computes sequence-wide Shannon Entropy to expose automated C2 tunnel anomalies. A pure C++ eBPF Harvester with zero-copy IPC bypasses the Python GIL, enabling a linear-time O(N) Mamba-3 core to process 64,000-packet swarms at line-rate. Evaluated on a 400GB, 4-tier adversarial corpus spanning backbone traffic, IoT botnets, zero-days, and proprietary VLESS Reality tunnels, AEGIS achieves an F1-score of 0.9952 and 99.50% True Positive Rate at 262 us inference latency on an RTX 4090, establishing a new state-of-the-art for physics-based adversarial network defense.
[LG-12] Application of parametric Shallow Recurrent Decoder Network to magnetohydrodynamic flows in liquid metal blankets of fusion reactors
链接: https://arxiv.org/abs/2604.02139
作者: M. Lo Verso,C. Introini,E. Cervi,L. Savoldi,J. N. Kutz,A. Cammi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Magnetohydrodynamic (MHD) phenomena play a pivotal role in the design and operation of nuclear fusion systems, where electrically conducting fluids (such as liquid metals or molten salts employed in reactor blankets) interact with magnetic fields of varying intensity and orientation, influencing the resulting flow dynamics. The numerical solution of MHD models entails the resolution of highly nonlinear, multiphysics systems of equations, which can become computationally demanding, particularly in multi-query, parametric, or real-time contexts. This study investigates a fully data-driven framework for MHD state reconstruction that integrates dimensionality reduction through Singular Value Decomposition (SVD) with the SHallow REcurrent Decoder (SHRED), a neural network architecture designed to reconstruct the full spatio-temporal state from sparse time-series measurements of selected observables, including previously unseen parametric configurations. The SHRED methodology is applied to a three-dimensional geometry representative of a portion of a WCLL blanket cell, in which lead-lithium flows around a water-cooled tube. Multiple magnetic field configurations are examined, including constant toroidal fields, combined toroidal-poloidal fields, and time-dependent magnetic fields. Across all considered scenarios, SHRED achieves high reconstruction accuracy, robustness, and generalization to magnetic field intensities, orientations, and temporal evolutions not seen during training. Notably, in the presence of time-varying magnetic fields, the model accurately infers the temporal evolution of the magnetic field itself using temperature measurements alone. Overall, the findings identify SHRED as a computationally efficient, data-driven, and flexible approach for MHD state reconstruction, with significant potential for real-time monitoring, diagnostics and control in fusion reactor systems.
[LG-13] AA-SVD : Anchored and Adaptive SVD for Large Language Model Compression
链接: https://arxiv.org/abs/2604.02119
作者: Atul Kumar Sinha,François Fleuret
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce a fast low-rank factorization-based framework for compressing large language models that enables rapid compression of billion-parameter models without retraining. Unlike existing factorization-based approaches that optimize only on the original inputs, ignoring distribution shifts from upstream compression and thus propagating errors forward, or those that rely only on shifted inputs and risk drifting away from the original outputs, our approach accounts for both. Beyond individual layer compression, we further refine each transformer block end-to-end, minimizing block-level output distortion and allowing compressed layers to jointly compensate for accumulated errors. By anchoring each compressed layer to the original outputs while explicitly modeling input distribution shifts, our method finds a low-rank approximation that maintains functional equivalence with the original model. Experiments on large language models show that our method consistently outperforms existing SVD-based baselines across compression ratios, with the advantage becoming increasingly pronounced at aggressive compression budgets, where competing methods degrade substantially or collapse entirely, offering a practical solution for efficient, large-scale model deployment.
[LG-14] Cross-Modal Visuo-Tactile Object Perception
链接: https://arxiv.org/abs/2604.02108
作者: Anirvan Dutta,Simone Tasciotti,Claudia Cusseddu,Ang Li,Panayiota Poirazi,Julijana Gjorgjieva,Etienne Burdet,Patrick van der Smagt,Mohsen Kaboli
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 23 pages, 8 figures, 1 table. Submitted for review to journal
Abstract:Estimating physical properties is critical for safe and efficient autonomous robotic manipulation, particularly during contact-rich interactions. In such settings, vision and tactile sensing provide complementary information about object geometry, pose, inertia, stiffness, and contact dynamics, such as stick-slip behavior. However, these properties are only indirectly observable and cannot always be modeled precisely (e.g., deformation in non-rigid objects coupled with nonlinear contact friction), making the estimation problem inherently complex and requiring sustained exploitation of visuo-tactile sensory information during action. Existing visuo-tactile perception frameworks have primarily emphasized forceful sensor fusion or static cross-modal alignment, with limited consideration of how uncertainty and beliefs about object properties evolve over time. Inspired by human multi-sensory perception and active inference, we propose the Cross-Modal Latent Filter (CMLF) to learn a structured, causal latent state-space of physical object properties. CMLF supports bidirectional transfer of cross-modal priors between vision and touch and integrates sensory evidence through a Bayesian inference process that evolves over time. Real-world robotic experiments demonstrate that CMLF improves the efficiency and robustness of latent physical properties estimation under uncertainty compared to baseline approaches. Beyond performance gains, the model exhibits perceptual coupling phenomena analogous to those observed in humans, including susceptibility to cross-modal illusions and similar trajectories in learning cross-sensory associations. Together, these results constitutes a significant step toward generalizable, robust and physically consistent cross-modal integration for robotic multi-sensory perception.
[LG-15] Feature Weighting Improves Pool-Based Sequential Active Learning for Regression
链接: https://arxiv.org/abs/2604.02019
作者: Dongrui Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Pool-based sequential active learning for regression (ALR) optimally selects a small number of samples sequentially from a large pool of unlabeled samples to label, so that a more accurate regression model can be constructed under a given labeling budget. Representativeness and diversity, which involve computing the distances among different samples, are important considerations in ALR. However, previous ALR approaches do not incorporate the importance of different features in inter-sample distance computation, resulting in sub-optimal sample selection. This paper proposes three feature weighted single-task ALR approaches and two feature weighted multi-task ALR approaches, where the ridge regression coefficients trained from a small amount of previously labeled samples are used to weight the corresponding features in inter-sample distance computation. Experiments showed that this easy-to-implement enhancement almost always improves the performance of four existing ALR approaches, in both single-task and multi-task regression problems. The feature weighting strategy may also be easily extended to stream-based ALR, and classification algorithms.
[LG-16] Apriel-Reason er: RL Post-Training for General-Purpose and Efficient Reasoning
链接: https://arxiv.org/abs/2604.02007
作者: Rafael Pardinas,Ehsan Kamalloo,David Vazquez,Alexandre Drouin
类目: Machine Learning (cs.LG)
*备注: 20 pages, 4 tables, 6 figures, appendix included
Abstract:Building general-purpose reasoning models using reinforcement learning with verifiable rewards (RLVR) across diverse domains has been widely adopted by frontier open-weight models. However, their training recipes and domain mixtures are often not disclosed. Joint optimization across domains poses significant challenges: domains vary widely in rollout length, problem difficulty and sample efficiency. Further, models with long chain-of-thought traces increase inference cost and latency, making efficiency critical for practical deployment. We present Apriel-Reasoner, trained with a fully reproducible multi-domain RL post-training recipe on Apriel-Base, a 15B-parameter open-weight LLM, across five domains using public datasets: mathematics, code generation, instruction following, logical puzzles and function calling. We introduce an adaptive domain sampling mechanism that preserves target domain ratios despite heterogeneous rollout dynamics, and a difficulty-aware extension of the standard length penalty that, with no additional training overhead, encourages longer reasoning for difficult problems and shorter traces for easy ones. Trained with a strict 16K-token output budget, Apriel-Reasoner generalizes to 32K tokens at inference and improves over Apriel-Base on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench while producing 30-50% shorter reasoning traces. It matches strong open-weight models of similar size at lower token cost, thereby pushing the Pareto frontier of accuracy versus token budget.
[LG-17] Generalization Bounds and Statistical Guarantees for Multi-Task and Multiple Operator Learning with MNO Networks
链接: https://arxiv.org/abs/2604.01961
作者: Adrien Weihs,Hayden Schaeffer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multiple operator learning concerns learning operator families \G[\alpha]:U\to V_\alpha\in W indexed by an operator descriptor \alpha . Training data are collected hierarchically by sampling operator instances \alpha , then input functions u per instance, and finally evaluation points x per input, yielding noisy observations of G[\alpha]u . While recent work has developed expressive multi-task and multiple operator learning architectures and approximation-theoretic scaling laws, quantitative statistical generalization guarantees remain limited. We provide a covering-number-based generalization analysis for separable models, focusing on the Multiple Neural Operator (MNO) architecture: we first derive explicit metric-entropy bounds for hypothesis classes given by linear combinations of products of deep ReLU subnetworks, and then combine these complexity bounds with approximation guarantees for MNO to obtain an explicit approximation-estimation tradeoff for the expected test error on new (unseen) triples (\alpha,u,x) . The resulting bound makes the dependence on the hierarchical sampling budgets (n_\alpha,n_u,n_x) transparent and yields an explicit learning-rate statement in the operator-sampling budget n_\alpha , providing a sample-complexity characterization for generalization across operator instances. The structure and architecture can also be viewed as a general purpose solver or an example of a "small’’ PDE foundation model, where the triples are one form of multi-modality.
[LG-18] Learn by Surprise Commit by Proof
链接: https://arxiv.org/abs/2604.01951
作者: Kang-Sin Choi
类目: Machine Learning (cs.LG)
*备注: 24 pages, 3 figures
Abstract:We propose LSCP, a self-gated post-training framework for autonomous knowledge acquisition: learning only what a model does not already know, verified against what it does know, at a strength proportional to conviction, with no external oracle. When a passage produces anomalously high per-token loss, LSCP flags it, generates a QA chain that forces the model to articulate its own knowledge and identify gaps, then adjusts AdamW’s \beta_2 proportionally to conviction depth k (the number of self-verification steps the passage survives) via \beta_2 = 0.999 \cdot r^k . The entire learning intensity is governed by a single parameter r . Beyond new knowledge, this process sharpens weakly encoded existing knowledge, which is a primary source of hallucination. The framework is self-extinguishing: as the model learns, per-token loss on learned passages decreases toward the surprisal threshold and the system progressively converges to standard AdamW. This models biological memory consolidation: temporary information in the context window is selectively consolidated into parametric weights, the model’s long-term memory. Experiments on the reference model (Qwen3-14B) and across six models (8B–32B, four families) show that standard fine-tuning produces rote memorization (perturbation gap (the ratio of paraphrase to original perplexity) of 11.6 ± 0.2 x baseline) while all LSCP conditions learn semantically (2.7–3.0x). The r=1.0 condition (identical optimizer, nearly identical data, only QA format differs) confirms that the training data format, not \beta_2 gating, is the primary mechanism preventing memorization; gating instead protects neighboring knowledge from contamination by corrupt content (93 ± 7% accuracy on adjacent questions at r=0.98 vs. 90% baseline).
[LG-19] annbatch unlocks terabyte-scale training of biological data in anndata
链接: https://arxiv.org/abs/2604.01949
作者: Ilan Gold,Felix Fischer,Lucas Arnoldt,F. Alexander Wolf,Fabian J. Theis
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:
Abstract:The scale of biological datasets now routinely exceeds system memory, making data access rather than model computation the primary bottleneck in training machine-learning models. This bottleneck is particularly acute in biology, where widely used community data formats must support heterogeneous metadata, sparse and dense assays, and downstream analysis within established computational ecosystems. Here we present annbatch, a mini-batch loader native to anndata that enables out-of-core training directly on disk-backed datasets. Across single-cell transcriptomics, microscopy and whole-genome sequencing benchmarks, annbatch increases loading throughput by up to an order of magnitude and shortens training from days to hours, while remaining fully compatible with the scverse ecosystem. Annbatch establishes a practical data-loading infrastructure for scalable biological AI, allowing increasingly large and diverse datasets to be used without abandoning standard biological data formats. Github: this https URL
[LG-20] PAC-Bayesian Reward-Certified Outcome Weighted Learning
链接: https://arxiv.org/abs/2604.01946
作者: Yuya Ishikawa,Shu Tamano
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:Estimating optimal individualized treatment rules (ITRs) via outcome weighted learning (OWL) often relies on observed rewards that are noisy or optimistic proxies for the true latent utility. Ignoring this reward uncertainty leads to the selection of policies with inflated apparent performance, yet existing OWL frameworks lack the finite-sample guarantees required to systematically embed such uncertainty into the learning objective. To address this issue, we propose PAC-Bayesian Reward-Certified Outcome Weighted Learning (PROWL). Given a one-sided uncertainty certificate, PROWL constructs a conservative reward and a strictly policy-dependent lower bound on the true expected value. Theoretically, we prove an exact certified reduction that transforms robust policy learning into a unified, split-free cost-sensitive classification task. This formulation enables the derivation of a nonasymptotic PAC-Bayes lower bound for randomized ITRs, where we establish that the optimal posterior maximizing this bound is exactly characterized by a general Bayes update. To overcome the learning-rate selection problem inherent in generalized Bayesian inference, we introduce a fully automated, bounds-based calibration procedure, coupled with a Fisher-consistent certified hinge surrogate for efficient optimization. Our experiments demonstrate that PROWL achieves improvements in estimating robust, high-value treatment regimes under severe reward uncertainty compared to standard methods for ITR estimation.
[LG-21] he Rank and Gradient Lost in Non-stationarity: Sample Weight Decay for Mitigating Plasticity Loss in Reinforcement Learning ICLR
链接: https://arxiv.org/abs/2604.01913
作者: Zihao Wu,Hongyao Tang,Yi Ma,Jiashun Liu,Yan Zheng,Jianye Hao
类目: Machine Learning (cs.LG)
*备注: ICLR
Abstract:Deep reinforcement learning (RL) suffers from plasticity loss severely due to the nature of non-stationarity, which impairs the ability to adapt to new data and learn continually. Unfortunately, our understanding of how plasticity loss arises, dissipates, and can be dissolved remains limited to empirical findings, leaving the theoretical end this http URL address this gap, we study the plasticity loss problem from the theoretical perspective of network optimization. By formally characterizing the two culprit factors in online RL process: the non-stationarity of data distributions and the non-stationarity of targets induced by bootstrapping, our theory attributes the loss of plasticity to two mechanisms: the rank collapse of the Neural Tangent Kernel (NTK) Gram matrix and the \Theta(\frac1k) decay of gradient magnitude. The first mechanism echoes prior empirical findings from the theoretical perspective and sheds light on the effects of existing methods, e.g., network reset, neuron recycle, and noise injection. Against this backdrop, we focus primarily on the second mechanism and aim to alleviate plasticity loss by addressing the gradient attenuation issue, which is orthogonal to existing methods. We propose Sample Weight Decay – a lightweight method to restore gradient magnitude, as a general remedy to plasticity loss for deep RL methods based on experience replay. In experiments, we evaluate the efficacy of \methodName upon TD3, \myaddedDouble DQN and SAC with SimBa architecture in MuJoCo, \myaddedALE and DeepMind Control Suite tasks. The results demonstrate that \methodName effectively alleviates plasticity loss and consistently improves learning performance across various configurations of deep RL algorithms, UTD, network architectures, and environments, achieving SOTA performance on challenging DMC Humanoid tasks.
[LG-22] Enhancing the Reliability of Medical AI through Expert-guided Uncertainty Modeling
链接: https://arxiv.org/abs/2604.01898
作者: Aleksei Khalin,Ekaterina Zaychenkova,Aleksandr Yugay,Andrey Goncharov,Sergey Korchagin,Alexey Zaytsev,Egor Ershov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Artificial intelligence (AI) systems accelerate medical workflows and improve diagnostic accuracy in healthcare, serving as second-opinion systems. However, the unpredictability of AI errors poses a significant challenge, particularly in healthcare contexts, where mistakes can have severe consequences. A widely adopted safeguard is to pair predictions with uncertainty estimation, enabling human experts to focus on high-risk cases while streamlining routine verification. Current uncertainty estimation methods, however, remain limited, particularly in quantifying aleatoric uncertainty, which arises from data ambiguity and noise. To address this, we propose a novel approach that leverages disagreement in expert responses to generate targets for training machine learning models. These targets are used in conjunction with standard data labels to estimate two components of uncertainty separately, as given by the law of total variance, via a two-ensemble approach, as well as its lightweight variant. We validate our method on binary image classification, binary and multi-class image segmentation, and multiple-choice question answering. Our experiments demonstrate that incorporating expert knowledge can enhance uncertainty estimation quality by 9% to 50% depending on the task, making this source of information invaluable for the construction of risk-aware AI systems in healthcare applications.
[LG-23] LI-DSN: A Layer-wise Interactive Dual-Stream Network for EEG Decoding
链接: https://arxiv.org/abs/2604.01889
作者: Chenghao Yue,Zhiyuan Ma,Zhongye Xia,Xinche Zhang,Yisi Zhang,Xinke Shen,Sen Song
类目: Machine Learning (cs.LG)
*备注:
Abstract:Electroencephalography (EEG) provides a non-invasive window into brain activity, offering high temporal resolution crucial for understanding and interacting with neural processes through brain-computer interfaces (BCIs). Current dual-stream neural networks for EEG often process temporal and spatial features independently through parallel branches, delaying their integration until a final, late-stage fusion. This design inherently leads to an “information silo” problem, precluding intermediate cross-stream refinement and hindering spatial-temporal decompositions essential for full feature utilization. We propose LI-DSN, a layer-wise interactive dual-stream network that facilitates progressive, cross-stream communication at each layer, thereby overcoming the limitations of late-fusion paradigms. LI-DSN introduces a novel Temporal-Spatial Integration Attention (TSIA) mechanism, which constructs a Spatial Affinity Correlation Matrix (SACM) to capture inter-electrode spatial structural relationships and a Temporal Channel Aggregation Matrix (TCAM) to integrate cosine-gated temporal dynamics under spatial guidance. Furthermore, we employ an adaptive fusion strategy with learnable channel weights to optimize the integration of dual-stream features. Extensive experiments across eight diverse EEG datasets, encompassing motor imagery (MI) classification, emotion recognition, and steady-state visual evoked potentials (SSVEP), consistently demonstrate that LI-DSN significantly outperforms 13 state-of-the-art (SOTA) baseline models, showcasing its superior robustness and decoding performance. The code will be publicized after acceptance.
[LG-24] DDCL-INCRT: A Self-Organising Transformer with Hierarchical Prototype Structure (Theoretical Foundations)
链接: https://arxiv.org/abs/2604.01880
作者: Giansalvo Cirrincione
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注: 30 pages, 5 figures. Submitted to Neural Networks (Elsevier)
Abstract:Modern neural networks of the transformer family require the practitioner to decide, before training begins, how many attention heads to use, how deep the network should be, and how wide each component should be. These decisions are made without knowledge of the task, producing architectures that are systematically larger than necessary: empirical studies find that a substantial fraction of heads and layers can be removed after training without performance loss. This paper introduces DDCL-INCRT, an architecture that determines its own structure during training. Two complementary ideas are combined. The first, DDCL (Deep Dual Competitive Learning), replaces the feedforward block with a dictionary of learned prototype vectors representing the most informative directions in the data. The prototypes spread apart automatically, driven by the training objective, without explicit regularisation. The second, INCRT (Incremental Transformer), controls the number of heads: starting from one, it adds a new head only when the directional information uncaptured by existing heads exceeds a threshold. The main theoretical finding is that these two mechanisms reinforce each other: each new head amplifies prototype separation, which in turn raises the signal triggering the next addition. At convergence, the network self-organises into a hierarchy of heads ordered by representational granularity. This hierarchical structure is proved to be unique and minimal, the smallest architecture sufficient for the task, under the stated conditions. Formal guarantees of stability, convergence, and pruning safety are established throughout. The architecture is not something one designs. It is something one derives. Comments: 30 pages, 5 figures. Submitted to Neural Networks (Elsevier) Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML) Cite as: arXiv:2604.01880 [cs.LG] (or arXiv:2604.01880v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.01880 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Giansalvo Cirrincione [view email] [v1] Thu, 2 Apr 2026 10:39:06 UTC (129 KB)
[LG-25] owards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via Diffusion Sampler
链接: https://arxiv.org/abs/2604.01870
作者: Yiran Ma,Jerome Le Ny,Zhichao Chen,Zhihuan Song
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: This manuscript has been accepted for publication in IEEE Transactions on Industrial Informatics. Copyright has been transferred to IEEE. Reuse of this material is subject to IEEE copyright restrictions
Abstract:In modern process industries, data-driven models are important tools for real-time monitoring when key performance indicators are difficult to measure directly. While accurate predictions are essential, reliable uncertainty quantification (UQ) is equally critical for safety, reliability, and decision-making, but remains a major challenge in current data-driven approaches. In this work, we introduce a diffusion-based posterior sampling framework that inherently produces well-calibrated predictive uncertainty via faithful posterior sampling, eliminating the need for post-hoc calibration. In extensive evaluations on synthetic distributions, the Raman-based phenylacetic acid soft sensor benchmark, and a real ammonia synthesis case study, our method achieves practical improvements over existing UQ techniques in both uncertainty calibration and predictive accuracy. These results highlight diffusion samplers as a principled and scalable paradigm for advancing uncertainty-aware modeling in industrial applications.
[LG-26] Physics Informed Reinforcement Learning with Gibbs Priors for Topology Control in Power Grids
链接: https://arxiv.org/abs/2604.01830
作者: Pantelis Dogoulis,Maxime Cordy
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Topology control for power grid operation is a challenging sequential decision making problem because the action space grows combinatorially with the size of the grid and action evaluation through simulation is computationally expensive. We propose a physics-informed Reinforcement Learning framework that combines semi-Markov control with a Gibbs prior, that encodes the system’s physics, over the action space. The decision is only taken when the grid enters a hazardous regime, while a graph neural network surrogate predicts the post action overload risk of feasible topology actions. These predictions are used to construct a physics-informed Gibbs prior that both selects a small state-dependent candidate set and reweights policy logits before action selection. In this way, our method reduces exploration difficulty and online simulation cost while preserving the flexibility of a learned policy. We evaluate the approach in three realistic benchmark environments of increasing difficulty. Across all settings, the proposed method achieves a strong balance between control quality and computational efficiency: it matches oracle-level performance while being approximately 6\times faster on the first benchmark, reaches 94.6% of oracle reward with roughly 200\times lower decision time on the second one, and on the most challenging benchmark improves over a PPO baseline by up to 255% in reward and 284% in survived steps while remaining about 2.5\times faster than a strong specialized engineering baseline. These results show that our method provides an effective mechanism for topology control in power grids.
[LG-27] Graph Neural Operator Towards Edge Deployability and Portability for Sparse-to-Dense Real-Time Virtual Sensing on Irregular Grids
链接: https://arxiv.org/abs/2604.01802
作者: William Howes,Jason Yoo,Kazuma Kobayashi,Subhankar Sarkar,Farid Ahmed,Souvik Chakraborty,Syed Bahauddin Alam
类目: Machine Learning (cs.LG)
*备注: 34 pages, 5 figures, 16 tables
Abstract:Accurate sensing of spatially distributed physical fields typically requires dense instrumentation, which is often infeasible in real-world systems due to cost, accessibility, and environmental constraints. Physics-based solvers address this through direct numerical integration of governing equations, but their computational latency and power requirements preclude real-time use in resource-constrained monitoring and control systems. Here we introduce VIRSO (Virtual Irregular Real-Time Sparse Operator), a graph-based neural operator for sparse-to-dense reconstruction on irregular geometries, and a variable-connectivity algorithm, Variable KNN (V-KNN), for mesh-informed graph construction. Unlike prior neural operators that treat hardware deployability as secondary, VIRSO reframes inference as measurement: the combination of both spectral and spatial analysis provides accurate reconstruction without the high latency and power consumption of previous graph-based methodologies with poor scalability, presenting VIRSO as a potential candidate for edge-constrained, real-time virtual sensing. We evaluate VIRSO on three nuclear thermal-hydraulic benchmarks of increasing geometric and multiphysics complexity, across reconstruction ratios from 47:1 to 156:1. VIRSO achieves mean relative L_2 errors below 1%, outperforming other benchmark operators while using fewer parameters. The full 10-layer configuration reduces the energy-delay product (EDP) from \approx206 J \cdot ms for the graph operator baseline to 10.1 J \cdot ms on an NVIDIA H200. Implemented on an NVIDIA Jetson Orin Nano, all configurations of VIRSO provide sub-10 W power consumption and sub-second latency. These results establish the edge-feasibility and hardware-portability of VIRSO and present compute-aware operator learning as a new paradigm for real-time sensing in inaccessible and resource-constrained environments.
[LG-28] Bridging Deep Learning and Integer Linear Programming: A Predictive-to-Prescriptive Framework for Supply Chain Analytics
链接: https://arxiv.org/abs/2604.01775
作者: Khai Banh Nghiep,Duc Nguyen Minh,Lan Hoang Thi
类目: Machine Learning (cs.LG)
*备注: 12 pages, 4 figures, 4 tables
Abstract:Although demand forecasting is a critical component of supply chain planning, actual retail data can exhibit irreconcilable seasonality, irregular spikes, and noise, rendering precise projections nearly unattainable. This paper proposes a three-step analytical framework that combines forecasting and operational analytics. The first stage consists of exploratory data analysis, where delivery-tracked data from 180,519 transactions are partitioned, and long-term trends, seasonality, and delivery-related attributes are examined. Secondly, the forecasting performance of a statistical time series decomposition model N-BEATS MSTL and a recent deep learning architecture N-HiTS were compared. N-BEATS and N-HiTS were both statistically, and hence were N-BEATS’s and N-HiTS’s statistically selected. Most recent time series deep learning models, N-HiTS, N-BEATS. N-HiTS and N-BEATS N-HiTS and N-HiTS outperformed the statistical benchmark to a large extent. N-BEATS was selected to be the most optimized model, as the one with the lowest forecasting error, in the 3rd and final stage forecasting values of the next 4 weeks of 1918 units, and provided those as a model with a set of deterministically integer linear program outcomes that are aimed to minimize the total delivery time with a set of bound budget, capacity, and service constraints. The solution allocation provided a feasible and cost-optimal shipping plan. Overall, the study provides a compelling example of the practical impact of precise forecasting and simple, highly interpretable model optimization in logistics.
[LG-29] Dual-Attention Based 3D Channel Estimation
链接: https://arxiv.org/abs/2604.01769
作者: Xiangzhao Qin,Sha Hu
类目: Machine Learning (cs.LG)
*备注: 5 pages, 6 figures
Abstract:For multi-input and multi-output (MIMO) channels, the optimal channel estimation (CE) based on linear minimum mean square error (LMMSE) requires three-dimensional (3D) filtering. However, the complexity is often prohibitive due to large matrix dimensions. Suboptimal estimators approximate 3DCE by decomposing it into time, frequency, and spatial domains, while yields noticeable performance degradation under correlated MIMO channels. On the other hand, recent advances in deep learning (DL) can explore channel correlations in all domains via attention mechanisms. Building on this capability, we propose a dual attention mechanism based 3DCE network (3DCENet) that can achieve accurate estimates.
[LG-30] DDCL: Deep Dual Competitive Learning: A Differentiable End-to-End Framework for Unsupervised Prototype-Based Representation Learning
链接: https://arxiv.org/abs/2604.01740
作者: Giansalvo Cirrincione
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:A persistent structural weakness in deep clustering is the disconnect between feature learning and cluster assignment. Most architectures invoke an external clustering step, typically k-means, to produce pseudo-labels that guide training, preventing the backbone from directly optimising for cluster quality. This paper introduces Deep Dual Competitive Learning (DDCL), the first fully differentiable end-to-end framework for unsupervised prototype-based representation learning. The core contribution is architectural: the external k-means is replaced by an internal Dual Competitive Layer (DCL) that generates prototypes as native differentiable outputs of the network. This single inversion makes the complete pipeline, from backbone feature extraction through prototype generation to soft cluster assignment, trainable by backpropagation through a single unified loss, with no Lloyd iterations, no pseudo-label discretisation, and no external clustering step. To ground the framework theoretically, the paper derives an exact algebraic decomposition of the soft quantisation loss into a simplex-constrained reconstruction error and a non-negative weighted prototype variance term. This identity reveals a self-regulating mechanism built into the loss geometry: the gradient of the variance term acts as an implicit separation force that resists prototype collapse without any auxiliary objective, and leads to a global Lyapunov stability theorem for the reduced frozen-encoder system. Six blocks of controlled experiments validate each structural prediction. The decomposition identity holds with zero violations across more than one hundred thousand training epochs; the negative feedback cycle is confirmed with Pearson -0.98; with a jointly trained backbone, DDCL outperforms its non-differentiable ablation by 65% in clustering accuracy and DeepCluster end-to-end by 122%.
[LG-31] Koopman-Based Nonlinear Identification and Adaptive Control of a Turbofan Engine
链接: https://arxiv.org/abs/2604.01730
作者: David Grasev
类目: Machine Learning (cs.LG)
*备注: 21 pages, 23 figures
Abstract:This paper investigates Koopman operator-based approaches for multivariable control of a two-spool turbofan engine. A physics-based component-level model is developed to generate training data and validate the controllers. A meta-heuristic extended dynamic mode decomposition is developed, with a cost function designed to accurately capture both spool-speed dynamics and the engine pressure ratio (EPR), enabling the construction of a single Koopman model suitable for multiple control objectives. Using the identified time-varying Koopman model, two controllers are developed: an adaptive Koopman-based model predictive controller (AKMPC) with a disturbance observer and a Koopman-based feedback linearization controller (K-FBLC), which serves as a benchmark. The controllers are evaluated for two control strategies, namely configurations of spool speeds and EPR, under both sea-level and varying flight conditions. The results demonstrate that the proposed identification approach enables accurate predictions of both spool speeds and EPR, allowing the Koopman model to be reused flexibly across different control formulations. While both control strategies achieve comparable performance in steady conditions, the AKMPC exhibits superior robustness compared with the K-FBLC under varying flight conditions due to its ability to compensate for model mismatch. Moreover, the EPR control strategy improves the thrust response. The study highlights the applicability of Koopman-based control and demonstrates the advantages of the AKMPC-based framework for robust turbofan engine control.
[LG-32] MATA-Former SIICU: Semantic Aware Temporal Alignment for High-Fidelity ICU Risk Prediction
链接: https://arxiv.org/abs/2604.01727
作者: Zhichong Zheng,Xiaohang Nie,Xueqi Wang,Yuanjin Zhao,Haitao Zhang,Yichao Tang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Forecasting evolving clinical risks relies on intrinsic pathological dependencies rather than mere chronological proximity, yet current methods struggle with coarse binary supervision and physical timestamps. To align predictive modeling with clinical logic, we propose the Medical-semantics Aware Time-ALiBi Transformer (MATA-Former), utilizing event semantics to dynamically parameterize attention weights to prioritize causal validity over time lags. Furthermore, we introduce Plateau-Gaussian Soft Labeling (PSL), reformulating binary classification into continuous multi-horizon regression for full-trajectory risk modeling. Evaluated on SIICU – a newly constructed dataset featuring over 506k events with rigorous expert-verified, fine-grained annotations – and the MIMIC-IV dataset, our framework demonstrates superior efficacy and robust generalization in capturing risks from text-intensive, irregular clinical time series.
[LG-33] Label Shift Estimation With Incremental Prior Update SDM2025
链接: https://arxiv.org/abs/2604.01651
作者: Yunrui Zhang,Gustavo Batista,Salil S. Kanhere
类目: Machine Learning (cs.LG)
*备注: SIAM SDM 2025
Abstract:An assumption often made in supervised learning is that the training and testing sets have the same label distribution. However, in real-life scenarios, this assumption rarely holds. For example, medical diagnosis result distributions change over time and across locations; fraud detection models must adapt as patterns of fraudulent activity shift; the category distribution of social media posts changes based on trending topics and user demographics. In the task of label shift estimation, the goal is to estimate the changing label distribution p_t(y) in the testing set, assuming the likelihood p(x|y) does not change, implying no concept drift. In this paper, we propose a new approach for post-hoc label shift estimation, unlike previous methods that perform moment matching with confusion matrix estimated from a validation set or maximize the likelihood of the new data with an expectation-maximization algorithm. We aim to incrementally update the prior on each sample, adjusting each posterior for more accurate label shift estimation. The proposed method is based on intuitive assumptions on classifiers that are generally true for modern probabilistic classifiers. The proposed method relies on a weaker notion of calibration compared to other methods. As a post-hoc approach for label shift estimation, the proposed method is versatile and can be applied to any black-box probabilistic classifier. Experiments on CIFAR-10 and MNIST show that the proposed method consistently outperforms the current state-of-the-art maximum likelihood-based methods under different calibrations and varying intensities of label shift.
[LG-34] Pseudo-Quantized Actor-Critic Algorithm for Robustness to Noisy Temporal Difference Error
链接: https://arxiv.org/abs/2604.01613
作者: Taisuke Kobayashi
类目: Machine Learning (cs.LG)
*备注: 38 pages, 12 figures
Abstract:In reinforcement learning (RL), temporal difference (TD) errors are widely adopted for optimizing value and policy functions. However, since the TD error is defined by a bootstrap method, its computation tends to be noisy and destabilize learning. Heuristics to improve the accuracy of TD errors, such as target networks and ensemble models, have been introduced so far. While these are essential approaches for the current deep RL algorithms, they cause side effects like increased computational cost and reduced learning efficiency. Therefore, this paper revisits the TD learning algorithm based on control as inference, deriving a novel algorithm capable of robust learning against noisy TD errors. First, the distribution model of optimality, a binary random variable, is represented by a sigmoid function. Alongside forward and reverse Kullback-Leibler divergences, this new model derives a robust learning rule: when the sigmoid function saturates with a large TD error probably due to noise, the gradient vanishes, implicitly excluding it from learning. Furthermore, the two divergences exhibit distinct gradient-vanishing characteristics. Building on these analyses, the optimality is decomposed into multiple levels to achieve pseudo-quantization of TD errors, aiming for further noise reduction. Additionally, a Jensen-Shannon divergence-based approach is approximately derived to inherit the characteristics of both divergences. These benefits are verified through RL benchmarks, demonstrating stable learning even when heuristics are insufficient or rewards contain noise.
[LG-35] raining In-Context and In-Weights Mixtures Via Contrastive Context Sampling
链接: https://arxiv.org/abs/2604.01601
作者: Deeptanshu Malu,Deevyanshu Malu,Aditya Nemiwal,Sunita Sarawagi
类目: Machine Learning (cs.LG)
*备注:
Abstract:We investigate training strategies that co-develop in-context learning (ICL) and in-weights learning (IWL), and the ability to switch between them based on context relevance. Although current LLMs exhibit both modes, standard task-specific fine-tuning often erodes ICL, motivating IC-Train - fine-tuning with in-context examples. Prior work has shown that emergence of ICL after IC-Train depends on factors such as task diversity and training duration. In this paper we show that the similarity structure between target inputs and context examples also plays an important role. Random context leads to loss of ICL and IWL dominance, while only similar examples in context causes ICL to degenerate to copying labels without regard to relevance. To address this, we propose a simple Contrastive-Context which enforces two types of contrasts: (1) mix of similar and random examples within a context to evolve a correct form of ICL, and (2) varying grades of similarity across contexts to evolve ICL-IWL mixtures. We present insights on the importance of such contrast with theoretical analysis of a minimal model. We validate with extensive empirical evaluation on four LLMs and several tasks. Diagnostic probes confirm that contrasted contexts yield stable ICL-IWL mixtures, avoiding collapse into pure ICL, IWL, or copying. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.01601 [cs.LG] (or arXiv:2604.01601v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.01601 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-36] Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training
链接: https://arxiv.org/abs/2604.01597
作者: Dong Shu,Denghui Zhang,Jessica Hullman
类目: Machine Learning (cs.LG)
*备注:
Abstract:Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that all generated episodes provide a beneficial optimization signal. However, these episodes frequently contain noisy or unfaithful reasoning, which can degrade model performance and slow down training. In this paper, we propose \textbfInfluence-Guided PPO (I-PPO), a novel framework that integrates data attribution into the RL post-training loop. By calculating an influence score for each episode using a gradient-based approximation, I-PPO identifies and eliminates episodes that are anti-aligned with a validation gradient. Our experiments demonstrate that I-PPO consistently outperforms SFT and PPO baselines. We show that our filtering process acts as an intrinsic early stopping mechanism, accelerating training efficiency while effectively reducing unfaithful CoT reasoning.
[LG-37] Optimizing EEG Graph Structure for Seizure Detection: An Information Bottleneck and Self-Supervised Learning Approach ALT
链接: https://arxiv.org/abs/2604.01595
作者: Lincan Li,Rikuto Kotoge,Xihao Piao,Zheng Chen,Yushun Dong
类目: Machine Learning (cs.LG)
*备注: Accepted by IEEE 14th International Conference on Healthcare Informatics (ICHI)
Abstract:Seizure detection from EEG signals is highly challenging due to complex spatiotemporal dynamics and extreme inter-patient variability. To model them, recent methods construct dynamic graphs via statistical correlations, predefined similarity measures, or implicit learning, yet rarely account for EEG’s noisy nature. Consequently, these graphs usually contain redundant or task-irrelevant connections, undermining model performance even with state-of-the-art architectures. In this paper, we present a new perspective for EEG seizure detection: jointly learning denoised dynamic graph structures and informative spatial-temporal representations guided by the Information Bottleneck (IB). Unlike prior approaches, our graph constructor explicitly accounts for the noisy characteristics of EEG data, producing compact and reliable connectivity patterns that better support downstream seizure detection. To further enhance representation learning, we employ a self-supervised Graph Masked AutoEncoder that reconstructs masked EEG signals based on dynamic graph context, promoting structure-aware and compact representations aligned with the IB principle. Bringing things together, we introduce Information Bottleneck-guided EEG SeizuRE DetectioN via SElf-Supervised Learning (IRENE), which explicitly learns dynamic graph structures and interpretable spatial-temporal EEG representations. IRENE addresses three core challenges: (i) Identifying the most informative nodes and edges; (ii) Explaining seizure propagation in the brain network; and (iii) Enhancing robustness against label scarcity and inter-patient variability. Extensive experiments on benchmark EEG datasets demonstrate that our method outperforms state-of-the-art baselines in seizure detection and provides clinically meaningful insights into seizure dynamics. The source code is available at this https URL.
[LG-38] Variational LSTM with Augmented Inputs: Nonlinear Response History Metamodeling with Aleatoric and Epistemic Uncertainty
链接: https://arxiv.org/abs/2604.01587
作者: Manisha Sapkota,Min Li,Bowei Li
类目: Machine Learning (cs.LG)
*备注: 22 pages, 10 figures
Abstract:Uncertainty propagation in high-dimensional nonlinear dynamic structural systems is pivotal in state-of-the-art performance-based design and risk assessment, where uncertainties from both excitations and structures, i.e., the aleatoric uncertainty, must be considered. This poses a significant challenge due to heavy computational demands. Machine learning techniques are thus introduced as metamodels to alleviate this burden. However, the “black box” nature of Machine learning models underscores the necessity of avoiding overly confident predictions, particularly when data and training efforts are insufficient. This creates a need, in addition to considering the aleatoric uncertainty, of estimating the uncertainty related to the prediction confidence, i.e., epistemic uncertainty, for machine learning-based metamodels. We developed a probabilistic metamodeling technique based on a variational long short-term memory (LSTM) with augmented inputs to simultaneously capture aleatoric and epistemic uncertainties. Key random system parameters are treated as augmented inputs alongside excitation series carrying record-to-record variability to capture the full range of aleatoric uncertainty. Meanwhile, epistemic uncertainty is effectively approximated via the Monte Carlo dropout scheme. Unlike computationally expensive full Bayesian approaches, this method incurs negligible additional training costs while enabling nearly cost-free uncertainty simulation. The proposed technique is demonstrated through multiple case studies involving stochastic seismic or wind excitations. Results show that the calibrated metamodels accurately reproduce nonlinear response time histories and provide confidence bounds indicating the associated epistemic uncertainty.
[LG-39] Care-Conditioned Neuromodulation for Autonomy-Preserving Supportive Dialogue Agents
链接: https://arxiv.org/abs/2604.01576
作者: Shalima Binta Manir,Tim Oates
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models deployed in supportive or advisory roles must balance helpfulness with preservation of user autonomy, yet standard alignment methods primarily optimize for helpfulness and harmlessness without explicitly modeling relational risks such as dependency reinforcement, overprotection, or coercive guidance. We introduce Care-Conditioned Neuromodulation (CCN), a state-dependent control framework in which a learned scalar signal derived from structured user state and dialogue context conditions response generation and candidate selection. We formalize this setting as an autonomy-preserving alignment problem and define a utility function that rewards autonomy support and helpfulness while penalizing dependency and coercion. We also construct a benchmark of relational failure modes in multi-turn dialogue, including reassurance dependence, manipulative care, overprotection, and boundary inconsistency. On this benchmark, care-conditioned candidate generation combined with utility-based reranking improves autonomy-preserving utility by +0.25 over supervised fine-tuning and +0.07 over preference optimization baselines while maintaining comparable supportiveness. Pilot human evaluation and zero-shot transfer to real emotional-support conversations show directional agreement with automated metrics. These results suggest that state-dependent control combined with utility-based selection is a practical approach to multi-objective alignment in autonomy-sensitive dialogue.
[LG-40] EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild
链接: https://arxiv.org/abs/2604.01554
作者: Yiming Fan(1),Jun Yeon Won(1),Ding Zhu(1),Melih Sirlanci(1),Mahdi Khalili(1),Carter Yagemann(1) ((1) The Ohio State University)
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 13 pages, 7 figures. This is a technical report for the EXHIB benchmark. Code and data are available at this https URL
Abstract:Binary Function Similarity Detection (BFSD) is a core problem in software security, supporting tasks such as vulnerability analysis, malware classification, and patch provenance. In the past few decades, numerous models and tools have been developed for this application; however, due to the lack of a comprehensive universal benchmark in this field, researchers have struggled to compare different models effectively. Existing datasets are limited in scope, often focusing on a narrow set of transformations or types of binaries, and fail to reflect the full diversity of real-world applications. We introduce EXHIB, a benchmark comprising five realistic datasets collected from the wild, each highlighting a distinct aspect of the BFSD problem space. We evaluate 9 representative models spanning multiple BFSD paradigms on EXHIB and observe performance degradations of up to 30% on firmware and semantic datasets compared to standard settings, revealing substantial generalization gaps. Our results show that robustness to low- and mid-level binary variations does not generalize to high-level semantic differences, underscoring a critical blind spot in current BFSD evaluation practices. Comments: 13 pages, 7 figures. This is a technical report for the EXHIB benchmark. Code and data are available at this https URL Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE) ACMclasses: D.4.6; K.6.5 Cite as: arXiv:2604.01554 [cs.CR] (or arXiv:2604.01554v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.01554 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-41] ZEUS: Accelerating Diffusion Models with Only Second-Order Predictor
链接: https://arxiv.org/abs/2604.01552
作者: Yixiao Wang,Ting Jiang,Zishan Shao,Hancheng Ye,Jingwei Sun,Mingyuan Ma,Jianyi Zhang,Yiran Chen,Hai Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Denoising generative models deliver high-fidelity generation but remain bottlenecked by inference latency due to the many iterative denoiser calls required during sampling. Training-free acceleration methods reduce latency by either sparsifying the model architecture or shortening the sampling trajectory. Current training-free acceleration methods are more complex than necessary: higher-order predictors amplify error under aggressive speedups, and architectural modifications hinder deployment. Beyond 2x acceleration, step skipping creates structural scarcity – at most one fresh evaluation per local window – leaving the computed output and its backward difference as the only causally grounded information. Based on this, we propose ZEUS, an acceleration method that predicts reduced denoiser evaluations using a second-order predictor, and stabilizes aggressive consecutive skipping with an interleaved scheme that avoids back-to-back extrapolations. ZEUS adds essentially zero overhead, no feature caches, and no architectural modifications, and it is compatible with different backbones, prediction objectives, and solver choices. Across image and video generation, ZEUS consistently improves the speed-fidelity performance over recent training-free baselines, achieving up to 3.2x end-to-end speedup while maintaining perceptual quality. Our code is available at: this https URL.
[LG-42] Learning ECG Image Representations via Dual Physiological-Aware Alignments
链接: https://arxiv.org/abs/2604.01526
作者: Hung Manh Pham,Jialu Tang,Aaqib Saeed,Dong Ma,Bin Zhu,Pan Zhou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Electrocardiograms (ECGs) are among the most widely used diagnostic tools for cardiovascular diseases, and a large amount of ECG data worldwide appears only in image form. However, most existing automated ECG analysis methods rely on access to raw signal recordings, limiting their applicability in real-world and resource-constrained settings. In this paper, we present ECG-Scan, a self-supervised framework for learning clinically generalized representations from ECG images through dual physiological-aware alignments: 1) Our approach optimizes image representation learning using multimodal contrastive alignment between image and gold-standard signal-text modalities. 2) We further integrate domain knowledge via soft-lead constraints, regularizing the reconstruction process and improving signal lead inter-consistency. Extensive benchmarking across multiple datasets and downstream tasks demonstrates that our image-based model achieves superior performance compared to existing image baselines and notably narrows the gap between ECG image and signal analysis. These results highlight the potential of self-supervised image modeling to unlock large-scale legacy ECG data and broaden access to automated cardiovascular diagnostics.
[LG-43] Beyond Logit Adjustment: A Residual Decomposition Framework for Long-Tailed Reranking
链接: https://arxiv.org/abs/2604.01506
作者: Zhanliang Wang,Hongzhuo Chen,Quan Minh Nguyen,Mian Umair Ahsan,Kai Wang
类目: Machine Learning (cs.LG)
*备注: Preprint
Abstract:Long-tailed classification, where a small number of frequent classes dominate many rare ones, remains challenging because models systematically favor frequent classes at inference time. Existing post-hoc methods such as logit adjustment address this by adding a fixed classwise offset to the base-model logits. However, the correction required to restore the relative ranking of two classes need not be constant across inputs, and a fixed offset cannot adapt to such variation. We study this problem through Bayes-optimal reranking on a base-model top-k shortlist. The gap between the optimal score and the base score, the residual correction, decomposes into a classwise component that is constant within each class, and a pairwise component that depends on the input and competing labels. When the residual is purely classwise, a fixed offset suffices to recover the Bayes-optimal ordering. We further show that when the same label pair induces incompatible ordering constraints across contexts, no fixed offset can achieve this recovery. This decomposition leads to testable predictions regarding when pairwise correction can improve performance and when cannot. We develop REPAIR (Reranking via Pairwise residual correction), a lightweight post-hoc reranker that combines a shrinkage-stabilized classwise term with a linear pairwise term driven by competition features on the shortlist. Experiments on five benchmarks spanning image classification, species recognition, scene recognition, and rare disease diagnosis confirm that the decomposition explains where pairwise correction helps and where classwise correction alone suffices.
[LG-44] Matching Accuracy Different Geometry: Evolution Strategies vs GRPO in LLM Post-Training
链接: https://arxiv.org/abs/2604.01499
作者: William Hoy,Binxu Wang,Xu Pan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Evolution Strategies (ES) have emerged as a scalable gradient-free alternative to reinforcement learning based LLM fine-tuning, but it remains unclear whether comparable task performance implies comparable solutions in parameter space. We compare ES and Group Relative Policy Optimization (GRPO) across four tasks in both single-task and sequential continual-learning settings. ES matches or exceeds GRPO in single-task accuracy and remains competitive sequentially when its iteration budget is controlled. Despite this similarity in task performance, the two methods produce markedly different model updates: ES makes much larger changes and induces broader off-task KL drift, whereas GRPO makes smaller, more localized updates. Strikingly, the ES and GRPO solutions are linearly connected with no loss barrier, even though their update directions are nearly orthogonal. We develop an analytical theory of ES that explains all these phenomena within a unified framework, showing how ES can accumulate large off-task movement on weakly informative directions while still making enough progress on the task to match gradient-based RL in downstream accuracy. These results show that gradient-free and gradient-based fine-tuning can reach similarly accurate yet geometrically distinct solutions, with important consequences for forgetting and knowledge preservation. The source code is publicly available: this https URL.
[LG-45] Soft MPCritic: Amortized Model Predictive Value Iteration
链接: https://arxiv.org/abs/2604.01477
作者: Thomas Banker,Nathan P. Lawrence,Ali Mesbah
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: submitted to CDC 2026
Abstract:Reinforcement learning (RL) and model predictive control (MPC) offer complementary strengths, yet combining them at scale remains computationally challenging. We propose soft MPCritic, an RL-MPC framework that learns in (soft) value space while using sample-based planning for both online control and value target generation. soft MPCritic instantiates MPC through model predictive path integral control (MPPI) and trains a terminal Q-function with fitted value iteration, aligning the learned value function with the planner and implicitly extending the effective planning horizon. We introduce an amortized warm-start strategy that recycles planned open-loop action sequences from online observations when computing batched MPPI-based value targets. This makes soft MPCritic computationally practical, while preserving solution quality. soft MPCritic plans in a scenario-based fashion with an ensemble of dynamic models trained for next-step prediction accuracy. Together, these ingredients enable soft MPCritic to learn effectively through robust, short-horizon planning on classic and complex control tasks. These results establish soft MPCritic as a practical and scalable blueprint for synthesizing MPC policies in settings where policy extraction and direct, long-horizon planning may fail.
[LG-46] Generative Profiling for Soft Real-Time Systems and its Applications to Resource Allocation
链接: https://arxiv.org/abs/2604.01441
作者: Georgiy A. Bondar,Abigail Eisenklam,Yifan Cai,Robert Gifford,Tushar Sial,Linh Thi Xuan Phan,Abhishek Halder
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Operating Systems (cs.OS); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:
Abstract:Modern real-time systems require accurate characterization of task timing behavior to ensure predictable performance, particularly on complex hardware architectures. Existing methods, such as worst-case execution time analysis, often fail to capture the fine-grained timing behaviors of a task under varying resource contexts (e.g., an allocation of cache, memory bandwidth, and CPU frequency), which is necessary to achieve efficient resource utilization. In this paper, we introduce a novel generative profiling approach that synthesizes context-dependent, fine-grained timing profiles for real-time tasks, including those for unmeasured resource allocations. Our approach leverages a nonparametric, conditional multi-marginal Schrödinger Bridge (MSB) formulation to generate accurate execution profiles for unseen resource contexts, with maximum likelihood guarantees. We demonstrate the efficiency and effectiveness of our approach through real-world benchmarks, and showcase its practical utility in a representative case study of adaptive multicore resource allocation for real-time systems.
[LG-47] Know Your Streams: On the Conceptualization Characterization and Generation of Intentional Event Streams
链接: https://arxiv.org/abs/2604.01440
作者: Andrea Maldonado,Christian Imenkamp,Hendrik Reiter,Thomas Seidl,Wilhelm Hasselbring,Martin Werner,Agnes Koschmider
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:
Abstract:The shift toward IoT-enabled, sensor-driven systems has transformed how operational data is generated, favoring continuous, real-time event streams (ES) over static event logs. This evolution presents new challenges for Streaming Process Mining (SPM), which must cope with out-of-order events, concurrent activities, incomplete cases, and concept drifts. Yet, the evaluation of SPM algorithms remains rooted in outdated practices, relying on static logs or artificially streamified data that fail to reflect the complexities of real-world streams. To address this gap, we first perform a comprehensive review of data stream literature to identify stream characteristics currently not reflected in the SPM community. Next, we use this information to extend the conceptual foundation for ES. Finally, we propose Stream of Intent, a prototype generator to produce ES with specific features. Our evaluation shows excellence in producing reproducible, intentional ES for targeted benchmarking and adaptive algorithm development in SPM.
[LG-48] Improving Latent Generalization Using Test-time Compute
链接: https://arxiv.org/abs/2604.01430
作者: Arslan Chaudhry,Sridhar Thiagarajan,Andrew Lampinen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Language Models (LMs) exhibit two distinct mechanisms for knowledge acquisition: in-weights learning (i.e., encoding information within the model weights) and in-context learning (ICL). Although these two modes offer complementary strengths, in-weights learning frequently struggles to facilitate deductive reasoning over the internalized knowledge. We characterize this limitation as a deficit in latent generalization, of which the reversal curse is one example. Conversely, in-context learning demonstrates highly robust latent generalization capabilities. To improve latent generalization from in-weights knowledge, prior approaches rely on train-time data augmentation, yet these techniques are task-specific, scale poorly, and fail to generalize to out-of-distribution knowledge. To overcome these shortcomings, this work studies how models can be taught to use test-time compute, or ‘thinking’, specifically to improve latent generalization. We use Reinforcement Learning (RL) from correctness feedback to train models to produce long chains-of-thought (CoTs) to improve latent generalization. Our experiments show that this thinking approach not only resolves many instances of latent generalization failures on in-distribution knowledge but also, unlike augmentation baselines, generalizes to new knowledge for which no RL was performed. Nevertheless, on pure reversal tasks, we find that thinking does not unlock direct knowledge inversion, but the generate-and-verify ability of thinking models enables them to get well above chance performance. The brittleness of factual self-verification means thinking models still remain well below the performance of in-context learning for this task. Overall, our results establish test-time thinking as a flexible and promising direction for improving the latent generalization of LMs.
[LG-49] Causal Optimal Coupling for Gaussian Input-Output Distributional Data
链接: https://arxiv.org/abs/2604.01406
作者: Daran Xu,Amirhossein Taghvaei
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注:
Abstract:We study the problem of identifying an optimal coupling between input-output distributional data generated by a causal dynamical system. The coupling is required to satisfy prescribed marginal distributions and a causality constraint reflecting the temporal structure of the system. We formulate this problem as a Schr"odinger Bridge, which seeks the coupling closest - in Kullback-Leibler divergence - to a given prior while enforcing both marginal and causality constraints. For the case of Gaussian marginals and general time-dependent quadratic cost functions, we derive a fully tractable characterization of the Sinkhorn iterations that converges to the optimal solution. Beyond its theoretical contribution, the proposed framework provides a principled foundation for applying causal optimal transport methods to system identification from distributional data.
[LG-50] Benchmark Problems and Benchmark Datasets for the evaluation of Machine and Deep Learning methods on Photoplethysmography signals: the D4 report from the QUMPHY project
链接: https://arxiv.org/abs/2604.01398
作者: Urs Hackstein,Jordi Alastruey,Philip Aston,Ciaran Bench,Peter H. Charlton,Loic Coquelin,Nando Hegemann,Vaidotas Marozas,Mohammad Moulaeifard,Manasi Nandi,Andrius Petrenas,Oskar Pfeffer,Mantas Rinkevicius,Andrius Solosenko,Nils Strodthoff,Sara Vardanega
类目: Machine Learning (cs.LG)
*备注: 28 pages
Abstract:This report is part of the Qumphy project (22HLT01 Qumphy) that is funded by the European Union and is dedicated to the development of measures to quantify the uncertainties associated with Machine Learning algorithms applied to medical problems, in particular the analysis and processing of Photoplethysmography (PPG) signals. In this report, a list of six medical problems that are related to PPG signals and serve as Benchmark Problems is given. Suitable Benchmark datasets and their usage are described also.
[LG-51] Residuals-based Offline Reinforcement Learning
链接: https://arxiv.org/abs/2604.01378
作者: Qing Zhu,Xian Yu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Offline reinforcement learning (RL) has received increasing attention for learning policies from previously collected data without interaction with the real environment, which is particularly important in high-stakes applications. While a growing body of work has developed offline RL algorithms, these methods often rely on restrictive assumptions about data coverage and suffer from distribution shift. In this paper, we propose a residuals-based offline RL framework for general state and action spaces. Specifically, we define a residuals-based Bellman optimality operator that explicitly incorporates estimation error in learning transition dynamics into policy optimization by leveraging empirical residuals. We show that this Bellman operator is a contraction mapping and identify conditions under which its fixed point is asymptotically optimal and possesses finite-sample guarantees. We further develop a residuals-based offline deep Q-learning (DQN) algorithm. Using a stochastic CartPole environment, we demonstrate the effectiveness of our residuals-based offline DQN algorithm.
[LG-52] PI-JEPA: Label-Free Surrogate Pretraining for Coupled Multiphysics Simulation via Operator-Split Latent Prediction
链接: https://arxiv.org/abs/2604.01349
作者: Brandon Yee,Pairie Koh
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph)
*备注:
Abstract:Reservoir simulation workflows face a fundamental data asymmetry: input parameter fields (geostatistical permeability realizations, porosity distributions) are free to generate in arbitrary quantities, yet existing neural operator surrogates require large corpora of expensive labeled simulation trajectories and cannot exploit this unlabeled structure. We introduce \textbfPI-JEPA (Physics-Informed Joint Embedding Predictive Architecture), a surrogate pretraining framework that trains \emphwithout any completed PDE solves, using masked latent prediction on unlabeled parameter fields under per-sub-operator PDE residual regularization. The predictor bank is structurally aligned with the Lie–Trotter operator-splitting decomposition of the governing equations, dedicating a separate physics-constrained latent module to each sub-process (pressure, saturation transport, reaction), enabling fine-tuning with as few as 100 labeled simulation runs. On single-phase Darcy flow, PI-JEPA achieves 1.9\times lower error than FNO and 2.4\times lower error than DeepONet at N_\ell=100 , with 24% improvement over supervised-only training at N_\ell=500 , demonstrating that label-free surrogate pretraining substantially reduces the simulation budget required for multiphysics surrogate deployment.
[LG-53] Malliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement Learning
链接: https://arxiv.org/abs/2604.01345
作者: Vikram Krishnamurthy,Luke Snow
类目: Machine Learning (cs.LG)
*备注:
Abstract:Inverse reinforcement learning (IRL) recovers the loss function of a forward learner from its observed responses adaptive IRL aims to reconstruct the loss function of a forward learner by passively observing its gradients as it performs reinforcement learning (RL). This paper proposes a novel passive Langevin-based algorithm that achieves adaptive IRL. The key difficulty in adaptive IRL is that the required gradients in the passive algorithm are counterfactual, that is, they are conditioned on events of probability zero under the forward learner’s trajectory. Therefore, naive Monte Carlo estimators are prohibitively inefficient, and kernel smoothing, though common, suffers from slow convergence. We overcome this by employing Malliavin calculus to efficiently estimate the required counterfactual gradients. We reformulate the counterfactual conditioning as a ratio of unconditioned expectations involving Malliavin quantities, thus recovering standard estimation rates. We derive the necessary Malliavin derivatives and their adjoint Skorohod integral formulations for a general Langevin structure, and provide a concrete algorithmic approach which exploits these for counterfactual gradient estimation.
[LG-54] Massively Parallel Exact Inference for Hawkes Processes
链接: https://arxiv.org/abs/2604.01342
作者: Ahmer Raza,Hudson Smith
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multivariate Hawkes processes are a widely used class of self-exciting point processes, but maximum likelihood estimation naively scales as O(N^2) in the number of events. The canonical linear exponential Hawkes process admits a faster O(N) recurrence, but prior work evaluates this recurrence sequentially, without exploiting parallelization on modern GPUs. We show that the Hawkes process intensity can be expressed as a product of sparse transition matrices admitting a linear-time associative multiply, enabling computation via a parallel prefix scan. This yields a simple yet massively parallelizable algorithm for maximum likelihood estimation of linear exponential Hawkes processes. Our method reduces the computational complexity to approximately O(N/P) with P parallel processors, and naturally yields a batching scheme to maintain constant memory usage, avoiding GPU memory constraints. Importantly, it computes the exact likelihood without any additional assumptions or approximations, preserving the simplicity and interpretability of the model. We demonstrate orders-of-magnitude speedups on simulated and real datasets, scaling to thousands of nodes and tens of millions of events, substantially beyond scales reported in prior work. We provide an open-source PyTorch library implementing our optimizations.
[LG-55] Bias Inheritance in Neural-Symbolic Discovery of Constitutive Closures Under Function-Class Mismatch
链接: https://arxiv.org/abs/2604.01335
作者: Hanbing Liang,Ze Tao,Fujun Liu
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:We investigate the data-driven discovery of constitutive closures in nonlinear reaction-diffusion systems with known governing PDE structures. Our objective is to robustly recover diffusion and reaction laws from spatiotemporal observations while avoiding the common pitfall where low residuals or short-horizon predictions are conflated with physical recovery. We propose a three-stage neural-symbolic framework: (1) learning numerical surrogates under physical constraints using a noise-robust weak-form-driven objective; (2) compressing these surrogates into restricted interpretable symbolic families (e.g., polynomial, rational, and saturation forms); and (3) validating the symbolic closures through explicit forward re-simulation on unseen initial conditions. Extensive numerical experiments reveal two distinct regimes. Under matched-library settings, weak polynomial baselines behave as correctly specified reference estimators, showing that neural surrogates do not uniformly outperform classical bases. Conversely, under function-class mismatch, neural surrogates provide necessary flexibility and can be compressed into compact symbolic laws with minimal rollout degradation. However, we identify a critical “bias inheritance” mechanism where symbolic compression does not automatically repair constitutive bias. Across various observation regimes, the true error of the symbolic closure closely tracks that of the neural surrogate, yielding a bias inheritance ratio near one. These findings demonstrate that the primary bottleneck in neural-symbolic modeling lies in the initial numerical inverse problem rather than the subsequent symbolic compression. We underscore that constitutive claims must be rigorously supported by forward validation rather than residual minimization alone.
[LG-56] Model Merging via Data-Free Covariance Estimation
链接: https://arxiv.org/abs/2604.01329
作者: Marawan Gamal Abdel Hameed,Derek Tam,Pascal Jr Tikeng Notsawo,Colin Raffel,Guillaume Rabusseau
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model merging provides a way of cheaply combining individual models to produce a model that inherits each individual’s capabilities. While some merging methods can approach the performance of multitask training, they are often heuristically motivated and lack theoretical justification. A principled alternative is to pose model merging as a layer-wise optimization problem that directly minimizes interference between tasks. However, this formulation requires estimating per-layer covariance matrices from data, which may not be available when performing merging. In contrast, many of the heuristically-motivated methods do not require auxiliary data, making them practically advantageous. In this work, we revisit the interference minimization framework and show that, under certain conditions, covariance matrices can be estimated directly from difference matrices, eliminating the need for data while also reducing computational costs. We validate our approach across vision and language benchmarks on models ranging from 86M parameters to 7B parameters, outperforming previous data-free state-of-the-art merging methods
[LG-57] Efficient and Principled Scientific Discovery through Bayesian Optimization: A Tutorial
链接: https://arxiv.org/abs/2604.01328
作者: Zhongwei Yu,Rasul Tutunov,Alexandre Max Maraval,Zikai Xie,Zhenzhi Tan,Jiankang Wang,Zijing Li,Liangliang Xu,Qi Yang,Jun Jiang,Sanzhong Luo,Zhenxiao Guo,Haitham Bou-Ammar,Jun Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Traditional scientific discovery relies on an iterative hypothesise-experiment-refine cycle that has driven progress for centuries, but its intuitive, ad-hoc implementation often wastes resources, yields inefficient designs, and misses critical insights. This tutorial presents Bayesian Optimisation (BO), a principled probability-driven framework that formalises and automates this core scientific cycle. BO uses surrogate models (e.g., Gaussian processes) to model empirical observations as evolving hypotheses, and acquisition functions to guide experiment selection, balancing exploitation of known knowledge and exploration of uncharted domains to eliminate guesswork and manual trial-and-error. We first frame scientific discovery as an optimisation problem, then unpack BO’s core components, end-to-end workflows, and real-world efficacy via case studies in catalysis, materials science, organic synthesis, and molecule discovery. We also cover critical technical extensions for scientific applications, including batched experimentation, heteroscedasticity, contextual optimisation, and human-in-the-loop integration. Tailored for a broad audience, this tutorial bridges AI advances in BO with practical natural science applications, offering tiered content to empower cross-disciplinary researchers to design more efficient experiments and accelerate principled scientific discovery.
[LG-58] Macroscopic transport patterns of UAV traffic in 3D anisotropic wind fields: A constraint-preserving hybrid PINN-FVM approach
链接: https://arxiv.org/abs/2604.01327
作者: Hanbing Liang,Fujun Liu
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:Macroscopic unmanned aerial vehicle (UAV) traffic organization in three-dimensional airspace faces significant challenges from static wind fields and complex obstacles. A critical difficulty lies in simultaneously capturing the strong anisotropy induced by wind while strictly preserving transport consistency and boundary semantics, which are often compromised in standard physics-informed learning approaches. To resolve this, we propose a constraint-preserving hybrid solver that integrates a physics-informed neural network for the anisotropic Eikonal value problem with a conservative finite-volume method for steady density transport. These components are coupled through an outer Picard iteration with under-relaxation, where the target condition is hard-encoded and strictly conservative no-flux boundaries are enforced during the transport step. We evaluate the framework on reproducible homing and point-to-point scenarios, effectively capturing value slices, induced-motion patterns, and steady density structures such as bands and bottlenecks. Ultimately, our perspective emphasizes the value of a reproducible computational framework supported by transparent empirical diagnostics to enable the traceable assessment of macroscopic traffic phenomena.
[LG-59] Detecting Complex Money Laundering Patterns with Incremental and Distributed Graph Modeling
链接: https://arxiv.org/abs/2604.01315
作者: Haseeb Tariq,Alen Kaja,Marwan Hassani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Money launderers take advantage of limitations in existing detection approaches by hiding their financial footprints in a deceitful manner. They manage this by replicating transaction patterns that the monitoring systems cannot easily distinguish. As a result, criminally gained assets are pushed into legitimate financial channels without drawing attention. Algorithms developed to monitor money flows often struggle with scale and complexity. The difficulty of identifying such activities is further intensified by the (persistent) inability of current solutions to control the excessive number of false positive signals produced by rigid, risk-based rules systems. We propose a framework called ReDiRect (REduce, DIstribute, and RECTify), specifically designed to overcome these challenges. The primary contribution of our work is a novel framing of this problem in an unsupervised setting; where a large transaction graph is fuzzily partitioned into smaller, manageable components to enable fast processing in a distributed manner. In addition, we define a refined evaluation metric that better captures the effectiveness of exposed money laundering patterns. Through comprehensive experimentation, we demonstrate that our framework achieves superior performance compared to existing and state-of-the-art techniques, particularly in terms of efficiency and real-world applicability. For validation, we used the real (open source) Libra dataset and the recently released synthetic datasets by IBM Watson. Our code and datasets are available at this https URL.
[LG-60] JetPrism: diagnosing convergence for generative simulation and inverse problems in nuclear physics
链接: https://arxiv.org/abs/2604.01313
作者: Zeyu Xia,Tyler Kim,Trevor Reed,Judy Fox,Geoffrey Fox,Adam Szczepaniak
类目: Machine Learning (cs.LG); Nuclear Experiment (nucl-ex); Data Analysis, Statistics and Probability (physics.data-an); Instrumentation and Detectors (physics.ins-det)
*备注: Submitted to AI4EIC 2025. 21 pages, 17 figures
Abstract:High-fidelity Monte Carlo simulations and complex inverse problems, such as mapping smeared experimental observations to ground-truth states, are computationally intensive yet essential for robust data analysis. Conditional Flow Matching (CFM) offers a mathematically robust approach to accelerating these tasks, but we demonstrate its standard training loss is fundamentally misleading. In rigorous physics applications, CFM loss plateaus prematurely, serving as an unreliable indicator of true convergence and physical fidelity. To investigate this disconnect, we designed JetPrism, a configurable CFM framework acting as an efficient generative surrogate for evaluating unconditional generation and conditional detector unfolding. Using synthetic stress tests and a Jefferson Lab kinematic dataset ( \gamma p \to \rho^0 p \to \pi^+\pi^- p ) relevant to the forthcoming Electron-Ion Collider (EIC), we establish that physics-informed metrics continue to improve significantly long after the standard loss converges. Consequently, we propose a multi-metric evaluation protocol incorporating marginal and pairwise \chi^2 statistics, W_1 distances, correlation matrix distances ( D_\mathrmcorr ), and nearest-neighbor distance ratios ( R_\mathrmNN ). By demonstrating that domain-specific evaluations must supersede generic loss metrics, this work establishes JetPrism as a dependable generative surrogate that ensures precise statistical agreement with ground-truth data without memorizing the training set. While demonstrated in nuclear physics, this diagnostic framework is readily extensible to parameter generation and complex inverse problems across broad domains. Potential applications span medical imaging, astrophysics, semiconductor discovery, and quantitative finance, where high-fidelity simulation, rigorous inversion, and generative reliability are critical.
[LG-61] An Online Machine Learning Multi-resolution Optimization Framework for Energy System Design Limit of Performance Analysis
链接: https://arxiv.org/abs/2604.01308
作者: Oluwamayowa O. Amusat,Luka Grbcic,Remi Patureau,M. Jibran S. Zuberi,Dan Gunter,Michael Wetter
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Optimization and Control (math.OC)
*备注:
Abstract:Designing reliable integrated energy systems for industrial processes requires optimization and verification models across multiple fidelities, from architecture-level sizing to high-fidelity dynamic operation. However, model mismatch across fidelities obscures the sources of performance loss and complicates the quantification of architecture-to-operation performance gaps. We propose an online, machine-learning-accelerated multi-resolution optimization framework that estimates an architecture-specific upper bound on achievable performance while minimizing expensive high-fidelity model evaluations. We demonstrate the approach on a pilot energy system supplying a 1 MW industrial heat load. First, we solve a multi-objective architecture optimization to select the system configuration and component capacities. We then develop an machine learning (ML)-accelerated multi-resolution, receding-horizon optimal control strategy that approaches the achievable-performance bound for the specified architecture, given the additional controls and dynamics not captured by the architectural optimization model. The ML-guided controller adaptively schedules the optimization resolution based on predictive uncertainty and warm-starts high-fidelity solves using elite low-fidelity solutions. Our results on the pilot case study show that the proposed multi-resolution strategy reduces the architecture-to-operation performance gap by up to 42% relative to a rule-based controller, while reducing required high-fidelity model evaluations by 34% relative to the same multi-fidelity approach without ML guidance, enabling faster and more reliable design verification. Together, these gains make high-fidelity verification tractable, providing a practical upper bound on achievable operational performance.
[LG-62] UQ-SHRED: uncertainty quantification of shallow recurrent decoder networks for sparse sensing via engression
链接: https://arxiv.org/abs/2604.01305
作者: Mars Liyao Gao,Yuxuan Bao,Amy S. Rude,Xinwei Shen,J. Nathan Kutz
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:Reconstructing high-dimensional spatiotemporal fields from sparse sensor measurements is critical in a wide range of scientific applications. The SHallow REcurrent Decoder (SHRED) architecture is a recent state-of-the-art architecture that reconstructs high-quality spatial domain from hyper-sparse sensor measurement streams. An important limitation of SHRED is that in complex, data-scarce, high-frequency, or stochastic systems, portions of the spatiotemporal field must be modeled with valid uncertainty estimation. We introduce UQ-SHRED, a distributional learning framework for sparse sensing problems that provides uncertainty quantification through a neural network-based distributional regression called engression. UQ-SHRED models the uncertainty by learning the predictive distribution of the spatial state conditioned on the sensor history. By injecting stochastic noise into sensor inputs and training with an energy score loss, UQ-SHRED produces predictive distributions with minimal computational overhead, requiring only noise injection at the input and resampling through a single architecture without retraining or additional network structures. On complicated synthetic and real-life datasets including turbulent flow, atmospheric dynamics, neuroscience and astrophysics, UQ-SHRED provides a distributional approximation with well-calibrated confidence intervals. We further conduct ablation studies to understand how each model setting affects the quality of the UQ-SHRED performance, and its validity on uncertainty quantification over a set of different experimental setups.
[LG-63] Forecasting Supply Chain Disruptions with Foresight Learning
链接: https://arxiv.org/abs/2604.01298
作者: Benjamin Turtel,Paul Wilczewski,Kris Skotheim
类目: Machine Learning (cs.LG)
*备注:
Abstract:Anticipating supply chain disruptions before they materialize is a core challenge for firms and policymakers alike. A key difficulty is learning to reason reliably about infrequent, high-impact events from noisy and unstructured inputs - a setting where general-purpose models struggle without task-specific adaptation. We introduce an end-to-end framework that trains LLMs to produce calibrated probabilistic forecasts using realized disruption outcomes as supervision. The resulting model substantially outperforms strong baselines - including GPT-5 - on accuracy, calibration, and precision. We also show that training induces more structured and reliable probabilistic reasoning without explicit prompting. These results suggest a general pathway for training domain-specific forecasting models that produce decision-ready signals. To support transparency we open-source the evaluation dataset used in this study. Dataset: this https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.01298 [cs.LG] (or arXiv:2604.01298v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.01298 Focus to learn more arXiv-issued DOI via DataCite
[LG-64] opological Effects in Neural Network Field Theory
链接: https://arxiv.org/abs/2604.02313
作者: Christian Ferko,James Halverson,Vishnu Jejjala,Brandon Robinson
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG)
*备注: 55 pages, 8 figures
Abstract:Neural network field theory formulates field theory as a statistical ensemble of fields defined by a network architecture and a density on its parameters. We extend the construction to topological settings via the inclusion of discrete parameters that label the topological quantum number. We recover the Berezinskii–Kosterlitz–Thouless transition, including the spin-wave critical line and the proliferation of vortices at high temperatures. We also verify the T-duality of the bosonic string, showing invariance under the exchange of momentum and winding on S^1 , the transformation of the sigma model couplings according to the Buscher rules on constant toroidal backgrounds, the enhancement of the current algebra at self-dual radius, and non-geometric T-fold transition functions.
[LG-65] BVFLMSP : Bayesian Vertical Federated Learning for Multimodal Survival with Privacy
链接: https://arxiv.org/abs/2604.02248
作者: Abhilash Kar,Basisth Saha,Tanmay Sen,Biswabrata Pradhan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Multimodal time-to-event prediction often requires integrating sensitive data distributed across multiple parties, making centralized model training impractical due to privacy constraints. At the same time, most existing multimodal survival models produce single deterministic predictions without indicating how confident the model is in its estimates, which can limit their reliability in real-world decision making. To address these challenges, we propose BVFLMSP, a Bayesian Vertical Federated Learning (VFL) framework for multimodal time-to-event analysis based on a Split Neural Network architecture. In BVFLMSP, each client independently models a specific data modality using a Bayesian neural network, while a central server aggregates intermediate representations to perform survival risk prediction. To enhance privacy, we integrate differential privacy mechanisms by perturbing client side representations before transmission, providing formal privacy guarantees against information leakage during federated training. We first evaluate our Bayesian multimodal survival model against widely used single modality survival baselines and the centralized multimodal baseline MultiSurv. Across multimodal settings, the proposed method shows consistent improvements in discrimination performance, with up to 0.02 higher C-index compared to MultiSurv. We then compare federated and centralized learning under varying privacy budgets across different modality combinations, highlighting the tradeoff between predictive performance and privacy. Experimental results show that BVFLMSP effectively includes multimodal data, improves survival prediction over existing baselines, and remains robust under strict privacy constraints while providing uncertainty estimates. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2604.02248 [stat.ML] (or arXiv:2604.02248v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2604.02248 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-66] Gradient estimators for parameter inference in discrete stochastic kinetic models
链接: https://arxiv.org/abs/2604.02121
作者: Ludwig Burger,Annalena Kofler,Lukas Heinrich,Ulrich Gerland
类目: Computational Physics (physics.comp-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Chemical Physics (physics.chem-ph)
*备注: 13 pages, 6 figures
Abstract:Stochastic kinetic models are ubiquitous in physics, yet inferring their parameters from experimental data remains challenging. In deterministic models, parameter inference often relies on gradients, as they can be obtained efficiently through automatic differentiation. However, these tools cannot be directly applied to stochastic simulation algorithms (SSA) such as the Gillespie algorithm, since sampling from a discrete set of reactions introduces non-differentiable operations. In this work, we adopt three gradient estimators from machine learning for the Gillespie SSA: the Gumbel-Softmax Straight-Through (GS-ST) estimator, the Score Function estimator, and the Alternative Path estimator. We compare the properties of all estimators in two representative systems exhibiting relaxation or oscillatory dynamics, where the latter requires gradient estimation of time-dependent objective functions. We find that the GS-ST estimator mostly yields well-behaved gradient estimates, but exhibits diverging variance in challenging parameter regimes, resulting in unsuccessful parameter inference. In these cases, the other estimators provide more robust, lower variance gradients. Our results demonstrate that gradient-based parameter inference can be integrated effectively with the Gillespie SSA, with different estimators offering complementary advantages.
[LG-67] Reinforcement Learning for Speculative Trading under Exploratory Framework
链接: https://arxiv.org/abs/2604.02035
作者: Yun Zhao,Alex S.L. Tse,Harry Zheng
类目: Mathematical Finance (q-fin.MF); Machine Learning (cs.LG); Optimization and Control (math.OC); Computational Finance (q-fin.CP); Trading and Market Microstructure (q-fin.TR)
*备注: 37 pages, 14 figures
Abstract:We study a speculative trading problem within the exploratory reinforcement learning (RL) framework of Wang et al. [2020]. The problem is formulated as a sequential optimal stopping problem over entry and exit times under general utility function and price process. We first consider a relaxed version of the problem in which the stopping times are modeled by the jump times of Cox processes driven by bounded, non-randomized intensity controls. Under the exploratory formulation, the agent’s randomized control is characterized via the probability measure over the jump intensities, and their objective function is regularized by Shannon’s differential entropy. This yields a system of the exploratory HJB equations and Gibbs distributions in closed-form as the optimal policy. Error estimates and convergence of the RL objective to the value function of the original problem are established. Finally, an RL algorithm is designed, and its implementation is showcased in a pairs-trading application.
[LG-68] Demographic Parity Tails for Regression
链接: https://arxiv.org/abs/2604.02017
作者: Naht Sinh Le(LAMA),Christophe Denis(SAMM),Mohamed Hebiri(LAMA)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Demographic parity (DP) is a widely studied fairness criterion in regression, enforcing independence between the predictions and sensitive attributes. However, constraining the entire distribution can degrade predictive accuracy and may be unnecessary for many applications, where fairness concerns are localized to specific regions of the distribution. To overcome this issue, we propose a new framework for regression under DP that focuses on the tails of target distribution across sensitive groups. Our methodology builds on optimal transport theory. By enforcing fairness constraints only over targeted regions of the distribution, our approach enables more nuanced and context-sensitive interventions. Leveraging recent advances, we develop an interpretable and flexible algorithm that leverages the geometric structure of optimal transport. We provide theoretical guarantees, including risk bounds and fairness properties, and validate the method through experiments in regression settings.
[LG-69] Homogenized Transformers
链接: https://arxiv.org/abs/2604.01978
作者: Hugo Koubbi,Borjan Geshkovski,Philippe Rigollet
类目: Probability (math.PR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study a random model of deep multi-head self-attention in which the weights are resampled independently across layers and heads, as at initialization of training. Viewing depth as a time variable, the residual stream defines a discrete-time interacting particle system on the unit sphere. We prove that, under suitable joint scalings of the depth, the residual step size, and the number of heads, this dynamics admits a nontrivial homogenized limit. Depending on the scaling, the limit is either deterministic or stochastic with common noise; in the mean-field regime, the latter leads to a stochastic nonlinear Fokker–Planck equation for the conditional law of a representative token. In the Gaussian setting, the limiting drift vanishes, making the homogenized dynamics explicit enough to study representation collapse. This yields quantitative trade-offs between dimension, context length, and temperature, and identifies regimes in which clustering can be mitigated.
[LG-70] A Novel Theoretical Analysis for Clustering Heteroscedastic Gaussian Data without Knowledge of the Number of Clusters
链接: https://arxiv.org/abs/2604.01943
作者: Dominique Pastor,Elsa Dupraz,Ismail Hbilou,Guillaume Ansel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 76 pages, submitted to JMLR
Abstract:This paper addresses the problem of clustering measurement vectors that are heteroscedastic in that they can have different covariance matrices. From the assumption that the measurement vectors within a given cluster are Gaussian distributed with possibly different and unknown covariant matrices around the cluster centroid, we introduce a novel cost function to estimate the centroids. The zeros of the gradient of this cost function turn out to be the fixed-points of a certain function. As such, the approach generalizes the methodology employed to derive the existing Mean-Shift algorithm. But as a main and novel theoretical result compared to Mean-Shift, this paper shows that the sole fixed-points of the identified function tend to be the cluster centroids if both the number of measurements per cluster and the distances between centroids are large enough. As a second contribution, this paper introduces the Wald kernel for clustering. This kernel is defined as the p-value of the Wald hypothesis test for testing the mean of a Gaussian. As such, the Wald kernel measures the plausibility that a measurement vector belongs to a given cluster and it scales better with the dimension of the measurement vectors than the usual Gaussian kernel. Finally, the proposed theoretical framework allows us to derive a new clustering algorithm called CENTRE-X that works by estimating the fixed-points of the identified function. As Mean-Shift, CENTRE-X requires no prior knowledge of the number of clusters. It relies on a Wald hypothesis test to significantly reduce the number of fixed points to calculate compared to the Mean-Shift algorithm, thus resulting in a clear gain in complexity. Simulation results on synthetic and real data sets show that CENTRE-X has comparable or better performance than standard clustering algorithms K-means and Mean-Shift, even when the covariance matrices are not perfectly known.
[LG-71] Learning in Prophet Inequalities with Noisy Observations ICLR2026
链接: https://arxiv.org/abs/2604.01789
作者: Jung-hun Kim,Vianney Perchet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: ICLR 2026
Abstract:We study the prophet inequality, a fundamental problem in online decision-making and optimal stopping, in a practical setting where rewards are observed only through noisy realizations and reward distributions are unknown. At each stage, the decision-maker receives a noisy reward whose true value follows a linear model with an unknown latent parameter, and observes a feature vector drawn from a distribution. To address this challenge, we propose algorithms that integrate learning and decision-making via lower-confidence-bound (LCB) thresholding. In the i.i.d.\ setting, we establish that both an Explore-then-Decide strategy and an \varepsilon -Greedy variant achieve the sharp competitive ratio of 1 - 1/e , under a mild condition on the optimal value. For non-identical distributions, we show that a competitive ratio of 1/2 can be guaranteed against a relaxed benchmark. Moreover, with limited window access to past rewards, the tight ratio of 1/2 against the optimal benchmark is achieved.
[LG-72] Random Coordinate Descent on the Wasserstein Space of Probability Measures
链接: https://arxiv.org/abs/2604.01606
作者: Yewei Xu,Qin Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Optimization over the space of probability measures endowed with the Wasserstein-2 geometry is central to modern machine learning and mean-field modeling. However, traditional methods relying on full Wasserstein gradients often suffer from high computational overhead in high-dimensional or ill-conditioned settings. We propose a randomized coordinate descent framework specifically designed for the Wasserstein manifold, introducing both Random Wasserstein Coordinate Descent (RWCD) and Random Wasserstein Coordinate Proximal-Gradient (RWCP) for composite objectives. By exploiting coordinate-wise structures, our methods adapt to anisotropic objective landscapes where full-gradient approaches typically struggle. We provide a rigorous convergence analysis across various landscape geometries, establishing guarantees under non-convex, Polyak-Łojasiewicz, and geodesically convex conditions. Our theoretical results mirror the classic convergence properties found in Euclidean space, revealing a compelling symmetry between coordinate descent on vectors and on probability measures. The developed techniques are inherently adaptive to the Wasserstein geometry and offer a robust analytical template that can be extended to other optimization solvers within the space of measures. Numerical experiments on ill-conditioned energies demonstrate that our framework offers significant speedups over conventional full-gradient methods.
[LG-73] A Determinantal Approach to a Sharp ell1-ellinfty-ell2 Norm Inequality
链接: https://arxiv.org/abs/2604.01525
作者: Jose Antonio Lara Benitez
类目: Classical Analysis and ODEs (math.CA); Machine Learning (cs.LG); History and Overview (math.HO); Optimization and Control (math.OC)
*备注:
Abstract:We give a short linear–algebraic proof of the inequality [ |x|1,|x|\infty \le \frac1+\sqrtp2,|x|_2^2, ] valid for every (x\in\mathbbR^p). This inequality relates three fundamental norms on finite-dimensional spaces and has applications in optimization and numerical analysis. Our proof exploits the determinantal structure of a parametrized family of quadratic forms, and we show the constant (1+\sqrtp)/2 is optimal.
[LG-74] Non-monotonicity in Conformal Risk Control
链接: https://arxiv.org/abs/2604.01502
作者: Tareq Aldirawi,Yun Li,Wenge Guo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 38 pages, 6 figures, 3 tables
Abstract:Conformal risk control (CRC) provides distribution-free guarantees for controlling the expected loss at a user-specified level. Existing theory typically assumes that the loss decreases monotonically with a tuning parameter that governs the size of the prediction set. This assumption is often violated in practice, where losses may behave non-monotonically due to competing objectives such as coverage and efficiency. We study CRC under non-monotone loss functions when the tuning parameter is selected from a finite grid, a common scenario in thresholding or discretized decision rules. Revisiting a known counterexample, we show that the validity of CRC without monotonicity depends on the relationship between the calibration sample size and the grid resolution. In particular, risk control can still be achieved when the calibration sample is sufficiently large relative to the grid. We provide a finite-sample guarantee for bounded losses over a grid of size m , showing that the excess risk above the target level \alpha is of order \sqrt\log(m)/n , where n is the calibration sample size. A matching lower bound shows that this rate is minimax optimal. We also derive refined guarantees under additional structural conditions, including Lipschitz continuity and monotonicity, and extend the analysis to settings with distribution shift via importance weighting. Numerical experiments on synthetic multilabel classification and real object detection data illustrate the practical impact of non-monotonicity. Methods that account for finite-sample deviations achieve more stable risk control than approaches based on monotonicity transformations, while maintaining competitive prediction-set sizes. Comments: 38 pages, 6 figures, 3 tables Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2604.01502 [stat.ML] (or arXiv:2604.01502v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2604.01502 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Wenge Guo [view email] [v1] Thu, 2 Apr 2026 00:26:50 UTC (639 KB)
[LG-75] he topological gap at criticality: scaling exponent d η universality and scope
链接: https://arxiv.org/abs/2604.01484
作者: Matthew Loftus
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures, 4 tables
Abstract:The topological gap \Delta = TP_H_1^real - TP_H_1^shuf – the excess H_1 total persistence of the majority-spin alpha complex over a density-matched null – encodes critical correlations in spin models. We establish finite-size scaling: \Delta(L,T) = A L^d+\eta G_-(L|t/T_c|) , with G_-(x) \sim (1+x/x_0)^-(1+\beta/\nu) . For 2D Ising, \alpha = 2.249 \pm 0.038 , matching d+\eta = 9/4 to 0.03\sigma ; the G_- exponent \gamma = 1.089 \pm 0.077 is consistent with 1+\beta/\nu = 9/8 ( \Delta R^2 10^-5 ). For 2D Potts q=3 with L up to 1024, \alpha = 2.272 \pm 0.024 ( 0.2\sigma from d+\eta = 2.267 ), with two-term corrections to scaling ( R^2 = 0.9999 ). The G_- exponent \gamma = 1.114 (68% CI [1.053, 1.173] ) matches 1+\beta/\nu = 17/15 . Scope boundaries: the law fails for 2D Potts q=4 ( \alpha = 2.347 \pm 0.017 , 9.3\sigma from d+\eta = 5/2 ) where logarithmic corrections prevent convergence, and for raw 3D Ising ( 4\sigma from d+\eta ), but density normalization \Delta/|M|^1/2 recovers \alpha = 3.06 \pm 0.04 ( 0.6\sigma ). The framework fails for first-order, BKT, and percolation. The criterion: \alpha = d+\eta holds when corrections to scaling are algebraic ( \omega 0 ) but fails when logarithmic ( \omega \to 0 ).
[LG-76] VIANA: character Value-enhanced Intensity Assessment via domain-informed Neural Architecture
链接: https://arxiv.org/abs/2604.01365
作者: Luana P. Queiroz,Icaro S. C. Bernardes,Ana M. Ribeiro,Bernardo M. Aguilera-Mercado,Idelfonso B. R. Nogueira
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:
Abstract:Predicting the perceived intensity of odorants remains a fundamental challenge in sensory science due to the complex, non-linear behavior of their response, as well as the difficulty in correlating molecular structure with human perception. While traditional deep learning models, such as Graph Convolutional Networks (GCNs), excel at capturing molecular topology, they often fail to account for the biological and perceptual context of olfaction. This study introduces VIANA, a novel “tri-pillar” framework that integrates structural graph theory, character value embeddings, and phenomenological behavior. This methodology systematically evaluates knowledge transfer across three distinct domains: molecular structure via GCNs, semantic odor character values via Principal Odor Map (POM) embeddings, and biological dose-response logic via Hill’s law. We demonstrate that knowledge transfer is not inherently positive; rather, a balance must be maintained in the volume of information provided to the model. While raw semantic data led to “information overload” in domain-informed models, applying Principal Component Analysis (PCA) to distill the 95% most impactful semantic variance yielded a superior “signal distillation” effect. Results indicate that the synthesis of these three knowledge transfer pillars significantly outperforms baseline structural models, with VIANA achieving a peak R^2 of 0.996 and a test Mean Squared Error (MSE) of 0.19. In this context, VIANA successfully captures the physical ceiling of saturation, the sensitivity of detection thresholds, and the nuance of odor character value expression, providing a domain grounded simulation of the human olfactory experience. This research provides a robust framework for digital olfaction, effectively bridging the gap between molecular informatics and sensory perception.
[LG-77] Descending into the Modular Bootstrap
链接: https://arxiv.org/abs/2604.01275
作者: Nathan Benjamin,A. Liam Fitzpatrick,Wei Li,Jesse Thaler
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注: 57 pages, 23 figures, 4 tables; code available at this http URL
Abstract:In this paper, we attempt to explore the landscape of two-dimensional conformal field theories (2d CFTs) by efficiently searching for numerical solutions to the modular bootstrap equation using machine-learning-style optimization. The torus partition function of a 2d CFT is fixed by the spectrum of its primary operators and its chiral algebra, which we take to be the Virasoro algebra with c1 . We translate the requirement that this partition function is modular invariant into a loss function, which we then minimize to identify possible primary spectra. Our approach involves two technical innovations that facilitate finding reliable candidate CFTs. The first is a strategy to estimate the uncertainty associated with truncating the spectrum to the lowest dimension operators. The second is the use of a new singular-value-based optimizer (Sven) that is more effective than gradient descent at navigating the hierarchical structure of the loss landscape. We numerically construct candidate truncated CFT partition functions with central charges between 1 and \frac87 , a range devoid of known examples, and argue that these candidates likely come from a continuous space of modular bootstrap solutions. We also provide evidence for a more stringent constraint on the spectral gap near c = 1 than the existing bound of \Delta_\rm gap \le \fracc6 + \frac13 .
[LG-78] Experimental Design for Missing Physics
链接: https://arxiv.org/abs/2604.01231
作者: Arno Strouwen,Sebastián Micluţa-Câmpeanu
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:
Abstract:For most process systems, knowledge of the model structure is incomplete. This missing physics must then be learned from experimental data. Recently, a combination of universal differential equations and symbolic regression has become a popular tool to discover these missing physics. Universal differential equations employ neural networks to represent missing parts of the model structure, and symbolic regression aims to make these neural networks interpretable. These machine learning techniques require high-quality data to successfully recover the true model structure. To gather such informative data, a sequential experimental design technique is developed which is based on optimally discriminating between the plausible model structures suggested by symbolic regression. This technique is then applied to discovering the missing physics of a bioreactor.
[LG-79] Interpretable Battery Aging without Extra Tests via Neural-Assisted Physics-based Modelling IJCNN
链接: https://arxiv.org/abs/2604.01229
作者: Yuan Qiu,Wei Li,Wei Zhang,Yi Zhou,Fang Liu,Jianbiao Wang,Zhi Wei Seh
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted to IEEE WCCI 2026 (IJCNN Special Session SS30: Computational Intelligence and AI Applications for Sustainable Energy Management in Smart Grids and Energy Communities, 2nd ed.). 8 pages, 4 figures, 2 tables
Abstract:State of health (SoH) is widely used for battery management, but it is a single scalar and offers limited interpretability. Two batteries with similar SoH can exhibit very different degradation behaviors and the lack of interpretability hinders optimal battery operation. In this paper, we propose IBAM for interpretable battery aging modelling with a neural-assisted physics-based framework. IBAM outputs a 2-D aging fingerprint without extra diagnostic tests and uses only routine logs from the battery management system. The fingerprint offers great interpretability by capturing a battery’s curve-wide polarization voltage loss and the tail loss near the end-of-discharge. IBAM first creates a physics-based battery model based on a fractional-order equivalent circuit model, and then extracts per-cycle fingerprints from the model using a two-stage least-squares method. IBAM further anchors fingerprints on the SoH axis with physics-guided regression, where the per-cycle SoH is estimated via a bidirectional gated recurrent unit with customized multi-channel voltage features. Across batteries with short-, medium-, and long-lifespans, IBAM consistently yields the best physics model fidelity at different aging stages, and provides clear interpretations of degradation mechanisms and fingerprint patterns about batteries of different lifespans. The resulting fingerprints support interpretable battery health assessment and can inform battery control choices.
附件下载



