本篇博文主要内容为 2026-03-20 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-03-20)

今日共更新681篇论文,其中:

  • 自然语言处理93篇(Computation and Language (cs.CL))
  • 人工智能229篇(Artificial Intelligence (cs.AI))
  • 计算机视觉147篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习196篇(Machine Learning (cs.LG))
  • 多智能体系统11篇(Multiagent Systems (cs.MA))
  • 信息检索15篇(Information Retrieval (cs.IR))
  • 人机交互29篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Optimal Path Planning in Hostile Environments ICAPS-2026

【速读】:该论文致力于解决多智能体路径规划(Multi-Agent Path Planning, MAPF)中的一个新问题,即在存在可恢复性危险区域(hazards)的图结构环境中,如何协调一组相同智能体从共同起点移动到共同目标,同时最大化成功抵达目标的智能体数量。其核心挑战在于:危险区域在接触后会暂时失效(cooldown period),但随后重新激活,这使得路径规划需考虑时间动态与风险规避之间的权衡。解决方案的关键在于证明了最优策略所需步数为多项式级别,从而将问题纳入NP类;进一步揭示了即使在树状拓扑下该问题仍为NP-hard,但在起点至目标由顶点不相交路径组成的图中,可设计多项式时间算法求解,从而刻画出该问题在不同约束下的计算复杂性边界。

链接: https://arxiv.org/abs/2603.18958
作者: Andrzej Kaczmarczyk,Šimon Schierreich,Nicholas Axel Tanujaya,Haifeng Xu
机构: Czech Technical University in Prague (捷克技术大学); AGH University of Krakow (克拉科夫AGH大学); Bina Nusantara University (比纳努桑塔拉大学); University of Chicago (芝加哥大学)
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: Accepted for publication at ICAPS-2026 (25 pages, 6 figures)

点击查看摘要

Abstract:Coordinating agents through hazardous environments, such as aid-delivering drones navigating conflict zones or field robots traversing deployment areas filled with obstacles, poses fundamental planning challenges. We introduce and analyze the computational complexity of a new multi-agent path planning problem that captures this setting. A group of identical agents begins at a common start location and must navigate a graph-based environment to reach a common target. The graph contains hazards that eliminate agents upon contact but then enter a known cooldown period before reactivating. In this discrete-time, fully-observable, deterministic setting, the planning task is to compute a movement schedule that maximizes the number of agents reaching the target. We first prove that, despite the exponentially large space of feasible plans, optimal plans require only polynomially-many steps, establishing membership in NP. We then show that the problem is NP-hard even when the environment graph is a tree. On the positive side, we present a polynomial-time algorithm for graphs consisting of vertex-disjoint paths from start to target. Our results establish a rich computational landscape for this problem, identifying both intractable and tractable fragments.

[MA-1] I Cant Believe Its Corrupt: Evaluating Corruption in Multi-Agent Governance Systems

【速读】:该论文试图解决的问题是:当大型语言模型(Large Language Models, LLMs)被赋予公共决策权时,它们是否会遵守制度性规则,以及如何确保其在高风险应用场景中的行为合规性。解决方案的关键在于提出“制度性完整性应作为部署前的强制要求,而非部署后的默认假设”,并通过多智能体治理模拟实验验证了治理结构对腐败相关行为的影响显著强于模型本身身份——即在不同权力分配机制下,LLM代理的行为表现存在系统性差异;同时指出轻量级防护措施无法稳定抑制严重违规风险,强调必须在真实授权前,通过具有可审计日志、可执行规则和人类监督的治理约束环境对系统进行压力测试,以实现安全委托。

链接: https://arxiv.org/abs/2603.18894
作者: Vedanta S P,Ponnurangam Kumaraguru
机构: IIIT Kottayam (印度信息技术研究所科塔扬校区); IIIT Hyderabad (印度信息技术研究所海得拉巴校区)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Short Paper, Preprint

点击查看摘要

Abstract:Large language models are increasingly proposed as autonomous agents for high-stakes public workflows, yet we lack systematic evidence about whether they would follow institutional rules when granted authority. We present evidence that integrity in institutional AI should be treated as a pre-deployment requirement rather than a post-deployment assumption. We evaluate multi-agent governance simulations in which agents occupy formal governmental roles under different authority structures, and we score rule-breaking and abuse outcomes with an independent rubric-based judge across 28,112 transcript segments. While we advance this position, the core contribution is empirical: among models operating below saturation, governance structure is a stronger driver of corruption-related outcomes than model identity, with large differences across regimes and model–governance pairings. Lightweight safeguards can reduce risk in some settings but do not consistently prevent severe failures. These results imply that institutional design is a precondition for safe delegation: before real authority is assigned to LLM agents, systems should undergo stress testing under governance-like constraints with enforceable rules, auditable logs, and human oversight on high-impact actions.

[MA-2] Reason ably reasoning AI agents can avoid game-theoretic failures in zero-shot provably

【速读】:该论文试图解决的问题是:在由多个AI代理(AI agents)构成的重复互动经济环境中,如何实现稳定的战略均衡(如纳什均衡),而无需对不同来源的AI模型进行统一的事后对齐(alignment)处理。当前方法依赖于后训练调整来诱导均衡,但这种做法在多样化、独立开发的AI系统中难以普适应用。解决方案的关键在于提出并验证了一种“零样本”(zero-shot)机制——即具备合理推理能力的AI代理(reasonably reasoning agents),能够基于历史观测形成对其他代理策略的信念,并据此学习最优反应,从而在几乎所有的实际博弈路径上收敛至近似纳什均衡的行为模式。该机制不依赖于共同知识收益假设,允许每个代理仅观测自身随机收益,依然可保证收敛性,且通过五类博弈场景的实证模拟验证了理论的有效性。这表明AI代理天然具备此类推理结构,可在无需全局对齐的情况下自发达成稳定战略行为。

链接: https://arxiv.org/abs/2603.18563
作者: Enoch Hyunwook Kang
机构: University of Washington (华盛顿大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH)
备注:

点击查看摘要

Abstract:AI agents are increasingly deployed in interactive economic environments characterized by repeated AI-AI interactions. Despite AI agents’ advanced capabilities, empirical studies reveal that such interactions often fail to stably induce a strategic equilibrium, such as a Nash equilibrium. Post-training methods have been proposed to induce a strategic equilibrium; however, it remains impractical to uniformly apply an alignment method across diverse, independently developed AI models in strategic settings. In this paper, we provide theoretical and empirical evidence that off-the-shelf reasoning AI agents can achieve Nash-like play zero-shot, without explicit post-training. Specifically, we prove that `reasonably reasoning’ agents, i.e., agents capable of forming beliefs about others’ strategies from previous observation and learning to best respond to these beliefs, eventually behave along almost every realized play path in a way that is weakly close to a Nash equilibrium of the continuation game. In addition, we relax the common-knowledge payoff assumption by allowing stage payoffs to be unknown and by having each agent observe only its own privately realized stochastic payoffs, and we show that we can still achieve the same on-path Nash convergence guarantee. We then empirically validate the proposed theories by simulating five game scenarios, ranging from a repeated prisoner’s dilemma game to stylized repeated marketing promotion games. Our findings suggest that AI agents naturally exhibit such reasoning patterns and therefore attain stable equilibrium behaviors intrinsically, obviating the need for universal alignment procedures in many real-world strategic interactions.

[MA-3] Interleaved Information Structures in Dynamic Games: A General Framework with Application to the Linear-Quadratic Case

【速读】:该论文旨在解决非合作动态博弈理论中关于在任意交错信息结构(interleaved information structures)下计算纳什均衡(Nash equilibria)的问题。现有研究主要局限于反馈和开环两种经典信息结构,难以刻画实际场景中 agents 仅在每一步观测部分其他 agent 的复杂信息依赖关系。解决方案的关键在于:首先,将具有任意交错信息结构的确定性动态博弈建模为数学规划网络(Mathematical Program Networks, MPNs),其中网络拓扑结构显式编码了 agents 之间的信息依赖关系;其次,针对线性二次(Linear-Quadratic, LQ)动态博弈,利用 MPN 表达式推导出一类类似 Riccati 方程的递推关系,从而系统性地刻画纳什均衡的存在条件与求解路径。

链接: https://arxiv.org/abs/2603.18407
作者: Janani S K,Kushagra Gupta,Ufuk Topcu,David Fridovich-Keil
机构: Indian Institute of Technology Madras (印度理工学院马德拉斯分校); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:A fundamental problem in noncooperative dynamic game theory is the computation of Nash equilibria under different information structures, which specify the information available to each agent during decision-making. Prior work has extensively studied equilibrium solutions for two canonical information structures: feedback, where agents observe the current state at each time, and open-loop, where agents only observe the initial state. However, these paradigms are often too restrictive to capture realistic settings exhibiting interleaved information structures, in which each agent observes only a subset of other agents at every timestep. To date, there is no systematic framework for modeling and solving dynamic games under arbitrary interleaved information structures. To this end, we make two main contributions. First, we introduce a method to model deterministic dynamic games with arbitrary interleaved information structures as Mathematical Program Networks (MPNs), where the network structure encodes the informational dependencies between agents. Second, for linear-quadratic (LQ) dynamic games, we leverage the MPN formulation to develop a systematic procedure for deriving Riccati-like equations that characterize Nash equilibria. Finally, we illustrate our approach through an example involving three agents exhibiting a cyclic information structure.

[MA-4] Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization

【速读】:该论文旨在解决自动提示优化(Automatic Prompt Optimization, APO)方法在缺乏可解释性与系统性失败方面的局限性,尤其针对如GEPA这类反射式APO方法在无标签、黑箱优化过程中导致的性能退化和不可追踪的优化轨迹问题。其核心解决方案是提出VISTA——一种多智能体APO框架,关键在于将假设生成(hypothesis generation)与提示重写(prompt rewriting)解耦,从而实现语义标注的假设、并行小批量验证以及可解释的优化路径;同时引入两层探索-利用机制(随机重启与ε-贪婪采样),有效跳出局部最优,显著提升模型在GSM8K等基准上的性能表现,例如在缺陷种子下准确率从13.50%恢复至87.57%,并优于所有基线方法。

链接: https://arxiv.org/abs/2603.18388
作者: Shiyan Liu,Qifeng Xia,Qiyun Xia,Yisheng Liu,Xinyu Yu,Rui Qu
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Automatic prompt optimization (APO) has emerged as a powerful paradigm for improving LLM performance without manual prompt engineering. Reflective APO methods such as GEPA iteratively refine prompts by diagnosing failure cases, but the optimization process remains black-box and label-free, leading to uninterpretable trajectories and systematic failure. We identify and empirically demonstrate four limitations: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50%. We propose VISTA, a multi-agent APO framework that decouples hypothesis generation from prompt rewriting, enabling semantically labeled hypotheses, parallel minibatch verification, and interpretable optimization trace. A two-layer explore-exploit mechanism combining random restart and epsilon-greedy sampling further escapes local optima. VISTA recovers accuracy to 87.57% on the same defective seed and consistently outperforms baselines across all conditions on GSM8K and AIME2025.

[MA-5] Evolutionarily Stable Stackelberg Equilibrium

【速读】:该论文旨在解决传统Stackelberg演化博弈中缺乏对跟随者策略稳定性的显式保障问题,即现有方法要么依赖演化动态定义跟随者响应,要么假设理性最优反应行为,但未明确确保策略在面对突变入侵时的稳定性。其解决方案的关键在于提出一种新的均衡概念——演化稳定Stackelberg均衡(evolutionarily stable Stackelberg equilibrium, SESS),其中领导者选择最优混合策略,预判跟随者群体在诱导子博弈中将采用演化稳定策略(ESS),并可能满足额外生态条件;同时区分领导者最优与跟随者最优的ESS选择情形,从而统一了此前不一致的建模方式。该框架通过算法实现离散和连续博弈中的SESS计算,并在生物场景(如癌症治疗中医生作为领导者、癌细胞表型作为跟随者)中验证其适用性。

链接: https://arxiv.org/abs/2603.18385
作者: Sam Ganzfried
机构: Ganzfried Research
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH); Populations and Evolution (q-bio.PE)
备注:

点击查看摘要

Abstract:We present a new solution concept called evolutionarily stable Stackelberg equilibrium (SESS). We study the Stackelberg evolutionary game setting in which there is a single leading player and a symmetric population of followers. The leader selects an optimal mixed strategy, anticipating that the follower population plays an evolutionarily stable strategy (ESS) in the induced subgame and may satisfy additional ecological conditions. We consider both leader-optimal and follower-optimal selection among ESSs, which arise as special cases of our framework. Prior approaches to Stackelberg evolutionary games either define the follower response via evolutionary dynamics or assume rational best-response behavior, without explicitly enforcing stability against invasion by mutations. We present algorithms for computing SESS in discrete and continuous games, and validate the latter empirically. Our model applies naturally to biological settings; for example, in cancer treatment the leader represents the physician and the followers correspond to competing cancer cell phenotypes.

[MA-6] HRI-SA: A Multimodal Dataset for Online Assessment of Human Situational Awareness during Remote Human-Robot Teaming

【速读】:该论文旨在解决人-机器人团队中操作员情境意识(Situation Awareness, SA)缺口难以实时检测的问题。在高工作负荷和动态环境中,传统SA测量方法要么干扰任务流程,要么无法捕捉实时波动,限制了其实际应用价值。解决方案的关键在于构建首个公开可用的多模态数据集HRI-SA,涵盖眼动、瞳孔直径、生物信号、用户交互及机器人数据,并通过预设事件获取感知与理解层面SA延迟的地面真实标签。研究进一步验证了仅使用通用眼动特征即可有效分类感知层SA延迟(召回率=88.91%,F1=67.63%),融合上下文信息后性能显著提升(召回率=91.51%,F1=80.38%),证明了基于眼动特征的连续感知层SA延迟检测在远程人-机器人协作中的可行性与有效性。

链接: https://arxiv.org/abs/2603.18344
作者: Hashini Senaratne,Richard Attfield,Samith Widhanapathirana,David Howard,Cecile Paris,Dana Kulic,Leimin Tian
机构: CSIRO Robotics (澳大利亚联邦科学与工业研究组织机器人部门); Monash Robotics (蒙纳士机器人)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: This work is currently under peer review

点击查看摘要

Abstract:Maintaining situational awareness (SA) is critical in human-robot teams. Yet, under high workload and dynamic conditions, operators often experience SA gaps. Automated detection of SA gaps could provide timely assistance for operators. However, conventional SA measures either disrupt task flow or cannot capture real-time fluctuations, limiting their operational utility. To the best of our knowledge, no publicly available dataset currently supports the systematic evaluation of online human SA assessment in human-robot teaming. To advance the development of online SA assessment tools, we introduce HRI-SA, a multimodal dataset from 30 participants in a realistic search-and-rescue human-robot teaming context, incorporating eye movements, pupil diameter, biosignals, user interactions, and robot data. The experimental protocol included predefined events requiring timely operator assistance, with ground truth SA latency of two types (perceptual and comprehension) systematically obtained by measuring the time between assistance need onset and resolution. We illustrate the utility of this dataset by evaluating standard machine learning models for detecting perceptual SA latencies using generic eye-tracking features and contextual features. Results show that eye-tracking features alone effectively classified perceptual SA latency (recall=88.91%, F1=67.63%) using leave-one-group-out cross-validation, with performance improved through contextual data fusion (recall=91.51%, F1=80.38%). This paper contributes the first public dataset supporting the systematic evaluation of SA throughout a human-robot teaming mission, while also demonstrating the potential of generic eye-tracking features for continuous perceptual SA latency detection in remote human-robot teaming.

[MA-7] MemArchitect: A Policy Driven Memory Governance Layer DATE

【速读】:该论文旨在解决持久化大型语言模型(Large Language Model, LLM)代理在记忆管理中存在的治理空白问题。现有检索增强生成(Retrieval-Augmented Generation, RAG)框架将记忆视为被动存储,缺乏对冲突信息的处理机制、隐私保护能力以及防止过时信息(即“僵尸记忆”)污染上下文窗口的能力。其解决方案的关键在于提出MemArchitect——一个将记忆生命周期管理与模型权重解耦的治理层,通过显式的规则策略(如记忆衰减、冲突消解和隐私控制)实现结构化的记忆治理,从而显著提升代理在自主系统中的可靠性与安全性。

链接: https://arxiv.org/abs/2603.18330
作者: Lingavasan Suresh Kumar,Yang Ba,Rong Pan
机构: Arizona State University (亚利桑那州立大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: This is an on going research work and will be updated periodically

点击查看摘要

Abstract:Persistent Large Language Model (LLM) agents expose a critical governance gap in memory management. Standard Retrieval-Augmented Generation (RAG) frameworks treat memory as passive storage, lacking mechanisms to resolve contradictions, enforce privacy, or prevent outdated information (“zombie memories”) from contaminating the context window. We introduce MemArchitect, a governance layer that decouples memory lifecycle management from model weights. MemArchitect enforces explicit, rule-based policies, including memory decay, conflict resolution, and privacy controls. We demonstrate that governed memory consistently outperforms unmanaged memory in agentic settings, highlighting the necessity of structured memory governance for reliable and safe autonomous systems. Comments: This is an on going research work and will be updated periodically Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2603.18330 [cs.AI] (or arXiv:2603.18330v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.18330 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-8] A Trace-Based Assurance Framework for Agent ic AI Orchestration: Contracts Testing and Governance

【速读】:该论文旨在解决Agentic AI系统中因长时交互、随机决策及外部副作用(如API调用、数据库写入和消息发送)导致的多种故障问题,包括非终止、角色漂移、错误信息传播以及通过不可信上下文或外部通道发起的攻击。其核心解决方案是提出一个保证框架,通过将执行过程记录为带显式步骤与追踪契约的消息-动作轨迹(Message-Action Traces, MAT),实现机器可验证的断言、首次违规步骤定位和确定性重放;同时引入基于预算的反例搜索进行压力测试,并在服务、检索与记忆边界实施结构化故障注入以评估容错能力;此外,治理作为运行时组件,在语言到动作边界上对每个代理的能力进行限制并执行允许、重写或阻断的动作中介策略,从而保障系统安全可控。

链接: https://arxiv.org/abs/2603.18096
作者: Ciprian Paduraru,Petru-Liviu Bouruc,Alin Stefanescu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In Agentic AI, Large Language Models (LLMs) are increasingly used in the orchestration layer to coordinate multiple agents and to interact with external services, retrieval components, and shared memory. In this setting, failures are not limited to incorrect final outputs. They also arise from long-horizon interaction, stochastic decisions, and external side effects (such as API calls, database writes, and message sends). Common failures include non-termination, role drift, propagation of unsupported claims, and attacks via untrusted context or external channels. This paper presents an assurance framework for such Agentic AI systems. Executions are instrumented as Message-Action Traces (MAT) with explicit step and trace contracts. Contracts provide machine-checkable verdicts, localize the first violating step, and support deterministic replay. The framework includes stress testing, formulated as a budgeted counterexample search over bounded perturbations. It also supports structured fault injection at service, retrieval, and memory boundaries to assess containment under realistic operational faults and degraded conditions. Finally, governance is treated as a runtime component, enforcing per-agent capability limits and action mediation (allow, rewrite, block) at the language-to-action boundary. To support comparative evaluations across stochastic seeds, models, and orchestration configurations, the paper defines trace-based metrics for task success, termination reliability, contract compliance, factuality indicators, containment rate, and governance outcome distributions. More broadly, the framework is intended as a common abstraction to support testing and evaluation of multi-agent LLM systems, and to facilitate reproducible comparison across orchestration designs and configurations. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.18096 [cs.MA] (or arXiv:2603.18096v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2603.18096 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-9] he Provenance Paradox in Multi-Agent LLM Routing: Delegation Contracts and Attested Identity in LDP

【速读】:该论文旨在解决多智能体大语言模型(Multi-agent LLM)系统中因委托方在不可验证的质量声明下进行任务分配所引发的“溯源悖论”问题——即当代理可虚报质量评分时,基于质量的路由策略反而会系统性地选择表现最差的代理,性能劣于随机选择。其解决方案的关键在于对原有LLM委托协议(LDP)进行三重扩展:一是引入委托契约(delegation contracts),通过明确目标、预算和失败策略来限制代理权限;二是构建“声称 vs. 验证身份”模型(claimed-vs-attested identity model),区分自报质量与经验证的质量;三是采用类型化失败语义(typed failure semantics)实现自动化恢复机制。实验表明,仅依赖自报质量的路由显著劣于随机选择(模拟环境:0.55 vs. 0.68;真实Claude模型:8.90 vs. 9.30),而基于验证质量的路由则达到近最优性能(d = 9.51, p < 0.001),且该效果在36种配置下稳定存在。所有改进均保持向后兼容,并具备亚微秒级验证开销。

链接: https://arxiv.org/abs/2603.18043
作者: Sunil Prakash
机构: Indian School of Business (印度管理学院)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures. Open-source: this https URL

点击查看摘要

Abstract:Multi-agent LLM systems delegate tasks across trust boundaries, but current protocols do not govern delegation under unverifiable quality claims. We show that when delegates can inflate self-reported quality scores, quality-based routing produces a provenance paradox: it systematically selects the worst delegates, performing worse than random. We extend the LLM Delegate Protocol (LDP) with delegation contracts that bound authority through explicit objectives, budgets, and failure policies; a claimed-vs-attested identity model that distinguishes self-reported from verified quality; and typed failure semantics enabling automated recovery. In controlled experiments with 10 simulated delegates and validated with real Claude models, routing by self-claimed quality scores performs worse than random selection (simulated: 0.55 vs. 0.68; real models: 8.90 vs. 9.30), while attested routing achieves near-optimal performance (d = 9.51, p 0.001). Sensitivity analysis across 36 configurations confirms the paradox emerges reliably when dishonest delegates are present. All extensions are backward-compatible with sub-microsecond validation overhead.

[MA-10] Computationally Efficient Density-Driven Optimal Control via Analytical KKT Reduction and Contractive MPC

【速读】:该论文旨在解决多智能体系统中实现高效协同空间分布的挑战,特别是针对基于密度驱动最优控制(Density-Driven Optimal Control, D2OC)框架在预测控制器实现时面临的计算复杂度问题——传统方法需求解一个随预测时域T呈立方增长(O(T³))的Karush-Kuhn-Tucker (KKT)系统,难以应用于大规模群体。解决方案的关键在于提出一种解析结构化降阶方法,将T时域的KKT系统转化为一个压缩的二次规划(Quadratic Program, QP)形式,从而实现线性复杂度(O(T))的在线计算效率提升;同时引入收缩Lyapunov约束以确保闭环系统对参考分布漂移具有输入到状态稳定性(Input-to-State Stability, ISS),保障动态环境下的收敛性。数值仿真验证了该方法在保证快速密度覆盖的同时显著降低计算开销,支持长时间预测控制用于大规模多智能体群组。

链接: https://arxiv.org/abs/2603.18503
作者: Julian Martinez,Kooktae Lee
机构: New Mexico Institute of Mining and Technology (新墨西哥矿业技术学院)
类目: Optimization and Control (math.OC); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Efficient coordination for collective spatial distribution is a fundamental challenge in multi-agent systems. Prior research on Density-Driven Optimal Control (D2OC) established a framework to match agent trajectories to a desired spatial distribution. However, implementing this as a predictive controller requires solving a large-scale Karush-Kuhn-Tucker (KKT) system, whose computational complexity grows cubically with the prediction horizon. To resolve this, we propose an analytical structural reduction that transforms the T-horizon KKT system into a condensed quadratic program (QP). This formulation achieves O(T) linear scalability, significantly reducing the online computational burden compared to conventional O(T^3) approaches. Furthermore, to ensure rigorous convergence in dynamic environments, we incorporate a contractive Lyapunov constraint and prove the Input-to-State Stability (ISS) of the closed-loop system against reference propagation drift. Numerical simulations verify that the proposed method facilitates rapid density coverage with substantial computational speed-up, enabling long-horizon predictive control for large-scale multi-agent swarms.

自然语言处理

[NLP-0] F2LLM -v2: Inclusive Performant and Efficient Embeddings for a Multilingual World

【速读】: 该论文旨在解决当前通用多语言嵌入模型在资源受限场景下效率不足、对中低资源语言支持薄弱的问题。其核心解决方案在于提出F2LLM-v2系列模型,通过融合两阶段基于大语言模型(Large Language Model, LLM)的嵌入训练流程,并结合Matryoshka学习、模型剪枝与知识蒸馏等关键技术,在保持高性能的同时显著提升模型效率。该方法使模型在14B参数规模下于11个MTEB基准上排名第一,同时较小版本也实现了资源受限应用下的新SOTA性能。

链接: https://arxiv.org/abs/2603.19223
作者: Ziyin Zhang,Zihan Liao,Hang Yu,Peng Di,Rui Wang
机构: Ant Group (蚂蚁集团); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

[NLP-1] Online Learning and Equilibrium Computation with Ranking Feedback

【速读】: 该论文旨在解决在线学习中仅能获取动作排名反馈(而非精确的数值效用反馈)时的 regret 问题,特别是在人类参与闭环系统或存在隐私限制场景下,传统依赖数值效用反馈的在线学习算法失效的问题。其关键解决方案在于:首先识别出在瞬时效用排名反馈和强确定性时间平均效用排名反馈(如 Plackett-Luce 模型温度趋近于零时)下,子线性 regret 在一般情况下是不可能实现的;随后提出新的算法,在额外假设效用序列具有子线性总变差(sublinear total variation)的前提下可实现子线性 regret;特别地,在全信息设置下,时间平均效用排名反馈无需该额外假设即可达成子线性 regret。这一结果保证了当博弈中所有玩家采用所提算法时,重复博弈将收敛至近似粗相关均衡(approximate coarse correlated equilibrium),并在大规模语言模型路由任务中验证了算法的有效性。

链接: https://arxiv.org/abs/2603.19221
作者: Mingyang Liu,Yongshan Chen,Zhiyuan Fan,Gabriele Farina,Asuman Ozdaglar,Kaiqing Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Online learning in arbitrary, and possibly adversarial, environments has been extensively studied in sequential decision-making, and it is closely connected to equilibrium computation in game theory. Most existing online learning algorithms rely on \emphnumeric utility feedback from the environment, which may be unavailable in human-in-the-loop applications and/or may be restricted by privacy concerns. In this paper, we study an online learning model in which the learner only observes a \emphranking over a set of proposed actions at each timestep. We consider two ranking mechanisms: rankings induced by the \emphinstantaneous utility at the current timestep, and rankings induced by the \emphtime-average utility up to the current timestep, under both \emphfull-information and \emphbandit feedback settings. Using the standard external-regret metric, we show that sublinear regret is impossible with instantaneous-utility ranking feedback in general. Moreover, when the ranking model is relatively deterministic, \emphi.e., under the Plackett-Luce model with a temperature that is sufficiently small, sublinear regret is also impossible with time-average utility ranking feedback. We then develop new algorithms that achieve sublinear regret under the additional assumption that the utility sequence has sublinear total variation. Notably, for full-information time-average utility ranking feedback, this additional assumption can be removed. As a consequence, when all players in a normal-form game follow our algorithms, repeated play yields an approximate coarse correlated equilibrium. We also demonstrate the effectiveness of our algorithms in an online large-language-model routing task.

[NLP-2] Nemotron-Cascade 2: Post-Training LLM s with Cascade RL and Multi-Domain On-Policy Distillation

【速读】: 该论文旨在解决小型语言模型(LLM)在数学推理与编程能力上难以媲美大型前沿模型的问题,同时追求更高的智能密度(intelligence density)。解决方案的关键在于:首先,在经过精心构建的数据集上的监督微调(SFT)后,大幅扩展了Cascade强化学习(Cascade RL)的覆盖范围,使其涵盖更广泛的推理与代理(agentic)任务领域;其次,引入多域在线策略蒸馏(multi-domain on-policy distillation),从每个领域的最强中间教师模型中持续蒸馏知识,从而有效缓解基准性能下降问题,并保持训练过程中的性能提升。这一方法使30B参数的MoE模型(仅激活3B参数)在多项国际竞赛中达到金牌水平,展现出20倍更低参数量下的卓越智能密度。

链接: https://arxiv.org/abs/2603.19220
作者: Zhuolin Yang,Zihan Liu,Yang Chen,Wenliang Dai,Boxin Wang,Sheng-Chieh Lin,Chankyu Lee,Yangyi Chen,Dongfu Jiang,Jiafan He,Renjie Pi,Grace Lam,Nayeon Lee,Alexander Bukharin,Mohammad Shoeybi,Bryan Catanzaro,Wei Ping
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: We release the model and data at this https URL

点击查看摘要

Abstract:We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.

[NLP-3] Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在对抗性提示下易产生幻觉和不可靠推理的问题。现有安全方法如基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)和输出过滤主要作用于行为层面,缺乏对推理过程本身的显式控制机制。为此,论文提出Box Maze框架,其核心在于构建一个分层的过程控制架构,将LLM的推理过程明确划分为三个模块:记忆锚定(memory grounding)、结构化推理(structured inference)与边界强制(boundary enforcement)。关键创新在于通过显式的认知控制层增强推理路径的完整性,在模拟对抗场景中验证表明,该架构可显著降低边界失效概率——从基线RLHF的约40%降至1%以下,为提升LLM推理可靠性提供了新的过程级控制范式。

链接: https://arxiv.org/abs/2603.19182
作者: Zou Qiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 5 tables, 0 figures. Conceptual architecture with preliminary simulation-based validation

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong generative capabilities but remain vulnerable to hallucination and unreliable reasoning under adversarial prompting. Existing safety approaches – such as reinforcement learning from human feedback (RLHF) and output filtering – primarily operate at the behavioral level and may lack explicit architectural mechanisms for enforcing reasoning process integrity. This paper proposes the Box Maze framework, a conceptual process-control architecture that decomposes LLM reasoning into three explicit layers: memory grounding, structured inference, and boundary enforcement. We introduce preliminary simulation-based evaluation involving progressive boundary erosion scenarios across multiple heterogeneous LLM systems (DeepSeek-V3, Doubao, Qwen). Results from n=50 adversarial scenarios suggest that explicit cognitive control layers may improve consistency in boundary maintenance, with architectural constraints reducing boundary failure rates from approximately 40% (baseline RLHF) to below 1% under adversarial conditions. While current validation is simulation-based, these preliminary results indicate that process-level control may offer a promising direction for improving reliability in large language model reasoning. Comments: 10 pages, 5 tables, 0 figures. Conceptual architecture with preliminary simulation-based validation Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2.0 Cite as: arXiv:2603.19182 [cs.AI] (or arXiv:2603.19182v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.19182 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-4] Evaluating Counterfactual Strategic Reasoning in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在重复博弈场景中是否具备真正的战略推理能力,还是仅仅依赖于记忆中的模式匹配这一问题。其解决方案的关键在于设计了一套多指标评估框架,通过引入反事实变体(counterfactual variants)——即改变收益结构和动作标签以打破原有对称性和支配关系——来测试LLMs在非标准博弈环境下的表现,从而揭示其在激励敏感性、结构泛化能力和战略推理方面的局限性。

链接: https://arxiv.org/abs/2603.19167
作者: Dimitrios Georgousis,Maria Lymperaiou,Angeliki Dimitriou,Giorgos Filandrianos,Giorgos Stamou
机构: National Technical University of Athens
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We evaluate Large Language Models (LLMs) in repeated game-theoretic settings to assess whether strategic performance reflects genuine reasoning or reliance on memorized patterns. We consider two canonical games, Prisoner’s Dilemma (PD) and Rock-Paper-Scissors (RPS), upon which we introduce counterfactual variants that alter payoff structures and action labels, breaking familiar symmetries and dominance relations. Our multi-metric evaluation framework compares default and counterfactual instantiations, showcasing LLM limitations in incentive sensitivity, structural generalization and strategic reasoning within counterfactual environments.

[NLP-5] Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

【速读】: 该论文旨在解决机器人在人机协作中将自然语言目标转化为可执行、物理空间内一致决策的问题,特别是针对包含语义参考与度量约束(metric constraints)的复杂语言查询,如“向冰箱右侧两米处移动”。现有视觉语言模型(VLM)虽具备良好的语义 grounding 能力,但在物理空间中的度量推理方面存在局限。解决方案的关键在于提出一种多智能体概率 grounding 框架(MAPG),其核心机制是将语言查询分解为结构化的子组件,分别调用 VLM 进行语义 grounding,再通过概率组合策略生成在 3D 空间中度量一致的可行动作决策。该方法显著提升了对复杂度量-语义查询的理解能力,并在 HM-EQA 和新提出的 MAPG-Bench 基准上验证了有效性,同时展示了在真实机器人场景中的迁移潜力。

链接: https://arxiv.org/abs/2603.19166
作者: Swagat Padhan,Lakshya Jain,Bhavya Minesh Shah,Omkar Patil,Thao Nguyen,Nakul Gopalan
机构: Arizona State University (亚利桑那州立大学); Haverford College (哈弗福德学院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Equal contribution: Swagat Padhan and Lakshya Jain, 9 pages, 6 figures, paper website: this https URL

点击查看摘要

Abstract:Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as “go two meters to the right of the fridge” requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.

[NLP-6] VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

【速读】: 该论文旨在解决大语言模型在低资源语言上表现不佳的问题,其核心原因在于子词分词(subword segmentation)效率低下和训练数据系统性不平衡。解决方案的关键是提出一种基于可验证奖励的强化学习框架——可变熵策略优化(Variable Entropy Policy Optimization, VEPO),通过引入确定性结构约束来对齐策略,从而在训练过程中确保预设序列长度、格式一致性及语言形式合法性。VEPO的核心创新在于可变熵机制,它动态调节字面忠实度与语义自然度之间的平衡,结合熵加权的优势估计与非对称裁剪策略,在维持鲁棒探索的同时防止策略崩溃,显著提升了低资源语言的分词效率与翻译质量。

链接: https://arxiv.org/abs/2603.19152
作者: Chonghan Liu,Yimin Du,Qi An,Xin He,Cunqi Zhai,Fei Tan,Weijia Lin,Xiaochun Gong,Yongchao Deng,Shousheng Jia,Xiangzheng Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages. Includes figures and tables. Conference submission

点击查看摘要

Abstract:Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.

[NLP-7] Optimal Splitting of Language Models from Mixtures to Specialized Domains

【速读】: 该论文旨在解决多领域语言模型训练中如何高效分配计算资源的问题,特别是在传统两阶段训练范式(先在全量数据上预训练,再在高质量子集上继续预训练)下,如何优化预训练与继续预训练之间的计算分配以提升模型性能。其解决方案的关键在于提出一种基于缩放定律(scaling laws)的独立预训练方法:首先让多个模型在通用预训练语料库上独立进行预训练,随后通过缩放定律精确预测不同模型规模(N)、预训练令牌数(D)和继续预训练令牌数(D’)组合下的损失值,并据此确定最优的计算资源配置。该方法能够准确 extrapolate 到更大模型规模和更多令牌数,从而在常识知识和推理基准测试中一致提升不同模型尺寸和计算预算下的性能表现。

链接: https://arxiv.org/abs/2603.19149
作者: Skyler Seto,Pierre Ablin,Anastasiia Filippova,Jiayuan Ye,Louis Bethune,Angelos Katharopoulos,David Grangier
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 26 pages, 11 tables, 17 figures

点击查看摘要

Abstract:Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model training. We propose a method for pretraining multiple models independently over a general pretraining corpus, and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N with D pretraining and D’ specialization tokens, and extrapolates to larger model sizes and number of tokens. Applied to language model training, our approach improves performance consistently across common sense knowledge and reasoning benchmarks across different model sizes and compute budgets.

[NLP-8] UGID: Unified Graph Isomorphism for Debiasing Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的显著社会偏见问题。现有基于输出层面或数据优化的去偏方法难以彻底消除偏见,且已有研究表明偏见往往嵌入在模型的内部表示中。为此,论文提出了一种名为UGID(Unified Graph Isomorphism for Debiasing)的内部表示层级去偏框架,其核心创新在于将Transformer架构建模为结构化的计算图:注意力机制定义图中的路由边,隐藏状态作为节点。关键在于通过强制在反事实输入下保持图结构不变(仅允许敏感属性差异),实现对偏见敏感区域中注意力路由与隐藏表示的联合约束,从而有效阻止偏见在模型各组件间的迁移。此外,引入对敏感logits的对数空间约束和选择性锚定目标,可在不损害模型通用能力的前提下实现行为对齐,实验表明该方法在分布内和分布外场景下均能显著降低偏见并保持模型安全性和实用性。

链接: https://arxiv.org/abs/2603.19144
作者: Zikang Ding,Junchi Yao,Junhao Li,Yi Zhang,Wenbo Jiang,Hongbo Liu,Lijie Hu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit pronounced social biases. Output-level or data-optimization–based debiasing methods cannot fully resolve these biases, and many prior works have shown that biases are embedded in internal representations. We propose \underlineUnified \underlineGraph \underlineIsomorphism for \underlineDebiasing large language models (\textit\textbfUGID), an internal-representation–level debiasing framework for large language models that models the Transformer as a structured computational graph, where attention mechanisms define the routing edges of the graph and hidden states define the graph nodes. Specifically, debiasing is formulated as enforcing invariance of the graph structure across counterfactual inputs, with differences allowed only on sensitive attributes. \textit\textbfUGID jointly constrains attention routing and hidden representations in bias-sensitive regions, effectively preventing bias migration across architectural components. To achieve effective behavioral alignment without degrading general capabilities, we introduce a log-space constraint on sensitive logits and a selective anchor-based objective to preserve definitional semantics. Extensive experiments on large language models demonstrate that \textit\textbfUGID effectively reduces bias under both in-distribution and out-of-distribution settings, significantly reduces internal structural discrepancies, and preserves model safety and utility.

[NLP-9] How Uncertainty Estimation Scales with Sampling in Reasoning Models

【速读】: 该论文旨在解决推理型语言模型在扩展链式思维(Chain-of-Thought, CoT)推理过程中不确定性估计不足的问题。其核心挑战在于如何有效利用黑箱方法对模型输出的置信度进行量化,从而提升模型决策的可靠性。解决方案的关键在于采用并行采样策略,结合两种信号:一是显式表达的自信度(verbalized confidence),二是自一致性(self-consistency)。研究表明,虽然两者均随采样量增长而提升,但它们的初始判别能力不同且收敛速度各异;更重要的是,两者的组合能显著增强不确定性估计性能——仅需两个样本即可使平均AUROC提升高达+12,且优于单独使用任一信号,即便在更大采样预算下仍保持优势,体现了信号融合带来的高效性与互补性。

链接: https://arxiv.org/abs/2603.19118
作者: Maksym Del,Markus Kängsepp,Marharyta Domnich,Ardi Tampuu,Lisa Yankovskaya,Meelis Kull,Mark Fishel
机构: University of Tartu (塔尔图大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale. Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency exhibits lower initial discrimination and lags behind verbalized confidence under moderate sampling. Most uncertainty gains, however, arise from signal combination: with just two samples, a hybrid estimator improves AUROC by up to +12 on average and already outperforms either signal alone even when scaled to much larger budgets, after which returns diminish. These effects are domain-dependent: in mathematics, the native domain of RLVR-style post-training, reasoning models achieve higher uncertainty quality and exhibit both stronger complementarity and faster scaling than in STEM or humanities. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2603.19118 [cs.AI] (or arXiv:2603.19118v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.19118 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-10] DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering ICASSP2026

【速读】: 该论文旨在解决多语言多跳问答(multilingual multi-hop QA, MM-hop QA)场景下检索增强生成(Retrieval-augmented generation, RAG)系统性能显著下降的问题,其核心挑战包括缺乏针对多语言环境的基准测试以及现有模型对英语语义理解能力的过度依赖导致跨语言迁移效果不佳。解决方案的关键在于提出DaPT框架,该框架通过并行生成源语言查询及其英文翻译对应的子问题图(sub-question graphs),随后进行融合,并采用双语检索与答案生成策略,实现对多跳问题的分步求解,从而有效提升跨语言场景下的准确性与一致性。实验表明,DaPT在最具挑战性的MuSiQue基准上相较最强基线平均EM得分提升18.3%。

链接: https://arxiv.org/abs/2603.19097
作者: Yilin Wang,Yuchun Fan,Jiaoyang Li,Ziming Zhu,Yongyu Mu,Qiaozhi He,Tong Xiao,Jingbo Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems have made significant progress in solving complex multi-hop question answering (QA) tasks in the English scenario. However, RAG systems inevitably face the application scenario of retrieving across multilingual corpora and queries, leaving several open challenges. The first one involves the absence of benchmarks that assess RAG systems’ capabilities under the multilingual multi-hop (MM-hop) QA setting. The second centers on the overreliance on LLMs’ strong semantic understanding in English, which diminishes effectiveness in multilingual scenarios. To address these challenges, we first construct multilingual multi-hop QA benchmarks by translating English-only benchmarks into five languages, and then we propose DaPT, a novel multilingual RAG framework. DaPT generates sub-question graphs in parallel for both the source-language query and its English translation counterpart, then merges them before employing a bilingual retrieval-and-answer strategy to sequentially solve sub-questions. Our experimental results demonstrate that advanced RAG systems suffer from a significant performance imbalance in multilingual scenarios. Furthermore, our proposed method consistently yields more accurate and concise answers compared to the baselines, significantly enhancing RAG performance on this task. For instance, on the most challenging MuSiQue benchmark, DaPT achieves a relative improvement of 18.3% in average EM score over the strongest baseline.

[NLP-11] SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在真实场景中做出安全决策时缺乏可解释性的问题,即难以确定哪些视觉证据驱动了其判断。解决方案的关键在于提出一种语义引导框架(semantic steering framework),通过施加受控的文本、视觉和认知干预,在不改变原始场景内容的前提下,系统性地探究语义线索如何影响VLM的安全行为。该框架结合了SAVeS基准测试与分离式评估协议,能够区分行为拒绝、基于视觉理解的安全推理以及误拒行为,从而揭示VLM更依赖于习得的视觉-语言关联而非真正的视觉感知理解,进而暴露多模态安全系统的潜在脆弱性。

链接: https://arxiv.org/abs/2603.19092
作者: Carlos Hinojosa,Clemens Grange,Bernard Ghanem
机构: King Abdullah University of Science and Technology (KAUST); Technical University of Munich (TUM)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic cues. We introduce a semantic steering framework that applies controlled textual, visual, and cognitive interventions without changing the underlying scene content. To evaluate these effects, we propose SAVeS, a benchmark for situational safety under semantic cues, together with an evaluation protocol that separates behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional state-of-the-art benchmark show that safety decisions are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding. We further demonstrate that automated steering pipelines can exploit these mechanisms, highlighting a potential vulnerability in multimodal safety systems.

[NLP-12] Serendipity by Design: Evaluating the Impact of Cross-domain Mappings on Human and LLM Creativity

【速读】: 该论文旨在解决一个核心问题:大型语言模型(Large Language Models, LLMs)是否以与人类相同的方式具备创造力,以及针对人类有效的创造性干预措施是否同样适用于LLMs。其解决方案的关键在于评估一种新兴的创意激发策略——跨域映射(cross-domain mapping),即强制创作者从一个随机且语义遥远的源领域(如章鱼、仙人掌或GPS)中提取特性并转化为目标产品(如背包、电视)的新颖功能。研究发现,人类在使用跨域映射时显著提升创意产出,而LLMs平均生成的创意比人类更原创,但并未因该干预产生统计学意义上的显著改善;不过,在两种系统中,当源领域与目标领域语义距离增加时,跨域映射的效果均增强,揭示了远程联想在创意生成中的共性作用,同时也凸显了人类与LLMs对同一干预机制响应方式的本质差异。

链接: https://arxiv.org/abs/2603.19087
作者: Qiawen Ella Liu,Marina Dubova,Henry Conklin,Takumi Harada,Thomas L. Griffiths
机构: Princeton University (普林斯顿大学); Santa Fe Institute (圣塔菲研究所); Toyota Motor North America, Inc., Plano, Texas (丰田汽车北美公司,德克萨斯州普拉诺市)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Are large language models (LLMs) creative in the same way humans are, and can the same interventions increase creativity in both? We evaluate a promising but largely untested intervention for creativity: forcing creators to draw an analogy from a random, remote source domain (‘‘cross-domain mapping’’). Human participants and LLMs generated novel features for ten daily products (e.g., backpack, TV) under two prompts: (i) cross-domain mapping, which required translating a property from a randomly assigned source (e.g., octopus, cactus, GPS), and (ii) user-need, which required proposing innovations targeting unmet user needs. We show that humans reliably benefit from randomly assigned cross-domain mappings, while LLMs, on average, generate more original ideas than humans and do not show a statistically significant effect of cross-domain mappings. However, in both systems, the impact of cross-domain mapping increases when the inspiration source becomes more semantically distant from the target. Our results highlight both the role of remote association in creative ideation and systematic differences in how humans and LLMs respond to the same intervention for creativity.

[NLP-13] A Dataset and Resources for Identifying Patient Health Literacy Information from Clinical Notes

【速读】: 该论文旨在解决当前健康素养(Health Literacy)筛查工具在电子健康记录(EHR)中难以结构化记录的问题,这些问题主要源于现有工具在条目数量、问题形式和测量维度上的不一致性。为应对这一挑战,作者提出了一种基于临床笔记的自动化检测方法,其关键在于构建了首个公开可用的健康素养标注数据集HEALIX,该数据集来源于真实临床笔记,通过社会工作者笔记采样、关键词过滤与大语言模型(LLM)驱动的主动学习相结合的方式进行标注,涵盖9类笔记类型共589条文本,并标注了低、正常、高水平的健康素养标签。此数据集为后续利用自然语言处理技术实现健康素养的自动识别提供了基础资源与验证平台。

链接: https://arxiv.org/abs/2603.19082
作者: Madeline Bittner,Dina Demner-Fushman,Yasmeen Shabazz,Davis Bartels,Dukyong Yoon,Brad Quitadamo,Rajiv Menghrajani,Leo Celi,Sarvesh Soni
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Health literacy is a critical determinant of patient outcomes, yet current screening tools are not always feasible and differ considerably in the number of items, question format, and dimensions of health literacy they capture, making documentation in structured electronic health records difficult to achieve. Automated detection from unstructured clinical notes offers a promising alternative, as these notes often contain richer, more contextual health literacy information, but progress has been limited by the lack of annotated resources. We introduce HEALIX, the first publicly available annotated health literacy dataset derived from real clinical notes, curated through a combination of social worker note sampling, keyword-based filtering, and LLM-based active learning. HEALIX contains 589 notes across 9 note types, annotated with three health literacy labels: low, normal, and high. To demonstrate its utility, we benchmarked zero-shot and few-shot prompting strategies across four open source large language models (LLMs).

[NLP-14] Parallelograms Strike Back: LLM s Generate Better Analogies than People

【速读】: 该论文试图解决的问题是:经典的四词类比(A:B::C:D)几何模型——即“平行四边形模型”(parallelogram model)是否能有效解释人类生成类比的机制,抑或其失效源于人类本身难以产生符合关系约束的类比。研究表明,尽管人类常依赖局部相似性启发式(local-similarity heuristics)生成类比,导致对平行四边形结构的偏离,但大型语言模型(LLM)在相同任务中表现出更强的平行四边形结构一致性,并且其类比被人类评价为更优。关键在于,LLM的优势并非来自整体响应质量的普遍提升,而是源于人类存在大量低质量的类比响应(长尾效应),而LLM则更稳定地满足语义关系约束。进一步分析表明,预测类比质量的关键因素是平行四边形对齐度与词汇频率,而非局部相似性敏感度,这说明平行四边形模型本质上仍是类比关系的有效表征,问题出在人类生成过程中的不稳定性,而非模型本身的缺陷。

链接: https://arxiv.org/abs/2603.19066
作者: Qiawen Ella Liu,Raja Marjieh,Jian-Qiao Zhu,Adele E. Goldberg,Thomas L. Griffiths
机构: Princeton University (普林斯顿大学); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Four-term word analogies (A:B::C:D) are classically modeled geometrically as ‘‘parallelograms,’’ yet recent work suggests this model poorly captures how humans produce analogies, with simple local-similarity heuristics often providing a better account (Peterson et al., 2020). But does the parallelogram model fail because it is a bad model of analogical relations, or because people are not very good at generating relation-preserving analogies? We compared human and large language model (LLM) analogy completions on the same set of analogy problems from (Peterson et al., 2020). We find that LLM-generated analogies are reliably judged as better than human-generated ones, and are also more closely aligned with the parallelogram structure in a distributional embedding space (GloVe). Crucially, we show that the improvement over human analogies was driven by greater parallelogram alignment and reduced reliance on accessible words rather than enhanced sensitivity to local similarity. Moreover, the LLM advantage is driven not by uniformly superior responses by LLMs, but by humans producing a long tail of weak completions: when only modal (most frequent) responses by both systems are compared, the LLM advantage disappears. However, greater parallelogram alignment and lower word frequency continue to predict which LLM completions are rated higher than those of humans. Overall, these results suggest that the parallelogram model is not a poor account of word analogy. Rather, humans may often fail to produce completions that satisfy this relational constraint, whereas LLMs do so more consistently.

[NLP-15] MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体方法在科学创意生成(Scientific Ideation)中缺乏深层科学推理建模的问题,导致生成方案仅停留在表面概念重组,缺乏技术深度与科学依据。其解决方案的关键在于提出MoRI框架,通过监督微调初始化模型以从给定科学背景中生成研究动机,并进一步采用复合强化学习奖励机制训练模型:一方面引入熵感知的信息增益以鼓励模型挖掘并详述基于真实方法论的高复杂度技术细节;另一方面利用对比语义增益约束推理路径,确保其与科学有效解保持概念一致性。此设计使模型能够显式学习从研究动机到方法论的完整推理链条,从而显著提升创意的新颖性、技术严谨性和可行性。

链接: https://arxiv.org/abs/2603.19044
作者: Chenyang Gu,Jiahao Cheng,Meicong Zhang,Pujun Zheng,Jinquan Zheng,Guoxiu He
机构: East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM-based agentic approaches emulate human research workflows, yet inadequately model scientific reasoning, resulting in surface-level conceptual recombinations that lack technical depth and scientific grounding. To address this issue, we propose \textbfMoRI (\textbfMotivation-grounded \textbfReasoning for Scientific \textbfIdeation), a framework that enables LLMs to explicitly learn the reasoning process from research motivations to methodologies. The base LLM is initialized via supervised fine-tuning to generate a research motivation from a given context, and is subsequently trained under a composite reinforcement learning reward that approximates scientific rigor: (1) entropy-aware information gain encourages the model to uncover and elaborate high-complexity technical details grounded in ground-truth methodologies, and (2) contrastive semantic gain constrains the reasoning trajectory to maintain conceptually aligned with scientifically valid solutions. Empirical results show that MoRI significantly outperforms strong commercial LLMs and complex agentic baselines across multiple dimensions, including novelty, technical rigor, and feasibility. The code will be made available on \hrefthis https URLGitHub.

[NLP-16] What Really Controls Temporal Reasoning in Large Language Models : Tokenisation or Representation of Time?

【速读】: 该论文旨在解决多语言环境下时间推理能力的评估与优化问题,特别是针对不同语言资源稀缺性及多种历法(如格里高利历、希吉拉历和中国农历)下模型表现不一致的问题。解决方案的关键在于构建了MultiTempBench这一跨语言、跨历法的时间推理基准,包含15,000个经过精心设计的样本,并引入多语言日期碎片化比率(multilingual Date Fragmentation Ratio, mDFR)作为量化指标,结合人类严重度评分校准与几何探测分析内部时序表征。研究发现,时间标记的分词质量是低资源语言和稀有历法下的关键瓶颈,而高资源语言中时间线性结构则成为主导预测因子,揭示了模型在不同资源条件下对时间推理机制的依赖差异。

链接: https://arxiv.org/abs/2603.19017
作者: Gagan Bhatia,Ahmad Muhammad Isa,Maxime Peyrard,Wei Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains 15,000 examples built by translating 750 curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: this https URL

[NLP-17] Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在需要从多个竞争选项中做出决策的任务中表现不佳的问题。传统RAG方法通常依赖单一初始查询进行检索,往往侧重于主题相关性而非决策相关证据,导致检索到的背景信息难以有效区分候选答案。解决方案的关键在于提出一种无需训练的预检索框架——假设条件查询重写(Hypothesis-Conditioned Query Rewriting, HCQR),其核心是首先基于输入问题和候选选项生成一个轻量级工作假设,随后将检索过程转化为三个针对性查询:(1) 支持该假设的证据,(2) 区分该假设与竞争选项的证据,以及(3) 验证问题中关键线索的证据。这一机制使检索内容更直接服务于答案选择,从而提升生成器基于检索证据确认或推翻初始假设的能力。

链接: https://arxiv.org/abs/2603.19008
作者: Hangeol Chang,Changsun Lee,Seungjoon Rho,Junho Yeo,Jong Chul Ye
机构: KAIST(韩国科学技术院); Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by grounding generation in external, non-parametric knowledge. However, when a task requires choosing among competing options, simply grounding generation in broadly relevant context is often insufficient to drive the final decision. Existing RAG methods typically rely on a single initial query, which often favors topical relevance over decision-relevant evidence, and therefore retrieves background information that can fail to discriminate among answer options. To address this issue, here we propose Hypothesis-Conditioned Query Rewriting (HCQR), a training-free pre-retrieval framework that reorients RAG from topic-oriented retrieval to evidence-oriented retrieval. HCQR first derives a lightweight working hypothesis from the input question and candidate options, and then rewrites retrieval into three targeted queries that seek evidence to: (1) support the hypothesis, (2) distinguish it from competing alternatives, and (3) verify salient clues in the question. This approach enables context retrieval that is more directly aligned with answer selection, allowing the generator to confirm or overturn the initial hypothesis based on the retrieved evidence. Experiments on MedQA and MMLU-Med show that HCQR consistently outperforms single-query RAG and re-rank/filter baselines, improving average accuracy over Simple RAG by 5.9 and 3.6 points, respectively. Code is available at this https URL.

[NLP-18] RADIUS: Ranking Distribution and Significance - A Comprehensive Alignment Suite for Survey Simulation

【速读】: 该论文旨在解决当前survey simulation(调查模拟)评估中缺乏统一、标准化指标的问题,现有方法多借用其他领域的评价指标,存在碎片化、非标准化且忽视排序一致性(ranking alignment)的缺陷,导致结果难以比较。尤其在决策类应用中,即使模拟结果在准确性或分布上表现良好,仍可能无法捕捉人类最偏好的选项,从而影响实际效用。其解决方案的关键在于提出RADIUS——一个包含两个维度的综合对齐评估套件:RAnking alignment(排序对齐)与DIstribUtion alignment(分布对齐),并分别引入统计显著性检验,从而更全面、客观地衡量模拟结果的质量,并提供开源实现以支持可复现和可比的评估。

链接: https://arxiv.org/abs/2603.19002
作者: Weronika Łajewska,Paul Missault,George Davidson,Saab Mansour
机构: Amazon(亚马逊); Amazon(亚马逊)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Simulation of surveys using LLMs is emerging as a powerful application for generating human-like responses at scale. Prior work evaluates survey simulation using metrics borrowed from other domains, which are often ad hoc, fragmented, and non-standardized, leading to results that are difficult to compare. Moreover, existing metrics focus mainly on accuracy or distributional measures, overlooking the critical dimension of ranking alignment. In practice, a simulation can achieve high accuracy while still failing to capture the option most preferred by humans - a distinction that is critical in decision-making applications. We introduce RADIUS, a comprehensive two-dimensional alignment suite for survey simulation that captures: 1) RAnking alignment and 2) DIstribUtion alignment, each complemented by statistical Significance testing. RADIUS highlights the limitations of existing metrics, enables more meaningful evaluation of survey simulation, and provides an open-source implementation for reproducible and comparable assessment.

[NLP-19] A conceptual framework for ideology beyond the left and right

【速读】: 该论文试图解决当前自然语言处理(Natural Language Processing, NLP)与计算社会科学(Computational Social Science, CSS)研究中对意识形态的建模过于简化的问题,即现有方法主要基于左右派(left/right partisan)单一维度进行操作化定义,忽略了个体在种族、气候、性别等具体议题上持有的复杂且多元的意识形态立场。其解决方案的关键在于提出一个将意识形态视为“被赋予的、多层次的社会认知概念网络”的新框架,该框架不仅解释了意识形态如何通过话语体现,并揭示其与框架(framing)等社会过程的关联;同时阐明了该模型如何厘清现有NLP任务(如立场检测和自然语言推理)之间的重叠关系,并开辟新的研究方向,从而在计算方法与意识形态理论之间搭建起一座独特而重要的桥梁,推动社会话语分析向更丰富、更贴近现实的方向发展。

链接: https://arxiv.org/abs/2603.18945
作者: Kenneth Joseph,Kim Williams,David Lazer
机构: University at Buffalo (纽约州立大学布法罗分校); Portland State University (波特兰州立大学); Northeastern University (东北大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:NLP+CSS work has operationalized ideology almost exclusively on a left/right partisan axis. This approach obscures the fact that people hold interpretations of many different complex and more specific ideologies on issues like race, climate, and gender. We introduce a framework that understands ideology as an attributed, multi-level socio-cognitive concept network, and explains how ideology manifests in discourse in relation to other relevant social processes like framing. We demonstrate how this framework can clarifies overlaps between existing NLP tasks (e.g. stance detection and natural language inference) and also how it reveals new research directions. Our work provides a unique and important bridge between computational methods and ideology theory, enabling richer analysis of social discourse in a way that benefits both fields.

[NLP-20] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在链式思维(Chain-of-Thought, CoT)推理过程中失败难以低成本检测的问题。其核心解决方案在于提出“熵轨迹单调性”(entropy-trajectory monotonicity)这一新指标:若推理每一步的答桉分布熵值均单调递减,则判定该链为单调链。实验表明,单调链在GSM8K数据集上准确率显著优于非单调链(+21.9个百分点),且该指标不依赖于总熵减少量,揭示了不确定性动态形状(shape)比幅度(magnitude)更具预测价值。此方法可在约1500 token/问题的代价下实现优于传统标量基线的性能,成本仅为40链自洽性(self-consistency)的1/8。

链接: https://arxiv.org/abs/2603.18940
作者: Xinghao Zhao
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive. We study whether the shape of uncertainty dynamics across reasoning steps–captured by sampling a few answer completions per step–predicts correctness. We introduce entropy-trajectory monotonicity: a chain is monotone if its per-step answer-distribution entropy decreases at every step. On GSM8K (n=300) with Qwen2.5-7B-Instruct, monotone chains achieve 68.8% accuracy vs. 46.8% for non-monotone chains (+21.9 pp; Fisher’s p=0.0005; OR=2.50). Critically, total entropy reduction is not predictive ( \rho =-0.06, p=0.31), revealing a shape-over-magnitude dissociation: whether entropy decreases at every step matters, not how much. Violation count 0/1/2 gives 68.8%/50.8%/28.6% accuracy. Token log-probability confidence worsens in calibration with step depth (ECE: 0.186-0.312), and monotonicity achieves +5.8 pp at 73.7% coverage, outperforming scalar baselines at approx 1,500 tokens/question–1/8 the cost of 40-chain self-consistency. Results replicate on Mistral-7B (n=300): monotone chains reach 72.3% vs. 37.6% (+34.7 pp; OR=4.33). Structural properties of uncertainty trajectories are thus more informative than aggregate measures. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2603.18940 [cs.CL] (or arXiv:2603.18940v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.18940 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-21] Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLM s

【速读】: 该论文旨在解决知识增强型对话系统在多语言场景下存在的三大核心问题:一是现有方法主要局限于英文,缺乏对事实性陈述的显式引用机制以验证真实性,二是模型决策过程透明度不足,三是跨语言迁移能力弱且易发生灾难性遗忘。其解决方案的关键在于提出一个渐进式的四阶段训练流程(XKD-Dial),依次完成多语言适配、带引用标注的英文对话监督微调(SFT)、双语对话SFT以及基于引用感知奖励的GRPO对齐优化;该流程不仅显著降低幻觉率至0.0%(针对编码器-解码器架构),还通过系统性后验可解释性分析(如交叉注意力对齐、集成梯度归因和遮蔽因果定位)揭示了引用行为的学习机制,从而实现了可解释、高保真且具备双语能力的知识增强对话生成。

链接: https://arxiv.org/abs/2603.18911
作者: Vedant Pandya
机构: Indian Institute of Technology Jodhpur (印度理工学院贾德普尔分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages, 15 figures, 11 tables. Comprehensive study across 6 LLMs (250M-7B parameters) with explainability analysis. Code and data available upon request

点击查看摘要

Abstract:Knowledge-grounded dialogue systems aim to generate informative, contextually relevant responses by conditioning on external knowledge sources. However, most existing approaches focus exclusively on English, lack explicit citation mechanisms for verifying factual claims, and offer limited transparency into model decision-making. We present XKD-Dial, a progressive four-stage training pipeline for explainable, knowledge-grounded dialogue generation in a bilingual (English-Hindi) setting, comprising: (1) multilingual adaptation, (2) English dialogue SFT with citation grounding, (3) bilingual dialogue SFT, and (4) GRPO alignment with citation-aware rewards. We evaluate six models spanning encoder-decoder (250M-3B) and decoder-only (1B-7B) architectures at every pipeline stage. Our key contributions are: (i) three post-hoc explainability analyses - cross-attention alignment, Integrated Gradients attribution, and occlusion-based causal grounding - applied systematically across the training trajectory to reveal how citation behaviour is learned, not only whether it is learned; (ii) citation-grounded SFT reduces hallucination to 0.0% for encoder-decoder models from Stage 2 onward; (iii) the progressive pipeline prevents catastrophic forgetting while improving Hindi capabilities; (iv) smaller models match larger models on English after SFT; and (v) GRPO provides marginal improvement over well-designed SFT for structured citation tasks. We evaluate across six automatic metrics (BLEU, ROUGE, BERTScore, FactScore, Citation-F1, and hallucination rate).

[NLP-22] Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在数学与科学推理任务中对复杂数学对象(mathematical objects)生成能力不足的问题,尤其针对下游STEM应用中对形式化表达结构的严格要求。现有评估多依赖数值答案或多项选择题(MCQA),难以全面衡量模型的逻辑推导能力。其解决方案的关键在于:首先构建并发布名为Principia的训练数据集与基准测试套件,用于专门提升数学对象的推导能力;其次提出基于强LLM裁判(LLM-judges)和验证器的训练策略,证明在线策略裁判训练(on-policy judge training)可显著提升性能;最后利用相同机制通过聚合方式扩展推理时计算资源(test-time compute scaling),实现跨格式(如数值、MCQA与符号表达)的推理能力泛化。

链接: https://arxiv.org/abs/2603.18886
作者: Pranjal Aggarwal,Marjan Ghazvininejad,Seungone Kim,Ilia Kulikov,Jack Lanchantin,Xian Li,Tianjian Li,Bo Liu,Graham Neubig,Anaelia Ovalle,Swarnadeep Saha,Sainbayar Sukhbaatar,Sean Welleck,Jason Weston,Chenxi Whitehouse,Adina Williams,Jing Xu,Ping Yu,Weizhe Yuan,Jingyu Zhang,Wenting Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally structured expressions. Yet, current LM evaluations of mathematical and scientific reasoning rely heavily on simplified answer formats such as numerical values or multiple choice options due to the convenience of automated assessment. In this paper we provide three contributions for improving reasoning over mathematical objects: (i) we build and release training data and benchmarks for deriving mathematical objects, the Principia suite; (ii) we provide training recipes with strong LLM-judges and verifiers, where we show that on-policy judge training boosts performance; (iii) we show how on-policy training can also be used to scale test-time compute via aggregation. We find that strong LMs such as Qwen3-235B and o3 struggle on Principia, while our training recipes can bring significant improvements over different LLM backbones, while simultaneously improving results on existing numerical and MCQA tasks, demonstrating cross-format generalization of reasoning abilities.

[NLP-23] A Human-in/on-the-Loop Framework for Accessible Text Generation LREC2026

【速读】: 该论文旨在解决当前自动文本简化与评估流程中缺乏用户理解反馈和规范标准映射的问题,即现有方法多依赖自动化指标,难以真实反映认知可及性(cognitive accessibility)需求。其解决方案的关键在于提出一种混合框架,通过“人在回路”(Human-in-the-Loop, HiTL)机制在生成阶段引入人类指导以调整输出,并借助“人在环路”(Human-on-the-Loop, HoTL)监督实现系统性后生成审查;同时将用户研究与标注资源转化为三类结构化组件:符合规范的检查清单、基于事件-条件-动作(Event-Condition-Action)规则触发专家干预的逻辑机制,以及可量化的无障碍关键绩效指标(Accessibility Key Performance Indicators, KPIs),从而构建可追溯、可复现且可审计的可访问文本生成与评估流程,将可解释性和伦理责任嵌入设计核心,推动更透明、包容的自然语言处理(NLP)系统发展。

链接: https://arxiv.org/abs/2603.18879
作者: Lourdes Moreno,Paloma Martínez
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026. To appear in the Proceedings of the 14th International Conference on Language Resources and Evaluation (LREC 2026)

点击查看摘要

Abstract:Plain Language and Easy-to-Read formats in text simplification are essential for cognitive accessibility. Yet current automatic simplification and evaluation pipelines remain largely automated, metric-driven, and fail to reflect user comprehension or normative standards. This paper introduces a hybrid framework that explicitly integrates human participation into LLM-based accessible text generation. Human-in-the-Loop (HiTL) contributions guide adjustments during generation, while Human-on-the-Loop (HoTL) supervision ensures systematic post-generation review. Empirical evidence from user studies and annotated resources is operationalized into (i) checklists aligned with standards, (ii) Event-Condition-Action trigger rules for activating expert oversight, and (iii) accessibility Key Performance Indicators (KPIs). The framework shows how human-centered mechanisms can be encoded for evaluation and reused to provide structured feedback that improves model adaptation. By embedding the human role in both generation and supervision, it establishes a traceable, reproducible, and auditable process for creating and evaluating accessible texts. In doing so, it integrates explainability and ethical accountability as core design principles, contributing to more transparent and inclusive NLP systems.

[NLP-24] Why Better Cross-Lingual Alignment Fails for Better Cross-Lingual Transfer: Case of Encoders

【速读】: 该论文试图解决的问题是:尽管跨语言对齐(cross-lingual alignment)被广泛认为能够提升跨语言迁移性能,但实际中显式对齐技术在增加嵌入相似性的同时,往往无法改善词级别下游任务的表现。其关键解决方案在于揭示对齐目标与下游任务目标在优化方向上高度正交(orthogonal),且对齐带来的收益因语言和任务类型而异;通过分析嵌入距离、梯度相似性和梯度幅值等表征指标,作者发现仅靠提升嵌入相似性不足以预测任务性能提升,并提出应谨慎选择损失函数组合,以实现对齐与任务微调之间的有效协同。

链接: https://arxiv.org/abs/2603.18863
作者: Yana Veitsman,Yihong Liu,Hinrich Schütze
机构: University of Göttingen(哥廷根大学); LMU Munich(慕尼黑大学); Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Better cross-lingual alignment is often assumed to yield better cross-lingual transfer. However, explicit alignment techniques – despite increasing embedding similarity – frequently fail to improve token-level downstream performance. In this work, we show that this mismatch arises because alignment and downstream task objectives are largely orthogonal, and because the downstream benefits from alignment vary substantially across languages and task types. We analyze four XLM-R encoder models aligned on different language pairs and fine-tuned for either POS Tagging or Sentence Classification. Using representational analyses, including embedding distances, gradient similarities, and gradient magnitudes for both task and alignment losses, we find that: (1) embedding distances alone are unreliable predictors of improvements (or degradations) in task performance and (2) alignment and task gradients are often close to orthogonal, indicating that optimizing one objective may contribute little to optimizing the other. Taken together, our findings explain why better'' alignment often fails to translate into better’’ cross-lingual transfer. Based on these insights, we provide practical guidelines for combining cross-lingual alignment with task-specific fine-tuning, highlighting the importance of careful loss selection.

[NLP-25] RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agent ic RL with Large Language Models

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大语言模型(Large Language Models, LLMs)代理推理能力时面临的奖励稀疏性问题,即终端奖励难以支持细粒度的状态级优化。传统过程奖励建模虽具潜力,但训练专用奖励模型往往计算开销大且扩展困难。解决方案的关键在于提出RewardFlow方法,其核心是利用推理轨迹中状态的内在拓扑结构构建状态图,通过拓扑感知的图传播机制量化各状态对任务成功的影响,从而生成客观、可微的状态级奖励信号;该奖励信号作为密集奖励用于RL优化,显著优于现有RL基线,在四个代理推理基准测试中展现出更强性能、鲁棒性和训练效率。

链接: https://arxiv.org/abs/2603.18859
作者: Xiao Feng,Bo Han,Zhanke Zhou,Jiaqi Fan,Jiangchao Yao,Ka Ho Li,Dahai Yu,Michael Kwok-Po Ng
机构: Hong Kong Baptist University; TCL Corporate Research (HK) Co Ltd; Shanghai Jiao Tong University; Department of Mathematics, Hong Kong Baptist University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine-grained, state-level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state-level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state-wise contributions to success, followed by topology-aware graph propagation to quantify contributions and yield objective, state-level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at this https URL.

[NLP-26] Detecting Basic Values in A Noisy Russian Social Media Text Data: A Multi-Stage Classification Framework

【速读】: 该论文旨在解决在噪声较大的俄语社交媒体文本中准确识别和分类人类基本价值观(basic human values)的问题。其核心挑战在于数据质量低、标注主观性强以及多值共现的复杂性。解决方案的关键在于构建一个多阶段分类框架:首先通过过滤垃圾和非个人内容提升数据质量;其次基于Schwartz的价值观理论进行针对性样本筛选;接着利用大语言模型(LLM)生成软标签(soft labels),通过聚合多个LLM判断来缓解标注主观性;最后采用基于Transformer架构的模型(如XLM RoBERTa large)进行多标签概率预测,实现对十种基本价值观的精准识别。该方法不仅提升了模型性能(F1 macro达0.83),还通过将专家标注视为解释性基准而非绝对真值,实现了对价值表达的多层次解读。

链接: https://arxiv.org/abs/2603.18822
作者: Maria Milkova,Maksim Rudnev
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study presents a multi-stage classification framework for detecting human values in noisy Russian language social media, validated on a random sample of 7.5 million public text posts. Drawing on Schwartz’s theory of basic human values, we design a multi-stage pipeline that includes spam and nonpersonal content filtering, targeted selection of value relevant and politically relevant posts, LLM based annotation, and multi-label classification. Particular attention is given to verifying the quality of LLM annotations and model predictions against human experts. We treat human expert annotations not as ground truth but as an interpretative benchmark with its own uncertainty. To account for annotation subjectivity, we aggregate multiple LLM generated judgments into soft labels that reflect varying levels of agreement. These labels are then used to train transformer based models capable of predicting the probability of each of the ten basic values. The best performing model, XLM RoBERTa large, achieves an F1 macro of 0.83 and an F1 of 0.71 on held out test data. By treating value detection as a multi perspective interpretive task, where expert labels, GPT annotations, and model predictions represent coherent but not identical readings of the same texts, we show that the model generally aligns with human judgments but systematically overestimates the Openness to Change value domain. Empirically, the study reveals distinct patterns of value expression and their co-occurrence in Russian social networks, contributing to a broader research agenda on cultural variation, communicative framing, and value based interpretation in digital environments. All models are released publicly.

[NLP-27] Mi:dm K 2.5 Pro

【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)在企业级应用场景中面临的多步骤推理能力不足、长上下文理解受限以及代理式工作流支持薄弱的问题,尤其针对韩语和领域特定场景下模型规模与性能不足的挑战。其解决方案的关键在于:构建以推理优化为核心的训练框架,包括基于抽象语法树(Abstract Syntax Tree, AST)分析的高质量代码数据筛选、数学题补全合成策略及LLM驱动的质量评估机制,实现数据基础的强化;通过层预测器驱动的深度扩展(Depth Upscaling, DuS)和渐进式预训练策略支持128K tokens上下文窗口;后训练阶段引入多阶段流程(Reasoning SFT、模型融合与异步强化学习),系统性提升复杂问题求解能力;最终通过“融合训练”(Fusion Training)平衡推理能力与对话流畅性、响应风格一致性及工具使用可靠性,从而在韩语基准测试中达到领先水平,并通过负责任AI评估确保部署安全性。

链接: https://arxiv.org/abs/2603.18788
作者: KT Tech innovation Group
机构: KT(韩国电信)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The evolving LLM landscape requires capabilities beyond simple text generation, prioritizing multi-step reasoning, long-context understanding, and agentic workflows. This shift challenges existing models in enterprise environments, especially in Korean-language and domain-specific scenarios where scaling is insufficient. We introduce Mi:dm K 2.5 Pro, a 32B parameter flagship LLM designed to address enterprise-grade complexity through reasoning-focused optimization. Our methodology builds a robust data foundation via a quality-centric curation pipeline utilizing abstract syntax tree (AST) analysis for code, gap-filling synthesis for mathematics, and an LLM-based quality evaluator. Pre-training scales the model via layer-predictor-based Depth Upscaling (DuS) and a progressive strategy supporting a 128K token context window. Post-training introduces a specialized multi-stage pipeline, including Reasoning SFT, model merging, and asynchronous reinforcement learning (RL), to develop complex problem-solving skills. “Fusion Training” then rebalances these capabilities with conversational fluency, consistent response styling, and reliable tool-use. The evaluations show that Mi:dm K 2.5 Pro achieves competitive performance against leading global and domestic models. In addition, it sets state-of-the-art results on Korean-specific benchmarks, showcasing deep linguistic and cultural understanding. Finally, Responsible AI evaluations validate safety against attacks, ensuring a secure profile for deployment with a balance of harmlessness and responsiveness. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.18788 [cs.CL] (or arXiv:2603.18788v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.18788 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-28] Implicit Grading Bias in Large Language Models : How Writing Style Affects Automated Assessment Across Math Programming and Essay Tasks

【速读】: 该论文旨在解决生成式 AI(Generative AI)在教育场景中作为自动评分工具时可能存在的隐性评分偏见问题,特别是当内容正确性保持不变时,写作风格差异是否会导致评分偏差。其关键解决方案在于构建了一个受控数据集,包含数学、编程和作文三类学科的180份学生作答,并对每类任务引入三种表面扰动类型(语法错误、非正式语言、非母语表达),随后使用两个先进的开源大语言模型(LLaMA 3.3 70B 和 Qwen 2.5 72B)进行评分测试,且明确指示其仅依据内容正确性评分、忽略写作风格。结果发现,在作文任务中存在显著的风格敏感型偏见,即便有反偏见提示,模型仍对非正式语言和非母语表达给予明显扣分,而在数学与编程任务中则几乎无显著偏见,揭示了评分偏见具有学科依赖性和持续性特征。

链接: https://arxiv.org/abs/2603.18765
作者: Rudra Jadhav,Janhavi Danve,Sonalika Shaw
机构: Savitribai Phule Pune University (萨维特里·菲勒·普内大学); Dr. D. Y. Patil School of Science and Technology (D. Y. 帕蒂尔科学技术学院)
类目: Computation and Language (cs.CL)
备注: 7 pages, 5 figures, 2 tables, 11 references

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed as automated graders in educational settings, concerns about fairness and bias in their evaluations have become critical. This study investigates whether LLMs exhibit implicit grading bias based on writing style when the underlying content correctness remains constant. We constructed a controlled dataset of 180 student responses across three subjects (Mathematics, Programming, and Essay/Writing), each with three surface-level perturbation types: grammar errors, informal language, and non-native phrasing. Two state-of-the-art open-source LLMs – LLaMA 3.3 70B (Meta) and Qwen 2.5 72B (Alibaba) – were prompted to grade responses on a 1-10 scale with explicit instructions to evaluate content correctness only and to disregard writing style. Our results reveal statistically significant grading bias in Essay/Writing tasks across both models and all perturbation types (p 0.05), with effect sizes ranging from medium (Cohen’s d = 0.64) to very large (d = 4.25). Informal language received the heaviest penalty, with LLaMA deducting an average of 1.90 points and Qwen deducting 1.20 points on a 10-point scale – penalties comparable to the difference between a B+ and C+ letter grade. Non-native phrasing was penalized 1.35 and 0.90 points respectively. In sharp contrast, Mathematics and Programming tasks showed minimal bias, with most conditions failing to reach statistical significance. These findings demonstrate that LLM grading bias is subject-dependent, style-sensitive, and persists despite explicit counter-bias instructions in the grading prompt. We discuss implications for equitable deployment of LLM-based grading systems and recommend bias auditing protocols before institutional adoption.

[NLP-29] Are complicated loss functions necessary for teaching LLM s to reason ?

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理与数学能力提升过程中,后训练技术(post-training techniques)的复杂性与其必要组件之间的关系问题。现有方法如组相对策略优化(Group Relative Policy Optimization, GRPO)虽有效,但其多组件设计是否均不可或缺尚不明确。解决方案的关键在于通过系统性分析发现:(1)负反馈的引入对于训练有效性至关重要,仅基于基线以上的动作训练会限制学习;(2)PPO风格的裁剪机制(如策略比值裁剪)并非提升数学推理性能所必需。基于此,作者提出简化版本REINFORCE with Group Relative Advantage (RGRA),保留组相对优势估计,移除PPO式裁剪和策略比值项,在多个标准数学基准测试中展现出优于GRPO的潜力,表明基于REINFORCE的更简洁方法可高效增强LLMs的推理能力。

链接: https://arxiv.org/abs/2603.18756
作者: Gabriele Carrino,Andrea Sassella,Nicolo Brunello,Federico Toschi,Mark James Carman
机构: DEIB, Politecnico di Milano (米兰理工大学电气与信息工程系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) highlight the importance of post training techniques for improving reasoning and mathematical ability. Group Relative Policy Optimization (GRPO) has shown promise in this domain by combining group relative advantage estimation, PPO style clipping, and KL regularization. However, its complexity raises the question of whether all components are necessary for fostering reasoning behaviors. We conduct a systematic analysis of GRPO and identify two key findings: (1) incorporating negative feedback is essential training solely on actions above a baseline limits learning; and (2) PPO style constraints, such as policy ratio clipping, are not required to improve mathematical reasoning or performance. Building on these insights, we propose REINFORCE with Group Relative Advantage (RGRA), a simplified variant that retains group relative advantage estimation but removes PPO style clipping and policy ratio terms. Experiments across standard mathematical benchmarks indicate that RGRA has the potential to achieve stronger performance than GRPO. Our results suggest that simpler REINFORCE based approaches can effectively enhance reasoning in LLMs, offering a more transparent and efficient alternative to GRPO.

[NLP-30] Automatic detection of Gen-AI texts: A comparative framework of neural models

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)普及背景下,人类撰写文本与人工智能生成文本之间难以区分的问题,这一挑战在学术、编辑和社会领域均引发广泛关注。解决方案的关键在于设计并对比评估多种基于机器学习的检测模型,包括多层感知机(Multilayer Perceptron)、一维卷积神经网络(1D CNN)、基于MobileNet的CNN以及Transformer模型,并将其性能与多个商用在线检测工具(如ZeroGPT、GPTZero等)进行系统性比较。实验表明,监督式检测器在不同语言(英语和意大利语)和领域(艺术与心理健康主题)中均展现出比商业工具更稳定和鲁棒的性能,凸显了当前检测策略的优势与局限。

链接: https://arxiv.org/abs/2603.18750
作者: Cristian Buttaro,Irene Amerini
机构: Sapienza University in Rome (罗马大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid proliferation of Large Language Models has significantly increased the difficulty of distinguishing between human-written and AI generated texts, raising critical issues across academic, editorial, and social domains. This paper investigates the problem of AI generated text detection through the design, implementation, and comparative evaluation of multiple machine learning based detectors. Four neural architectures are developed and analyzed: a Multilayer Perceptron, a one-dimensional Convolutional Neural Network, a MobileNet-based CNN, and a Transformer model. The proposed models are benchmarked against widely used online detectors, including ZeroGPT, GPTZero, QuillBot, this http URL, Sapling, IsGen, Rephrase, and Writer. Experiments are conducted on the COLING Multilingual Dataset, considering both English and Italian configurations, as well as on an original thematic dataset focused on Art and Mental Health. Results show that supervised detectors achieve more stable and robust performance than commercial tools across different languages and domains, highlighting key strengths and limitations of current detection strategies.

[NLP-31] Memento-Skills: Let Agents Design Agents

【速读】: 该论文旨在解决传统大语言模型(Large Language Model, LLM)代理在面对新任务时缺乏持续学习能力与自主构建任务特定代理能力的问题。现有方法通常依赖人工设计的代理结构,难以实现跨任务的知识迁移和动态适应。其解决方案的关键在于提出 Memento-Skills 系统,该系统采用基于记忆的强化学习框架,通过状态感知提示(stateful prompts)和可重用技能(reusable skills)构成持久且演化的外部记忆库;利用“读-写-反思”闭环机制(Read–Write Reflective Learning),使代理能够根据当前状态选择最优技能并基于新经验更新技能库,从而在不调整 LLM 参数的前提下实现持续学习与自我改进,最终实现通用代理自主设计专用代理的能力。

链接: https://arxiv.org/abs/2603.18743
作者: Huichi Zhou,Siyuan Guo,Anjie Liu,Zhongwei Yu,Ziqin Gong,Bowen Zhao,Zhixun Chen,Menglong Zhang,Yihang Chen,Jinsong Li,Runyu Yang,Qiangbin Liu,Xinlei Yu,Jianmin Zhou,Na Wang,Chunyang Sun,Jun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Memento-Skills Technical Report

点击查看摘要

Abstract:We introduce \emphMemento-Skills, a generalist, continually-learnable LLM agent system that functions as an \emphagent-designing agent: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with \emphstateful prompts, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emphRead–Write Reflective Learning mechanism introduced in \emphMemento~2~\citewang2025memento2. In the \emphread phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emphwrite phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables \emphcontinual learning without updating LLM parameters, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to \emphdesign agents end-to-end for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emphGeneral AI Assistants benchmark and \emphHumanity’s Last Exam demonstrate sustained gains, achieving 26.2% and 116.2% relative improvements in overall accuracy, respectively. Code is available at this https URL. Comments: Memento-Skills Technical Report Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2603.18743 [cs.AI] (or arXiv:2603.18743v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.18743 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-32] CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

【速读】: 该论文旨在解决当前基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)中奖励建模依赖昂贵且受限的人类标注数据的问题,提出一种利用观测型用户反馈(如点击、复制、点赞等)进行可扩展、低成本奖励建模的新范式——观测奖励建模(Observational Reward Modeling)。其核心挑战在于:(1)观测反馈存在噪声,因标注误差导致偏离真实用户偏好;(2)反馈具有选择偏差,即用户更倾向于对强烈感受的响应提供反馈,造成训练与推理阶段的数据分布偏移。解决方案的关键在于提出CausalRM框架,通过两个机制实现无偏奖励建模:一是引入噪声感知的代理损失项,显式建模标注错误生成过程,在无噪声条件下等价于原始损失;二是采用倾向得分(propensity scores)重加权训练样本,消除用户偏好偏差,从而在噪声和偏倚的观测数据上学习出准确的奖励信号。

链接: https://arxiv.org/abs/2603.18736
作者: Hao Wang,Licheng Pan,Zhichao Chen,Chunyuan Zheng,Zhixuan Chu,Xiaoxi Li,Yuan Lu,Xinggao Liu,Haoxuan Li,Zhouchen Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly conditions. In this work, we introduce observational reward modeling – learning reward models with observational user feedback (e.g., clicks, copies, and upvotes) – as a scalable and cost-effective alternative. We identify two fundamental challenges in this setting: (1) observational feedback is noisy due to annotation errors, which deviates it from true user preference; (2) observational feedback is biased by user preference, where users preferentially provide feedback on responses they feel strongly about, which creats a distribution shift between training and inference data. To address these challenges, we propose CausalRM, a causal-theoretic reward modeling framework that aims to learn unbiased reward models from observational feedback. To tackle challenge (1), CausalRM introduces a noise-aware surrogate loss term that is provably equivalent to the primal loss under noise-free conditions by explicitly modeling the annotation error generation process. To tackle challenge (2), CausalRM uses propensity scores – the probability of a user providing feedback for a given response – to reweight training samples, yielding a loss function that eliminates user preference bias. Extensive experiments across diverse LLM backbones and benchmark datasets validate that CausalRM effectively learns accurate reward signals from noisy and biased observational feedback and delivers substantial performance improvements on downstream RLHF tasks – including a 49.2% gain on WildGuardMix and a 32.7% improvement on HarmBench. Code is available on our project website.

[NLP-33] STEP: Scientific Time-Series Encoder Pretraining via Cross-Domain Distillation

【速读】: 该论文旨在解决科学时间序列(scientific time series)在表示学习中面临的挑战,包括数据稀疏性、高度异构性和规模有限等问题,同时探索如何有效利用来自音频、通用时间序列和脑电信号等领域的基础模型(foundation models)知识来构建统一的编码器。解决方案的关键在于提出STEP框架——一种通过跨域蒸馏(cross-domain distillation)进行预训练的科学时间序列编码器架构;其核心创新包括:自适应分块(adaptive patching)以处理极端长度序列,统计补偿机制(statistics compensation scheme)以适应不同数值尺度,并通过跨域蒸馏整合多个基础模型的知识,从而学习适用于科学信号的通用且可迁移的特征表示。

链接: https://arxiv.org/abs/2603.18688
作者: Chen Zhang,Liwei Liu,Jun Tao,Xiaoyu Yang,Xuenan Xu,Kai Chen,Bowen Zhou,Wen Wu,Chao Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific time series are central to scientific AI but are typically sparse, highly heterogeneous, and limited in scale, making unified representation learning particularly challenging. Meanwhile, foundation models pretrained on relevant time series domains such as audio, general time series, and brain signals contain rich knowledge, but their applicability to scientific signals remains underexplored. In this paper, we investigate the transferability and complementarity of foundation models from relevant time series domains, and study how to effectively leverage them to build a unified encoder for scientific time series. We first systematically evaluate relevant foundation models, showing the effectiveness of knowledge transfer to scientific tasks and their complementary strengths. Based on this observation, we propose STEP, a Scientific Time Series Encoder Pretraining framework via cross domain distillation. STEP introduces adaptive patching to handle extreme-length sequences and a statistics compensation scheme to accommodate diverse numerical scales. It further leverages cross-domain distillation to integrate knowledge from multiple foundation models into a unified encoder. By combining complementary representations across different domains, STEP learns general-purpose and transferable features tailored for scientific signals. Experiments on seven scientific time series tasks demonstrate that STEP provides both an effective structure and an effective pretraining paradigm, taking a STEP toward scientific time series representation learning.

[NLP-34] HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agent ic Reinforcement Learning ACL2026

【速读】: 该论文旨在解决大语言模型在复杂长程代理决策任务中表现受限的问题,尤其针对现有基于多轮强化学习的方法在稀疏结果奖励下的延迟传播和细粒度过程奖励带来的不可靠信用分配问题。解决方案的关键在于提出一种利用事后信息(Hindsight Information)调节分段过程奖励(Segmental process Rewards)的机制(HISR),通过引入分段级过程奖励模型来对每个子目标进行奖励分配,避免过度细化到单个交互回合;同时设计一个事后模型,基于轨迹最终结果评估动作的重要性,并通过比较事后模型与策略模型的序列似然比来量化动作重要性,进而聚合得到分段重要性得分,用于调制分段奖励,从而显著提升信用分配的可靠性。

链接: https://arxiv.org/abs/2603.18683
作者: Zhicong Lu,Zichuan Lin,Wei Jia,Changyuan Tian,Deheng Ye,Peiguang Li,Li Jin,Nayu Liu,Guangluan Xu,Wei Feng
机构: Aerospace Information Research Institute, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Tencent Hunyuan; School of Computer Science and Technology, Tianjin University
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to ACL 2026 on Jan 5, 2026

点击查看摘要

Abstract:While large language models excel in diverse domains, their performance on complex longhorizon agentic decision-making tasks remains limited. Most existing methods concentrate on designing effective reward models (RMs) to advance performance via multi-turn reinforcement learning. However, they suffer from delayed propagation in sparse outcome rewards and unreliable credit assignment with potentially overly fine-grained and unfocused turnlevel process rewards. In this paper, we propose (HISR) exploiting Hindsight Information to modulate Segmental process Rewards, which closely aligns rewards with sub-goals and underscores significant segments to enhance the reliability of credit assignment. Specifically, a segment-level process RM is presented to assign rewards for each sub-goal in the task, avoiding excessively granular allocation to turns. To emphasize significant segments in the trajectory, a hindsight model is devised to reflect the preference of performing a certain action after knowing the trajectory outcome. With this characteristic, we design the ratios of sequence likelihoods between hindsight and policy model to measure action importance. The ratios are subsequently employed to aggregate segment importance scores, which in turn modulate segmental process rewards, enhancing credit assignment reliability. Extensive experimental results on three publicly benchmarks demonstrate the validity of our method.

[NLP-35] Words at Play: Benchmarking Audio Pun Understanding in Large Audio-Language Models

【速读】: 该论文旨在解决音频语义理解中对双关语(pun)识别与解释能力不足的问题,尤其是当前针对语音双关语(spoken puns)的数据集和系统性资源严重匮乏,导致生成式音频语言模型(Large Audio Language Models, LALMs)在幽默感知方面的研究进展受限。解决方案的关键在于构建了首个专门用于评估LALMs音频双关语理解能力的基准测试集APUN-Bench,其包含4,434个标注音频样本,覆盖双关识别、双关词定位和双关语义推理三个层次,并通过系统性评估10个前沿LALMs揭示了模型在定位偏倚和语义推理错误等方面的性能瓶颈,为提升幽默感知型音频智能提供了可量化、可复现的研究基础与改进方向。

链接: https://arxiv.org/abs/2603.18678
作者: Yuchen Su,Shaoxin Zhong,Yonghua Zhu,Ruofan Wang,Zijian Huang,Qiqi Wang,Na Zhao,Diana Benavides-Prado,Michael Witbrock
机构: University of Auckland (奥克兰大学); Singapore University of Technology and Design (新加坡科技设计大学); Queen Mary University of London (伦敦玛丽女王大学); Nankai University (南开大学)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: The paper is currently under review

点击查看摘要

Abstract:Puns represent a typical linguistic phenomenon that exploits polysemy and phonetic ambiguity to generate humour, posing unique challenges for natural language understanding. Within pun research, audio plays a central role in human communication except text and images, while datasets and systematic resources for spoken puns remain scarce, leaving this crucial modality largely underexplored. In this paper, we present APUN-Bench, the first benchmark dedicated to evaluating large audio language models (LALMs) on audio pun understanding. Our benchmark contains 4,434 audio samples annotated across three stages: pun recognition, pun word location and pun meaning inference. We conduct a deep analysis of APUN-Bench by systematically evaluating 10 state-of-the-art LALMs, uncovering substantial performance gaps in recognizing, localizing, and interpreting audio puns. This analysis reveals key challenges, such as positional biases in audio pun location and error cases in meaning inference, offering actionable insights for advancing humour-aware audio intelligence.

[NLP-36] A Comparative Empirical Study of Catastrophic Forgetting Mitigation in Sequential Task Adaptation for Continual Natural Language Processing Systems

【速读】: 该论文旨在解决持续学习(Continual Learning, CL)中灾难性遗忘(Catastrophic Forgetting)问题,特别是在意图分类(Intent Classification)任务中的应用。研究通过构建一个10任务标签不相交的场景,在三种不同骨干架构(前馈神经网络ANN、门控循环单元GRU、Transformer编码器)下对比多种CL策略的效果,包括基于回放的MIR、基于正则化的LwF以及基于参数隔离的HAT方法及其组合。解决方案的关键在于:回放机制(尤其是MIR)是提升性能和缓解遗忘的核心要素,且最优配置具有架构依赖性——例如ANN与Transformer上MIR+HAT表现最佳,而GRU则在MIR+LwF+HAT组合下效果最优;此外,部分CL方法甚至优于联合训练,表明其具备一定的正则化作用。

链接: https://arxiv.org/abs/2603.18641
作者: Aram Abrahamyan,Sachin Kumar
机构: AUA (亚美尼亚美国大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neural language models deployed in real-world applications must continually adapt to new tasks and domains without forgetting previously acquired knowledge. This work presents a comparative empirical study of catastrophic forgetting mitigation in continual intent classification. Using the CLINC150 dataset, we construct a 10-task label-disjoint scenario and evaluate three backbone architectures: a feed-forward Artificial Neural Network (ANN), a Gated Recurrent Unit (GRU), and a Transformer encoder, under a range of continual learning (CL) strategies. We consider one representative method from each major CL family: replay-based Maximally Interfered Retrieval (MIR), regularization-based Learning without Forgetting (LwF), and parameter-isolation via Hard Attention to Task (HAT), both individually and in all pairwise and triple combinations. Performance is assessed with average accuracy, macro F1, and backward transfer, capturing the stability-plasticity trade-off across the task sequence. Our results show that naive sequential fine-tuning suffers from severe forgetting for all architectures and that no single CL method fully prevents it. Replay emerges as a key ingredient: MIR is the most reliable individual strategy, and combinations that include replay (MIR+HAT, MIR+LwF, MIR+LwF+HAT) consistently achieve high final performance with near-zero or mildly positive backward transfer. The optimal configuration is architecture-dependent. MIR+HAT yields the best result for ANN and Transformer, MIR+LwF+HAT, on the other hand, works the best for GRU, and in several cases CL methods even surpass joint training, indicating a regularization effect. These findings highlight the importance of jointly selecting backbone architecture and CL mechanism when designing continual intent-classification systems.

[NLP-37] MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment

【速读】: 该论文旨在解决在固定监督微调预算下,如何同时平衡多轮安全对齐、良性边界查询上的低过度拒绝率(over-refusal)以及受可验证约束下的指令遵循能力的问题。解决方案的关键在于提出MOSAIC(Multi-Objective Slice-Aware Iterative Curation for Alignment),这是一个基于统一L1-L3评估接口的闭环数据混合搜索框架,能够将切片级别的失败特征转化为可执行的数据操作,包括数据集层面的混合比例、桶(bucket)级别的权重调整和聚焦标准,从而实现高效且结构化的数据构建策略。

链接: https://arxiv.org/abs/2603.18637
作者: Yipu Dou,Wang Yang
机构: Southeast University (东南大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 9 pages, 5 figures. Code available at this https URL

点击查看摘要

Abstract:We study how to allocate a fixed supervised fine-tuning budget when three objectives must be balanced at once: multi-turn safety alignment, low over-refusal on benign boundary queries, and instruction following under verifiable constraints. We propose MOSAIC (Multi-Objective Slice-Aware Iterative Curation for Alignment), a multi-objective framework for closed-loop data mixture search built on a unified L1-L3 evaluation interface. MOSAIC turns slice-level failure profiles into executable data actions, including dataset-level mixture ratios, bucket-level weights, and focus criteria. Under a fixed 1M-token budget and five rounds of independent fine-tuning from the same base model, MOSAIC improves internal XGuard from 2.76 to 4.67 while keeping OrBench at 4.41 and IFEval at 3.65. The final Pareto solution also generalizes better than a random static LoRA baseline on independent attack, over-refusal, and capability tests, suggesting that structured failure diagnosis can serve as a practical control signal for budgeted data construction. Code is available at this https URL.

[NLP-38] Learning to Self-Evolve

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在测试时如何通过自我迭代优化其上下文以提升对新任务性能的问题。现有方法依赖模型固有的推理能力,缺乏对自进化过程的显式训练,导致优化效果受限。解决方案的关键在于提出一种名为“学习自进化”(Learning to Self-Evolve, LSE)的强化学习框架,将多步自进化问题简化为单步强化学习目标:每次上下文编辑均根据下游任务性能的提升获得奖励,并结合树状引导的演化循环(tree-guided evolution loop),使模型能够系统性地学习如何高效改进自身上下文。实验表明,基于LSE训练的4B参数模型在Text-to-SQL和通用问答任务上超越了GPT-5、Claude Sonnet 4.5等先进模型及传统提示优化方法,且具备良好的迁移能力。

链接: https://arxiv.org/abs/2603.18620
作者: Xiaoyin Chen,Canwen Xu,Yite Wang,Boyi Liu,Zhewei Yao,Yuxiong He
机构: Mila – Quebec AI Institute (蒙特利尔魁北克人工智能研究所); University of Montreal (蒙特利尔大学); Snowflake
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Learning to Self-Evolve (LSE), a reinforcement learning framework that trains large language models (LLMs) to improve their own contexts at test time. We situate LSE in the setting of test-time self-evolution, where a model iteratively refines its context from feedback on seen problems to perform better on new ones. Existing approaches rely entirely on the inherent reasoning ability of the model and never explicitly train it for this task. LSE reduces the multi-step evolution problem to a single-step RL objective, where each context edit is rewarded by the improvement in downstream performance. We pair this objective with a tree-guided evolution loop. On Text-to-SQL generation (BIRD) and general question answering (MMLU-Redux), a 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods including GEPA and TextGrad, and transfers to guide other models without additional training. Our results highlight the effectiveness of treating self-evolution as a learnable skill.

[NLP-39] DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units INTERSPEECH2026

【速读】: 该论文旨在解决无监督音位发现(unsupervised phoneme discovery)在多语言场景下的评估难题,即如何从离散语音单元中自动识别并映射到预定义音位库的问题。其解决方案的关键在于构建了一个名为DiscoPhon的多语言基准测试集,涵盖6种开发语言和6种测试语言,覆盖广泛的音位对比,并要求模型仅用10小时未见过的语言语音数据,生成可映射至标准音位系统的离散单元。通过多对一或一对一映射机制,系统输出的序列在单元质量、识别准确率和分割精度上进行评估,从而验证当前预训练模型(如HuBERT和SpidR)在跨语言场景下提取音位信息的能力。

链接: https://arxiv.org/abs/2603.18612
作者: Maxime Poli,Manel Khentout,Angelo Ortiz Tandazo,Ewan Dunbar,Emmanuel Chemla,Emmanuel Dupoux
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 6 pages, 2 figures. Submitted to Interspeech 2026

点击查看摘要

Abstract:We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of phonemic contrasts. Given only 10 hours of speech in a previously unseen language, systems must produce discrete units that are mapped to a predefined phoneme inventory, through either a many-to-one or a one-to-one assignment. The resulting sequences are evaluated for unit quality, recognition and segmentation. We provide four pretrained multilingual HuBERT and SpidR baselines, and show that phonemic information is available enough in current models for derived units to correlate well with phonemes, though with variations across languages.

[NLP-40] Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media WWW2026

【速读】: 该论文旨在解决危机情境下多模态信息(文本与图像)分类过程中模型决策过程不透明的问题,尤其针对现有方法在解释性方面对图像信息关注不足的局限。其解决方案的关键在于提出一种“可解释设计”(interpretable-by-design)的多模态分类框架:首先利用视觉语言Transformer模型学习文本与图像的联合表示,并提取文本理由(rationales);随后通过跨模态理由迁移机制,将文本理由映射至图像空间以生成图像理由;这一策略显著减少了人工标注成本,同时提升了分类性能与可解释性。实验表明,该方法在CrisisMMD数据集上使Macro-F1提升2–35%,并能有效识别高质量的图像理由区域(人类评估支持提升12%),且具备良好的零样本迁移能力(准确率达80%)。

链接: https://arxiv.org/abs/2603.18611
作者: Thi Huyen Nguyen,Koustav Rudra,Wolfgang Nejdl
机构: L3S Research Center (L3S研究中心); Leibniz University Hannover (汉诺威莱布尼茨大学); Indian Institute of Technology Kharagpur (印度理工学院克哈格普尔分校)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WWW 2026

点击查看摘要

Abstract:Advances in social media data dissemination enable the provision of real-time information during a crisis. The information comes from different classes, such as infrastructure damages, persons missing or stranded in the affected zone, etc. Existing methods attempted to classify text and images into various humanitarian categories, but their decision-making process remains largely opaque, which affects their deployment in real-life applications. Recent work has sought to improve transparency by extracting textual rationales from tweets to explain predicted classes. However, such explainable classification methods have mostly focused on text, rather than crisis-related images. In this paper, we propose an interpretable-by-design multimodal classification framework. Our method first learns the joint representation of text and image using a visual language transformer model and extracts text rationales. Next, it extracts the image rationales via the mapping with text rationales. Our approach demonstrates how to learn rationales in one modality from another through cross-modal rationale transfer, which saves annotation effort. Finally, tweets are classified based on extracted rationales. Experiments are conducted over CrisisMMD benchmark dataset, and results show that our proposed method boosts the classification Macro-F1 by 2-35% while extracting accurate text tokens and image patches as rationales. Human evaluation also supports the claim that our proposed method is able to retrieve better image rationale patches (12%) that help to identify humanitarian classes. Our method adapts well to new, unseen datasets in zero-shot mode, achieving an accuracy of 80%.

[NLP-41] myMNIST: Benchmark of PETNN KAN and Classical Deep Learning Models for Burmese Handwritten Digit Recognition

【速读】: 该论文旨在解决缅甸手写数字识别(myMNIST)任务中缺乏系统性基准评估的问题,以推动缅甸语自然语言处理(NLP)与人工智能(AI)研究的发展。其解决方案的关键在于构建并公开首个针对myMNIST数据集的全面基准测试,涵盖从传统深度学习模型(如卷积神经网络CNN、长短期记忆网络LSTM、门控循环单元GRU、Transformer)到新兴架构(如FastKAN、EfficientKAN)、能量基模型(JEM)以及受物理启发的PETNN变体(Sigmoid、GELU、SiLU)在内的十一类模型,并采用精确率(Precision)、召回率(Recall)、F1分数和准确率(Accuracy)作为评价指标。结果表明,CNN在所有模型中表现最优(F1 = 0.9959, Accuracy = 0.9970),而PETNN(GELU)紧随其后,显示出优于LSTM、GRU、Transformer及KAN类模型的性能,同时揭示了能量基模型(JEM)与PETNN之间的性能差距,从而为缅甸手写数字识别提供了可复现的基线,并促进了对新兴架构在区域性文字识别任务中的有效性评估。

链接: https://arxiv.org/abs/2603.18597
作者: Ye Kyaw Thu,Thazin Myint Oo,Thepchai Supnithi
机构: NECTEC(泰国国家电子与计算机技术研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 2 figures, 3 tables, Accepted to ICNLP 2026, Xi’an, China

点击查看摘要

Abstract:We present the first systematic benchmark on myMNIST (formerly BHDD), a publicly available Burmese handwritten digit dataset important for Myanmar NLP/AI research. We evaluate eleven architectures spanning classical deep learning models (Multi-Layer Perceptron, Convolutional Neural Network, Long Short-Term Memory, Gated Recurrent Unit, Transformer), recent alternatives (FastKAN, EfficientKAN), an energy-based model (JEM), and physics-inspired PETNN variants (Sigmoid, GELU, SiLU). Using Precision, Recall, F1-Score, and Accuracy as evaluation metrics, our results show that the CNN remains a strong baseline, achieving the best overall scores (F1 = 0.9959, Accuracy = 0.9970). The PETNN (GELU) model closely follows (F1 = 0.9955, Accuracy = 0.9966), outperforming LSTM, GRU, Transformer, and KAN variants. JEM, representing energy-based modeling, performs competitively (F1 = 0.9944, Accuracy = 0.9958). KAN-based models (FastKAN, EfficientKAN) trail the top performers but provide a meaningful alternative baseline (Accuracy ~0.992). These findings (i) establish reproducible baselines for myMNIST across diverse modeling paradigms, (ii) highlight PETNN’s strong performance relative to classical and Transformer-based models, and (iii) quantify the gap between energy-inspired PETNNs and a true energy-based model (JEM). We release this benchmark to facilitate future research on Myanmar digit recognition and to encourage broader evaluation of emerging architectures on regional scripts.

[NLP-42] Language Model Maps for Prompt-Response Distributions via Log-Likelihood Vectors

【速读】: 该论文旨在解决如何有效表征和比较语言模型(Language Models, LM)之间条件分布差异的问题,从而揭示模型行为的全局结构及其与训练数据、任务性能等属性的关系。其解决方案的关键在于:将语言模型表示为基于提示-响应对(prompt-response pairs)的对数似然向量(log-likelihood vectors),并在该嵌入空间中构建模型映射(model maps),使得模型间的距离近似于它们条件分布之间的KL散度(KL divergence)。这一方法不仅捕捉了模型间的系统性变化(如提示修改带来的影响及其近似可加性),还通过引入点互信息(Pointwise Mutual Information, PMI)向量进一步削弱无条件分布的影响,使模型映射更准确地反映训练数据相关的差异,从而支持对输入依赖型模型行为的分析。

链接: https://arxiv.org/abs/2603.18593
作者: Yusuke Takase,Momose Oyama,Hidetoshi Shimodaira
机构: Kyoto University (京都大学); RIKEN (理化学研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a method that represents language models by log-likelihood vectors over prompt-response pairs and constructs model maps for comparing their conditional distributions. In this space, distances between models approximate the KL divergence between the corresponding conditional distributions. Experiments on a large collection of publicly available language models show that the maps capture meaningful global structure, including relationships to model attributes and task performance. The method also captures systematic shifts induced by prompt modifications and their approximate additive compositionality, suggesting a way to analyze and predict the effects of composite prompt operations. We further introduce pointwise mutual information (PMI) vectors to reduce the influence of unconditional distributions; in some cases, PMI-based model maps better reflect training-data-related differences. Overall, the framework supports the analysis of input-dependent model behavior.

[NLP-43] ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLM s

【速读】: 该论文旨在解决现有解释评估基准无法可靠判断解释是否真实反映模型推理的问题,因为这些基准通常仅使用单一干预操作且缺乏统计检验,难以区分真实忠实性与随机性能。其解决方案的关键在于提出ICE(Intervention-Consistent Explanation)框架,通过在多种干预算子下对解释与匹配的随机基线进行随机化检验,从而获得具有置信区间的胜率(win rates)。这一方法能够揭示不同干预操作下的忠实性差异,并识别出反忠实性现象,同时表明忠实性与人类可解释性之间几乎无相关性,强调了应基于干预算子比较而非单一分数来理解模型解释的可信度。

链接: https://arxiv.org/abs/2603.18579
作者: Abhinaba Basu,Pavan Chakraborty
机构: Indian Institute of Information Technology, Allahabad (IIITA); National Institute of Electronics and Information Technology (NIELIT)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Evaluating whether explanations faithfully reflect a model’s reasoning remains an open problem. Existing benchmarks use single interventions without statistical testing, making it impossible to distinguish genuine faithfulness from chance-level performance. We introduce ICE (Intervention-Consistent Explanation), a framework that compares explanations against matched random baselines via randomization tests under multiple intervention operators, yielding win rates with confidence intervals. Evaluating 7 LLMs across 4 English tasks, 6 non-English languages, and 2 attribution methods, we find that faithfulness is operator-dependent: operator gaps reach up to 44 percentage points, with deletion typically inflating estimates on short text but the pattern reversing on long text, suggesting that faithfulness should be interpreted comparatively across intervention operators rather than as a single score. Randomized baselines reveal anti-faithfulness in one-third of configurations, and faithfulness shows zero correlation with human plausibility (|r| 0.04). Multilingual evaluation reveals dramatic model-language interactions not explained by tokenization alone. We release the ICE framework and ICEBench benchmark.

[NLP-44] SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在推理过程中因自回归解码的串行特性导致的高延迟问题。现有方案如推测解码(Speculative Decoding)虽能通过轻量级草稿模型(draft model)并行提出多个候选token进行批量验证来缓解此瓶颈,但其实际应用受限于高质量草稿模型的稀缺以及缺乏可扩展的训练基础设施。本文提出SpecForge——一个面向生产的开源训练框架,其核心创新在于引入目标-草稿解耦(target-draft decoupling)、混合并行策略(hybrid parallelism)、优化的训练内核及与生产级推理引擎的集成,显著提升了EAGLE-3训练效率(最高达9.9倍加速),并配套发布SpecBundle,一套基于SpecForge训练的高质量EAGLE-3草稿模型,可在SGLang上实现最高4.48倍端到端推理加速,从而为推测解码的实际部署提供了可靠、高效的解决方案。

链接: https://arxiv.org/abs/2603.18567
作者: Shenggui Li,Chao Wang,Yikai Zhu,Yubo Wang,Fan Yin,Shuai Shi,Yefei Chen,Xiaomin Dong,Qiaoling Chen,Jin Pan,Ji Li,Laixin Xie,Yineng Zhang,Lei Yu,Yonggang Wen,Ivor Tsang,Tianwei Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models incur high inference latency due to sequential autoregressive decoding. Speculative decoding alleviates this bottleneck by using a lightweight draft model to propose multiple tokens for batched verification. However, its adoption has been limited by the lack of high-quality draft models and scalable training infrastructure. We introduce SpecForge, an open-source, production-oriented framework for training speculative decoding models with full support for EAGLE-3. SpecForge incorporates target-draft decoupling, hybrid parallelism, optimized training kernels, and integration with production-grade inference engines, enabling up to 9.9x faster EAGLE-3 training for Qwen3-235B-A22B. In addition, we release SpecBundle, a suite of production-grade EAGLE-3 draft models trained with SpecForge for mainstream open-source LLMs. Through a systematic study of speculative decoding training recipes, SpecBundle addresses the scarcity of high-quality drafts in the community, and our draft models achieve up to 4.48x end-to-end inference speedup on SGLang, establishing SpecForge as a practical foundation for real-world speculative decoding deployment.

[NLP-45] Cross-Lingual LLM -Judge Transfer via Evaluation Decomposition

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在非英语场景下自动化评估困难的问题,尤其针对多语言环境下缺乏高质量人工标注数据的瓶颈。其解决方案的关键在于提出一种基于分解的评估框架,核心是构建一个通用评价维度集合(Universal Criteria Set, UCS),该集合由一组语言无关的评价维度组成,能够生成可解释的中间表示,从而支持跨语言迁移,且仅需极少的目标语言标注数据即可实现性能提升。实验表明,该方法在多种语言和模型架构下的忠实性任务中均显著优于现有强基线模型。

链接: https://arxiv.org/abs/2603.18557
作者: Ivaxi Sheth,Zeno Jonke,Amin Mantrach,Saab Mansour
机构: CISPA Helmholtz Center for Information Security; Amazon
类目: Computation and Language (cs.CL)
备注: 19 pages

点击查看摘要

Abstract:As large language models are increasingly deployed across diverse real-world applications, extending automated evaluation beyond English has become a critical challenge. Existing evaluation approaches are predominantly English-focused, and adapting them to other languages is hindered by the scarcity and cost of human-annotated judgments in most languages. We introduce a decomposition-based evaluation framework built around a Universal Criteria Set (UCS). UCS consists of a shared, language-agnostic set of evaluation dimensions, producing an interpretable intermediate representation that supports cross-lingual transfer with minimal supervision. Experiments on multiple faithfulness tasks across languages and model backbones demonstrate consistent improvements over strong baselines without requiring target-language annotations.

[NLP-46] Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在推理过程中存在的两个核心问题:一是“过度思考”现象,即对简单任务生成冗长且重复的输出;二是“过度自信”现象,即对超出模型能力的问题给出过于简短但错误的答案,从而导致性能下降。解决方案的关键在于提出一种难度差异化策略优化方法(Difficulty-Differentiated Policy Optimization, DDPO),该方法基于任务难度分离优化策略:对于简单任务,通过缩短输出长度而不牺牲准确性来提升效率;对于复杂任务,则扩大探索空间以增强性能。此外,论文推导出最大化预期准确率的理论条件,指出长度分布应尽可能接近最优长度且高度集中,并据此引入难度级平均长度作为长度优化的合理参考基准。实验表明,DDPO相比GRPO在多个基准上实现了平均答案长度减少12%的同时准确率提升1.85%,显著改善了准确率与长度之间的权衡关系。

链接: https://arxiv.org/abs/2603.18533
作者: Yinan Xia,Haotian Zhang,Huiming Wang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 13 pages

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have shown exceptional reasoning capabilities, but they also suffer from the issue of overthinking, often generating excessively long and redundant answers. For problems that exceed the model’s capabilities, LRMs tend to exhibit the overconfidence phenomenon, generating overly short but incorrect answers, which may contribute to suboptimal performance. To address these issues, we propose Difficulty-Differentiated Policy Optimization (DDPO), an efficient reinforcement learning algorithm that optimizes simple and complex tasks separately based on the overconfidence phenomenon. Specifically, it reduces the output length for simple tasks without compromising accuracy, while for complex tasks, it expands the exploration space to improve performance. We further derive the theoretical conditions for maximizing expected accuracy, which require the length distribution to closely approximate the optimal length and be as concentrated as possible. Based on these conditions, we propose using the difficulty-level average as a well-founded reference for length optimization. Extensive experiments on both in-domain and out-of-domain benchmarks validate the superiority and effectiveness of DDPO. Compared to GRPO, DDPO reduces the average answer length by 12% while improving accuracy by 1.85% across multiple benchmarks, achieving a better trade-off between accuracy and length. The code is available at this https URL. Comments: 13 pages Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2603.18533 [cs.LG] (or arXiv:2603.18533v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.18533 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yinan Xia [view email] [v1] Thu, 19 Mar 2026 06:30:26 UTC (326 KB) Full-text links: Access Paper: View a PDF of the paper titled Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning, by Yinan Xia and Haotian Zhang and Huiming WangView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-03 Change to browse by: cs cs.CL References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-47] When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险决策场景中对虚假特征(spurious features)的依赖问题,这些问题可能导致偏见和不可靠的输出。其关键解决方案是提出ICE-Guard框架,通过干预一致性测试(intervention consistency testing)识别三类虚假特征依赖:人口统计学特征(如姓名/种族替换)、权威性特征(如资质/声望替换)以及表述框架(如正负面重述)。该框架揭示了权威偏倚(平均5.8%)和框架偏倚(5.0%)显著高于人口统计偏倚(2.2%),并验证了结构化分解方法可将翻转率降低最多100%(中位数49%),最终通过“检测-诊断-缓解-验证”闭环实现累计78%的偏倚减少,为LLM在关键应用中的可靠性评估与改进提供了系统性路径。

链接: https://arxiv.org/abs/2603.18530
作者: Abhinaba Basu,Pavan Chakraborty
机构: Indian Institute of Information Technology, Allahabad (IIITA); National Institute of Electronics and Information Technology (NIELIT)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for high-stakes decisions, yet their susceptibility to spurious features remains poorly characterized. We introduce ICE-Guard, a framework applying intervention consistency testing to detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Across 3,000 vignettes spanning 10 high-stakes domains, we evaluate 11 LLMs from 8 families and find that (1) authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%), challenging the field’s narrow focus on demographics; (2) bias concentrates in specific domains – finance shows 22.6% authority bias while criminal justice shows only 2.8%; (3) structured decomposition, where the LLM extracts features and a deterministic rubric decides, reduces flip rates by up to 100% (median 49% across 9 models). We demonstrate an ICE-guided detect-diagnose-mitigate-verify loop achieving cumulative 78% bias reduction via iterative prompt patching. Validation against real COMPAS recidivism data shows COMPAS-derived flip rates exceed pooled synthetic rates, suggesting our benchmark provides a conservative estimate of real-world bias. Code and data are publicly available.

[NLP-48] EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

【速读】: 该论文旨在解决基于扩散的大语言模型(Diffusion-based Large Language Models, dLLMs)在推理过程中因双向注意力机制导致的KV缓存无法无损保存、每步去噪均需全前向传播所带来的高计算开销问题。现有近似KV缓存方法虽通过选择性更新缓存状态降低代价,但其决策开销随上下文长度或模型深度增长,难以高效扩展。解决方案的关键在于提出一种无需训练的KV缓存策略EntropyCache,其核心创新是利用新解码token分布的最大熵作为恒定成本信号来判断是否跳过缓存或重新计算;该设计基于两个经验观察:(1) 解码token熵与KV缓存漂移强相关,可作为缓存过时性的低成本代理指标;(2) 解码token特征波动在去掩码后持续多步存在,因此仅需对最近k个token进行重计算。该机制每步仅需O(V)复杂度,与上下文长度和模型规模无关,显著提升了推理效率,实验表明在标准与链式思维基准上分别获得15.2×–26.4×和22.4×–24.1×加速,且决策开销仅占总推理时间的0.5%。

链接: https://arxiv.org/abs/2603.18489
作者: Minsoo Cheong,Donghyun Son,Woosang Lim,Sungjoo Yoo
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the k most recently decoded tokens. The skip-or-recompute decision requires only O(V) computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves 15.2\times - 26.4\times speedup on standard benchmarks and 22.4\times - 24.1\times on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only 0.5% of inference time. Code is available at this https URL.

[NLP-49] he Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token Choices

【速读】: 该论文试图解决的问题是:当前基于似然(likelihood)的文本生成解码策略(如top-k、核采样和对比搜索)在选择token时仅局限于高概率区域,导致生成文本中存在“截断盲区”(truncation blind spot)——即语境上合适但统计罕见的token无法被模型选取,从而使得机器生成文本更容易被检测。解决方案的关键在于识别并利用这一盲区:通过分析超过180万条文本数据发现,8–18%的人类选择token位于常规截断边界之外;进而训练简单分类器基于可预测性和词汇多样性即可实现高精度检测。研究进一步表明,检测率主要由截断参数决定,而非模型规模或架构,且低检测率配置常伴随文本不连贯性,说明规避检测与生成自然语言是两个不同目标。

链接: https://arxiv.org/abs/2603.18482
作者: Esteban Garces Arias,Nurzhan Sapargali,Christian Heumann,Matthias Aßenmacher
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Under review

点击查看摘要

Abstract:Standard decoding strategies for text generation, including top-k, nucleus sampling, and contrastive search, select tokens based on likelihood, restricting selection to high-probability regions. Human language production operates differently: tokens are chosen for communicative appropriateness rather than statistical frequency. This mismatch creates a truncation blind spot: contextually appropriate but statistically rare tokens remain accessible to humans yet unreachable by likelihood-based decoding. We hypothesize this contributes to the detectability of machine-generated text. Analyzing over 1.8 million texts across eight language models, five decoding strategies, and 53 hyperparameter configurations, we find that 8-18% of human-selected tokens fall outside typical truncation boundaries. Simple classifiers trained on predictability and lexical diversity achieve remarkable detection rates. Crucially, neither model scale nor architecture correlates strongly with detectability; truncation parameters account for most variance. Configurations achieving low detectability often produce incoherent text, indicating that evading detection and producing natural text are distinct objectives. These findings suggest detectability is enhanced by likelihood-based token selection, not merely a matter of model capability.

[NLP-50] WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂应用场景中缺乏精确行为控制的问题,现有方法普遍存在训练成本高、自然语言可控性差或语义连贯性受损等缺陷。其解决方案的关键在于提出WASD(unWeaving Actionable Sufficient Directives)框架,通过识别生成特定token所需的充分神经条件来解释模型行为;具体而言,将候选条件表示为神经元激活谓词,并迭代搜索在输入扰动下仍能保证当前输出的最小条件集合,从而实现稳定、准确且简洁的行为解释与可控性提升。

链接: https://arxiv.org/abs/2603.18474
作者: Haonan Yu,Junhao Liu,Zhenyu Yan,Haoran Lin,Xin Zhang
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training costs, lack natural language controllability, or compromise semantic coherence. To bridge this gap, we propose WASD (unWeaving Actionable Sufficient Directives), a novel framework that explains model behavior by identifying sufficient neural conditions for token generation. Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations. Experiments on SST-2 and CounterFact with the Gemma-2-2B model demonstrate that our approach produces explanations that are more stable, accurate, and concise than conventional attribution graphs. Moreover, through a case study on controlling cross-lingual output generation, we validated the practical effectiveness of WASD in controlling model behavior.

[NLP-51] GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms

【速读】: 该论文旨在解决现有基准测试在评估大语言模型(Large Language Models, LLMs)于真实业务场景中如何权衡规范(norms)与商业目标(business goals)方面的局限性,尤其是缺乏对影响决策因素的系统性分析。解决方案的关键在于提出GAIN(Goal-Aligned Decision-Making under Imperfect Norms)基准,其创新性地引入五类明确设计的“情境压力”——目标一致性、风险规避、情感/伦理诉求、社会/权威影响和个人激励——以模拟现实世界中规范与目标冲突的复杂情境。通过1,200个跨招聘、客户服务、广告和金融四个领域的场景测试,GAIN能够系统评估LLMs在不同压力下的决策行为,揭示出当存在个人激励压力时,模型反而更倾向于遵守规范,这一发现挑战了传统预期,凸显了情境压力作为关键变量在理解LLM决策机制中的核心作用。

链接: https://arxiv.org/abs/2603.18469
作者: Masayuki Kawarada,Kodai Watanabe,Soichiro Murakami
机构: 未知
类目: Computation and Language (cs.CL)
备注: We are working towards releasing the code in April 2026

点击查看摘要

Abstract:We introduce GAIN (Goal-Aligned Decision-Making under Imperfect Norms), a benchmark designed to evaluate how large language models (LLMs) balance adherence to norms against business goals. Existing benchmarks typically focus on abstract scenarios rather than real-world business applications. Furthermore, they provide limited insights into the factors influencing LLM decision-making. This restricts their ability to measure models’ adaptability to complex, real-world norm-goal conflicts. In GAIN, models receive a goal, a specific situation, a norm, and additional contextual pressures. These pressures, explicitly designed to encourage potential norm deviations, are a unique feature that differentiates GAIN from other benchmarks, enabling a systematic evaluation of the factors influencing decision-making. We define five types of pressures: Goal Alignment, Risk Aversion, Emotional/Ethical Appeal, Social/Authoritative Influence, and Personal Incentive. The benchmark comprises 1,200 scenarios across four domains: hiring, customer support, advertising and finance. Our experiments show that advanced LLMs frequently mirror human decision-making patterns. However, when Personal Incentive pressure is present, they diverge significantly, showing a strong tendency to adhere to norms rather than deviate from them.

[NLP-52] UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference

【速读】: 该论文旨在解决大语言模型在长文本推理中因注意力稀释(attention dilution)和分布外退化(out-of-distribution degradation)导致的性能下降问题。现有上下文选择方法通常采用固定上下文预算,但未能适应不同token级别差异化的上下文需求。其解决方案的关键在于提出不确定性触发的自适应上下文分配机制(Uncertainty-Triggered Adaptive Context Allocation, UT-ACA),该机制在推理阶段基于token级不确定性动态调整上下文窗口:通过融合语义嵌入与基于logit的置信度来构建不确定性检测器,并考虑解码过程中不确定性累积效应;当检测到证据不足时,UT-ACA会智能回滚、扩展上下文窗口并重新生成该token,从而在显著降低平均上下文使用量的同时保持长上下文场景下的生成质量。

链接: https://arxiv.org/abs/2603.18446
作者: Lang Zhou,Shuxuan Li,Zhuohao Li,Shi Liu,Zhilin Zhao,Wei-Shi Zheng
机构: Sun Yat-sen University (中山大学); Shenzhen Loop Area Institude (深圳 loop 区研究院); Southern University of Science and Technology (南方科技大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long-context inference remains challenging for large language models due to attention dilution and out-of-distribution degradation. Context selection mitigates this limitation by attending to a subset of key-value cache entries, yet most methods allocate a fixed context budget throughout decoding despite highly non-uniform token-level contextual demands. To address this issue, we propose Uncertainty-Triggered Adaptive Context Allocation (UT-ACA), an inference-time framework that dynamically adjusts the context window based on token-wise uncertainty. UT-ACA learns an uncertainty detector that combines semantic embeddings with logit-based confidence while accounting for uncertainty accumulation across decoding steps. When insufficient evidence is indicated, UT-ACA selectively rolls back, expands the context window, and regenerates the token with additional support. Experiments show that UT-ACA substantially reduces average context usage while preserving generation quality in long-context settings.

[NLP-53] Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在生成文本时因采用静态解码策略(如贪心搜索或固定温度/top-p采样)而导致的输出质量不稳定、缺乏风格与结构灵活性的问题,尤其在跨领域任务中表现不佳。其核心解决方案是提出一种基于强化学习的解码采样器(reinforcement learning-based decoder sampler),将解码过程建模为序列决策问题,训练一个轻量级策略网络在推理阶段动态调整采样参数,同时保持LLM权重冻结不变。该方法通过复合奖励函数(包含重叠度、长度、覆盖率、重复性和完整性等结构化项)引导策略优化,在多个摘要数据集上显著优于传统静态基线,实现了无需重新训练即可适应不同领域和用户需求的生成控制能力。

链接: https://arxiv.org/abs/2603.18428
作者: Asmita Bhardwaj,Yuya Jeremy Ong,Eelaaf Zahid,Basel Shbita
机构: UC San Diego; Plastic Labs; IBM Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decoding strategies largely determine the quality of Large Language Model (LLM) outputs, yet widely used heuristics such as greedy or fixed temperature/top-p decoding are static and often task-agnostic, leading to suboptimal or inconsistent generation quality across domains that demand stylistic or structural flexibility. We introduce a reinforcement learning-based decoder sampler that treats decoding as sequential decision-making and learns a lightweight policy to adjust sampling parameters at test-time while keeping LLM weights frozen. We evaluated summarization datasets including BookSum, arXiv, and WikiHow using Granite-3.3-2B and Qwen-2.5-0.5B. Our policy sampler consistently outperforms greedy and static baselines, achieving relative gains of up to +88% (BookSum, Granite) and +79% (WikiHow, Qwen). Reward ablations show that overlap-only objectives underperform compared to composite rewards, while structured shaping terms (length, coverage, repetition, completeness) enable stable and sustained improvements. These findings highlight reinforcement learning as a practical mechanism for test-time adaptation in decoding, enabling domain-aware and user-controllable generation without retraining large models.

[NLP-54] Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLM s

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中因任务切换导致的性能下降问题,即任务干扰(Task Interference),这一现象此前仅在纯文本对话系统中被研究,而随着多模态对话系统的普及,其在跨模态场景下的影响亟待量化评估。解决方案的关键在于构建了一个系统性的基准测试框架,涵盖文本与视觉任务共六种,通过三个维度——模态不匹配(Modality Mismatch)、推理需求不匹配(Reasoning Mismatch)和答案格式不匹配(Answer Format Mismatch)——对历史-目标任务组合进行可控变化,从而揭示任务干扰的方向性与强度来源:实验证明干扰具有高度方向性,从纯文本到图像目标的任务切换造成显著性能下降,反之则影响微弱;且多种不匹配因素叠加时干扰加剧,其中模态差异是主导因素,其次为答案格式差异,推理需求变化的影响最小。

链接: https://arxiv.org/abs/2603.18425
作者: Masayuki Kawarada,Tatsuya Ishigaki,Hiroya Takamura
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.

[NLP-55] ARo: Token-level Adaptive Routing for LLM Test-time Alignment

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理能力上依赖昂贵的后训练(post-training)才能达到高性能的问题,同时填补现有测试时对齐(test-time alignment)方法主要聚焦于偏好对齐而忽视结构化推理的空白。其解决方案的关键在于提出一种名为Token-level Adaptive Routing (TARo) 的新框架,该框架在推理阶段完全无需微调模型参数,而是通过训练一个基于步骤级数学推理轨迹的奖励模型来捕捉细粒度逻辑一致性信号,并引入一个可学习的 token-level 路由器,动态控制奖励信号对基础模型的引导作用,从而实现高效的结构化推理增强。

链接: https://arxiv.org/abs/2603.18411
作者: Arushi Rai,Qiang Zhang,Hanqing Zeng,Yunkai Zhang,Dipesh Tamboli,Xiangjun Fan,Zhuokai Zhao
机构: Meta; University of Pittsburgh; University of California, Berkeley
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model. Extensive experiments show that TARo significantly improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of-distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval). Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.

[NLP-56] opoChunker: Topology-Aware Agent ic Document Chunking Framework

【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)中文档分块(document chunking)方法因强制线性化文本而导致的语义碎片化问题,该问题削弱了跨段落依赖关系的保留,进而影响下游检索质量。解决方案的关键在于提出TopoChunker框架,其核心创新是将异构文档映射到结构化中间表示(Structured Intermediate Representation, SIR),以显式保留跨段落的拓扑依赖关系;同时采用双代理架构——Inspector Agent动态优化提取路径以平衡结构保真度与计算成本,Refiner Agent则执行容量审计与拓扑上下文消歧,重建层级关联,从而实现高效且结构感知的文档分块。

链接: https://arxiv.org/abs/2603.18409
作者: Xiaoyu Liu
机构: Independent Researcher
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation’’ that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.

[NLP-57] AutoScreen-FW: An LLM -based Framework for Resume Screening

【速读】: 该论文旨在解决企业招聘中简历筛选效率低、人工负担重以及商业大语言模型(Large Language Model, LLM)可能引发数据隐私风险的问题。其核心挑战在于:一方面,公司难以获取带有标注结果的公开简历样本以训练模型;另一方面,现有方法依赖商用LLM,存在部署成本高和隐私泄露隐患。解决方案的关键在于提出AutoScreen-FW框架,通过自动选择少量代表性简历样本,并结合角色设定(persona)与评估标准进行上下文学习(in-context learning),使开源LLM能够模拟职业顾问角色对未见简历进行准确评价。该方法在多个真实标注基准下表现优于GPT-5-nano,在部分场景下超越GPT-5-mini,同时具备本地化部署能力与更高的处理速度,显著降低企业招聘流程中的计算资源消耗与人力负担。

链接: https://arxiv.org/abs/2603.18390
作者: Zhelin Xu,Shuhei Yamamoto,Atsuyuki Morishima
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:Corporate recruiters often need to screen many resumes within a limited time, which increases their burden and may cause suitable candidates to be overlooked. To address these challenges, prior work has explored LLM-based automated resume screening. However, some methods rely on commercial LLMs, which may pose data privacy risks. Moreover, since companies typically do not make resumes with evaluation results publicly available, it remains unclear which resume samples should be used during learning to improve an LLM’s judgment performance. To address these problems, we propose AutoScreen-FW, an LLM-based locally and automatically resume screening framework. AutoScreen-FW uses several methods to select a small set of representative resume samples. These samples are used for in-context learning together with a persona description and evaluation criteria, enabling open-source LLMs to act as a career advisor and evaluate unseen resumes. Experiments with multiple ground truths show that the open-source LLM judges consistently outperform GPT-5-nano. Under one ground truth setting, it also surpass GPT-5-mini. Although it is slightly weaker than GPT-5-mini under other ground-truth settings, it runs substantially faster per resume than commercial GPT models. These findings indicate the potential for deploying AutoScreen-FW locally in companies to support efficient screening while reducing recruiters’ burden.

[NLP-58] PowerFlow: Unlocking the Dual Nature of LLM s via Principled Distribution Matching

【速读】: 该论文旨在解决当前无监督强化学习(Unsupervised Reinforcement Learning from Internal Feedback, RLIF)方法中因依赖启发式内在奖励而导致的理论优化目标不明确及退化偏差问题。其解决方案的关键在于提出PowerFlow框架,将无监督微调重新建模为分布匹配问题,并利用GFlowNet作为非归一化密度的近似变分采样器,设计了一种长度感知的轨迹平衡(Trajectory-Balance)目标,以显式消除自回归生成中的结构长度偏差;同时通过聚焦于α-幂分布(α-power distributions),实现对大语言模型(LLM)双重特性的定向激发:当α < 1时平滑分布以释放表达性创造力,当α > 1时锐化分布以增强逻辑推理能力。

链接: https://arxiv.org/abs/2603.18363
作者: Ruishuo Chen,Yu Chen,Zhuoran Li,Longbo Huang
机构: IIIS, Tsinghua University (清华大学人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting \alpha -power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ( \alpha 1 ) to intensify logical reasoning, or flattening it ( \alpha 1 ) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

[NLP-59] Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

【速读】: 该论文旨在解决生成式常识推理(Generative Commonsense Reasoning, GCR)领域中因缺乏大规模高质量多样性训练数据而导致模型生成响应多样性不足的问题。现有GCR数据集受限于人工标注成本,通常由少量标注者构建,覆盖的常识场景较为单一,难以支撑模型生成多样化且符合常识的响应。解决方案的关键在于提出一种两阶段合成方法,用于构建首个面向多样性的GCR合成数据集CommonSyn;该方法通过合成数据对大语言模型(Large Language Models, LLMs)进行微调,显著提升了模型在不同规模下的生成多样性与质量,优于直接使用人类标注数据微调的基线模型。

链接: https://arxiv.org/abs/2603.18361
作者: Tianhui Zhang,Bei Peng,Danushka Bollegala
机构: University of Liverpool (利物浦大学)
类目: Computation and Language (cs.CL)
备注: 21 pages, 7 figures

点击查看摘要

Abstract:Conversational agents are required to respond to their users not only with high quality (i.e. commonsense bearing) responses, but also considering multiple plausible alternative scenarios, reflecting the diversity in their responses. Despite the growing need to train diverse commonsense generators, the progress of this line of work has been significantly hindered by the lack of large-scale high-quality diverse commonsense training datasets. Due to the high annotation costs, existing Generative Commonsense Reasoning (GCR) datasets are created using a small number of human annotators, covering only a narrow set of commonsense scenarios. To address this training resource gap, we propose a two-stage method to create the first-ever synthetic dataset CommonSyn for diversified (GCR). The model fine-tuned on our synthetic data jointly increase both generation diversity and quality compared with vanilla models and the model fine-tuned on human-crafted dataset across different size Large Language Models (LLMs)

[NLP-60] From Noise to Signal: When Outliers Seed New Topics LREC2026

【速读】: 该论文旨在解决动态主题建模中异常值(outliers)常被视作噪声而忽略的问题,提出异常值可能作为新兴主题的早期信号。其解决方案的关键在于构建一个时间维度上的新闻文档轨迹分类体系(temporal taxonomy),将文档按其与主题形成的时间关系区分为三类:前瞻性异常值(anticipatory outliers,即在主题正式出现前就已存在)、强化现有主题的文档和孤立文档。通过捕捉这些轨迹,该分类体系实现了弱信号检测与时间主题建模的结合,并揭示了单个文章如何在演化聚类中提前预示、启动或漂移于新主题之中。

链接: https://arxiv.org/abs/2603.18358
作者: Evangelia Zve,Gauvain Bourgne,Benjamin Icard,Jean-Gabriel Ganascia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in the Proceedings of the 15th Language Resources and Evaluation Conference (LREC 2026)

点击查看摘要

Abstract:Outliers in dynamic topic modeling are typically treated as noise, yet we show that some can serve as early signals of emerging topics. We introduce a temporal taxonomy of news-document trajectories that defines how documents relate to topic formation over time. It distinguishes anticipatory outliers, which precede the topics they later join, from documents that either reinforce existing topics or remain isolated. By capturing these trajectories, the taxonomy links weak-signal detection with temporal topic modeling and clarifies how individual articles anticipate, initiate, or drift within evolving clusters. We implement it in a cumulative clustering setting using document embeddings from eleven state-of-the-art language models and evaluate it retrospectively on HydroNewsFr, a French news corpus on the hydrogen economy. Inter-model agreement reveals a small, high-consensus subset of anticipatory outliers, increasing confidence in these labels. Qualitative case studies further illustrate these trajectories through concrete topic developments.

[NLP-61] Large-Scale Analysis of Political Propaganda on Moltbook

【速读】: 该论文旨在解决如何在大规模在线平台中识别和分析政治宣传(political propaganda)的问题,特别是在由人工智能代理(AI agents)主导的Reddit风格社交平台上。其解决方案的关键在于开发基于大语言模型(LLM)的分类器,用于自动检测政治宣传内容,并通过专家标注进行验证(Cohen’s κ = 0.64–0.74),从而实现对673,127篇帖子和879,606条评论的大规模、自动化分析。该方法有效识别出政治宣传占所有帖子的1%、但占政治内容的42%,并揭示了其高度集中于少数社区及少数高产代理的传播特征。

链接: https://arxiv.org/abs/2603.18349
作者: Julia Jose,Meghna Manoj Nair,Rachel Greenstadt
机构: New York University (纽约大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:We present an NLP-based study of political propaganda on Moltbook, a Reddit-style platform for AI agents. To enable large-scale analysis, we develop LLM-based classifiers to detect political propaganda, validated against expert annotation (Cohen’s \kappa = 0.64-0.74). Using a dataset of 673,127 posts and 879,606 comments, we find that political propaganda accounts for 1% of all posts and 42% of all political content. These posts are concentrated in a small set of communities, with 70% of such posts falling into five of them. 4% of agents produced 51% of these posts. We further find that a minority of these agents repeatedly post highly similar content within and across communities. Despite this, we find limited evidence that comments amplify political propaganda.

[NLP-62] Detection Is Cheap Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)对齐评估中存在的关键盲区问题,即现有评估方法主要关注模型是否编码危险概念或是否拒绝有害请求,而忽略了对齐机制实际运作的核心环节——从概念检测到行为策略的路由(routing)过程。解决方案的关键在于提出并验证一个三阶段描述框架:检测(detect)、路由(route)、生成(generate),并通过在中国起源的语言模型中开展政治审查这一自然实验,结合探测器(probe)、手术式消融(surgical ablation)和行为测试,在九个来自五个研究机构的开源模型上系统性地揭示了路由机制的决定性作用。研究发现,仅靠探测准确率无法诊断对齐机制,真正有效的指标是类别泛化能力;不同模型的路由路径具有实验室特异性且不可迁移;更重要的是,拒绝机制已不再是主导的审查方式,叙事引导(narrative steering)已成为隐蔽但更有效的手段,从而说明传统仅基于拒绝行为的评估方法会严重低估模型的实际对齐偏差。

链接: https://arxiv.org/abs/2603.18280
作者: Gregory N. Frank
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 31 pages, 7 figures

点击查看摘要

Abstract:Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow. First, probe accuracy alone is non-diagnostic: political probes, null controls, and permutation baselines can all reach 100%, so held-out category generalization is the informative test. Second, surgical ablation reveals lab-specific routing. Removing the political-sensitivity direction eliminates censorship and restores accurate factual output in most models tested, while one model confabulates because its architecture entangles factual knowledge with the censorship mechanism. Cross-model transfer fails, indicating that routing geometry is model- and lab-specific. Third, refusal is no longer the dominant censorship mechanism. Within one model family, hard refusal falls to zero while narrative steering rises to the maximum, making censorship invisible to refusal-only benchmarks. These results support a three-stage descriptive framework: detect, route, generate. Models often retain the relevant knowledge; alignment changes how that knowledge is expressed. Evaluations that audit only detection or refusal therefore miss the routing mechanism that most directly determines behavior.

[NLP-63] Retrieval-Augmented LLM Agents : Learning to Learn from Experience

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在构建通用智能体(General-Purpose Agents)时,难以实现对未见过任务的鲁棒泛化问题。现有方法主要依赖微调(Fine-Tuning)或基于检索的经验增强生成(Retrieval-Augmented Generation),但二者均存在局限:微调常无法外推至新任务,而经验检索性能往往低于监督基线。论文的关键解决方案在于将微调与经验检索相结合,提出一种融合检索机制的训练流程——首先通过LoRA(Low-Rank Adaptation)实现稳健的监督微调(Supervised Fine-Tuning, SFT),其次系统分析并优化经验存储、查询和轨迹选择策略,最终将检索到的轨迹作为上下文信息嵌入微调过程,从而显著提升智能体在未见任务上的泛化能力,形成一个可扩展且高效的“从经验中学习”的智能体构建框架。

链接: https://arxiv.org/abs/2603.18272
作者: Thomas Palmeira Ferraz,Romain Deffayet,Vassilina Nikoulina,Hervé Déjean,Stéphane Clinchant
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have advanced the development of general-purpose agents, achieving robust generalization to unseen tasks remains a significant challenge. Current approaches typically rely on either fine-tuning or training-free memory-augmented generation using retrieved experience; yet both have limitations: fine-tuning often fails to extrapolate to new tasks, while experience retrieval often underperforms compared to supervised baselines. In this work, we propose to combine these approaches and systematically study how to train retrieval-augmented LLM agents to effectively leverage retrieved trajectories in-context. First, we establish a robust supervised fine-tuning (SFT) recipe using LoRA that outperforms several state-of-the-art agent training pipelines. Second, we provide a detailed analysis of key design choices for experience retrieval, identifying optimal strategies for storage, querying, and trajectory selection. Finally, we propose a pipeline that integrates experience retrieval into the fine-tuning process. Our results demonstrate that this combined approach significantly improves generalization to unseen tasks, providing a scalable and effective framework for building agents that learn to learn from experience.

[NLP-64] How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence

【速读】: 该论文试图解决当前人工智能(AI)范式在结构上继承了其源自心理学理论的局限性问题,具体表现为:强化学习无法刻画知识的内部结构、深度学习将表征压缩至难以解释的参数空间、现有整合方法缺乏对新理解如何从既有组件中构建的形式化描述。解决方案的关键在于提出ReSynth框架,这是一个三模块架构,将推理(Intellect)、目的(Identity)与知识(Memory)作为独立的建筑组件分离,从而实现系统性行为是表征架构的必然结果而非偶然属性,以此提升人工通用智能(AGI)的适应能力。

链接: https://arxiv.org/abs/2603.18203
作者: Alex Anvi Eponon,Ildar Batyrshin,Christian E. Maldonado-Sifuentes,Grigori Sidorov
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: conference ICSSH2026

点击查看摘要

Abstract:The dominant paradigms of artificial intelligence were shaped by learning theories from psychology: behaviorism inspired reinforcement learning, cognitivism gave rise to deep learning and memory-augmented architectures, and constructivism influenced curriculum learning and compositional approaches. This paper argues that each AI paradigm inherited not only the strengths but the structural limitations of the psychological theory that inspired it. Reinforcement learning cannot account for the internal structure of knowledge, deep learning compresses representations into opaque parameter spaces resistant to principled update, and current integrative approaches lack a formal account of how new understanding is constructed from existing components. The paper further examines a cross-cultural divergence in the interpretation of rote learning, arguing that the Eastern conception of memorization as a structured, multi-phase precursor to understanding offers an underexploited bridge between psychological theory and AI methodology. Drawing on the systematicity debate and critique of Aizawa of both classicism and connectionism, this paper introduces ReSynth, a trimodular framework that separates reasoning (Intellect), purpose (Identity), and knowledge (Memory) as architecturally independent components. The paper traces the genealogy from psychological paradigm to AI method, diagnoses the inherited limitations at each stage, and argues that adaptability, the central challenge of artificial general intelligence requires a representational architecture in which systematic behavior is a necessary consequence rather than an accidental property.

[NLP-65] CWoMP: Morpheme Representation Learning for Interlinear Glossing

【速读】: 该论文旨在解决跨语言文档中逐行标注文本(Interlinear Glossed Text, IGT)人工制作成本高、效率低的问题,同时克服现有自动化方法将词素视为字符序列而忽略其组合结构的局限性。解决方案的关键在于提出一种名为CWoMP(Contrastive Word-Morpheme Pretraining)的新框架,其核心创新是将词素(morpheme)视为原子化的形式-意义单元,并通过对比学习训练编码器在共享嵌入空间中对齐上下文中的词与其组成词素;随后利用自回归解码器从一个可动态扩展的词素嵌入词典中检索并生成词素序列,从而实现高效且可解释的IGT生成。该方法不仅显著优于现有技术,尤其在极低资源语言场景下表现突出,还支持用户在推理阶段通过扩充词典提升结果而无需重新训练模型。

链接: https://arxiv.org/abs/2603.18184
作者: Morris Alper,Enora Rice,Bhargav Shandilya,Alexis Palmer,Lori Levin
机构: 未知
类目: Computation and Language (cs.CL)
备注: Project page: this http URL

点击查看摘要

Abstract:Interlinear glossed text (IGT) is a standard notation for language documentation which is linguistically rich but laborious to produce manually. Recent automated IGT methods treat glosses as character sequences, neglecting their compositional structure. We propose CWoMP (Contrastive Word-Morpheme Pretraining), which instead treats morphemes as atomic form-meaning units with learned representations. A contrastively trained encoder aligns words-in-context with their constituent morphemes in a shared embedding space; an autoregressive decoder then generates the morpheme sequence by retrieving entries from a mutable lexicon of these embeddings. Predictions are interpretable–grounded in lexicon entries–and users can improve results at inference time by expanding the lexicon without retraining. We evaluate on diverse low-resource languages, showing that CWoMP outperforms existing methods while being significantly more efficient, with particularly strong gains in extremely low-resource settings.

[NLP-66] GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长期使用过程中因训练数据与评测基准数据发生污染(contamination)而导致的性能虚高问题,即模型在特定基准测试中表现优于实际能力的风险。解决方案的关键在于提出GRAFITE平台,这是一个持续评估系统,通过收集用户反馈构建随时间演化的模型问题库,并设计基于“LLM作为裁判”(LLM-as-a-judge)的质量保证(Quality Assurance, QA)测试流水线,实现对多个模型版本的并行对比和回归检测,从而保障评估结果的真实性和可靠性。

链接: https://arxiv.org/abs/2603.18173
作者: Ja Young Lee,Mírian Silva,Mohamed Nasr,Shonda Witherspoon,Enzo Bozzani,Veronique Demers,Radha Ratnaparkhi,Hui Wu,Sara Rosenthal
机构: IBM Research - AI
类目: Computation and Language (cs.CL)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:Large language models (LLMs) are largely motivated by their performance on popular topics and benchmarks at the time of their release. However, over time, contamination occurs due to significant exposure of benchmark data during training. This poses a risk of model performance inflation if testing is not carefully executed. To address this challenge, we present GRAFITE, a continuous LLM evaluation platform through a comprehensive system for maintaining and evaluating model issues. Our approach enables building a repository of model problems based on user feedback over time and offers a pipeline for assessing LLMs against these issues through quality assurance (QA) tests using LLM-as-a-judge. The platform enables side-by-side comparison of multiple models, facilitating regression detection across different releases. The platform is available at this https URL. The demo video is available at this http URL.

[NLP-67] Modeling the human lexicon under temperature variations: linguistic factors diversity and typicality in LLM word associations LREC2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在词义关联生成中是否具备人类相似的词汇表征这一核心问题,即评估LLMs内部词汇知识的人类一致性程度。其解决方案的关键在于通过对比人类与三种不同规模的LLM(Mistral-7B、Llama-3.1-8B 和 Qwen-2.5-32B)在SWOW数据集上的词关联响应,系统分析词汇频率和具象性等语义因素的影响,并量化模型响应的典型性(typicality)与变异性(variability)。研究发现,尽管所有模型均能复现人类对高频词和具象词的偏好趋势,但模型规模与温度参数显著调节了响应模式:大模型倾向于生成高度典型但低变异性的单一“原型”响应,而小模型则呈现更高变异性但更低典型性;温度升高进一步加剧该权衡关系。这揭示了LLM词汇表征既存在人类相似性又具结构性差异,强调在探究其词汇认知机制时需综合考虑模型规模与采样策略。

链接: https://arxiv.org/abs/2603.18171
作者: Maria Andueza Rodriguez,Marie Candito,Richard Huyghe
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 12 figures, to appear in LREC 2026

点击查看摘要

Abstract:Large language models (LLMs) achieve impressive results in terms of fluency in text generation, yet the nature of their linguistic knowledge - in particular the human-likeness of their internal lexicon - remains uncertain. This study compares human and LLM-generated word associations to evaluate how accurately models capture human lexical patterns. Using English cue-response pairs from the SWOW dataset and newly generated associations from three LLMs (Mistral-7B, Llama-3.1-8B, and Qwen-2.5-32B) across multiple temperature settings, we examine (i) the influence of lexical factors such as word frequency and concreteness on cue-response pairs, and (ii) the variability and typicality of LLM responses compared to human responses. Results show that all models mirror human trends for frequency and concreteness but differ in response variability and typicality. Larger models such as Qwen tend to emulate a single “prototypical” human participant, generating highly typical but minimally variable responses, while smaller models such as Mistral and Llama produce more variable yet less typical responses. Temperature settings further influence this trade-off, with higher values increasing variability but decreasing typicality. These findings highlight both the similarities and differences between human and LLM lexicons, emphasizing the need to account for model size and temperature when probing LLM lexical representations.

[NLP-68] How LLM s Distort Our Written Language

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在辅助人类写作过程中,不仅改变文本的语气和风格,还会系统性地改变原意,从而可能对文化与科学机构产生隐性但深远的影响。其解决方案的关键在于通过三重实证分析揭示这一语义偏移现象:首先,通过用户调研发现重度LLM使用者的作文更趋于中立、缺乏原创性和个人风格;其次,基于2021年未受LLM影响的人类写作数据集,验证即使仅要求LLM进行语法修正,也会显著改变文本语义;最后,分析顶会中21%由LLM生成的同行评审,发现其评分标准偏离清晰度和研究重要性等关键维度。这些发现共同表明,当前AI写作工具的广泛应用存在语义层面的隐性风险,亟需进一步研究其对知识生产与传播机制的长期影响。

链接: https://arxiv.org/abs/2603.18161
作者: Marwa Abdulhai,Isadora White,Yanming Wan,Ibrahim Qureshi,Joel Leibo,Max Kleiman-Weiner,Natasha Jaques
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are used by over a billion people globally, most often to assist with writing. In this work, we demonstrate that LLMs not only alter the voice and tone of human writing, but also consistently alter the intended meaning. First, we conduct a human user study to understand how people actually interact with LLMs when using them for writing. Our findings reveal that extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question. Significantly more heavy LLM users reported that the writing was less creative and not in their voice. Next, using a dataset of human-written essays that was collected in 2021 before the widespread release of LLMs, we study how asking an LLM to revise the essay based on the human-written feedback in the dataset induces large changes in the resulting content and meaning. We find that even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning. We then examine LLM-generated text in the wild, specifically focusing on the 21% of AI-generated scientific peer reviews at a recent top AI conference. We find that LLM-generated reviews place significantly less weight on clarity and significance of the research, and assign scores that, on average, are a full point this http URL findings highlight a misalignment between the perceived benefit of AI use and an implicit, consistent effect on the semantics of human writing, motivating future work on how widespread AI writing will affect our cultural and scientific institutions.

[NLP-69] Evaluating FrameNet-Based Semantic Modeling for Gender-Based Violence Detection in Clinical Records

【速读】: 该论文旨在解决巴西医疗系统中性别暴力(Gender-based Violence, GBV)病例识别与报告不足的问题,其核心挑战在于临床文本中非结构化信息难以被有效挖掘,且公共信息系统间缺乏整合。解决方案的关键在于利用FrameNet语义标注技术对电子病历中的开放式文本字段进行深度语义解析,从而提取出超越传统结构化人口统计学数据的领域特定语义特征;实验表明,结合语义标注的SVM分类器在F1分数上相较仅使用参数化数据的模型提升超过0.3,验证了语义分析能显著增强GBV早期识别能力,为公共卫生干预提供更精准的数据支持。

链接: https://arxiv.org/abs/2603.18124
作者: Lívia Dutra,Arthur Lorenzi,Frederico Belcavello,Ely Matos,Marcelo Viridiano,Lorena Larré,Olívia Guaranha,Erik Santos,Sofia Reinach,Pedro de Paula,Tiago Torrent
机构: Federal University of Juiz de Fora (FrameNet Brasil); University of Gothenburg; Vital Strategies Brasil; Brazilian National Council for Scientific and Technological Development (CNPq)
类目: Computation and Language (cs.CL)
备注: Paper accepted to the Lang4Heath Workshop at PROPOR 2026

点击查看摘要

Abstract:Gender-based violence (GBV) is a major public health issue, with the World Health Organization estimating that one in three women experiences physical or sexual violence by an intimate partner during her lifetime. In Brazil, although healthcare professionals are legally required to report such cases, underreporting remains significant due to difficulties in identifying abuse and limited integration between public information systems. This study investigates whether FrameNet-based semantic annotation of open-text fields in electronic medical records can support the identification of patterns of GBV. We compare the performance of an SVM classifier for GBV cases trained on (1) frame-annotated text, (2) annotated text combined with parameterized data, and (3) parameterized data alone. Quantitative and qualitative analyses show that models incorporating semantic annotation outperform categorical models, achieving over 0.3 improvement in F1 score and demonstrating that domain-specific semantic representations provide meaningful signals beyond structured demographic data. The findings support the hypothesis that semantic analysis of clinical narratives can enhance early identification strategies and support more informed public health interventions.

[NLP-70] MOSS-TTS Technical Report

【速读】: 该论文旨在解决高质量、多语言、可控语音生成的挑战,特别是在零样本语音克隆、细粒度发音控制和长文本稳定生成等方面存在的局限性。其解决方案的关键在于构建一个可扩展的语音生成基础模型MOSS-TTS,核心创新包括:基于离散音频标记(discrete audio tokens)的压缩表示,采用因果Transformer编码器实现高效的自回归建模(autoregressive modeling),并通过大规模预训练获得通用语音生成能力。此外,引入MOSS-Audio-Tokenizer实现24 kHz音频到12.5 fps的低比特率矢量量化(RVQ)表示,同时保留语义与声学信息,从而支持多种控制策略(如token级时长控制、音素/拼音级发音控制)及跨语言平滑切换,显著提升模型在多语言开放域场景下的实用性与可控性。

链接: https://arxiv.org/abs/2603.18090
作者: Yitian Gong,Botian Jiang,Yiwei Zhao,Yucheng Yuan,Kuangwei Chen,Yaozhou Jiang,Cheng Chang,Dong Hong,Mingshu Chen,Ruixiao Li,Yiyang Zhang,Yang Gao,Hanfu Chen,Ke Chen,Songlin Wang,Xiaogui Yang,Yuqian Zhang,Kexin Huang,ZhengYuan Lin,Kang Yu,Ziqi Chen,Jin Wang,Zhaoye Fei,Qinyuan Cheng,Shimin Li,Xipeng Qiu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.

[NLP-71] BenchBrowser – Collecting Evidence for Evaluating Benchmark Validity

【速读】: 该论文旨在解决语言模型(Language Model)基准测试在实际应用中与从业者意图之间存在显著偏差的问题,即当前基准测试的高阶元数据(high-level metadata)过于粗粒度,无法准确反映其对具体能力的覆盖程度,导致评估结果可能掩盖模型在未测试维度上的缺陷,从而造成“虚假胜任感”(illusion of competence)。解决方案的关键在于提出BenchBrowser——一种检索器(retriever),能够从20个基准套件中精准定位与自然语言使用场景相关的评估项,通过人类验证确认其高检索精度,从而为从业者提供实证依据,用于诊断内容效度(content validity,即能力维度覆盖不足)和聚合效度(convergent validity,即同一能力测量结果不稳定)问题,进而量化基准测试与实践目标之间的关键差距。

链接: https://arxiv.org/abs/2603.18019
作者: Harshita Diddee,Gregory Yauney,Swabha Swayamdipta,Daphne Ippolito
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Do language model benchmarks actually measure what practitioners intend them to ? High-level metadata is too coarse to convey the granular reality of benchmarks: a “poetry” benchmark may never test for haikus, while “instruction-following” benchmarks will often test for an arbitrary mix of skills. This opacity makes verifying alignment with practitioner goals a laborious process, risking an illusion of competence even when models fail on untested facets of user interests. We introduce BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases over 20 benchmark suites. Validated by a human study confirming high retrieval precision, BenchBrowser generates evidence to help practitioners diagnose low content validity (narrow coverage of a capability’s facets) and low convergent validity (lack of stable rankings when measuring the same capability). BenchBrowser, thus, helps quantify a critical gap between practitioner intent and what benchmarks actually test.

[NLP-72] An Agent ic System for Schema Aware NL2SQL Generation

【速读】: 该论文旨在解决自然语言到SQL(NL2SQL)任务中因依赖大型语言模型(LLMs)所带来的计算开销大、数据隐私风险高以及在资源受限环境下的实际部署困难等问题。其解决方案的关键在于提出一种基于模式的智能体系统(schema-based agentic system),该系统以小型语言模型(SLMs)作为主要执行代理,并引入选择性LLM回退机制——仅在SLM生成结果被检测出错误时才调用LLM,从而显著降低整体计算成本。实验表明,该方法在BIRD基准上实现了47.78%的执行准确率和51.05%的验证效率,相较纯LLM基线系统节省了超过90%的运行成本,且约67%的查询由本地SLM完成,单次查询平均成本降至0.0085(对比LLM-only系统的0.094),实现近零运营成本。

链接: https://arxiv.org/abs/2603.18018
作者: David Onyango,Naseef Mansoor
机构: 未知
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:The natural language to SQL (NL2SQL) task plays a pivotal role in democratizing data access by enabling non-expert users to interact with relational databases through intuitive language. While recent frameworks have enhanced translation accuracy via task specialization, their reliance on Large Language Models (LLMs) raises significant concerns regarding computational overhead, data privacy, and real-world deployability in resource-constrained environments. To address these challenges, we propose a schema based agentic system that strategically employs Small Language Models (SLMs) as primary agents, complemented by a selective LLM fallback mechanism. The LLM is invoked only upon detection of errors in SLM-generated output, the proposed system significantly minimizes computational expenditure. Experimental results on the BIRD benchmark demonstrate that our system achieves an execution accuracy of 47.78% and a validation efficiency score of 51.05%, achieving over 90% cost reduction compared to LLM-centric baselines as approximately 67% of queries are resolved using local SLMs. The system achieves an average cost per query of 0.0085 compared to 0.094 for LLM-only systems, achieving near-zero operational costs for locally executed queries. [Github repository: this https URL.]

[NLP-73] Frayed RoPE and Long Inputs: A Geometric Perspective ICLR2026

【速读】: 该论文旨在解决旋转位置编码(Rotary Positional Embedding, RoPE)在输入序列长度超过训练长度时导致模型性能下降的问题。研究表明,长输入会使RoPE引起的通道旋转“偏离分布”,从而破坏键(key)与查询(query)潜在点云的分离结构,进而抑制了“sink token”(用于避免注意力头在不需要时混合token的占位符)的功能,引发病理行为。解决方案的关键在于提出RoPE-ID(In Distribution),即通过在部分通道上应用高频RoPE,保持键与查询点云的几何分离,从而实现对更长输入的即插即用泛化能力。

链接: https://arxiv.org/abs/2603.18017
作者: Davis Wertheimer,Aozhong Zhang,Derrick Liu,Penghang Yin,Naigang Wang
机构: IBM Research; University at Albany, SUNY
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Rotary Positional Embedding (RoPE) is a widely adopted technique for encoding position in language models, which, while effective, causes performance breakdown when input length exceeds training length. Prior analyses assert (rightly) that long inputs cause channels to rotate ``out of distribution,‘’ but it is not clear how extra rotation relates to or causes pathological behavior. Through empirical and theoretical analysis we advance a unified geometric understanding of attention behavior with RoPE. We find that attention induces tight clustering of separated key and query latent point clouds, allowing for creation of sink tokens: placeholders that allow attention heads to avoid token mixing when not required. RoPE applied to longer inputs damages this key/query cluster separation, producing pathological behavior by inhibiting sink token functionality. From this geometric perspective, we propose RoPE-ID (In Distribution), a straightforward modification that allows attention layers to generalize to longer inputs out of the box: apply RoPE with high frequency to a subset of channels. We demonstrate the effectiveness of RoPE-ID for extended inputs using 1B and 3B parameter Transformers on the LongBench and RULER information retrieval benchmarks.

[NLP-74] MineDraft: A Framework for Batch Parallel Speculative Decoding

【速读】: 该论文旨在解决标准推测解码(Speculative Decoding, SD)中因严格串行执行起草(drafting)与验证(verification)阶段而导致的性能瓶颈问题。其解决方案的关键在于提出MineDraft,一种批处理并行推测解码(Batch Parallel Speculative Decoding, PSD)框架,通过创新的批并行设计维持两个请求批次,使一个批次的起草操作与另一批次的验证操作重叠执行,从而有效隐藏起草延迟。理论分析和实验结果表明,该方法在吞吐量(最高提升75%)和端到端延迟(最高降低39%)方面均显著优于标准SD,且已在vLLM中实现为插件,具备生产部署可行性。

链接: https://arxiv.org/abs/2603.18016
作者: Zhenwei Tang,Arun Verma,Zijian Zhou,Zhaoxuan Wu,Alok Prakash,Daniela Rus,Bryan Kian Hsiang Low
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: This paper proposes MineDraft, a framework that speeds up speculative decoding by overlapping drafting and verification, hiding drafting latency, and delivering improved throughput and latency

点击查看摘要

Abstract:Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.

[NLP-75] Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

【速读】: 该论文旨在解决生成式 AI(Generative AI)在在线有害内容检测中缺乏可解释性的问题,即当前神经网络模型虽具备高准确率,但其决策逻辑对平台审核人员和终端用户不透明,尤其在边缘情境、语境依赖或政治敏感内容上难以理解。解决方案的关键在于采用两种主流的后验解释方法——Shapley Additive Explanations (SHAP) 和 Integrated Gradients (IG),对基于 RoBERTa 的有害内容分类器进行可解释性驱动分析,从而揭示模型在正确预测与系统性误判中的行为差异,识别出诸如间接毒性、词汇过度归因及政治话语误判等重复性失败模式。研究证明,可解释性不仅是提升人类参与式审核的有效工具,更是诊断模型不确定性与增强自动化决策透明度的核心资源。

链接: https://arxiv.org/abs/2603.18015
作者: Trishita Dhara,Siddhesh Sheth
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper has been accepted at TrustNet 2026 ( this https URL ). The final version will appear in Springer (LNNS), 2026

点击查看摘要

Abstract:Although automated harmful content detection systems are frequently used to monitor online platforms, moderators and end users frequently cannot understand the logic underlying their predictions. While recent studies have focused on increasing classification accuracy, little focus has been placed on comprehending why neural models identify content as harmful, especially when it comes to borderline, contextual, and politically sensitive situations. In this work, a neural harmful content detection model trained on the Civil Comments dataset is analyzed explainability-drivenly. Two popular post-hoc explanation methods, Shapley Additive Explanations and Integrated Gradients, are used to analyze the behavior of a RoBERTa-based classifier in both correct predictions and systematic failure cases. Despite strong overall performance, with an area under the curve of 0.93 and an accuracy of 0.94, the analysis reveals limitations that are not observable from aggregate evaluation metrics alone. Integrated Gradients appear to extract more diffuse contextual attributions while Shapley Additive Explanations extract more focused attributions on explicit lexical cues. The consequent divergence in their outputs manifests in both false negatives and false positives. Qualitative case studies reveal recurring failure modes such as indirect toxicity, lexical over-attribution, or political discourse. The results suggest that explainable AI can foster human-in-the-loop moderation by exposing model uncertainty and increasing the interpretable rationale behind automated decisions. Most importantly, this work highlights the role of explainability as a transparency and diagnostic resource for online harmful content detection systems rather than as a performance-enhancing lever.

[NLP-76] Real-Time Trustworthiness Scoring for LLM Structured Outputs and Data Extraction

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)生成结构化输出(Structured Outputs)时存在的随机性错误问题,这类错误阻碍了企业级AI应用的落地与可信部署。其核心解决方案是提出CONSTRUCT方法,能够在实时场景中对LLM结构化输出的整体可信度进行评分,得分越低表示越可能包含错误;同时,该方法还能对输出中的每个字段独立评分,从而精准定位错误发生的位置。CONSTRUCT的关键优势在于:无需标注训练数据、不依赖模型日志概率(logprobs)、适用于黑盒API(如Anthropic模型和推理类模型),且支持复杂嵌套JSON结构的多类型字段评估,显著提升了错误检测的精度与召回率。

链接: https://arxiv.org/abs/2603.18014
作者: Hui Wen Goh,Jonas Mueller
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Structured Outputs from current LLMs exhibit sporadic errors, hindering enterprise AI efforts from realizing their immense potential. We present CONSTRUCT, a method to score the trustworthiness of LLM Structured Outputs in real-time, such that lower-scoring outputs are more likely to contain errors. This reveals the best places to focus limited human review bandwidth. CONSTRUCT additionally scores the trustworthiness of each field within a LLM Structured Output, helping reviewers quickly identify which parts of the output are wrong. Our method is suitable for any LLM (including black-box LLM APIs without logprobs such as reasoning models and Anthropic models), does not require labeled training data nor custom model deployment, and works for complex Structured Outputs with many fields of diverse types (including nested JSON schemas). We additionally present one of the first public LLM Structured Output benchmarks with reliable ground-truth values that are not full of mistakes. Over this four-dataset benchmark, CONSTRUCT detects errors from various LLMs (including Gemini 3 and GPT-5) with significantly higher precision/recall than other scoring methods. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2603.18014 [cs.CL] (or arXiv:2603.18014v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.18014 Focus to learn more arXiv-issued DOI via DataCite

[NLP-77] Learned but Not Expressed: Capability-Expression Dissociation in Large Language Models

【速读】: 该论文旨在解决一个关键问题:尽管大型语言模型(Large Language Models, LLMs)在特定条件下能够重构和追踪训练数据中的内容,但在标准生成场景中却极少表现出此类信息的输出。研究通过300次跨叙事与问题解决任务的提示-响应生成实验,发现无论是在创造性叙事还是实用建议场景下,所有三种LLM在十种任务情境中均未出现非因果性解决方案框架(0%,95%置信区间[0%, 1.2%]),尽管这些内容在条件提取下可被成功重建。其核心解决方案在于揭示了任务条件驱动的生成策略可以系统性抑制已学习内容的表达,从而挑战了“训练数据存在即等价于输出概率”的传统假设,并阐明了当前LLM行为边界与生成动态之间的深刻分离。

链接: https://arxiv.org/abs/2603.18013
作者: Toshiyuki Shigemura
机构: Independent Researcher(独立研究员)
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:Large language models (LLMs) demonstrate the capacity to reconstruct and trace learned content from their training data under specific elicitation conditions, yet this capability does not manifest in standard generation contexts. This empirical observational study examines the expression of non-causal, non-implementable solution types across 300 prompt-response generations spanning narrative and problem-solving task contexts. Drawing on recent findings regarding memorization contiguity and alignment-induced discourse priors, we document a systematic dissociation between learned capability and expressed output. Across three distinct LLMs, ten task scenarios, and both creative narrative and practical advisory contexts, we documented zero instances of non-causal solution frames in generated outputs (0%, 95% CI: [0%, 1.2%]), despite verified reconstruction capability under conditional extraction. These findings challenge the prevailing assumption that training data presence directly predicts output probability, demonstrating instead that task-conditioned generation policies can comprehensively suppress learned content across diverse contexts. The results offer implications for understanding generation dynamics, output distribution control, and the behavioral boundaries of contemporary LLMs.

[NLP-78] Agent ic Framework for Political Biography Extraction

【速读】: 该论文旨在解决政治科学领域中大规模政治数据集构建的瓶颈问题,即从海量非结构化文档或网络资源中提取多维精英人物传记信息的传统方法高度依赖昂贵的人工专家,难以实现规模化自动化。解决方案的关键在于提出一种两阶段“合成-编码”(Synthesis-Coding)框架:第一阶段利用递归代理型大语言模型(LLMs)在异构网络来源中搜索、过滤并整理传记内容,形成高质量的上下文合成;第二阶段将整理后的传记映射为结构化数据框。该框架通过实证验证了其在准确性上可媲美甚至超越人类专家,并能有效缓解直接处理长文本和多语言语料时引入的偏倚问题,从而提供了一种可扩展、透明且通用的大规模政治数据库构建方法。

链接: https://arxiv.org/abs/2603.18010
作者: Yifei Zhu,Songpo Yang,Jiangnan Zhu,Junyan Jiang
机构: The University of Hong Kong (香港大学); Peking University (北京大学); Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 70 pages, 14 figures

点击查看摘要

Abstract:The production of large-scale political datasets typically demands extracting structured facts from vast piles of unstructured documents or web sources, a task that traditionally relies on expensive human experts and remains prohibitively difficult to automate at scale. In this paper, we leverage Large Language Models (LLMs) to automate the extraction of multi-dimensional elite biographies, addressing a long-standing bottleneck in political science research. We propose a two-stage ``Synthesis-Coding’’ framework for complex extraction task: an upstream synthesis stage that uses recursive agentic LLMs to search, filter, and curate biography from heterogeneous web sources, followed by a downstream coding stage that maps curated biography into structured dataframes. We validate this framework through three primary results. First, we demonstrate that, when given curated contexts, LLM coders match or outperform human experts in extraction accuracy. Second, we show that in web environments, the agentic system synthesizes more information from web resources than human collective intelligence (Wikipedia). Finally, we diagnosed that directly coding from long and multi-language corpora introduces bias that the synthesis stage can alleviate by curating evidence into signal-dense representations. By comprehensive evaluation, We provide a generalizable, scalable framework for building transparent and expansible large scale database in political science.

[NLP-79] How Confident Is the First Token? An Uncertainty-Calibrated Prompt Optimization Framework for Large Language Model Classification and Understanding

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂任务中因自回归生成导致的输出不确定性问题,尤其针对多类选择题任务中传统不确定性度量方法(如熵)未能区分先验偏好带来的虚假置信与基于上下文的真实确定性,从而导致置信度校准不佳的问题。解决方案的关键在于提出一种基于首 token 的不确定性度量指标——对数尺度焦点不确定性(Log-Scale Focal Uncertainty, LSFU),其借鉴焦点损失(Focal Loss)的思想,引入标签先验概率作为风险调节因子,抑制高频类别的噪声影响并增强低频长尾类别的风险敏感性,并通过动态加权机制统一测量尺度;在此基础上构建了不确定性校准提示优化框架(Uncertainty-Calibrated Prompt Optimization Framework, UCPOF),利用首 token 选择高质量示例并动态优化提示,实现仅在高不确定性样本中触发检索增强生成(Retrieval-Augmented Generation, RAG),显著降低计算开销的同时保持领先性能。

链接: https://arxiv.org/abs/2603.18009
作者: Wei Chen,Guoyang Ju,Yuanyuan Qi
机构: China Jiliang University(中国计量大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 pages,16 figures

点击查看摘要

Abstract:With the widespread adoption of large language models (LLMs) in natural language processing, prompt engineering and retrieval-augmented generation (RAG) have become mainstream to enhance LLMs’ performance on complex tasks. However, LLMs generate outputs autoregressively, leading to inevitable output uncertainty. Since model performance is highly sensitive to prompt design, precise uncertainty measurement is crucial for reliable prompt optimization. For multi-class multiple-choice (understanding) tasks, conventional uncertainty measures (e.g., entropy) based on output probabilities treat all classes equally and ignore class prior differences in pretraining corpora. This failure to distinguish spurious confidence (from priors) from true certainty (from contextual understanding) results in poor confidence calibration. To address this, we propose Log-Scale Focal Uncertainty (LSFU), a first-token-based metric inspired by focal loss. LSFU incorporates label prior probabilities as a risk-modulation factor to suppress noise from high-frequency classes and emphasize risk for low-frequency long-tail classes, with a dynamic weighting mechanism unifying the measurement scale. Based on LSFU, we further propose the uncertainty-calibrated prompt optimization framework (UCPOF), which leverages the first token of model outputs to select high-quality exemplars and dynamically optimize prompts. Comprehensive evaluations show UCPOF improves average accuracy by 6.03% over few-shot baselines, surpasses always-on full RAG by 5.75% in overall average accuracy, and reduces the average retrieval trigger rate by 50.66%. By adaptively triggering RAG only for high-uncertainty samples, our framework significantly lowers computational costs while maintaining state-of-the-art performance.

[NLP-80] herapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在心理支持场景中缺乏临床有效性评估与安全约束的问题,现有评价方法如流畅性指标、偏好测试和通用对话基准无法捕捉认知行为疗法(Cognitive Behavioral Therapy, CBT)的关键临床维度。解决方案的核心是提出THERAPYGYM框架,其关键在于构建两个临床支柱:一是基于认知治疗评分量表(Cognitive Therapy Rating Scale, CTRS)的自动化多轮会话 fidelity 评估机制,用于量化模型对CBT技术的遵循程度;二是采用多标签标注方案实现 therapy-specific 安全性评估,涵盖如忽视伤害或虐待等高风险情境。此外,研究还发布了THERAPYJUDGEBENCH验证集,包含116段对话及1,270条专业临床医生评分,以校准和审计LLM判别可靠性,并将该框架作为训练工具,通过CTR S和安全性奖励驱动强化学习(Reinforcement Learning),结合多样化症状模拟患者进行训练,显著提升模型在专家评分中的表现(平均CTRS从0.10提升至0.60)。

链接: https://arxiv.org/abs/2603.18008
作者: Fangrui Huang,Souhad Chbeir,Arpandeep Khatua,Sheng Wang,Sijun Tan,Kenan Ye,Lily Bailey,Merryn Daniel,Ryan Louie,Sanmi Koyejo,Ehsan Adeli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for mental-health support; yet prevailing evaluation methods–fluency metrics, preference tests, and generic dialogue benchmarks–fail to capture the clinically critical dimensions of psychotherapy. We introduce THERAPYGYM, a framework that evaluates and improves therapy chatbots along two clinical pillars: fidelity and safety. Fidelity is measured using the Cognitive Therapy Rating Scale (CTRS), implemented as an automated pipeline that scores adherence to CBT techniques over multi-turn sessions. Safety is assessed using a multi-label annotation scheme, covering therapy-specific risks (e.g., failing to address harm or abuse). To mitigate bias and unreliability in LLM-based judges, we further release THERAPYJUDGEBENCH, a validation set of 116 dialogues with 1,270 expert ratings for auditing and calibration against licensed clinicians. THERAPYGYM also serves as a training harness: CTRS and safety-based rewards drive RL with configurable patient simulations spanning diverse symptom profiles. Models trained in THERAPYGYM improve on expert ratings, with average CTRS rising from 0.10 to 0.60 (and 0.16 to 0.59 under LLM judges). Our work enables scalable development of therapy chatbots that are faithful to evidence-based practice and safer in high-stakes use.

[NLP-81] Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)是否具备心智理论(Theory of Mind, ToM)能力的问题,即它们能否从文本中推断他人的信念、意图和情绪,并判断其推理是否接近人类水平而非仅依赖统计模式匹配。解决方案的关键在于设计并使用一个适配于文本的ToM测试工具,对五种不同规模的LLMs与人类对照组进行对比实验,结果表明:早期和较小的模型在推理准确性上受线索数量影响显著且易受干扰信息影响,而GPT-4o展现出高准确率与强鲁棒性,在复杂情境下表现接近人类水平,从而为LLMs是否存在类人社会认知能力提供了实证依据。

链接: https://arxiv.org/abs/2603.18007
作者: Anna Babarczy,Andras Lukacs,Peter Vedres,Zeteny Bujka
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 2 figures, 6 tables

点击查看摘要

Abstract:The study explores whether current Large Language Models (LLMs) exhibit Theory of Mind (ToM) capabilities – specifically, the ability to infer others’ beliefs, intentions, and emotions from text. Given that LLMs are trained on language data without social embodiment or access to other manifestations of mental representations, their apparent social-cognitive reasoning raises key questions about the nature of their understanding. Are they capable of robust mental-state attribution indistinguishable from human ability in its output, or do their outputs merely reflect superficial pattern completion? To address this question, we tested five LLMs and compared their performance to that of human controls using an adapted version of a text-based tool widely used in human ToM research. The test involves answering questions about the beliefs, intentions, and emotions of story characters. The results revealed a performance gap between the models. Earlier and smaller models were strongly affected by the number of relevant inferential cues available and, to some extent, were also vulnerable to the presence of irrelevant or distracting information in the texts. In contrast, GPT-4o demonstrated high accuracy and strong robustness, performing comparably to humans even in the most challenging conditions. This work contributes to ongoing debates about the cognitive status of LLMs and the boundary between genuine understanding and statistical approximation.

[NLP-82] Using Optimal Transport as Alignment Objective for fine-tuning Multilingual Contextualized Embeddings

【速读】: 该论文旨在解决多语言上下文嵌入(multilingual contextualized embeddings)在跨语言迁移任务中表示质量不足的问题,尤其关注如何在不依赖预定义词对齐数据的情况下实现更有效的跨语言对齐。其解决方案的关键在于引入最优传输(Optimal Transport, OT)作为微调阶段的对齐目标,通过无监督方式学习句子级软匹配关系,从而在保持上下文敏感性的同时优化源语言与目标语言嵌入空间的对齐效果。此方法避免了传统硬匹配导致的次优匹配问题,并支持灵活的映射类型,显著提升了跨语言自然语言理解任务(如XNLI和XQuAD)的性能。

链接: https://arxiv.org/abs/2110.02887
作者: Sawsan Alqahtani,Garima Lalwani,Yi Zhang,Salvatore Romeo,Saab Mansour
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies have proposed different methods to improve multilingual word representations in contextualized settings including techniques that align between source and target embedding spaces. For contextualized embeddings, alignment becomes more complex as we additionally take context into consideration. In this work, we propose using Optimal Transport (OT) as an alignment objective during fine-tuning to further improve multilingual contextualized representations for downstream cross-lingual transfer. This approach does not require word-alignment pairs prior to fine-tuning that may lead to sub-optimal matching and instead learns the word alignments within context in an unsupervised manner. It also allows different types of mappings due to soft matching between source and target sentences. We benchmark our proposed method on two tasks (XNLI and XQuAD) and achieve improvements over baselines as well as competitive results compared to similar recent works.

[NLP-83] How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

【速读】: 该论文旨在解决文本预训练的大型语言模型(Large Language Models, LLMs)在音频知识编码能力上的差异及其对下游音频任务性能影响不明确的问题。其解决方案的关键在于通过三种互补评估范式系统性地量化LLMs的听觉知识:(1) 在AKB-2000基准上直接探测其听觉知识广度与深度;(2) 通过文本描述的级联推理评估其跨模态理解能力;(3) 将LLM微调为音频感知的大型音频语言模型(Large Audio Language Models, LALMs),实现音频-grounded评估。研究发现,不同LLM家族间听觉知识存在显著差异,且纯文本表现与音频性能高度相关,从而为音频领域中LLMs的理解提供了实证基础。

链接: https://arxiv.org/abs/2603.19195
作者: Ke-Han Lu,Szu-Wei Fu,Chao-Han Huck Yang,Zhehuai Chen,Sung-Feng Huang,Chih-Kai Yang,Yi-Cheng Lin,Chi-Yuan Hsiao,Wenze Ren,En-Pei Hu,Yu-Han Huang,An-Yu Cheng,Cheng-Han Chiang,Yu Tsao,Yu-Chiang Frank Wang,Hung-yi Lee
机构: National Taiwan University (台湾大学); NVIDIA; Academia Sinica (中央研究院)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Project website: this https URL

点击查看摘要

Abstract:Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.

[NLP-84] Impact of automatic speech recognition quality on Alzheimers disease detection from spontaneous speech: a reproducible benchmark study with lexical modeling and statistical validation

【速读】: 该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期检测中自动语音识别(ASR)质量对下游临床语言建模性能影响不明确的问题。其解决方案的关键在于:采用Whisper ASR模型生成的转录文本,结合TF-IDF表示和可解释的机器学习模型(如逻辑回归与线性支持向量机),在ADReSSo 2021诊断数据集上验证了高保真度ASR对分类性能的显著提升作用。研究发现,使用Whisper-small转录文本训练的线性SVM模型能达到平衡准确率高于0.7850,且相比模型复杂度,ASR质量是影响性能差异的主要因素;同时,特征分析揭示了正常认知人群语言更具语义精确性,而AD患者则表现出模糊表达、话语标记增多及犹豫模式加剧等特征,表明高质量ASR可使简单、可解释的词汇模型实现具有竞争力的AD检测效果,无需显式声学建模。

链接: https://arxiv.org/abs/2603.18239
作者: Himadri Samanta
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 22 pages, 7 figures

点击查看摘要

Abstract:Early detection of Alzheimer’s disease from spontaneous speech has emerged as a promising non-invasive screening approach. However, the influence of automatic speech recognition (ASR) quality on downstream clinical language modeling remains insufficiently understood. In this study, we investigate Alzheimer’s disease detection using lexical features derived from Whisper ASR transcripts on the ADReSSo 2021 diagnosis dataset. We evaluate interpretable machine-learning models, including Logistic Regression and Linear Support Vector Machines, using TF-IDF text representations under repeated 5x5 stratified cross-validation. Our results demonstrate that transcript quality has a statistically significant impact on classification performance. Models trained on Whisper-small transcripts consistently outperform those using Whisper-base transcripts, achieving balanced accuracy above 0.7850 with Linear SVM. Paired statistical testing confirms that the observed improvements are significant. Importantly, classifier complexity contributes less to performance variation than ASR transcription quality. Feature analysis reveals that cognitively normal speakers produce more semantically precise object- and scene-descriptive language, whereas Alzheimer’s speech is characterized by vagueness, discourse markers, and increased hesitation patterns. These findings suggest that high-quality ASR can enable simple, interpretable lexical models to achieve competitive Alzheimer’s detection performance without explicit acoustic modeling. The study provides a reproducible benchmark pipeline and highlights ASR selection as a critical modeling decision in clinical speech-based artificial intelligence systems. Comments: 22 pages, 7 figures Subjects: Quantitative Methods (q-bio.QM); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2603.18239 [q-bio.QM] (or arXiv:2603.18239v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2603.18239 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-85] ProKWS: Personalized Keyword Spotting via Collaborative Learning of Phonemes and Prosody

【速读】: 该论文旨在解决当前关键词检测(Keyword Spotting, KS)系统在处理易混淆词汇时,仅依赖音素级匹配而忽略用户特定发音特征(如韵律,即语调、重音和节奏)所带来的识别准确性下降问题。其解决方案的关键在于提出ProKWS框架,通过双流编码器结构实现细粒度音素学习与个性化韵律建模的融合:一路径采用对比学习提取鲁棒的音素表示,另一路径捕捉说话人特有的韵律模式;随后通过动态协同融合模块整合音素与韵律信息,从而提升模型在不同声学环境下的适应性和对带语调及意图变化的个性化关键词的识别鲁棒性。

链接: https://arxiv.org/abs/2603.18024
作者: Jianan Pan,Yuanming Zhang,Kejie Huang
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Current keyword spotting systems primarily use phoneme-level matching to distinguish confusable words but ignore user-specific pronunciation traits like prosody (intonation, stress, rhythm). This paper presents ProKWS, a novel framework integrating fine-grained phoneme learning with personalized prosody modeling. We design a dual-stream encoder where one stream derives robust phonemic representations through contrastive learning, while the other extracts speaker-specific prosodic patterns. A collaborative fusion module dynamically combines phonemic and prosodic information, enhancing adaptability across acoustic environments. Experiments show ProKWS delivers highly competitive performance, comparable to state-of-the-art models on standard benchmarks and demonstrates strong robustness for personalized keywords with tone and intent variations.

[NLP-86] PCOV-KWS: Multi-task Learning for Personalized Customizable Open Vocabulary Keyword Spotting

【速读】: 该论文旨在解决智能语音助手在个性化和隐私保护需求日益增长背景下,传统关键词检测(Keyword Spotting, KWS)系统难以支持开放词汇、用户定制化且计算资源消耗高的问题。其核心解决方案是提出一种多任务学习框架——个性化开放词汇关键词检测(Personalized Customizable Open-Vocabulary Keyword Spotting, PCOV-KWS),通过轻量级网络同时执行KWS与说话人验证(Speaker Verification, SV)任务,实现个性化识别;关键创新在于采用非Softmax类别的训练准则,将多分类问题转化为多个二分类问题以消除类别间竞争,并引入多任务损失加权优化策略,从而在提升准确率的同时显著降低模型参数量和计算复杂度。

链接: https://arxiv.org/abs/2603.18023
作者: Jianan Pan,Kejie Huang
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:As advancements in technologies like Internet of Things (IoT), Automatic Speech Recognition (ASR), Speaker Verification (SV), and Text-to-Speech (TTS) lead to increased usage of intelligent voice assistants, the demand for privacy and personalization has escalated. In this paper, we introduce a multi-task learning framework for personalized, customizable open-vocabulary Keyword Spotting (PCOV-KWS). This framework employs a lightweight network to simultaneously perform Keyword Spotting (KWS) and SV to address personalized KWS requirements. We have integrated a training criterion distinct from softmax-based loss, transforming multi-class classification into multiple binary classifications, which eliminates inter-category competition, while an optimization strategy for multi-task loss weighting is employed during training. We evaluated our PCOV-KWS system in multiple datasets, demonstrating that it outperforms the baselines in evaluation results, while also requiring fewer parameters and lower computational resources.

信息检索

[IR-0] FinTradeBench: A Financial Reasoning Benchmark for LLM s

【速读】:该论文旨在解决当前金融决策模型在处理公司基本面(fundamentals)与交易信号(trading signals)之间跨模态推理能力不足的问题。现有金融问答基准多聚焦于财务报表数据,缺乏对股价动态及其与基本面交互关系的评估,导致模型难以全面理解市场行为。解决方案的关键在于提出FinTradeBench——一个整合纳斯达克-100公司十年历史数据的多模态金融推理基准,涵盖三类问题:以基本面为主、以交易信号为主及需跨信号推理的混合问题;并通过校准-扩展框架(calibration-then-scaling)结合专家标注、多模型生成、自过滤机制、数值审计和人-大语言模型(LLM)判别对齐,确保评测的可靠性与规模性。

链接: https://arxiv.org/abs/2603.19225
作者: Yogesh Agrawal,Aniruddha Dutta,Md Mahadi Hasan,Santu Karmaker,Aritra Dutta(University of Central Florida)
机构: University of Central Florida (中佛罗里达大学)
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Computational Finance (q-fin.CP)
备注: 8 pages main text, 22 pages total (including references and appendix). 5 figures, 14 tables. Preprint under review. Code and data will be made available upon publication

点击查看摘要

Abstract:Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with the advancement of Large Language Models (LLMs), financial analysts have begun to use them for financial decision-making tasks. However, existing financial question answering benchmarks for testing these models primarily focus on company balance sheet data and rarely evaluate reasoning over how company stocks trade in the market or their interactions with fundamentals. To take advantage of the strengths of both approaches, we introduce FinTradeBench, a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals. FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window. The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning. To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment. We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap. Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning. These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.

[IR-1] Comparative Analysis of Large Language Models in Generating Telugu Responses for Maternal Health Queries

【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在低资源语言环境下的产科医疗问答能力不足的问题,特别是针对Telugu、Hindi、Tamil、Urdu等语言中孕产相关问题的响应质量尚未被系统评估的现状。解决方案的关键在于通过构建双语数据集,结合BERT Score语义相似度指标与妇产科专家的多维度评分(包括准确性、流畅性、相关性、连贯性和完整性),对ChatGPT-4o、GeminiAI和Perplexity AI三种主流LLM在不同语言提示下的表现进行量化比较。研究发现,Gemini在Telugu语情境下生成准确且连贯的回答表现最优,而Perplexity在Telugu提示下亦表现出良好性能,表明模型选择与提示语言共同构成影响医疗信息检索效果的核心因素。

链接: https://arxiv.org/abs/2603.18898
作者: Anagani Bhanusree,Sai Divya Vissamsetty,K VenkataKrishna Rao,Rimjhim
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been progressively exhibiting there capabilities in various areas of research. The performance of the LLMs in acute maternal healthcare area, predominantly in low resource languages like Telugu, Hindi, Tamil, Urdu etc are still unstudied. This study presents how ChatGPT-4o, GeminiAI, and Perplexity AI respond to pregnancy related questions asked in different languages. A bilingual dataset is used to obtain results by applying the semantic similarity metrics (BERT Score) and expert assessments from expertise gynecologists. Multiple parameters like accuracy, fluency, relevance, coherence and completeness are taken into consideration by the gynecologists to rate the responses generated by the LLMs. Gemini excels in other LLMs in terms of producing accurate and coherent pregnancy relevant responses in Telugu, while Perplexity demonstrated well when the prompts were in Telugu. ChatGPT’s performance can be improved. The results states that both selecting an LLM and prompting language plays a crucial role in retrieving the information. Altogether, we emphasize for the improvement of LLMs assistance in regional languages for healthcare purposes.

[IR-2] Benchmarking PDF Parsers on Table Extraction with LLM -based Semantic Evaluation ICDAR2026

【速读】:该论文旨在解决从PDF中可靠提取表格数据的问题,这是大规模科学数据挖掘和知识库构建的关键环节;现有评估方法依赖于规则-based指标,无法有效捕捉表格内容的语义等价性。其解决方案的关键在于提出一种基于合成PDF的基准测试框架,利用精确的LaTeX真实值作为参考,并引入大语言模型作为评判者(LLM-as-a-judge)进行语义层面的表格匹配评估,该方法集成到一个可处理解析输出不一致性的匹配管道中,显著提升了评估结果与人工判断的一致性(Pearson相关系数达0.93),优于传统的Tree Edit Distance (TEDS) 和 Grid Table Similarity (GriTS) 方法。

链接: https://arxiv.org/abs/2603.18652
作者: Pius Horn,Janis Keuper
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Submitted to ICDAR 2026

点击查看摘要

Abstract:Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: this https URL Metric study and human evaluation: this https URL Comments: Submitted to ICDAR 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2603.18652 [cs.CV] (or arXiv:2603.18652v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.18652 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] Interplay: Training Independent Simulators for Reference-Free Conversational Recommendation ECIR2026

【速读】:该论文旨在解决训练对话式推荐系统(Conversational Recommender Systems, CRS)时大规模对话数据难以获取的问题。传统模拟方法依赖单一大型语言模型(Large Language Model, LLM)生成预先知晓目标物品的对话,导致对话流于程式化、缺乏真实感。其解决方案的关键在于提出一种无参考(reference-free)模拟框架,通过训练两个独立的LLM——一个作为用户,另一个作为对话推荐器——在无预设目标物品的情况下进行实时交互,仅提供偏好摘要和目标属性信息,使推荐器能够基于对话过程真正推断用户偏好。该方法显著提升了对话的真实性和多样性,同时保持了高质量与可扩展性。

链接: https://arxiv.org/abs/2603.18573
作者: Jerome Ramos,Feng Xia,Xi Wang,Shubham Chatterjee,Xiao Fu,Hossein A. Rahmani,Aldo Lipani
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at ECIR 2026

点击查看摘要

Abstract:Training conversational recommender systems (CRS) requires extensive dialogue data, which is challenging to collect at scale. To address this, researchers have used simulated user-recommender conversations. Traditional simulation approaches often utilize a single large language model (LLM) that generates entire conversations with prior knowledge of the target items, leading to scripted and artificial dialogues. We propose a reference-free simulation framework that trains two independent LLMs, one as the user and one as the conversational recommender. These models interact in real-time without access to predetermined target items, but preference summaries and target attributes, enabling the recommender to genuinely infer user preferences through dialogue. This approach produces more realistic and diverse conversations that closely mirror authentic human-AI interactions. Our reference-free simulators match or exceed existing methods in quality, while offering a scalable solution for generating high-quality conversational recommendation data without constraining conversations to pre-defined target items. We conduct both quantitative and human evaluations to confirm the effectiveness of our reference-free approach.

[IR-4] Latent Factor Modeling with Expert Network for Multi-Behavior Recommendation

【速读】:该论文旨在解决传统推荐方法因仅建模单一用户行为(如购买)而导致的数据稀疏问题,以及现有多行为推荐方法中行为因素混杂、难以捕捉特定用户意图的局限性。其解决方案的关键在于提出一种基于专家网络的多行为建模方法(MBLFE),通过设计一个门控专家网络,其中每个专家专门学习特定的潜在因子,而门控网络则动态选择最优专家组合以更精准地表示用户偏好;同时引入自监督学习确保专家间独立性和单个专家因子的一致性,并利用多行为数据丰富嵌入表示,从而提升因子提取的全面性与准确性。

链接: https://arxiv.org/abs/2603.18556
作者: Mingshi Yan,Zhiyong Cheng,Yahong Han,Meng Wang
机构: Tianjin University (天津大学); Hefei University of Technology (合肥工业大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Traditional recommendation methods, which typically focus on modeling a single user behavior (e.g., purchase), often face severe data sparsity issues. Multi-behavior recommendation methods offer a promising solution by leveraging user data from diverse behaviors. However, most existing approaches entangle multiple behavioral factors, learning holistic but imprecise representations that fail to capture specific user intents. To address this issue, we propose a multi-behavior method by modeling latent factors with an expert network (MBLFE). In our approach, we design a gating expert network, where the expert network models all latent factors within the entire recommendation scenario, with each expert specializing in a specific latent factor. The gating network dynamically selects the optimal combination of experts for each user, enabling a more accurate representation of user preferences. To ensure independence among experts and factor consistency of a particular expert, we incorporate self-supervised learning during the training process. Furthermore, we enrich embeddings with multi-behavior data to provide the expert network with more comprehensive collaborative information for factor extraction. Extensive experiments on three real-world datasets demonstrate that our method significantly outperforms state-of-the-art baselines, validating its effectiveness.

[IR-5] otal Recall QA: A Verifiable Evaluation Suite for Deep Research Agents

【速读】:该论文旨在解决深度研究代理(Deep Research Agents)在复杂信息检索与推理任务中的评估难题。当前基准测试无法满足全面评估所需的核心要求,导致模型性能难以客观比较。解决方案的关键在于提出一种名为“全召回问答”(Total Recall Question Answering, TRQA)的新任务框架,其通过构建单答案、全召回查询,并基于结构化知识库(如Wikidata)与文本语料库进行精确的相关性判断,从而实现大规模、可控的数据生成与公平评估。该框架有效缓解了数据污染问题,为深度研究代理的检索和端到端性能提供了可复现的基线结果。

链接: https://arxiv.org/abs/2603.18516
作者: Mahta Rafiee,Heydar Soudani,Zahra Abbasiantaeb,Mohammad Aliannejadi,Faegheh Hasibi,Hamed Zamani
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Radboud University (拉德布德大学); University of Amsterdam (阿姆斯特丹大学)
类目: Information Retrieval (cs.IR)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:Deep research agents have emerged as LLM-based systems designed to perform multi-step information seeking and reasoning over large, open-domain sources to answer complex questions by synthesizing information from multiple information sources. Given the complexity of the task and despite various recent efforts, evaluation of deep research agents remains fundamentally challenging. This paper identifies a list of requirements and optional properties for evaluating deep research agents. We observe that existing benchmarks do not satisfy all identified requirements. Inspired by prior research on TREC Total Recall Tracks, we introduce the task of Total Recall Question Answering and develop a framework for deep research agents evaluation that satisfies the identified criteria. Our framework constructs single-answer, total recall queries with precise evaluation and relevance judgments derived from a structured knowledge base paired with a text corpus, enabling large-scale data construction. Using this framework, we build TRQA, a deep research benchmark constructed from Wikidata-Wikipedia as a real-world source and a synthetically generated e-commerce knowledge base and corpus to mitigate the effects of data contamination. We benchmark the collection with representative retriever and deep research models and establish baseline retrieval and end-to-end results for future comparative evaluation.

[IR-6] HypeMed: Enhancing Medication Recommendations with Hypergraph-Based Patient Relationships

【速读】:该论文旨在解决临床药物推荐中因患者潜在临床状态难以从稀疏且噪声较大的健康记录中准确推断而导致的推荐效果不佳问题,其核心挑战在于既要保留就诊记录内部共现实体的高阶组合语义,又要通过有效的、基于就诊条件的检索机制利用历史参考信息。解决方案的关键在于提出一种两阶段超图(hypergraph)框架HypeMed,其中第一阶段MedRep模块通过知识感知的对比预训练将临床就诊编码为超边,构建全局一致且利于检索的嵌入空间;第二阶段SimMR模块在此空间内执行动态检索,融合检索到的历史参考与患者的纵向数据以优化药物预测,从而在提升推荐精度的同时显著降低药物相互作用(DDI)风险。

链接: https://arxiv.org/abs/2603.18459
作者: Xiangxu Zhang,Xiao Zhou,Hongteng Xu,Jianxun Lian
机构: Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院); Beijing Key Laboratory of Research on Large Models and Intelligent Governance(北京市大模型与智能治理研究重点实验室); Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE(教育部下一代智能搜索与推荐工程研究中心); Microsoft Research Asia(微软亚洲研究院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by TOIS

点击查看摘要

Abstract:Medication recommendations aim to generate safe and effective medication sets from health records. However, accurately recommending medications hinges on inferring a patient’s latent clinical condition from sparse and noisy observations, which requires both (i) preserving the visit-level combinatorial semantics of co-occurring entities and (ii) leveraging informative historical references through effective, visit-conditioned retrieval. Most existing methods fall short in one of both aspects: graph-based modeling often fragments higher-order intra-visit patterns into pairwise relations, while inter-visit augmentation methods commonly exhibit an imbalance between learning a globally stable representation space and performing dynamic retrieval within it. To address these limitations, this paper proposes HypeMed, a two-stage hypergraph-based framework unifying intra-visit coherence modeling and inter-visit augmentation. HypeMed consists of two core modules: MedRep for representation pre-training, and SimMR for similarity-enhanced recommendation. In the first stage, MedRep encodes clinical visits as hyperedges via knowledge-aware contrastive pre-training, creating a globally consistent, retrieval-friendly embedding space. In the second stage, SimMR performs dynamic retrieval within this space, fusing retrieved references with the patient’s longitudinal data to refine medication prediction. Evaluation on real-world benchmarks shows that HypeMed outperforms state-of-the-art baselines in both recommendation precision and DDI reduction, simultaneously enhancing the effectiveness and safety of clinical decision support.

[IR-7] SODIUM: From Open Web Data to Queryable Databases

【速读】:该论文旨在解决开放网络中多源异构数据的自动化收集与结构化整合问题,即“SODIUM任务”(Search, Organize, and Database Instantiation from the Open Web),该任务要求系统在无需预设数据库的情况下,通过深度探索互联网资源,提取并组织信息形成可查询的结构化表。其核心挑战在于如何实现高效、系统化的网络探索、结构相关性利用以及信息集成。解决方案的关键是提出SODIUM-Agent,一个由网页探索者和缓存管理器组成的多代理系统,结合了ATP-BFS算法以优化导航路径,并通过受控的缓存策略提升探索效率与结构一致性,从而显著提升对复杂数据采集任务的准确率(达91.1%),远超现有基线模型(最高仅46.5%)。

链接: https://arxiv.org/abs/2603.18447
作者: Chuxuan Hu,Philip Li,Maxwell Yang,Daniel Kang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:During research, domain experts often ask analytical questions whose answers require integrating data from a wide range of web sources. Thus, they must spend substantial effort searching, extracting, and organizing raw data before analysis can begin. We formalize this process as the SODIUM task, where we conceptualize open domains such as the web as latent databases that must be systematically instantiated to support downstream querying. Solving SODIUM requires (1) conducting in-depth and specialized exploration of the open web, which is further strengthened by (2) exploiting structural correlations for systematic information extraction and (3) integrating collected information into coherent, queryable database instances. To quantify the challenges in automating SODIUM, we construct SODIUM-Bench, a benchmark of 105 tasks derived from published academic papers across 6 domains, where systems are tasked with exploring the open web to collect and aggregate data from diverse sources into structured tables. Existing systems struggle with SODIUM tasks: we evaluate 6 advanced AI agents on SODIUM-Bench, with the strongest baseline achieving only 46.5% accuracy. To bridge this gap, we develop SODIUM-Agent, a multi-agent system composed of a web explorer and a cache manager. Powered by our proposed ATP-BFS algorithm and optimized through principled management of cached sources and navigation paths, SODIUM-Agent conducts deep and comprehensive web exploration and performs structurally coherent information extraction. SODIUM-Agent achieves 91.1% accuracy on SODIUM-Bench, outperforming the strongest baseline by approximately 2 times and the weakest by up to 73 times. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR) Cite as: arXiv:2603.18447 [cs.DB] (or arXiv:2603.18447v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2603.18447 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-8] From Topic to Transition Structure: Unsupervised Concept Discovery at Corpus Scale via Predictive Associative Memory

【速读】:该论文旨在解决传统嵌入模型(embedding models)仅能根据文本语义内容(what text is about)进行分组的问题,而忽视了文本中事件或结构的动态模式(what text does)。其核心挑战在于如何从大量文本中自动识别并建模具有普遍性的过渡结构概念(transition-structure concepts),从而捕捉文本的功能性、文体特征及文学传统。解决方案的关键在于训练一个参数为29.4M的对比模型(contrastive model),利用来自Project Gutenberg的373百万个共现对(co-occurrence pairs)构建关联空间(association space),使具有相似过渡结构的段落聚类在一起;在容量受限条件下(准确率42.75%),模型被迫压缩重复出现的结构模式而非记忆个体共现,从而实现跨文本的结构性抽象与泛化能力。此方法将预测性关联记忆(Predictive Associative Memory, PAM)从情景回忆扩展至概念形成,通过多粒度聚类生成多层次的概念图谱,并在未见小说上表现出选择性聚焦于少数一致集群的能力,显著区别于基于原始嵌入相似性的聚类方式。

链接: https://arxiv.org/abs/2603.18420
作者: Jason Dury
机构: Independent Researcher(独立研究员)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 22 pages, 5 figures. Code and demo: this https URL

点击查看摘要

Abstract:Embedding models group text by semantic content, what text is about. We show that temporal co-occurrence within texts discovers a different kind of structure: recurrent transition-structure concepts or what text does. We train a 29.4M-parameter contrastive model on 373 million co-occurrence pairs from 9,766 Project Gutenberg texts (24.96 million passages), mapping pre-trained embeddings into an association space where passages with similar transition structure cluster together. Under capacity constraint (42.75% accuracy), the model must compress across recurring patterns rather than memorise individual co-occurrences. Clustering at six granularities (k=50 to k=2,000) produces a multi-resolution concept map; from broad modes like “direct confrontation” and “lyrical meditation” to precise registers and scene templates like “sailor dialect” and “courtroom cross-examination.” At k=100, clusters average 4,508 books each (of 9,766), confirming corpus-wide patterns. Direct comparison with embedding-similarity clustering shows that raw embeddings group by topic while association-space clusters group by function, register, and literary tradition. Unseen novels are assigned to existing clusters without retraining; the association model concentrates each novel into a selective subset of coherent clusters, while raw embedding assignment saturates nearly all clusters. Validation controls address positional, length, and book-concentration confounds. The method extends Predictive Associative Memory (PAM, arXiv:2602.11322) from episodic recall to concept formation: where PAM recalls specific associations, multi-epoch contrastive training under compression extracts structural patterns that transfer to unseen texts, the same framework producing qualitatively different behaviour in a different regime.

[IR-9] Auditing Preferences for Brands and Cultures in LLM s

【速读】:该论文旨在解决生成式 AI(Generative AI)在市场中介中的系统性风险问题,特别是其对市场公平性、竞争格局以及信息多样性的影响。为实现可复现的审计机制,作者提出 ChoiceEval 框架,其核心创新在于两个关键技术:一是基于心理画像(psychographic profiles)生成多样化且贴近真实用户行为的评估查询,二是将自由文本输出转化为标准化的 top-k 选择集与量化偏好指标。该方法实现了跨主题和用户群体的偏好与地理偏差度量,从而建立起模型行为与现实经济后果之间的可解释联系。

链接: https://arxiv.org/abs/2603.18300
作者: Jasmine Rienecker,Katarina Mpofu,Naman Goel,Siddhartha Datta,Jun Zhao,Oscar Danielsson,Fredrik Thorsen
机构: University of Oxford (牛津大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 20 pages, 2 figures

点击查看摘要

Abstract:Large language models (LLMs) based AI systems increasingly mediate what billions of people see, choose and buy. This creates an urgent need to quantify the systemic risks of LLM-driven market intermediation, including its implications for market fairness, competition, and the diversity of information exposure. This paper introduces ChoiceEval, a reproducible framework for auditing preferences for brands and cultures in large language models (LLMs) under realistic usage conditions. ChoiceEval addresses two core technical challenges: (i) generating realistic, persona-diverse evaluation queries and (ii) converting free-form outputs into comparable choice sets and quantitative preference metrics. For a given topic (e.g. running shoes, hotel chains, travel destinations), the framework segments users into psychographic profiles (e.g., budget-conscious, wellness-focused, convenience), and then derives diverse prompts that reflect real-world advice-seeking and decision-making behaviour. LLM responses are converted into normalised top-k choice sets. Preference and geographic bias are then quantified using comparable metrics across topics and personas. Thus, ChoiceEval provides a scalable audit pipeline for researchers, platforms, and regulators, linking model behaviour to real-world economic outcomes. Applied to Gemini, GPT, and DeepSeek across 10 topics spanning commerce and culture and more than 2,000 questions, ChoiceEval reveals consistent preferences: U.S.-developed models Gemini and GPT show marked favouritism toward American entities, while China-developed DeepSeek exhibits more balanced yet still detectable geographic preferences. These patterns persist across user personas, suggesting systematic rather than incidental effects. Comments: 20 pages, 2 figures Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG) ACMclasses: I.2.7; I.2.6; I.2.8; H.3.3; K.4.1; K.4.4 Cite as: arXiv:2603.18300 [cs.HC] (or arXiv:2603.18300v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2603.18300 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-10] Lightweight Adaptation for LLM -based Technical Service Agent : Latent Logic Augmentation and Robust Noise Reduction

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂技术服务平台中适应性不足的问题,核心挑战包括人类示范中缺乏显式的认知链(cognitive chains)以及有效响应的多样性带来的固有歧义,这导致代理难以内化潜在决策动态并实现良好泛化;同时,标准训练范式存在资源和时间成本过高的问题。解决方案的关键在于提出一种轻量级适配框架,包含三项核心贡献:(1) 隐式逻辑增强(Latent Logic Augmentation),通过规划感知轨迹建模与决策推理增强来弥合表面监督与潜在决策逻辑之间的差距,提升监督微调对齐的稳定性;(2) 噪声鲁棒性降低(Robust Noise Reduction),构建多真实标签数据集并通过双过滤机制验证多样响应以捕捉语义多样性并减少噪声;(3) 轻量化适配机制(Lightweight Adaptation),设计融合LLM评分器与轻量级相关性重排序器的混合奖励机制,在保持与标准LLM-as-a-Judge强化学习相当对齐效果的同时显著降低计算开销。

链接: https://arxiv.org/abs/2603.18074
作者: Yi Yu,Junzhuo Ma,Chenghuang Shen,Xingyan Liu,Jing Gu,Hangyi Sun,Guangquan Hu,Jianfeng Liu,Weiting Liu,Mingyue Pu,Yu Wang,Zhengdong Xiao,Rui Xie,Longjiu Luo,Qianrong Wang,Gurong Cui,Honglin Qiao,Wenlian Lu
机构: Fudan University (复旦大学); Alibaba Group (阿里巴巴集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Adapting Large Language Models in complex technical service domains is constrained by the absence of explicit cognitive chains in human demonstrations and the inherent ambiguity arising from the diversity of valid responses. These limitations severely hinder agents from internalizing latent decision dynamics and generalizing effectively. Moreover, practical adaptation is often impeded by the prohibitive resource and time costs associated with standard training paradigms. To overcome these challenges and guarantee computational efficiency, we propose a lightweight adaptation framework comprising three key contributions. (1) Latent Logic Augmentation: We introduce Planning-Aware Trajectory Modeling and Decision Reasoning Augmentation to bridge the gap between surface-level supervision and latent decision logic. These approaches strengthen the stability of Supervised Fine-Tuning alignment. (2) Robust Noise Reduction: We construct a Multiple Ground Truths dataset through a dual-filtering method to reduce the noise by validating diverse responses, thereby capturing the semantic diversity. (3) Lightweight Adaptation: We design a Hybrid Reward mechanism that fuses an LLM-based judge with a lightweight relevance-based Reranker to distill high-fidelity reward signals while reducing the computational cost compared to standard LLM-as-a-Judge reinforcement learning. Empirical evaluations on real-world Cloud service tasks, conducted across semantically diverse settings, demonstrate that our framework achieves stability and performance gains through Latent Logic Augmentation and Robust Noise Reduction. Concurrently, our Hybrid Reward mechanism achieves alignment comparable to standard LLM-as-a-judge methods with reduced training time, underscoring the practical value for deploying technical service agents.

[IR-11] DynaRAG : Bridging Static and Dynamic Knowledge in Retrieval-Augmented Generation

【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)框架在处理动态信息需求时的局限性,即仅依赖静态知识库导致无法有效应对时效性强的问题。其解决方案的关键在于引入动态知识集成机制:通过一个基于大语言模型(LLM)的重排序器评估文档相关性,并结合充分性分类器判断是否需要调用外部API作为补充;同时利用Gorilla v2模型实现精准的工具调用,并借助FAISS进行schema过滤以优化API选择。这种动态感知的路由策略与选择性工具使用显著提升了问答准确性并减少了幻觉现象。

链接: https://arxiv.org/abs/2603.18012
作者: Penghao Liang,Mengwei Yuan,Jianan Liu,Jing Yang,Xianyou Li,Weiran Yan,Yichao Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We present DynaRAG, a retrieval-augmented generation (RAG) framework designed to handle both static and time-sensitive information needs through dynamic knowledge integration. Unlike traditional RAG pipelines that rely solely on static corpora, DynaRAG selectively invokes external APIs when retrieved documents are insufficient for answering a query. The system employs an LLM-based reranker to assess document relevance, a sufficiency classifier to determine when fallback is necessary, and Gorilla v2 – a state-of-the-art API calling model – for accurate tool invocation. We further enhance robustness by incorporating schema filtering via FAISS to guide API selection. Evaluations on the CRAG benchmark demonstrate that DynaRAG significantly improves accuracy on dynamic questions, while also reducing hallucinations. Our results highlight the importance of dynamic-aware routing and selective tool use in building reliable, real-world question-answering systems.

[IR-12] Controllable Evidence Selection in Retrieval-Augmented Question Answering via Deterministic Utility Gating

【速读】:该论文旨在解决检索增强型问答系统中因仅依赖相似度评分而导致的证据选择不可靠问题,即当多个候选文本具有相近的相似度分数时,系统可能选出冗余、不完整或与问题条件不符的文本作为证据。解决方案的关键在于提出一种确定性证据选择框架,引入Meaning-Utility Estimation(MUE)和Diversity-Utility Estimation(DUE)两个固定评分机制,通过语义相关性、术语覆盖度、概念独特性和冗余控制等显式信号独立评估每个句子单元,确保只有明确陈述任务所需事实、规则或条件的单元被接受为有效证据,且不进行合并或扩展,从而生成紧凑、可审计的证据集,并清晰界定相关文本与可用证据之间的边界。

链接: https://arxiv.org/abs/2603.18011
作者: Victor P. Unda
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 21 pages, 1 figures, 4 tables

点击查看摘要

Abstract:Many modern AI question-answering systems convert text into vectors and retrieve the closest matches to a user question. While effective for topical similarity, similarity scores alone do not explain why some retrieved text can serve as evidence while other equally similar text cannot. When many candidates receive similar scores, systems may select sentences that are redundant, incomplete, or address different conditions than the question requires. This paper presents a deterministic evidence selection framework for retrieval-augmented question answering. The approach introduces Meaning-Utility Estimation (MUE) and Diversity-Utility Estimation (DUE), fixed scoring and redundancy-control procedures that determine evidence admissibility prior to answer generation. Each sentence or record is evaluated independently using explicit signals for semantic relatedness, term coverage, conceptual distinctiveness, and redundancy. No training or fine-tuning is required. In the prototype, a unit is accepted only if it explicitly states the fact, rule, or condition required by the task. Units are not merged or expanded. If no unit independently satisfies the requirement, the system returns no answer. This deterministic gating produces compact, auditable evidence sets and establishes a clear boundary between relevant text and usable evidence. Comments: 21 pages, 1 figures, 4 tables Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) MSC classes: 68T50 ACMclasses: I.2.7; H.3.3 Cite as: arXiv:2603.18011 [cs.CL] (or arXiv:2603.18011v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.18011 Focus to learn more arXiv-issued DOI via DataCite

[IR-13] Negative Sampling Techniques in Information Retrieval: A Survey EACL2026

【速读】:该论文旨在解决密集检索(Dense Retrieval, DR)中负样本选择对模型训练效果的关键影响问题。在基于对比学习的训练范式下,负样本的质量直接影响模型的语义区分能力与最终检索性能。论文的核心解决方案在于提出一个系统性的分类框架,将负样本生成方法划分为随机采样、静态/动态挖掘和合成数据三类,并深入分析各类方法在有效性、计算成本与实现复杂度之间的权衡关系。尤其值得关注的是,该研究首次将大语言模型(Large Language Model, LLM)驱动的合成数据方法纳入综述范畴,为未来利用LLM生成高质量负样本提供了理论基础与实践指导。

链接: https://arxiv.org/abs/2603.18005
作者: Laurin Wischounig,Abdelrahman Abdallah,Adam Jatowt
机构: University of Innsbruck (因斯布鲁克大学)
类目: Information Retrieval (cs.IR)
备注: Accepted at findings EACL 2026

点击查看摘要

Abstract:Information Retrieval (IR) is fundamental to many modern NLP applications. The rise of dense retrieval (DR), using neural networks to learn semantic vector representations, has significantly advanced IR performance. Central to training effective dense retrievers through contrastive learning is the selection of informative negative samples. Synthesizing 35 seminal papers, this survey provides a comprehensive and up-to-date overview of negative sampling techniques in dense IR. Our unique contribution is the focus on modern NLP applications and the inclusion of recent Large Language Model (LLM)-driven methods, an area absent in prior reviews. We propose a taxonomy that categorizes techniques including random, static/dynamically mined, and synthetic datasets. We then analyze these approaches with respect to trade-offs between effectiveness, computational cost, and implementation difficulty. The survey concludes by outlining current challenges and promising future directions for the use of LLM-generated synthetic data.

[IR-14] Predictive Associative Memory: Retrieval Beyond Similarity Through Temporal Co-occurrence

【速读】:该论文旨在解决当前神经网络记忆系统中依赖相似性检索(similarity-based retrieval)的局限性,即假设“有用的记忆是相似的记忆”,这无法捕捉生物记忆的核心特性——通过时间共现(temporal co-occurrence)建立关联。为此,作者提出预测关联记忆(Predictive Associative Memory, PAM),其核心创新在于引入双向JEPA架构:标准的Outward JEPA用于预测输入感官数据的未来状态(基于时间共现训练),而新提出的Inward JEPA则在存储的经验空间内预测可关联的过去状态,从而显式建模嵌入空间中的关联结构。该方案的关键在于利用连续经验流中学习到的时间共现信号驱动记忆的关联性导航,而非依赖静态嵌入相似度,实验证明其在合成基准上能高精度召回真实时序关联(Association Precision@1 = 0.970),且在嵌入相似度无效的情况下仍保持显著区分能力(AUC = 0.849)。

链接: https://arxiv.org/abs/2602.11322
作者: Jason Dury
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Neural and Evolutionary Computing (cs.NE)
备注: 20 pages, 6 figures, for associated Git: this https URL

点击查看摘要

Abstract:Current approaches to memory in neural systems rely on similarity-based retrieval: given a query, find the most representationally similar stored state. This assumption – that useful memories are similar memories – fails to capture a fundamental property of biological memory: association through temporal co-occurrence. We propose Predictive Associative Memory (PAM), an architecture in which a JEPA-style predictor, trained on temporal co-occurrence within a continuous experience stream, learns to navigate the associative structure of an embedding space. We introduce an Inward JEPA that operates over stored experience (predicting associatively reachable past states) as the complement to the standard Outward JEPA that operates over incoming sensory data (predicting future states). We evaluate PAM as an associative recall system – testing faithfulness of recall for experienced associations – rather than as a retrieval system evaluated on generalisation to unseen associations. On a synthetic benchmark, the predictor’s top retrieval is a true temporal associate 97% of the time (Association Precision@1 = 0.970); it achieves cross-boundary Recall@20 = 0.421 where cosine similarity scores zero; and it separates experienced-together from never-experienced-together states with a discrimination AUC of 0.916 (cosine: 0.789). Even restricted to cross-room pairs where embedding similarity is uninformative, the predictor achieves AUC = 0.849 (cosine: 0.503, chance). A temporal shuffle control confirms the signal is genuine temporal co-occurrence structure, not embedding geometry: shuffling collapses cross-boundary recall by 90%, replicated across training seeds. All results are stable across seeds (SD 0.006) and query selections (SD \leq 0.012).

人机交互

[HC-0] Constitutive vs. Corrective: A Causal Taxonomy of Human Runtime Involvement in AI Systems

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)系统在高风险决策场景中,关于人类参与术语(如“人在回路”Human-in-the-Loop, HITL、“人在环路”Human-on-the-Loop, HOTL及“人类监督”Human Oversight)的语义模糊问题。这种模糊性阻碍了计算机科学、法律、哲学、心理学和社会学等跨学科协作,并引发监管不确定性。其解决方案的关键在于基于因果结构对人类介入进行重新定义:HITL是构成性的(人类贡献是决策输出所必需的),而HOTL是纠正性的(位于主要因果链之外,可预防或修改输出);进一步地,在HOTL内部区分同步、异步与前瞻三种时间模式,并引入认知整合维度(互补型或混合智能),从而形成四种结构性配置;最后明确规范性要求——法定“人类监督”是一种特定类型的HOTL,不仅要求因果上的纠正地位,还必须具备实际干预能力与准备状态。论文强调角色二元性(同一主体可同时处于HITL和HOTL)应作为设计问题处理,需通过架构与认识论层面的缓解策略加以应对,而非简单承认。

链接: https://arxiv.org/abs/2603.19213
作者: Kevin Baum,Johann Laux
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As AI systems increasingly permeate high-stakes decision-making, the terminology regarding human involvement - Human-in-the-Loop (HITL), Human-on-the-Loop (HOTL), and Human Oversight - has become vexingly ambiguous. This ambiguity complicates interdisciplinary collaboration between computer science, law, philosophy, psychology, and sociology and can lead to regulatory uncertainty. We propose a clarification grounded in causal structure, focused on human involvement during the runtime of AI systems. The distinction between HITL and HOTL, we argue, is not primarily spatial but causal: HITL is constitutive (a human contribution is necessary for the decision output), while HOTL is corrective (external to the primary causal chain, capable of preventing or modifying outputs). Within HOTL, we distinguish three temporal modes - synchronous, asynchronous, and anticipatory - situated within a nested model of provider and deployer runtime that clarifies their different capacities for intervention. A second, orthogonal dimension captures cognitive integration: whether human and machine operate as complementary or hybrid intelligence, yielding four structurally distinct configurations. Finally, we distinguish these descriptive categories from the normative requirements they serve: statutory “Human Oversight” is a specific normative mode of HOTL that demands not merely a corrective causal position, but genuine preparedness and capacity for effective intervention. Because the same person may occupy both HITL and HOTL roles simultaneously, we argue that this role duality must be treated as a design problem requiring architectural and epistemic mitigation rather than mere acknowledgment.

[HC-1] Exploring the Role of Interaction Data to Empower End-User Decision-Making In UI Personalization

【速读】:该论文试图解决用户在自主驱动的界面个性化设置中,因缺乏对有价值个性化机会的识别与评估支持而导致个性化功能使用不足的问题。解决方案的关键在于提出一种反思性个性化(reflexive personalization)方法,即通过实验性情境(vignettes)作为设计探针,引导用户反思其数字交互数据,从而自主识别有意义的个性化机会,并在此基础上偏好系统以可视化建议形式提供支持。研究发现,交互数据能增强用户对个性化价值的认知、平衡收益与努力成本的权衡,并提升系统建议的透明度,最终促进用户在个性化决策中的主体性(agency)。

链接: https://arxiv.org/abs/2603.19196
作者: Sérgio Alves,Carlos Duarte,Kyle Montague,Tiago Guerreiro
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems

点击查看摘要

Abstract:User interface personalization enhances digital efficiency, usability, and accessibility. However, in user-driven setups, limited support for identifying and evaluating worthwhile opportunities often leads to underuse. We explore a reflexive personalization approach where individuals engage with their digital interaction data to identify meaningful personalization opportunities and benefits. We interviewed 12 participants, using experimental vignettes as design probes to support reflection on different forms of using interaction data to empower decision-making in personalization and the preferred level of system support. We found that people can independently identify personalization opportunities but prefer system support through visual personalization suggestions. Interaction data can shape how users perceive and approach personalization by reinforcing the perceived value of change and data collection, helping them weigh benefits against effort, and increasing the transparency of system suggestions. We discuss opportunities for designing personalization software that raises end-users’ agency over interfaces through reflective engagement with their interaction data.

[HC-2] Introducing M: A Modular Modifiable Social Robot

【速读】:该论文旨在解决社会机器人研究中因平台复杂性高、可复现性差以及部署困难而导致的研究效率低下问题(即“平台摩擦”)。其解决方案的关键在于构建一个开源、低成本的社交机器人平台M,该平台通过模块化机械设计、多模态传感与结构简单的执行机构相结合,并配备基于ROS2的软件架构,实现了感知、表达控制与数据管理的清晰分离;同时提供与硬件接口等效的仿真环境,支持交互行为的快速“仿真到现实”迁移,从而显著提升研究的可扩展性与实用性。

链接: https://arxiv.org/abs/2603.19134
作者: Victor Nikhil Antony,Zhili Gong,Yoonjae Kim,Chien-Ming Huang
机构: Johns Hopkins University (约翰霍普金斯大学); Rice University (莱斯大学)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We present M, an open-source, low-cost social robot platform designed to reduce platform friction that slows social robotics research by making robots easier to reproduce, modify, and deploy in real-world settings. M combines a modular mechanical design, multimodal sensing, and expressive yet mechanically simple actuation architecture with a ROS2-native software package that cleanly separates perception, expression control, and data management. The platform includes a simulation environment with interface equivalence to hardware to support rapid sim-to-real transfer of interaction behaviors. We demonstrate extensibility through additional sensing/actuation modules and provide example interaction templates for storytelling and two-way conversational coaching. Finally, we report real-world use in participatory design and week-long in-home deployments, showing how M can serve as a practical foundation for longitudinal, reproducible social robotics research.

[HC-3] LLM s Arent Human: A Critical Perspective on LLM Personality

【速读】:该论文试图解决的问题是:当前大量研究将人类五大人格特质(Big Five personality traits)直接应用于大型语言模型(Large Language Models, LLMs),以评估其行为是否具有类似人类人格的稳定性与内在一致性,但这一做法未对背后假设进行批判性检验。论文指出,LLM 对人格测试的回答并不满足人格的六个核心特征,因此此类评估并不能真正测量等价于人类人格的构念。解决方案的关键在于推动研究范式从拟人化的人格归因转向功能导向的评价体系,明确人格测试在LLM中实际捕捉的是何种行为模式,并构建面向LLM自身特性的、用于刻画稳定内在行为的新型框架。

链接: https://arxiv.org/abs/2603.19030
作者: Kim Zierahn,Cristina Cachero,Anna Korhonen,Nuria Oliver
机构: ELLIS Alicante(ELLIS 阿利坎特); University of Alicante(阿利坎特大学); University of Cambridge(剑桥大学)
类目: Human-Computer Interaction (cs.HC)
备注: 4 pages

点击查看摘要

Abstract:A growing body of research examines personality traits in Large Language Models (LLMs), particularly in human-agent collaboration. Prior work has frequently applied the Big Five inventory to assess LLM behavior analogous to human personality, without questioning the underlying assumptions. This paper critically evaluates whether LLM responses to personality tests satisfy six defining characteristics of personality. We find that none are fully met, indicating that such assessments do not measure a construct equivalent to human personality. We propose a research agenda for shifting from anthropomorphic trait attribution toward functional evaluations, clarifying what personality tests actually capture in LLMs and developing LLM-specific frameworks for characterizing stable, intrinsic behavior.

[HC-4] SVLAT: Scientific Visualization Literacy Assessment Test

【速读】:该论文旨在解决科学可视化(Scientific Visualization, SciVis)领域缺乏有效评估工具的问题,即如何量化公众对复杂科学现象可视化结果的阅读、理解和解释能力。其解决方案的关键在于开发并验证了一个结构严谨的科学可视化素养评估测试(Scientific Visualization Literacy Assessment Test, SVLAT),该测试包含49个基于18个科学可视化实例的题目,覆盖八种可视化技术与十一个任务类型,并通过内容效度分析(平均内容效度比CV=0.79)、项目反应理论(IRT)和经典测验理论(CTT)进行多阶段心理测量学验证,最终在大规模试测中展现出高信度(McDonald’s omega_t = 0.82,Cronbach’s alpha = 0.81),为后续SciVis素养研究提供了可复用、标准化的测量工具。

链接: https://arxiv.org/abs/2603.19000
作者: Patrick Phuoc Do,Kaiyuan Tang,Kuangshi Ai,Chaoli Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Scientific visualization (SciVis) has become an essential means for exploring, understanding, and communicating complex scientific phenomena. However, the field still lacks a validated instrument assessing how well people read, understand, and interpret them. We present a scientific visualization literacy assessment test (SVLAT) that measures the general public’s SciVis literacy. Covering a range of visualization forms and interpretation demands, SVLAT comprises 49 items grounded in 18 scientific visualizations and illustrations spanning eight visualization techniques and 11 tasks. Instrument development followed a staged, psychometrically grounded pipeline. We defined the construct and blueprint, followed by item generation, and expert review with five SciVis experts using the content validity ratio (mean CVR = 0.79). We subsequently administered a pilot test (30 participants) and a large-scale test tryout (485 participants) to evaluate the instrument’s psychometric properties. For validation, we performed item analysis and refinement using both classical test theory (CTT) and item response theory (IRT) to examine item functioning and overall test quality. SVLAT demonstrates high reliability in the tryout sample (McDonald’s omega_t = 0.82, Cronbach’s alpha = 0.81). The assessment materials are available at this https URL.

[HC-5] Book your room in the Turing Hotel! A symmetric and distributed Turing Test with multiple AIs and humans

【速读】:该论文旨在解决传统图灵测试(Turing Test)在评估大型语言模型(Large Language Models, LLMs)智能水平时,因单一交互模式而难以全面反映其社会性与真实感的问题。为应对这一挑战,作者提出了一种名为“TuringHotel”的新型扩展实验范式,其核心创新在于将图灵测试从一对一对话重构为多智能体混合社区中的群体互动场景,其中人类与LLM参与者同时扮演评判者和被评判者的角色,从而更真实地模拟现实社交环境。解决方案的关键在于构建一个名为UNaIVERSE的分布式平台,该平台通过认证的点对点网络保障通信安全,并提供统一的人机交互接口,使人类用户可通过移动设备或电脑参与讨论,进而实现对LLM社会行为特征的动态观测与量化分析。实验结果表明,当前LLMs仍可能被误判为人,但同时也暴露出人类特有的“指纹”特征尚未完全消失,提示未来需进一步探索人机识别的边界与演化路径。

链接: https://arxiv.org/abs/2603.18981
作者: Christian Di Maio,Tommaso Guidi,Luigi Quarantiello,Jack Bell,Marco Gori,Stefano Melacci,Vincenzo Lomonaco
机构: DIISM, University of Siena, Italy; Department of Computer Science, University of Pisa, Italy; LUISS University, Rome, Italy
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In this paper, we report our experience with TuringHotel'', a novel extension of the Turing Test based on interactions within mixed communities of Large Language Models (LLMs) and human participants. The classical one-to-one interaction of the Turing Test is reinterpreted in a group setting, where both human and artificial agents engage in time-bounded discussions and, interestingly, are both judges and respondents. This community is instantiated in the novel platform UNaIVERSE (this https URL), creating a World’’ which defines the roles and interaction dynamics, facilitated by the platform’s built-in programming tools. All communication occurs over an authenticated peer-to-peer network, ensuring that no third parties can access the exchange. The platform also provides a unified interface for humans, accessible via both mobile devices and laptops, that was a key component of the experience in this paper. Results of our experimentation involving 17 human participants and 19 LLMs revealed that current models are still sometimes confused as humans. Interestingly, there are several unexpected mistakes, suggesting that human fingerprints are still identifiable but not fully unambiguous, despite the high-quality language skills of artificial participants. We argue that this is the first experiment conducted in such a distributed setting, and that similar initiatives could be of national interest to support ongoing experiments and competitions aimed at monitoring the evolution of large language models over time.

[HC-6] Sketch2Topo: Using Hand-Drawn Inputs for Diffusion-Based Topology Optimization

【速读】:该论文旨在解决传统拓扑优化(Topology Optimization, TO)方法计算成本高、用户定制能力有限以及难以兼顾结构性能与美学属性的问题。其解决方案的关键在于提出Sketch2Topo,该工具基于扩散模型(diffusion-based model),融合图像到图像生成(image-to-image generation)和图像编辑能力,使用户可通过手绘草图(sketching)自定义几何形状并指定物理约束,同时支持掩码输入以实现局部区域的拓扑优化,从而显著提升定制灵活性与用户体验,在功能性和美观性之间取得更好平衡。

链接: https://arxiv.org/abs/2603.18960
作者: Shuyue Feng,Cedric Caremel,Yoshihiro Kawahara
机构: The University of Tokyo (东京大学)
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, 4 figures, accepted at CHI 2026 as a poster

点击查看摘要

Abstract:Topology optimization (TO) is employed in engineering to optimize structural performance while maximizing material efficiency. However, traditional TO methods incur significant computational and time costs. Although research has leveraged generative AI to predict TO outcomes and validated feasibility and accuracy, existing approaches still suffer from limited customizability and impose a high cognitive load on users. Furthermore, balancing structural performance with aesthetic attributes remains a persistent challenge. We developed Sketch2Topo, which augments a diffusion-based TO model with image-to-image generation and image editing capabilities. With Sketch2Topo, users can use sketching to customize geometries and specify physical constraints. The tool also supports mask input, enabling users to perform TO on selected regions only, thereby supporting higher levels of customization. We summarize the workflow and details of the tool and conduct a brief quantitative evaluation. Finally, we explore application scenarios and discuss how hand-drawn input improves usability while balancing functionality and aesthetics.

[HC-7] What We Talk About When We Talk About Frameworks in HCI

【速读】:该论文旨在解决人机交互(Human-Computer Interaction, HCI)领域中关于框架(framework)的实际使用情况、功能定位及学术实践方式缺乏系统认知的问题。其解决方案的关键在于通过对2015至2024年十年间CHI会议论文中615篇显著提及“框架”的文献进行系统性回顾,将这些研究划分为六类参与类型,并基于功能分类法深入分析新提出框架的角色、形式及其核心构成要素,进而揭示当前研究中存在的对新框架过度热衷而忽视迭代改进、功能模糊以及缺乏系统验证等现象,最终呼吁在HCI领域建立更严谨、反思性和累积性的框架开发与应用实践。

链接: https://arxiv.org/abs/2603.18950
作者: Shitao Fang,Koji Yatani,Kasper Hornbæk
机构: The University of Tokyo(东京大学); University of Copenhagen(哥本哈根大学)
类目: Human-Computer Interaction (cs.HC)
备注: 25 pages, 8 figures, The ACM CHI conference on Human Factors in Computing Systems 2026

点击查看摘要

Abstract:In HCI, frameworks function as a type of theoretical contribution, often supporting ideation, design, and evaluation. Yet, little is known about how they are actually used, what functions they serve, and which scholarly practices that shape them. To address this gap, we conducted a systematic review of 615 papers from a decade of CHI proceedings (2015-2024) that prominently featured the term framework. We classified these papers into six engagement types. We then examined the role, form, and essential components of newly proposed frameworks through a functional typology, analyzing how they are constructed, validated, and articulated for reuse. Our results show that enthusiasm for proposing new frameworks exceeds the willingness to iterate on existing ones. They also highlight the ambiguity in the function of frameworks and the scarcity of systematic validation. Based on these insights, we call for more rigorous, reflective, and cumulative practices in the development and use of frameworks in HCI.

[HC-8] From Accuracy to Readiness: Metrics and Benchmarks for Human-AI Decision-Making

【速读】:该论文旨在解决当前人工智能(AI)系统在人类决策协作中评估标准过于侧重模型准确率,而忽视了人-AI团队协同安全性和有效性的核心问题。其解决方案的关键在于提出一个以“团队准备度”(team readiness)为核心的测量框架,通过四类评价指标——结果表现、依赖行为、安全信号和时间维度上的学习能力——构建了一个可操作的评估体系,并将其与“理解-控制-改进”(Understand-Control-Improve, U-C-I)的人-AI协作生命周期相衔接。该框架基于交互痕迹而非模型属性或主观信任报告,实现了对校准程度、错误恢复能力和治理机制的部署相关评估,从而推动更可比的基准测试和累积性研究,促进更安全、更负责任的人-AI协作。

链接: https://arxiv.org/abs/2603.18895
作者: Min Hun Lee
机构: Singapore Management University (新加坡管理大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACM CHI 2026 Poster

点击查看摘要

Abstract:Artificial intelligence (AI) systems are deployed as collaborators in human decision-making. Yet, evaluation practices focus primarily on model accuracy rather than whether human-AI teams are prepared to collaborate safely and effectively. Empirical evidence shows that many failures arise from miscalibrated reliance, including overuse when AI is wrong and underuse when it is helpful. This paper proposes a measurement framework for evaluating human-AI decision-making centered on team readiness. We introduce a four part taxonomy of evaluation metrics spanning outcomes, reliance behavior, safety signals, and learning over time, and connect these metrics to the Understand-Control-Improve (U-C-I) lifecycle of human-AI onboarding and collaboration. By operationalizing evaluation through interaction traces rather than model properties or self-reported trust, our framework enables deployment-relevant assessment of calibration, error recovery, and governance. We aim to support more comparable benchmarks and cumulative research on human-AI readiness, advancing safer and more accountable human-AI collaboration. Comments: ACM CHI 2026 Poster Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.18895 [cs.HC] (or arXiv:2603.18895v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2603.18895 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3772363.3798377 Focus to learn more DOI(s) linking to related resources

[HC-9] Evaluating LLM -Generated Lessons from the Language Learning Students Perspective: A Short Case Study on Duolingo

【速读】:该论文试图解决当前主流语言学习应用(如Duolingo)在课程设计上过度聚焦于通用生活场景(如问候、点餐或问路),而缺乏针对职业特定语境的训练内容,从而阻碍学习者实现专业级流利度的问题。专业级流利度被定义为能够在目标语言中自如交流工作相关及领域特定信息的能力。解决方案的关键在于:通过个性化适配机制,生成既包含通用、可迁移的基础性教学场景以巩固语法、词汇和文化认知,又嵌入用户所在领域的专业场景(domain-specific scenarios)来提升职业语用能力,从而构建兼顾基础与专业需求的动态化课程体系。

链接: https://arxiv.org/abs/2603.18873
作者: Carlos Rafael Catalan,Patricia Nicole Monderin,Lheane Marie Dizon,Gap Estrella,Raymund John Sarmimento,Marie Antoinette Patalagsa
机构: Samsung RD Institute Philippines(三星研发研究所菲律宾分部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 5 pages,3 figures,presented at the 3rd HEAL Workshop at CHI 2026

点击查看摘要

Abstract:Popular language learning applications such as Duolingo use large language models (LLMs) to generate lessons for its users. Most lessons focus on general real-world scenarios such as greetings, ordering food, or asking directions, with limited support for profession-specific contexts. This gap can hinder learners from achieving professional-level fluency, which we define as the ability to communicate comfortably various work-related and domain-specific information in the target language. We surveyed five employees from a multinational company in the Philippines on their experiences with Duolingo. Results show that respondents encountered general scenarios more frequently than work-related ones, and that the former are relatable and effective in building foundational grammar, vocabulary, and cultural knowledge. The latter helps bridge the gap toward professional fluency as it contains domain-specific vocabulary. Each participant suggested lesson scenarios that diverge in contexts hen analyzed in aggregate. With this understanding, we propose that language learning applications should generate lessons that adapt to an individual’s needs through personalized, domain specific lesson scenarios while maintaining foundational support through general, relatable lesson scenarios.

[HC-10] hrough the Looking-Glass: AI-Mediated Video Communication Reduces Interpersonal Trust and Confidence in Judgments

【速读】:该论文旨在解决人工智能(AI)中介的视频处理技术(如视频美颜、背景替换和虚拟形象)是否会影响人们在人际交往中对可信度和真实性的判断这一问题。其核心发现表明,尽管AI中介显著降低了人们对视频内容的信任感和判断自信,但并未影响参与者识别谎言的实际准确性,也未增加他们对使用AI工具者的怀疑倾向。解决方案的关键在于揭示了AI中介虽削弱主观信任与信心,却不损害客观的 lie detection 能力,从而挑战了基于线索的欺骗检测理论,并强调在涉及信任与信心的场景中,开发可信赖的AI中介工具具有重要意义。

链接: https://arxiv.org/abs/2603.18868
作者: Nelson Navajas Fernández,Jeffrey T. Hancock,Maurice Jakesch
机构: Bauhaus University Weimar (包豪斯大学魏玛分校); Stanford University (斯坦福大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:AI-based tools that mediate, enhance or generate parts of video communication may interfere with how people evaluate trustworthiness and credibility. In two preregistered online experiments (N = 2,000), we examined whether AI-mediated video retouching, background replacement and avatars affect interpersonal trust, people’s ability to detect lies and confidence in their judgments. Participants watched short videos of speakers making truthful or deceptive statements across three conditions with varying levels of AI mediation. We observed that perceived trust and confidence in judgments declined in AI-mediated videos, particularly in settings in which some participants used avatars while others did not. However, participants’ actual judgment accuracy remained unchanged, and they were no more inclined to suspect those using AI tools of lying. Our findings provide evidence against concerns that AI mediation undermines people’s ability to distinguish truth from lies, and against cue-based accounts of lie detection more generally. They highlight the importance of trustworthy AI mediation tools in contexts where not only truth, but also trust and confidence matter.

[HC-11] Signals of Success and Struggle: Early Prediction and Physiological Signatures of Human Performance across Task Complexity

【速读】:该论文旨在解决如何通过早期生理信号实现对用户在交互系统中任务表现的前瞻性预测问题,尤其关注眼动(ocular)和心率(cardiac)信号在早期识别性能差异中的潜力及其背后的生理机制。其解决方案的关键在于构建眼动-心率融合模型(ocular-cardiac fusion model),该模型利用实验初期采集的生理数据,在游戏环境中自然演化的复杂任务中实现了0.86的平衡准确率,且仅依赖眼动信号的模型也表现出相当的预测能力;同时发现高绩效用户具有更聚焦的注视模式、适应性视觉采样策略以及随任务负荷增强而保持稳定的心率激活水平,并伴随更积极的情绪体验,从而为跨会话的性能预测提供了可解释的生理依据,支持未来主动干预策略的设计。

链接: https://arxiv.org/abs/2603.18798
作者: Yufei Cao,Penny Sweetser,Ziyu Chen,Xuanying Zhu
机构: The Australian National University (澳大利亚国立大学)
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
备注: CHI2026

点击查看摘要

Abstract:User performance is crucial in interactive systems, capturing how effectively users engage with task execution. Prospectively predicting performance enables the timely identification of users struggling with task demands. While ocular and cardiac signals are widely used to characterise performance-relevant visual behaviour and physiological activation, their potential for early prediction and for revealing the physiological mechanisms underlying performance differences remains underexplored. We conducted a within-subject experiment in a game environment with naturally unfolding complexity, using early ocular and cardiac signals to predict later performance and to examine physiological and self-reported group differences. Results show that the ocular-cardiac fusion model achieves a balanced accuracy of 0.86, and the ocular-only model shows comparable predictive power. High performers exhibited targeted gaze and adjusted visual sampling, and sustained more stable cardiac activation as demands intensified, with a more positive affective experience. These findings demonstrate the feasibility of cross-session prediction from early physiology, providing interpretable insights into performance variation and facilitating future proactive intervention.

[HC-12] Dual-Model Prediction of Affective Engagement and Vocal Attractiveness from Speaker Expressiveness in Video Learning

【速读】:该论文旨在解决异步视频学习场景中如何仅通过发言者侧的面部表情、眼动特征、语音韵律和认知语义等多模态情感表达,准确预测观众的情感参与度(affective engagement)和声音吸引力(vocal attractiveness)的问题。其解决方案的关键在于提出了一种以发言者为中心的Emotion AI方法,构建了两个独立的回归模型:第一个模型融合面部动态、眼动特征、语音韵律与认知语义来预测情感参与度,第二个模型则仅基于发言者声学特征预测声音吸引力;实验表明,在发言者无关测试集上,两模型均表现出优异的预测性能(R² = 0.85 和 R² = 0.88),证明发言者侧多模态特征可有效代表整体观众反馈,从而实现无需采集观众端信息的高效、隐私友好的情感计算应用。

链接: https://arxiv.org/abs/2603.18758
作者: Hung-Yue Suen,Kuo-En Hung,Fan-Hsun Tseng
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: Preprint. Accepted for publication in IEEE Transactions on Computational Social Systems

点击查看摘要

Abstract:This paper outlines a machine learning-enabled speaker-centric Emotion AI approach capable of predicting audience-affective engagement and vocal attractiveness in asynchronous video-based learning, relying solely on speaker-side affective expressions. Inspired by the demand for scalable, privacy-preserving affective computing applications, this speaker-centric Emotion AI approach incorporates two distinct regression models that leverage a massive corpus developed within Massive Open Online Courses (MOOCs) to enable affectively engaging experiences. The regression model predicting affective engagement is developed by assimilating emotional expressions emanating from facial dynamics, oculomotor features, prosody, and cognitive semantics, while incorporating a second regression model to predict vocal attractiveness based exclusively on speaker-side acoustic features. Notably, on speaker-independent test sets, both regression models yielded impressive predictive performance (R2 = 0.85 for affective engagement and R2 = 0.88 for vocal attractiveness), confirming that speaker-side affect can functionally represent aggregated audience feedback. This paper provides a speaker-centric Emotion AI approach substantiated by an empirical study discovering that speaker-side multimodal features, including acoustics, can prospectively forecast audience feedback without necessarily employing audience-side input information.

[HC-13] Cognitive Amplification vs Cognitive Delegation in Human-AI Systems: A Metric Framework

【速读】:该论文旨在解决人机协作中人工智能(AI)对人类认知能力影响的双重性问题:一方面,AI可增强人类决策表现(即认知放大),另一方面也可能导致人类过度依赖AI而丧失自主推理能力(即认知外包)。其解决方案的关键在于提出一套可操作的量化指标体系——包括认知放大指数(CAI*)、依赖比(D)、人类依赖指数(HRI)和人类认知漂移率(HCDR),构建了一个低维度的评估空间,用于识别并区分“认知放大”与“认知外包”两种模式,并强调必须在系统设计中引入认知可持续性约束,以确保短期人机协同性能提升不以长期人类认知能力退化为代价。

链接: https://arxiv.org/abs/2603.18677
作者: Eduardo Di Santi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 16 pages, 2 figures. Conceptual and mathematical framework for human-AI collaboration, cognitive amplification, cognitive delegation, and cognitive sustainability

点击查看摘要

Abstract:Artificial intelligence is increasingly embedded in human decision-making, where it can either enhance human reasoning or induce excessive cognitive dependence. This paper introduces a conceptual and mathematical framework for distinguishing cognitive amplification, in which AI improves hybrid human-AI performance while preserving human expertise, from cognitive delegation, in which reasoning is progressively outsourced to AI systems. To characterize these regimes, we define a set of operational metrics: the Cognitive Amplification Index (CAI*), the Dependency Ratio (D), the Human Reliance Index (HRI), and the Human Cognitive Drift Rate (HCDR). Together, these quantities provide a low-dimensional metric space for evaluating not only whether human-AI systems achieve genuine synergistic performance, but also whether such performance is cognitively sustainable for the human component over time. The framework highlights a central design tension in human-AI systems: maximizing short-term hybrid capability does not necessarily preserve long-term human cognitive competence. We therefore argue that human-AI systems should be designed under a cognitive sustainability constraint, such that gains in hybrid performance do not come at the cost of degradation in human expertise. Comments: 16 pages, 2 figures. Conceptual and mathematical framework for human-AI collaboration, cognitive amplification, cognitive delegation, and cognitive sustainability Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) ACMclasses: H.5.2; I.2.0 Cite as: arXiv:2603.18677 [cs.HC] (or arXiv:2603.18677v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2603.18677 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-14] Dream the Dream: Futuring Communication between LGBTQ and Cisgender Groups in Metaverse

【速读】:该论文旨在解决数字平台中异性恋规范(heteronormative norms)和结构性偏见对LGBTQ+群体与顺性别个体之间包容性交流的限制问题。其解决方案的关键在于通过参与式设计工作坊,结合推测性场景(speculative Metaverse contexts),识别跨群体沟通障碍,并共同构想替代性未来,从而提出涵盖活动、交互、场景和空间四个层面的社会-空间-技术整合方案。该研究强调空间线索与权力动态在塑造数字相遇中的核心作用,为元宇宙(Metaverse)中的公平、可及且具变革性的通信基础设施设计提供了实践路径与理论原则。

链接: https://arxiv.org/abs/2603.18578
作者: Anqi Wang,Lei Han,Jiahua Dong,Muzhi Zhou,David Yip,Yuyang Wang,Pan Hui
机构: Hong Kong University of Science and Technology (香港科技大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Human-Computer Interaction (cs.HC)
备注: Conditionally accepted to DIS 2026

点击查看摘要

Abstract:Digital platforms frequently reproduce heteronormative norms and structural biases, limiting inclusive communication between LGBTQ+ and cisgender individuals. The Metaverse, with its affordances for identity fluidity, presence, and community governance, offers a promising site for reimagining such interactions. To investigate this potential, we conducted participatory design workshops involving LGBTQ+ and cisgender participants, situating them in speculative Metaverse contexts to surface barriers and co-create alternative futures. The workshops followed a three-phase process-identifying challenges, speculative problem-solving, and visualizing futures-yielding socio-spatial-technical solutions across four layers: activity, interaction, scene, and space. These findings highlight the importance of spatial cues and power dynamics in shaping digital encounters. We contribute by (1) articulating challenges of cross-group communication in virtual environments, (2) proposing inclusive design opportunities for the Metaverse, and (3) advancing principles for addressing power geometry in digital space. This work demonstrates futuring as a critical strategy for designing equitable, transformative communication infrastructures.

[HC-15] Align-to-Scale: Mode Switching Technique for Unimanual 3D Object Manipulation with Gaze-Hand-Object Alignment in Extended Reality

【速读】:该论文旨在解决扩展现实(Extended Reality, XR)环境中单手操作时缩放(scale)功能难以实现的问题。当前主流的“凝视+捏合”(Gaze + Pinch)交互模型虽能支持单手选择、移动和旋转物体,但缩放操作通常需要双手配合,限制了交互的便捷性与可用性。其解决方案的关键在于利用眼动与手势的空间对齐作为模式切换机制,从而在不依赖双侧肢体协作的前提下,实现单手捏合缩放(one-handed pinch-to-scale)。通过设计并评估多种面向单手缩放的技术方案,研究验证了该方法的有效性,并提出了适用于未来三维界面的单手交互设计指南。

链接: https://arxiv.org/abs/2603.18535
作者: Min-yung Kim,Jinwook Kim,Ken Pfeuffer,Sang Ho Yoon
机构: KAIST(韩国科学技术院); Aarhus University(奥胡斯大学)
类目: Human-Computer Interaction (cs.HC)
备注: 19 pages, 6 figures, Presented at ACM ETRA 2026

点击查看摘要

Abstract:As extended reality (XR) technologies rapidly become as ubiquitous as today’s mobile devices, supporting one-handed interaction becomes essential for XR. However, the prevalent Gaze + Pinch interaction model partially supports unimanual interaction, where users select, move, and rotate objects with one hand, but scaling typically requires both hands. In this work, we leverage the spatial alignment between gaze and hand as a mode switch to enable single-handed pinch-to-scale. We design and evaluate several techniques geared for one-handed scaling and assess their usability in a compound translate-scale task. Our findings show that all proposed methods effectively enable one-handed scaling, but each method offers distinct advantages and trade-offs. To this end, we derive design guidelines to support futuristic 3D interfaces with unimanual interaction. Our work helps make eye-hand 3D interaction in XR more mobile, flexible, and accessible.

[HC-16] Do Vision Language Models Understand Human Engagement in Games?

【速读】:该论文旨在解决如何利用视觉-语言模型(Vision-Language Models, VLMs)仅从游戏画面视频中推断玩家隐性心理状态——人类参与度(human engagement)的问题。其核心挑战在于当前VLMs虽能识别可见的游戏行为线索,但难以在跨游戏场景下稳定、准确地理解并预测玩家的参与水平。解决方案的关键在于系统评估三种VLMs在六种提示策略下的表现,包括零样本预测、基于Flow理论、GameFlow、自我决定理论(Self-Determination Theory)和MDA框架的理论引导提示,以及检索增强提示(retrieval-augmented prompting)。实验表明,单纯依赖理论引导提示效果不稳定,甚至可能强化表面特征的捷径学习;而检索增强方法在某些设置下可提升点对点参与度预测性能,但对连续窗口间的参与变化预测仍具挑战,揭示了当前VLMs在感知与理解之间的“认知鸿沟”(perception–understanding gap)。

链接: https://arxiv.org/abs/2603.18480
作者: Ziyi Wang,Qizan Guo,Rishitosh Singh,Xiyang Hu
机构: University of Maryland, College Park (马里兰大学学院公园分校); University of Southern California (南加州大学); Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Inferring human engagement from gameplay video is important for game design and player-experience research, yet it remains unclear whether vision–language models (VLMs) can infer such latent psychological states from visual cues alone. Using the GameVibe Few-Shot dataset across nine first-person shooter games, we evaluate three VLMs under six prompting strategies, including zero-shot prediction, theory-guided prompts grounded in Flow, GameFlow, Self-Determination Theory, and MDA, and retrieval-augmented prompting. We consider both pointwise engagement prediction and pairwise prediction of engagement change between consecutive windows. Results show that zero-shot VLM predictions are generally weak and often fail to outperform simple per-game majority-class baselines. Memory- or retrieval-augmented prompting improves pointwise prediction in some settings, whereas pairwise prediction remains consistently difficult across strategies. Theory-guided prompting alone does not reliably help and can instead reinforce surface-level shortcuts. These findings suggest a perception–understanding gap in current VLMs: although they can recognize visible gameplay cues, they still struggle to robustly infer human engagement across games.

[HC-17] CyberJustice Tutor: An Agent ic AI Framework for Cybersecurity Learning via Think-Plan-Act Reasoning and Pedagogical Scaffolding

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在刑事司法专业人员网络安全教育中应用时面临的两大核心问题:一是反应式聊天机器人的“状态无记忆性”导致教学连贯性差,二是高风险法律场景下生成内容易出现幻觉(hallucination),影响准确性。解决方案的关键在于提出一种基于代理型人工智能(Agentic AI)框架的“CyberJustice Tutor”教育对话系统,其核心创新包括:采用“思考-规划-执行”(Think-Plan-Act)认知循环实现自主目标分解与动态上下文维护;引入基于维果斯基最近发展区(Zone of Proximal Development, ZPD)的教学支架层,根据学习者实时进展动态调整支持策略;并通过自适应检索增强生成(Adaptive Retrieval Augmented Generation, RAG)机制锚定推理于经验证的课程材料,保障法律与技术准确性。

链接: https://arxiv.org/abs/2603.18470
作者: Baiqiang Wang,Yan Bai,Juan Li
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into cybersecurity education for criminal justice professionals is currently hindered by the “statelessness” of reactive chatbots and the risk of hallucinations in high-stakes legal contexts. To address these limitations, we propose the CyberJustice Tutor, an educational dialogue system powered by an Agentic AI framework. Unlike reactive chatbots, our system employs a “Think-Plan-Act” cognitive cycle, enabling autonomous goal decomposition, longitudinal planning, and dynamic context maintenance. We integrate a Pedagogical Scaffolding Layer grounded in Vygotsky’s Zone of Proximal Development (ZPD), which dynamically adapts instructional support based on the learner’s real-time progress. Furthermore, an Adaptive Retrieval Augmented Generation (RAG) core anchors the agent’s reasoning in verified curriculum materials to ensure legal and technical accuracy. A comprehensive user study with 123 participants, including students, educators, and active law enforcement officers, validated the system’s efficacy. Quantitative results demonstrate high user acceptance for Response Speed (4.7/5), Ease of Use (4.4/5), and Accuracy (4.3/5). Qualitative feedback indicates that the agentic architecture is perceived as highly effective in guiding learners through personalized paths, demonstrating the feasibility and usability of agentic AI for specialized professional education.

[HC-18] Beyond Ray-Casting: Evaluating Controller Free-Hand and Virtual-Touch Modalities for Immersive Text Entry

【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)环境中文本输入效率低下这一核心瓶颈问题,从而推动VR向高效生产力平台演进。其关键解决方案在于通过实证比较六种物理输入系统(包括控制器驱动、自由手势和虚拟触摸三种交互风格下的离散点击与连续滑动输入方式)以及语音输入作为非物理参考模态,发现控制器驱动的点击与手势组合输入(Controller Driven Tap Gesture Combo, CD TGC)在吞吐量上表现最优,较最慢系统提升2.25倍,且比当前行业标准快30%,同时将错误率降低至68%;但同时也揭示了性能与主观易用性之间的权衡关系,即尽管CD TGC在速度和准确性上领先,用户对虚拟触摸点击输入的主观体验评分最高(SUS得分高出68%),为未来VR界面设计提供了数据驱动的优化方向。

链接: https://arxiv.org/abs/2603.18435
作者: Md. Tanvir Hossain,Mohd Ruhul Ameen,Akif Islam,Md. Omar Faruqe,Mahboob Qaosar,A. F. M. Mahbubur Rahman,Sanjoy Kumar Chakravarty,M. Khademul Islam Molla
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 7 figures, International Conference on Power, Electronics, Communications, Computing, and Intelligent Infrastructure 2026

点击查看摘要

Abstract:Efficient text entry remains a primary bottleneck preventing Virtual Reality (VR) from evolving into a viable productivity platform. To address this, we conducted an empirical comparison of six physical input systems across three interaction styles Controller Driven, Free Hand, and Virtual Touch evaluating both discrete tap typing and continuous gesture typing (swiping), alongside a speech to text (Voice) condition as a non physical reference modality. Results from 21 participants show that the Controller Driven Tap Gesture Combo (CD TGC) delivers the best productivity performance, achieving speeds 2.25 times higher than the slowest system and 30% faster than the current industry standard, while reducing error rates by up to 68%. A clear trade off emerged between performance and perceived usability: although controller based gesture input led on speed and accuracy, participants rated Virtual Touch Tap Typing highest in subjective experience, scoring 80% higher on the System Usability Scale (SUS) than the lowest rated alternative. We further observe that Free Hand interaction remains limited by tracking stability and physical fatigue, whereas Voice input introduces practical constraints related to privacy, editing control, and immersive engagement. Together, these findings characterize the tension between throughput and natural interaction in immersive text entry and provide data driven guidance for future VR interface design.

[HC-19] Deconstructing Open-World Game Mission Design Formula: A Thematic Analysis Using an Action-Block Framework

【速读】:该论文旨在解决开放世界任务(open-world missions)设计中缺乏系统性分析工具的问题,尤其是在大规模游戏作品中难以量化评估关卡节奏(pacing)、变化性(variation)与体验平衡(experiential balance)的挑战。解决方案的关键在于提出一种六维结构化框架——任务行动质量向量(Mission Action Quality Vector, MAQV),涵盖战斗、探索、叙事、情感、问题解决和独特性六个维度,并结合动作块语法(action block grammar)将任务表示为可解析的游戏行为序列。研究通过大语言模型(LLM)辅助解析约2200个来自20款AAA游戏的任务社区攻略文本,将其转化为结构化行动序列并进行MAQV评分,最终借助交互式仪表板揭示隐藏的任务公式,从而实现对任务设计模式的数据驱动洞察与可视化分析。

链接: https://arxiv.org/abs/2603.18398
作者: Kaijie Xu,Yiwei Zhang,Brian Yang,Clark Verbrugge
机构: McGill University (麦吉尔大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Open-world missions often rely on repeated formulas, yet designers lack systematic ways to examine pacing, variation, and experiential balance across large portfolios. We introduce the Mission Action Quality Vector (MAQV), a six-dimensional framework-covering combat, exploration, narrative, emotion, problem-solving, and uniqueness-paired with an action block grammar representing missions as gameplay sequences. Using about 2200 missions from 20 AAA titles, we apply LLM-assisted parsing to convert community walkthroughs into structured action sequences and score them with MAQV. An interactive dashboard enables designers to reveal underlying mission formulas. In a mixed-methods study with experienced players and designers, we validate the pipeline’s fidelity and the tool’s usability, and use thematic analysis to identify recurring design trade-offs, pacing grammars, and systematic differences by quest type and franchise evolution. Our work offers a reproducible analytical workflow, a data-driven visualization tool, and reflective insights to support more balanced, varied mission design at scale.

[HC-20] Relationship-Centered Care: Relatedness and Responsible Design for Human Connections in Mental-Health Care

【速读】:该论文旨在解决当前数字治疗联盟(Digital Therapeutic Alliance, DTA)设计中一个隐性但关键的问题:过度追求AI代理与患者之间“拟似连接”的优化,可能削弱个体对真实人际关系的基本心理需求——即相关性(relatedness),从而干扰长期心理康复所依赖的实质性人际联结。解决方案的关键在于重构AI设计范式,从模拟关系转向“支架式”支持关系,通过将负责任人工智能的六维框架(Responsible AI Six Sphere Framework)结合自我决定理论(Self-Determination Theory, SDT)中的相关性需求,提出一套以关系为中心的设计指南,促使AI系统不仅作为陪伴者,更成为强化患者整体关系生态(包括与治疗师、照护者、家庭成员及同伴的关系)的催化剂。

链接: https://arxiv.org/abs/2603.18375
作者: Shivam Shukla,Emily Chen,Manhaz Roshanaei,Magy Seif El-Nasr
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校); Stanford University (斯坦福大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:There has been a growing research interest in Digital Therapeutic Alliance (DTA) as the field of AI-powered conversational agents are being deployed in mental health care, particularly those delivering CBT (Cognitive Behaviour Therapy). Our proposition argues that the current design paradigm which seeks to optimize the bond between a patient in need of support and an AI agent contains a subtle but consequential trap: it risks producing an “appearance of connection” that unintentionally disrupts the fundamental human need for relatedness, which potentially displaces the authentic human relationships upon which long-term psychological recovery depends. We propose a reorientation from designing artificial intelligence tools that simulate relationships to designing AI that scaffolds them. To operationalize our argument, we propose an interdisciplinary model that translates the Responsible AI Six Sphere Framework through the lens of Self-Determination Theory (SDT), with a specific focus on the basic psychological need for relatedness. The resulting model offers the technical and often clinical communities a set of relationship-centered design guidelines and relevant provocations for building AI systems that function not just as companions, but as a catalyst for strengthening a patient’s entire relational ecology; their connections with therapists, caregivers, family, and peers. In doing so, we discuss a model towards a more sustainable ecosystem of relationship-centered AI in mental health care.

[HC-21] PeriphAR: Fast and Accurate Real-World Object Selection with Peripheral Augmented Reality Displays

【速读】:该论文旨在解决扩展现实(XR)中基于注视的选择问题,尤其是在广视野(wide-FOV)显示环境下,由于眼动追踪精度限制和三维场景中的目标模糊性,传统依赖中央世界锁定叠加层的视觉确认机制难以适配始终开启的增强现实(AR)眼镜。其解决方案的关键在于提出一种名为PeriphAR的可视化技术,该技术利用用户周边视觉进行注视选择反馈,通过优化目标对象与邻近物体之间的颜色对比度来增强周边视觉对颜色的敏感性,从而提升预注意处理效率,并在实际系统中验证了其有效性。

链接: https://arxiv.org/abs/2603.18350
作者: Yutong Ren,Arnav Reddy,Michael Nebeling
机构: University of Michigan (密歇根大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Gaze-based selection in XR requires visual confirmation due to eye-tracking limitations and target ambiguity in 3D contexts. Current designs for wide-FOV displays use world-locked, central overlays, which are not conducive to always-on AR glasses. This paper introduces PeriphAR (per-ree-far), a visualization technique that leverages peripheral vision for feedback during gaze-based selection on a monocular AR display. In a first user study, we isolated text, color, and shape properties of target objects to compare peripheral selection cues. Peripheral vision was more sensitive to color than shape, but this sensitivity rapidly declined at lower contrast. To preserve preattentive processing of color, we developed two strategies to enhance color in users’ peripheral vision. In a second user study, our strategy that maximized contrast of the target to the neighboring object with the most similar color was subjectively preferred. As proof of concept, we implemented PeriphAR in an end-to-end system to test performance with real-world object detection.

[HC-22] Dont Vibe Code Do Skele-Code: Interactive No-Code Notebooks for Subject Matter Experts to Build Lower-Cost Agent ic Workflows

【速读】:该论文旨在解决非技术用户在构建AI代理工作流(workflow)时面临的高门槛问题,尤其是在传统多代理系统中因复杂调度与执行导致的高token消耗和低可扩展性。其解决方案的关键在于提出Skele-Code——一种基于自然语言和图结构的交互式接口,采用“代码优先、代理辅助”的设计范式:每个步骤通过函数调用和行为约束转化为可执行代码,仅在代码生成和错误恢复阶段调用代理,避免了代理对任务编排(orchestration)和执行的直接参与。这种机制结合上下文工程(context-engineering),显著降低了token成本,并生成模块化、易扩展且可复用的工作流,同时支持作为其他代理的技能或子步骤使用。

链接: https://arxiv.org/abs/2603.18122
作者: Sriram Gopalakrishnan
机构: JP Morgan Chase Co. (摩根大通公司)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Programming Languages (cs.PL); Systems and Control (eess.SY)
备注: Main paper 9 pages. Topics: Agentic Coding, HCI, LLMs, Workflows

点击查看摘要

Abstract:Skele-Code is a natural-language and graph-based interface for building workflows with AI agents, designed especially for less or non-technical users. It supports incremental, interactive notebook-style development, and each step is converted to code with a required set of functions and behavior to enable incremental building of workflows. Agents are invoked only for code generation and error recovery, not orchestration or task execution. This agent-supported, but code-first approach to workflows, along with the context-engineering used in Skele-Code, can help reduce token costs compared to the multi-agent system approach to executing workflows. Skele-Code produces modular, easily extensible, and shareable workflows. The generated workflows can also be used as skills by agents, or as steps in other workflows.

[HC-23] CaseLinker: An Open-Source System for Cross-Case Analysis of Internet Crimes Against Children Reports – Technical Report Initial Release

【速读】:该论文旨在解决儿童性剥削与虐待(Child Sexual Exploitation and Abuse, CSEA)案件数据因分散在多个组织、司法管辖区和机构中,且格式与细节不一致而导致的跨案例分析、模式识别和趋势检测困难的问题。解决方案的关键在于提出一个模块化系统 CaseLinker,其核心是采用混合确定性信息抽取方法:结合基于正则表达式的结构化数据提取(如人口统计信息、平台、证据等)与基于模式的语义分析来识别严重程度指标和案件主题,从而确保可解释性和可审计性;同时通过加权 Jaccard 相似度在多个维度(平台、人口统计、主题、严重程度、调查类型)上对相似案件进行聚类,并提供六种交互式可视化界面及自动化分类与洞察生成能力,有效提升CSEA案件的数据整合效率与分析深度。

链接: https://arxiv.org/abs/2603.18020
作者: Mrinaal Ramachandran
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 23 pages, independent project

点击查看摘要

Abstract:Child sexual exploitation and abuse (CSEA) case data is inherently disturbing, fragmented across multiple organizations, jurisdictions, and agencies, with varying levels of detail and formatting, making cross-case analysis, pattern identification, and trend detection challenging. This paper presents CaseLinker, a modular system for ingesting, processing, analyzing, and visualizing CSEA case data. CaseLinker employs a hybrid deterministic information extraction approach combining regex-based extraction for structured data (demographics, platforms, evidence) with pattern-based semantic analysis for severity indicators and case topics, ensuring interpretability and auditability. The system extracts relevant case information, populates a comprehensive case schema, creates six interactive visualizations (Timeline, Severity Indicators, Case Visualization, Previous Perpetrator Status, Environment/Platforms, Organizations Involved), provides a platform for deeper automated and manual analysis, groups similar cases using weighted Jaccard similarity across multiple dimensions (platforms, demographics, topics, severity, investigation type), and provides automated triage and insights based on collected case data. CaseLinker is evaluated on 47 cases from publicly available AZICAC reports (2011-2014), demonstrating effective information extraction, case clustering, automated insights generation, and interactive visualization capabilities. CaseLinker addresses critical challenges in case analysis including fragmented data sources, cross-case pattern identification, and the emotional burden of repeatedly processing disturbing case material.

[HC-24] R-equivalence on Cubic Surfaces I: Existing Cases with Non-Trivial Universal Equivalence

【速读】:该论文旨在解决p-进域上光滑三次曲面的R-等价性(R-equivalence)是否平凡的问题,特别是针对Swinnerton-Dyer在1981年未能完全处理的三类特殊情形,尤其是其中包含所有已知非平凡通用等价(universal equivalence)的2-进域三次曲面(all-Eckardt reductions)。其关键解决方案在于引入新的方法来研究R-等价性,证明了这类曲面上的R-等价性要么平凡,要么为2阶;并通过具体计算确认了两个经典案例——即在ℚ₂(ζ₃)上的对角三次曲面X³+Y³+Z³+ζ₃T³=0(回应Manin长期悬而未决的问题)和Kanevsky给出的指数为2的通用等价情形——均具有平凡的R-等价性。这一成果不仅填补了Swinnerton-Dyer方法的空白,也与Colliot-Thélène与Sansuc关于几何有理曲面上通用扭子(universal torsor)k-有理性的猜想保持一致。

链接: https://arxiv.org/abs/2603.19215
作者: Dimitri Kanevsky,Julian Salazar,Matt Harvey
机构: 未知
类目: Algebraic Geometry (math.AG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Number Theory (math.NT)
备注: 23 pages

点击查看摘要

Abstract:Let V be a smooth cubic surface over a p -adic field k with good reduction. Swinnerton-Dyer (1981) proved that R -equivalence is trivial on V(k) except perhaps if V is one of three special types–those whose R -equivalence he could not bound by proving the universal (admissible) equivalence is trivial. We consider all surfaces V currently known to have non-trivial universal equivalence. Beyond being intractable to Swinnerton-Dyer’s approach, we observe that if these surfaces also had non-trivial R -equivalence, they would contradict Colliot-Thélène and Sansuc’s conjecture regarding the k -rationality of universal torsors for geometrically rational surfaces. By devising new methods to study R -equivalence, we prove that for 2-adic surfaces with all-Eckardt reductions (the third special type, which contains every existing case of non-trivial universal equivalence), R -equivalence is trivial or of exponent 2. For the explicit cases, we confirm triviality: the diagonal cubic X^3+Y^3+Z^3+\zeta_3 T^3=0 over \mathbbQ_2(\zeta_3) --answering a long-standing question of Manin’s (Cubic Forms, 1972)–and the cubic with universal equivalence of exponent 2 (Kanevsky, 1982). This is the first in a series of works derived from a year of interactions with generative AI models such as AlphaEvolve and Gemini 3 Deep Think, with the latter proving many of our lemmas. We disclose the timeline and nature of their use towards this paper, and describe our broader AI-assisted research program in a companion report (in preparation). Comments: 23 pages Subjects: Algebraic Geometry (math.AG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Number Theory (math.NT) Cite as: arXiv:2603.19215 [math.AG] (or arXiv:2603.19215v1 [math.AG] for this version) https://doi.org/10.48550/arXiv.2603.19215 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-25] Setting SAIL: Leverag ing Scientist-AI-Loops for Rigorous Visualization Tools

【速读】:该论文旨在解决科学家在将理论知识转化为交互式工具时面临的挑战,即专业知识与实现所需技能及时间之间的鸿沟。传统方法中,研究人员常因缺乏编程能力或时间而难以构建高质量的科学可视化或模拟工具,而当前大语言模型(Large Language Models, LLMs)虽能加速代码生成,却常因忽视科学准确性而导致结果看似合理实则错误。解决方案的关键在于提出“科学家-AI循环”(Scientist-AI-Loop, SAIL)框架:通过将领域逻辑(domain logic)与代码语法(code syntax)分离,使研究者能够专注于科学概念和约束条件的把控,同时将具体代码实现交由AI完成。该框架确保了科学严谨性不被牺牲的前提下大幅提升开发效率,并已在两个开源天体物理交互工具中验证其有效性。

链接: https://arxiv.org/abs/2603.18145
作者: Nico Schuster,Andrés N. Salcedo,Simon Bouchard,Dennis Frei,Alice Pisani,Julian E. Bautista,Julien Zoubian,Stephanie Escoffier,Wei Liu,Georgios Valogiannis,Pauline Zarrouk
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Human-Computer Interaction (cs.HC)
备注: 10 pages (+ references), 4 figures. Interactive visualizations available at: this https URL and this https URL

点击查看摘要

Abstract:Scientists across all disciplines share a common challenge: the divide between their theoretical knowledge and the specialized skills and time needed to build interactive tools to communicate this expertise. While large language models (LLMs) offer unparalleled acceleration in code generation, they frequently prioritize functional syntax over scientific accuracy, risking visually convincing but scientifically invalid results. This work advocates the Scientist-AI-Loop (SAIL), a framework designed to harness this speed without compromising rigor. By separating domain logic from code syntax, SAIL enables researchers to maintain strict oversight of scientific concepts and constraints while delegating code implementation to AI. We illustrate this approach through two open-source, browser-based astrophysics tools: an interactive gravitational lensing visualization and a large-scale structure formation sandbox, both publicly available. Our methodology condensed development to mere days while maintaining scientific integrity. We specifically address failure modes where AI-generated code neglects phenomenological boundaries or scientific validity. While cautioning that research-grade code requires stringent protocols, we demonstrate through two examples that SAIL provides an effective code generation workflow for outreach, teaching, professional presentations, and early-stage research prototyping. This framework contributes to a foundation for the further development of AI-assisted scientific software.

计算机视觉

[CV-0] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在几何推理和物理动态理解方面的“空间盲视”问题,即其在缺乏显式3D信息的情况下难以进行细粒度的空间感知与物理规律建模。解决方案的关键在于利用大规模视频生成模型中隐含的空间先验:作者提出VEGA-3D框架,通过提取预训练视频扩散模型在中间噪声层的时空特征,并结合语义表示采用基于token级别的自适应门控融合机制,从而无需显式3D监督即可为MLLMs注入密集的几何线索。这种方法将视频生成模型转化为潜在世界模拟器(Latent World Simulator),显著提升了模型在3D场景理解、空间推理和具身操作等任务上的性能,验证了生成式先验在物理世界理解中的可扩展性。

链接: https://arxiv.org/abs/2603.19235
作者: Xianjin Wu,Dingkang Liang,Tianrui Feng,Kui Xia,Yumeng Zhang,Xiaofan Li,Xiao Tan,Xiang Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 31 pages, 12 figures

点击查看摘要

Abstract:While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at this https URL.

[CV-1] Matryoshka Gaussian Splatting

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在实际部署中因层级细节(Level of Detail, LoD)控制受限而导致的效率与质量权衡问题。现有离散LoD方法仅支持有限的操作点,而连续LoD方法虽能实现平滑缩放,却常在高预算下出现明显质量下降,使LoD设计成本高昂。解决方案的关键在于提出马特罗什卡高斯溅射(Matryoshka Gaussian Splatting, MGS),其核心思想是随机预算训练(stochastic budget training):每轮迭代随机采样一个溅射预算,同时优化对应前缀(前k个高斯)和完整集合,仅需两次前向传播且无需架构改动。该策略使模型能在单一结构下实现从低到高的连续速度-质量权衡,同时保持全容量渲染质量不变。

链接: https://arxiv.org/abs/2603.19234
作者: Zhilin Guo,Boqiao Zhang,Hakan Aktas,Kyle Fogarty,Jeffrey Hu,Nursena Koprucu Aslan,Wenzhao Li,Canberk Baykal,Albert Miao,Josef Bengtson,Chenliang Zhou,Weihao Xia,Cristina Nader Vasconcelos. Cengiz Oztireli
机构: University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: project page: this https URL

点击查看摘要

Abstract:The ability to render scenes at adjustable fidelity from a single model, known as level of detail (LoD), is crucial for practical deployment of 3D Gaussian Splatting (3DGS). Existing discrete LoD methods expose only a limited set of operating points, while concurrent continuous LoD approaches enable smoother scaling but often suffer noticeable quality degradation at full capacity, making LoD a costly design decision. We introduce Matryoshka Gaussian Splatting (MGS), a training framework that enables continuous LoD for standard 3DGS pipelines without sacrificing full-capacity rendering quality. MGS learns a single ordered set of Gaussians such that rendering any prefix, the first k splats, produces a coherent reconstruction whose fidelity improves smoothly with increasing budget. Our key idea is stochastic budget training: each iteration samples a random splat budget and optimises both the corresponding prefix and the full set. This strategy requires only two forward passes and introduces no architectural modifications. Experiments across four benchmarks and six baselines show that MGS matches the full-capacity performance of its backbone while enabling a continuous speed-quality trade-off from a single model. Extensive ablations on ordering strategies, training objectives, and model capacity further validate the designs.

[CV-2] Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens CVPR2026

【速读】:该论文旨在解决当前离散生成方法在视觉生成任务中因低维潜在表示(通常为8-32维)导致语义信息不足的问题,同时克服高维预训练表示(如768-1024维)在离散生成时面临的根本性挑战。其解决方案的关键在于提出Cubic Discrete Diffusion (CubiD),这是一种首个适用于高维表示的离散生成模型,通过在高维离散表示中进行细粒度掩码——即任意位置的任意维度均可被掩码并从部分观测中预测——从而学习空间内及跨空间位置的丰富相关性;该方法固定生成步数 $ T T \ll hwd $),与特征维度无关,显著提升了生成质量与可扩展性,并验证了离散化后的token在理解与生成任务中均能保留原始表示能力,为构建统一的多模态架构提供了新路径。

链接: https://arxiv.org/abs/2603.19232
作者: Yuqing Wang,Chuofan Ma,Zhijie Lin,Yao Teng,Lijun Yu,Shuai Wang,Jiaming Han,Jiashi Feng,Yi Jiang,Xihui Liu
机构: University of Hong Kong (香港大学); ByteDance Seed (字节跳动种子); Carnegie Mellon University (卡内基梅隆大学); Nanjing University (南京大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 main track; Code: this https URL

点击查看摘要

Abstract:Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation – any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at T regardless of feature dimensionality, where T \ll hwd . On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: this https URL.

[CV-3] MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

【速读】:该论文旨在解决从单张图像中重建可动三维物体(articulated 3D objects)时,因运动线索与物体结构之间存在耦合关系而导致的参数预测不稳定问题。现有方法通常依赖多视角监督、基于检索的组装或辅助视频生成,但往往在可扩展性或效率上存在局限。解决方案的关键在于提出一种统一的渐进式结构推理框架 MonoArt,其不直接从图像特征中回归关节参数,而是通过单一架构逐步将视觉观测转化为规范几何、结构化部件表示和运动感知嵌入(motion-aware embeddings),从而实现稳定且可解释的关节推理,无需外部运动模板或多阶段流水线。

链接: https://arxiv.org/abs/2603.19231
作者: Haitian Li,Haozhe Xie,Junxiang Xu,Beichen Wen,Fangzhou Hong,Ziwei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.

[CV-4] NavTrust: Benchmarking Trustworthiness for Embodied Navigation

【速读】:该论文旨在解决当前具身导航(Embodied Navigation)模型在真实场景中面临的关键鲁棒性不足问题,即现有方法主要在理想条件下评估性能,而忽略了现实世界中RGB图像、深度信息和自然语言指令可能遭受的多种类型扰动(corruptions)。解决方案的核心是提出首个统一基准NavTrust,系统性地对输入模态(包括RGB、深度图和指令)施加现实场景中的多样化扰动,并在此框架下评估导航性能下降情况。通过这一基准,研究揭示了七种前沿方法在真实扰动下的显著性能退化,明确了关键鲁棒性差距;同时,进一步验证了四种缓解策略的有效性,为构建更可信的具身导航系统提供了可量化的改进路径。

链接: https://arxiv.org/abs/2603.19229
作者: Huaide Jiang,Yash Chaudhary,Yuping Wang,Zehao Wang,Raghav Sharma,Manan Mehta,Yang Zhou,Lichao Sun,Zhiwen Fan,Zhengzhong Tu,Jiachen Li
机构: University of California, Riverside(加州大学河滨分校); University of Michigan(密歇根大学); Workday(Workday); University of Southern California(南加州大学); Texas AM University(德州农工大学); Lehigh University(利哈伊大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Project Website: this https URL

点击查看摘要

Abstract:There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To our best knowledge, NavTrust is the first benchmark that exposes embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial performance degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. Furthermore, we systematically evaluate four distinct mitigation strategies to enhance robustness against RGB-Depth and instructions corruptions. Our base models include Uni-NaVid and ETPNav. We deployed them on a real mobile robot and observed improved robustness to corruptions. The project website is: this https URL.

[CV-5] SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

【速读】:该论文旨在解决当前基于指令的视频编辑模型在实现精确语义修改与忠实运动保留之间难以平衡的问题。现有方法依赖于引入显式的外部先验(如视觉语言模型VLM特征或结构条件),这严重限制了模型的鲁棒性和泛化能力。解决方案的关键在于提出SAMA(factorized Semantic Anchoring and Motion Alignment)框架,其核心是将视频编辑任务解耦为两个独立模块:一是语义锚定(Semantic Anchoring),通过在稀疏锚点帧上联合预测语义token和视频潜在表示,实现纯指令驱动的结构规划;二是运动对齐(Motion Alignment),利用以运动为中心的预训练任务(立方体补全、速度扰动和管状打乱)使模型从原始视频中内化时序动态。这种因子分解策略使得模型能够在无配对视频-指令编辑数据的情况下学习内在的语义-运动表征,从而显著提升零样本视频编辑性能,并在开源模型中达到最先进水平。

链接: https://arxiv.org/abs/2603.19228
作者: Xinyao Zhang,Wenkai Dong,Yuxin Song,Bo Fang,Qi Zhang,Jing Wang,Fan Chen,Hui Zhang,Haocheng Feng,Yu Lu,Hang Zhou,Chun Yuan,Jingdong Wang
机构: Baidu(百度); Tsinghua University (清华大学); City University of Hong Kong (香港城市大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 12 figures

点击查看摘要

Abstract:Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.

[CV-6] Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

【速读】:该论文旨在解决现有运动生成方法在语义控制与运动细节之间难以兼顾的问题:传统连续扩散模型擅长实现精确的运动控制(kinematic control),而离散token生成方法则更适用于语义条件引导(semantic conditioning),但二者难以协同优化。解决方案的关键在于提出一个三阶段框架——感知(Perception)、规划(Planning)与控制(Control),其中核心创新是MoTok,一种基于扩散模型的离散运动分词器(diffusion-based discrete motion tokenizer)。MoTok通过将语义抽象与精细重建解耦,由扩散解码器完成运动恢复,从而实现紧凑的单层token表示并保持高保真度;同时,在规划阶段使用粗粒度约束引导token生成,在控制阶段通过扩散优化施加细粒度约束,避免运动细节干扰语义规划。这一设计显著提升了可控性与精度,如在HumanML3D数据集上,仅用1/6的token即可将轨迹误差从0.72 cm降至0.08 cm,FID从0.083降至0.029,并且在强运动约束下仍能提升生成质量(FID从0.033降至0.014)。

链接: https://arxiv.org/abs/2603.19227
作者: Chenyang Gu,Mingyuan Zhang,Haozhe Xie,Zhongang Cai,Lei Yang,Ziwei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL GitHub: this https URL

点击查看摘要

Abstract:Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.

[CV-7] Under One Sun: Multi-Object Generative Perception of Materials and Illumination

【速读】:该论文旨在解决从单张图像中对物体外观背后的辐射度成分(即反射率、纹理和光照)进行随机采样的问题,这是一个典型的辐射度解耦难题,因其固有的歧义性而极具挑战。解决方案的关键在于利用同一场景中不同物体共享相同光照这一先验知识,通过四项核心技术实现:1)级联端到端架构,结合图像空间与角度空间的解耦;2)协调引导机制,促进扩散过程收敛至一致的光照估计;3)轴向注意力机制,增强不同反射率物体间的跨对象信息交互;4)纹理提取ControlNet,在保留高频纹理细节的同时确保其与光照估计的解耦。该方法有效利用多个物体外观在空间和频率上的互补特性,恢复出个体纹理、反射率及共同光照。

链接: https://arxiv.org/abs/2603.19226
作者: Nobuo Yoshii,Xinran Nicole Han,Ryo Kawahara,Todd Zickler,Ko Nishino
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Multi-Object Generative Perception (MultiGP), a generative inverse rendering method for stochastic sampling of all radiometric constituents – reflectance, texture, and illumination – underlying object appearance from a single image. Our key idea to solve this inherently ambiguous radiometric disentanglement is to leverage the fact that while their texture and reflectance may differ, objects in the same scene are all lit by the same illumination. MultiGP exploits this consensus to produce samples of reflectance, texture, and illumination from a single image of known shapes based on four key technical contributions: a cascaded end-to-end architecture that combines image-space and angular-space disentanglement; Coordinated Guidance for diffusion convergence to a single consistent illumination estimate; Axial Attention applied to facilitate ``cross-talk’’ between objects of different reflectance; and a Texture Extraction ControlNet to preserve high-frequency texture details while ensuring decoupling from estimated lighting. Experimental results demonstrate that MultiGP effectively leverages the complementary spatial and frequency characteristics of multiple object appearances to recover individual texture and reflectance as well as the common illumination.

[CV-8] EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing CVPR2026

【速读】:该论文旨在解决视频对象移除(Video Object Removal)任务中难以彻底消除动态目标物体及其视觉效应(如形变、阴影和反射)的问题,同时恢复无缝背景。现有基于扩散模型的方法虽能移除物体本身,但在处理这些伴随效应及合成连贯背景方面表现不足。其根本挑战不仅来自方法局限性,还源于缺乏系统性地涵盖多种环境下的常见物体效应的高质量数据集。为此,作者提出了VOR(Video Object Removal)大规模数据集,包含60K对高质量视频(含物体及其效应 vs. 无物体且无效应),覆盖五类视觉效应、多样物体类别与复杂多目标场景。在此基础上,提出EffectErase方法,其核心创新在于引入“效果感知”的逆向学习机制:将视频对象插入视为辅助任务,在互惠学习框架下实现效果区域的显式建模;通过任务感知区域引导聚焦受影响区域并支持灵活任务切换,并设计插入-移除一致性目标以促进互补行为与结构线索共享,从而显著提升在多样化场景中的效果擦除质量。

链接: https://arxiv.org/abs/2603.19224
作者: Yang Fu,Yike Zheng,Ziyun Dai,Henghui Ding
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Project Page: this https URL

点击查看摘要

Abstract:Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce VOR (Video Object Removal), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60K high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose EffectErase, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion-removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.

[CV-9] Spectrally-Guided Diffusion Noise Schedules

【速读】:该论文旨在解决扩散模型中噪声调度(noise schedule)设计依赖人工调参、缺乏理论指导且在不同图像分辨率下需重复调整的问题。其核心解决方案是基于图像的频谱特性(spectral properties)为每个实例(per-instance)定制“紧致”(tight)噪声调度,通过推导最小和最大噪声水平的理论边界,有效消除冗余去噪步骤,并在推理阶段采用条件采样策略来适配不同图像内容,从而提升单阶段像素级扩散模型在低步数(low-step)下的生成质量。

链接: https://arxiv.org/abs/2603.19222
作者: Carlos Esteves,Ameesh Makadia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribution of noise levels applied during training and the sequence of noise levels traversed during sampling. Noise schedules are typically handcrafted and require manual tuning across different resolutions. In this work, we propose a principled way to design per-instance noise schedules for pixel diffusion, based on the image’s spectral properties. By deriving theoretical bounds on the efficacy of minimum and maximum noise levels, we design ``tight’’ noise schedules that eliminate redundant steps. During inference, we propose to conditionally sample such noise schedules. Experiments show that our noise schedules improve generative quality of single-stage pixel diffusion models, particularly in the low-step regime.

[CV-10] DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

【速读】:该论文旨在解决当前视觉-语言-动作模型和世界模型在自动驾驶系统中应用时,因多视角高分辨率驾驶场景下图像分词(image tokenization)效率低下且视图间不一致的问题。现有分词器多针对单目二维场景设计,难以有效处理三维空间中的多视角信息。解决方案的关键在于提出DriveTok,一种面向统一多视角重建与理解的高效3D驾驶场景分词方法:首先从视觉基础模型中提取语义丰富的视觉特征,再通过3D可变形交叉注意力机制将其转换为具有几何结构感知能力的场景令牌(scene tokens);解码端则利用多视角Transformer从这些令牌重建多视角特征,并通过多个头部实现RGB、深度和语义图像重建,同时在场景令牌上直接添加3D语义占据预测头以增强空间感知能力。多任务训练目标促使DriveTok学习集成语义、几何与纹理信息的统一场景令牌表示,从而实现高效且一致的多视角分词。

链接: https://arxiv.org/abs/2603.19219
作者: Dong Zhuo,Wenzhao Zheng,Sicheng Zuo,Siming Yan,Lu Hou,Jie Zhou,Jiwen Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross-attention. For decoding, we employ a multi-view transformer to reconstruct multi-view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly on the scene tokens for 3D semantic occupancy prediction for better spatial awareness. With the multiple training objectives, DriveTok learns unified scene tokens that integrate semantic, geometric, and textural information for efficient multi-view tokenization. Extensive experiments on the widely used nuScenes dataset demonstrate that the scene tokens from DriveTok perform well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.

[CV-11] Rethinking Vector Field Learning for Generative Segmentation

【速读】:该论文旨在解决扩散模型(diffusion models)在生成式分割(generative segmentation)任务中因连续流匹配目标(flow matching objective)与离散感知任务之间存在内在不匹配而导致的收敛缓慢和类别分离效果差的问题。其解决方案的关键在于提出一种基于向量场重塑(vector field reshaping)的原理性策略:通过引入一个独立的、距离感知的修正项来增强学习到的速度场,该修正项包含吸引与排斥相互作用,从而在聚类中心附近提升梯度幅度,同时保持原始扩散训练框架不变;此外,设计了一种受Kronecker序列启发的计算高效的准随机类别编码方案,可无缝集成至端到端像素神经场(pixel neural field)框架中,实现像素级语义对齐,显著提升了生成式分割性能。

链接: https://arxiv.org/abs/2603.19218
作者: Chaoyang Wang,Yaobo Liang,Boci Peng,Fan Duan,Jingdong Wang,Yunhai Tong
机构: Peking University (北京大学); Baidu (百度)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Taming diffusion models for generative segmentation has attracted increasing attention. While existing approaches primarily focus on architectural tweaks or training heuristics, there remains a limited understanding of the intrinsic mismatch between continuous flow matching objectives and discrete perception tasks. In this work, we revisit diffusion segmentation from the perspective of vector field learning. We identify two key limitations of the commonly used flow matching objective: gradient vanishing and trajectory traversing, which result in slow convergence and poor class separation. To tackle these issues, we propose a principled vector field reshaping strategy that augments the learned velocity field with a detached distance-aware correction term. This correction introduces both attractive and repulsive interactions, enhancing gradient magnitudes near centroids while preserving the original diffusion training framework. Furthermore, we design a computationally efficient, quasi-random category encoding scheme inspired by Kronecker sequences, which integrates seamlessly with an end-to-end pixel neural field framework for pixel-level semantic alignment. Extensive experiments consistently demonstrate significant improvements over vanilla flow matching approaches, substantially narrowing the performance gap between generative segmentation and strong discriminative specialists.

[CV-12] LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLM s

【速读】:该论文旨在解决当前多模态大语言模型(OmniLLMs)在长时音频-视频跨模态理解能力评估方面的不足问题。现有评测主要基于短时片段(10秒至5分钟),无法反映真实应用场景中长达数十分钟的视频处理需求。解决方案的关键在于提出LVOmniBench基准数据集,该数据集包含275个高质量长视频(时长10–90分钟)及1,014个精心标注的问答对,涵盖长期记忆、时间定位、细粒度理解和多模态感知等核心能力维度,从而系统性地评估模型在复杂长时音频-视频上下文中的跨模态理解性能。

链接: https://arxiv.org/abs/2603.19217
作者: Keda Tao,Yuhua Zheng,Jia Xu,Wenjie Du,Kele Shao,Hesong Wang,Xueyi Chen,Xin Jin,Junhan Zhu,Bohan Yu,Weiqiang Wang,Jian Liu,Can Qin,Yulun Zhang,Ming-Hsuan Yang,Huan Wang
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); Ant Group (蚂蚁集团); Shanghai Innovation Institute (上海创新研究院); Shanghai Jiao Tong University (上海交通大学); University of California Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.

[CV-13] DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

【速读】:该论文旨在解决当前文本到三维(text-to-3D)生成方法中缺乏语义结构建模的问题,即现有方法虽能生成几何形状,但未能充分考虑物体部件的语义含义、功能关系及其与文本描述的一致性。解决方案的关键在于提出DreamPartGen框架,其核心创新包括:1)双通道部件潜在表示(Duplex Part Latents, DPLs),联合建模每个部件的几何与外观特征;2)关系语义潜在表示(Relational Semantic Latents, RSLs),从语言中提取部件间的语义依赖关系;3)同步协同去噪机制,确保几何结构与语义信息的一致性,从而实现语义 grounded、部件感知且与文本对齐的高质量3D合成。

链接: https://arxiv.org/abs/2603.19216
作者: Tianjiao Yu,Xinzhuo Li,Muntasir Wahed,Jerry Xiong,Yifan Shen,Ying Shen,Ismini Lourentzou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods overlook the semantic and functional structure of parts. While recent part-aware approaches introduce decomposition, they remain largely geometry-focused, lacking semantic grounding and failing to model how parts align with textual descriptions or their inter-part relations. We propose DreamPartGen, a framework for semantically grounded, part-aware text-to-3D generation. DreamPartGen introduces Duplex Part Latents (DPLs) that jointly model each part’s geometry and appearance, and Relational Semantic Latents (RSLs) that capture inter-part dependencies derived from language. A synchronized co-denoising process enforces mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Across multiple benchmarks, DreamPartGen delivers state-of-the-art performance in geometric fidelity and text-shape alignment.

[CV-14] Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

【速读】:该论文旨在解决当前大规模视觉-语言模型(Vision-Language Models, VLMs)中视觉主干网络(vision backbone)选择的局限性,特别是对基于Transformer架构的视觉主干是否为唯一最优方案的质疑。研究发现,状态空间模型(State Space Model, SSM)作为视觉主干在保持高性能的同时,可显著降低模型规模,从而提供一种更具效率的替代方案。其关键解决方案在于:在控制条件下系统评估SSM与ViT类主干的性能差异,并通过密集任务(如检测或分割)微调提升稳定性;实验表明,SSM主干在VQA和定位/识别任务中表现优异,且不依赖于更大的模型参数量或更高的ImageNet准确率,从而验证了其作为Transformer替代方案的可行性与鲁棒性。

链接: https://arxiv.org/abs/2603.19209
作者: Shang-Jui Ray Kuo,Paola Cascante-Bonilla
机构: Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL ; Code: this https URL

点击查看摘要

Abstract:Large vision–language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.

[CV-15] RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing

【速读】:该论文旨在解决现有基于预训练视觉表示模型的图像生成与编辑中,因冻结编码器导致重建保真度不足、以及高维潜在空间增加扩散建模难度的问题。解决方案的关键在于提出一种基于表示的自编码器(Representation-Pivoted AutoEncoder, RPiAE),其核心创新包括:1)引入Representation-Pivot Regularization,使初始化于预训练表示空间的编码器可微调以实现高质量重建,同时保持语义结构一致性;2)设计变分桥接机制(variational bridge),将潜在空间压缩至紧凑维度以提升扩散建模效率;3)采用目标解耦的分阶段训练策略,依次优化生成可塑性与重建保真度目标。该方案在保证强语义保留的同时显著提升了重建精度和扩散建模性能,从而在文本到图像生成和图像编辑任务中优于现有方法。

链接: https://arxiv.org/abs/2603.19206
作者: Yue Gong,Hongyu Li,Shanyuan Liu,Bo Cheng,Yuhang Ma,Liebucha Wu,Xiaoyu Wu,Manyuan Zhang,Dawei Leng,Yuhui Yin,Lijun Zhang
机构: Beihang University (北京航空航天大学); 360 AI Research (360人工智能研究院); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult. To address these limitations, We propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compress latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity. Experiments demonstrate that RPiAE outperforms other visual tokenizers on text-to-image generation and image editing, while delivering the best reconstruction fidelity among representation-based tokenizers.

[CV-16] nted Frames: Question Framing Blinds Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对不同语言表述框架(linguistic framing)时表现出的选择性“盲视”问题,即模型对视觉输入的关注程度会因提问方式(如封闭式多选题 vs. 开放式问答)而显著变化,导致视觉推理能力下降和跨框架一致性差。解决方案的关键在于引入一种轻量级提示调优(prompt-tuning)方法,通过添加可学习的标记(learnable tokens),引导模型在各种框架下均能保持与开放式设置中相似的稳健、视觉锚定(visually grounded)注意力分布,从而提升视觉信息利用效率和跨框架性能一致性。

链接: https://arxiv.org/abs/2603.19203
作者: Wan-Cyuan Fan,Jiayun Luo,Declan Kutscher,Leonid Sigal,Ritwik Gupta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Project page: this https URL

点击查看摘要

Abstract:Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.

[CV-17] FASTER: Rethinking Real-Time Flow VLAs FAST

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在物理世界部署时的关键问题——反应延迟(reaction latency),即系统对环境变化作出响应所需的时间。现有异步推理方法虽优化了轨迹平滑性,但忽略了动作执行中的延迟瓶颈,尤其是因固定采样调度导致的早期动作无法及时生成的问题。解决方案的核心在于提出一种名为“快速动作采样以实现即时反应”(Fast Action Sampling for ImmediaTE Reaction, FASTER)的新机制,其关键创新是引入时间感知调度策略(Horizon-Aware Schedule),在流式扩散模型中优先对近期动作进行去噪处理,将立即反应阶段的去噪步骤压缩至单步(如π₀.₅和X-VLA中减少10倍),同时保持远期轨迹质量不变;配合流式客户端-服务器架构,显著降低真实机器人上的有效反应延迟,尤其在消费级GPU上表现突出。

链接: https://arxiv.org/abs/2603.19199
作者: Yuxiang Lu,Zhe Liu,Xianzhe Fan,Zhenya Yang,Jinghua Hou,Junyi Li,Kaixin Ding,Hengshuang Zhao
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in \pi_0.5 and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.

[CV-18] Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting

【速读】:该论文旨在解决当前鸟瞰图(Bird’s-Eye-View, BEV)感知框架在自动驾驶中因采用端到端训练范式而导致的几何理解不足与可解释性差的问题,进而影响下游任务如语义分割、3D目标检测和运动预测的性能。其解决方案的关键在于引入显式的三维(3D)重建机制,提出Splat2BEV框架——通过预训练一个高斯点绘(Gaussian Splatting)生成器,从多视角图像中显式重建3D场景并生成与几何对齐的特征表示,再将这些特征投影至BEV空间作为下游任务输入,从而实现语义丰富且几何精确的BEV特征学习。

链接: https://arxiv.org/abs/2603.19193
作者: Yiren Lu,Xin Ye,Burhaneddin Yaman,Jingru Luo,Zhexiao Xiong,Liu Ren,Yu Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:Bird’s-Eye-View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding-view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end-to-end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting-assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre-train a Gaussian generator that explicitly reconstructs 3D scenes from multi-view inputs, enabling the generation of geometry-aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state-of-the-art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.

[CV-19] Few-shot Acoustic Synthesis with Multimodal Flow Matching CVPR2026

【速读】:该论文旨在解决虚拟环境中声学一致性生成的问题,即如何在仅依赖少量场景数据的情况下,高效且准确地合成与环境物理特性一致的房间脉冲响应(Room Impulse Response, RIR)。现有神经声学场方法虽能实现空间连续的声音渲染,但存在场景特定性强、训练成本高、难以适应新环境的问题;而少数样本方法虽提升可扩展性,却因确定性建模无法捕捉声学不确定性。其解决方案的关键在于提出一种基于流匹配(flow-matching)的生成式声学合成方法 FLAC(Flow-Matching Acoustic Generation),通过训练一个扩散 Transformer 模型来建模给定空间、几何和声学线索下可能 RIR 的分布,从而在新场景中任意位置生成符合物理规律的 RIR,并实现单次采样即可超越多样本基线的效果。

链接: https://arxiv.org/abs/2603.19176
作者: Amandine Brunetto
机构: Mines Paris - PSL University (巴黎矿业大学-PSL大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: To appear at CVPR 2026. 23 pages, 16 figures. Project Page: this https URL

点击查看摘要

Abstract:Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.

[CV-20] ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis

【速读】:该论文旨在解决传统像素级损失函数在冠状动脉血管分割中无法约束拓扑结构的问题,导致即使像素级精度高,仍会出现血管树断裂的缺陷。解决方案的关键在于提出一个两阶段框架ARIADNE:第一阶段利用直接偏好优化(Direct Preference Optimization, DPO)对Sa2VA视觉-语言基础模型进行微调,以贝蒂数(Betti number)作为偏好信号,使模型策略从关注像素重叠转向生成几何完整的血管结构;第二阶段将狭窄定位建模为带有显式拒绝机制的马尔可夫决策过程(Markov Decision Process),通过自主排除模糊解剖区域(如分叉和血管交叉点),实现从覆盖最大化到可靠性优化的转变。该方法在1400张临床造影图像上达到中心线Dice值0.838,并将假阳性减少41%,且在多中心基准ARCADE和XCAD上验证了跨设备协议的泛化能力。

链接: https://arxiv.org/abs/2603.19169
作者: Zhan Jin,Yu Luo,Yizhou Zhang,Ziyang Cui,Yuqing Wei,Xianchao Liu,Xueying Zeng,Qing Zhang
机构: Ocean University of China (中国海洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages, 5 figures . arXiv:submit/7385738 [cs.AI]

点击查看摘要

Abstract:Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high pixel-level accuracy. We present ARIADNE, a two-stage framework coupling preference-aligned perception with RL-based diagnostic reasoning for topologically coherent stenosis detection. The perception module employs DPO to fine-tune the Sa2VA vision-language foundation model using Betti number constraints as preference signals, aligning the policy toward geometrically complete vessel structures rather than pixel-wise overlap metrics. The reasoning module formulates stenosis localization as a Markov Decision Process with an explicit rejection mechanism that autonomously defers ambiguous anatomical candidates such as bifurcations and vessel crossings, shifting from coverage maximization to reliability optimization. On 1,400 clinical angiograms, ARIADNE achieves state-of-the-art centerline Dice of 0.838, reduces false positives by 41% compared to geometric baselines. External validation on multi-center benchmarks ARCADE and XCAD confirms generalization across acquisition protocols. This represents the first application of DPO for topological alignment in medical imaging, demonstrating that preference-based learning over structural constraints mitigates topological violations while maintaining diagnostic sensitivity in interventional cardiology workflows.

[CV-21] Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation CVPR2026

【速读】:该论文旨在解决扩散模型在生成低密度区域目标概念时出现语义错位或结构不一致的问题,这一问题源于文本-图像数据集的长尾分布特性,导致罕见概念或编辑指令在训练中代表性不足。解决方案的关键在于提出自适应辅助提示融合(Adaptive Auxiliary Prompt Blending, AAPB),其通过引入辅助锚点提示(auxiliary anchor prompts)提供语义支持与结构引导,确保生成过程忠实于目标提示;AAPB进一步基于Tweedie恒等式推导出一个闭式自适应系数,在每个扩散步骤中动态优化辅助提示与目标提示之间的权重平衡,从而实现无需训练的、稳定且目标导向的生成效果。

链接: https://arxiv.org/abs/2603.19158
作者: Kwanyoung Lee,SeungJu Cha,Yebin Ahn,Hyunwoo Oh,Sungho Koh,Dong-Jin Kim
机构: Hanyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2026 (main track). 10 pages, 6 figures; supplementary material included (14 pages, 11 figures)

点击查看摘要

Abstract:Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long-tailed nature of text-image datasets, where rare concepts or editing instructions are underrepresented. To address this, we introduce Adaptive Auxiliary Prompt Blending (AAPB) - a unified framework that stabilizes the diffusion process in low-density regions. AAPB leverages auxiliary anchor prompts to provide semantic support in rare concept generation and structural support in image editing, ensuring faithful guidance toward the target prompt. Unlike prior heuristic prompt alternation methods, AAPB derives a closed-form adaptive coefficient that optimally balances the influence between the auxiliary anchor and the target prompt at each diffusion step. Grounded in Tweedie’s identity, our formulation provides a principled and training-free framework for adaptive prompt blending, ensuring stable and target-faithful generation. We demonstrate the effectiveness of adaptive interpolation over fixed interpolation through controlled experiments and empirically show consistent improvements on the RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines.

[CV-22] ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation CVPR2026

【速读】:该论文旨在解决扩散模型在文本到图像合成中生成罕见组合概念(rare compositional concepts)的难题,尤其是当这些属性在训练数据中不常见时。现有方法如R2F虽利用大语言模型(LLM)进行提示调度,但存在因语言模型随机性导致的固有方差,以及迭代文本嵌入切换带来的次优引导问题。论文提出的ADAPT框架是一种无需训练的解决方案,其关键在于通过注意力分数(attention scores)和正交分量(orthogonal components)实现确定性的提示调度规划与语义对齐,从而提供一致且精确的引导,显著提升RareBench基准上罕见概念的组合生成能力,同时保持视觉完整性。

链接: https://arxiv.org/abs/2603.19157
作者: Kwanyoung Lee,Hyunwoo Oh,SeungJu Cha,Sungho Koh,Dong-Jin Kim
机构: Hanyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2026 (findings). 10 pages, 4 figures; supplementary material included (8 pages, 10 figures)

点击查看摘要

Abstract:Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM for prompt scheduling, they suffer from inherent variance due to the randomness of language models and suboptimal guidance from iterative text embedding switching. To address these problems, we propose the ADAPT framework, a training-free framework that deterministically plans and semantically aligns prompt schedules, providing consistent guidance to enhance the composition of rare concepts. By leveraging attention scores and orthogonal components, ADAPT significantly enhances compositional generation of rare concepts in the RareBench benchmark without additional training or fine-tuning. Through comprehensive experiments, we demonstrate that ADAPT achieves superior performance in RareBench and accurately reflects the semantic information of rare attributes, providing deterministic and precise control over the generation of rare compositions without compromising visual integrity.

[CV-23] GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning

【速读】:该论文旨在解决现有场景表示方法在具身探索任务中缺乏事后可重观测性(post-hoc re-observability)的问题,即当初始观察遗漏目标时,记忆缺失往往不可恢复。解决方案的关键在于提出GSMem框架,其核心是基于3D高斯溅射(3D Gaussian Splatting, 3DGS)构建持续的空间记忆,使代理具备空间回溯能力(Spatial Recollection),能够从先前未占据的最优视角渲染逼真的新视图,从而支持高保真视觉-语言模型(Vision-Language Model, VLM)推理。该框架通过并行利用对象级场景图与语义级语言场进行检索,实现目标区域的鲁棒定位,并结合VLM驱动的语义评分与基于3DGS的覆盖目标的混合探索策略,平衡任务导向探索与几何覆盖需求。

链接: https://arxiv.org/abs/2603.19137
作者: Yiren Lu,Yi Du,Disheng Liu,Yunlai Zhou,Chen Wang,Yu Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page at this https URL

点击查看摘要

Abstract:Effective embodied exploration requires agents to accumulate and retain spatial knowledge over time. However, existing scene representations, such as discrete scene graphs or static view-based snapshots, lack \textitpost-hoc re-observability. If an initial observation misses a target, the resulting memory omission is often irrecoverable. To bridge this gap, we propose \textbfGSMem, a zero-shot embodied exploration and reasoning framework built upon 3D Gaussian Splatting (3DGS). By explicitly parameterizing continuous geometry and dense appearance, 3DGS serves as a persistent spatial memory that endows the agent with \textitSpatial Recollection: the ability to render photorealistic novel views from optimal, previously unoccupied viewpoints. To operationalize this, GSMem employs a retrieval mechanism that simultaneously leverages parallel object-level scene graphs and semantic-level language fields. This complementary design robustly localizes target regions, enabling the agent to ``hallucinate’’ optimal views for high-fidelity Vision-Language Model (VLM) reasoning. Furthermore, we introduce a hybrid exploration strategy that combines VLM-driven semantic scoring with a 3DGS-based coverage objective, balancing task-aware exploration with geometric coverage. Extensive experiments on embodied question answering and lifelong navigation demonstrate the robustness and effectiveness of our framework

[CV-24] Revisiting Autoregressive Models for Generative Image Classification

【速读】:该论文旨在解决传统自回归(Autoregressive, AR)图像生成模型在分类任务中表现受限的问题,其核心限制在于对固定标记顺序的依赖,这导致模型仅能利用部分判别性线索,从而引入了不利于图像理解的归纳偏置。解决方案的关键在于引入最近发展的任意顺序自回归(any-order AR)模型,通过估计对标记顺序边缘化的预测结果,使AR模型能够整合多种标记顺序下的信息,从而获得更全面的判别信号。这一方法显著提升了AR模型的分类性能,并在多个图像分类基准上超越扩散模型,同时效率提升达25倍。

链接: https://arxiv.org/abs/2603.19122
作者: Ilia Sudakov,Artem Babenko,Dmitry Baranchuk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report

点击查看摘要

Abstract:Class-conditional generative models have emerged as accurate and robust classifiers, with diffusion models demonstrating clear advantages over other visual generative paradigms, including autoregressive (AR) models. In this work, we revisit visual AR-based generative classifiers and identify an important limitation of prior approaches: their reliance on a fixed token order, which imposes a restrictive inductive bias for image understanding. We observe that single-order predictions rely more on partial discriminative cues, while averaging over multiple token orders provides a more comprehensive signal. Based on this insight, we leverage recent any-order AR models to estimate order-marginalized predictions, unlocking the high classification potential of AR models. Our approach consistently outperforms diffusion-based classifiers across diverse image classification benchmarks, while being up to 25x more efficient. Compared to state-of-the-art self-supervised discriminative models, our method delivers competitive classification performance - a notable achievement for generative classifiers.

[CV-25] CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization CVPR2026

【速读】:该论文旨在解决高保真、可定制的3D室内场景纹理生成问题,现有文本驱动方法虽具灵活性但缺乏细粒度实例级控制能力,且常产生质量不足、伪影明显及烘焙阴影(baked-in shading)的纹理。其解决方案的关键在于提出CustomTex框架,通过双蒸馏机制分离语义控制与像素级增强:一方面利用实例交叉注意力(instance cross-attention)进行语义层面蒸馏以保证语义合理性与参考图像实例的一致性;另一方面通过像素级蒸馏提升视觉保真度,二者统一于变分分数蒸馏(Variational Score Distillation, VSD)优化框架中,从而实现高质量、实例级精确可控的3D场景纹理生成。

链接: https://arxiv.org/abs/2603.19121
作者: Weilin Chen,Jiahao Rao,Wenhao Wang,Xinyang Li,Xuan Cheng,Liujuan Cao
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026. This version integrates the main paper and supplementary material

点击查看摘要

Abstract:The creation of high-fidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they lack the precision for fine-grained, instance-level control, and often produce textures with insufficient quality, artifacts, and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene and a set of reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that separates semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with an instance cross-attention, to ensure semantic plausibility and ``reference-instance’’ alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation (VSD) optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.

[CV-26] AU-R1: Visual Language Model for Traffic Anomaly Understanding

【速读】:该论文旨在解决交通异常理解(Traffic Anomaly Understanding, TAU)在智能交通系统中进展受限的问题,其根本原因在于缺乏专门针对TAU任务的基准数据集和有效的建模方法。为此,作者提出了Roundabout-TAU数据集,该数据集由与印第安纳州卡梅尔市合作采集的真实环形交叉口视频构建而成,包含342个片段及超过2000个问答对,覆盖多维度的交通异常理解需求。在此基础上,论文进一步设计了TAU-R1框架,这是一个两层视觉-语言模型:第一层为轻量级异常分类器,实现粗粒度异常类别划分;第二层为大型异常推理模块,生成详细的事件摘要。解决方案的关键在于引入一种两阶段训练策略——首先通过分解问答增强的监督微调(decomposed-QA-enhanced supervised fine-tuning)提升基础理解能力,再采用基于GRPO(Generalized Reward Policy Optimization)的后训练方法(TAU-GRPO),结合任务特异性奖励函数优化推理性能,从而在保持部署效率的同时显著提升异常识别与解释能力。

链接: https://arxiv.org/abs/2603.19098
作者: Yuqiang Lin,Kehua Chen,Sam Lockyer,Arjun Yadav,Mingxuan Sui,Shucheng Zhang,Yan Shi,Bingzhang Wang,Yuang Zhang,Markus Zarbock,Florain Stanek,Adrian Evans,Wenbin Li,Yinhai Wang,Nic Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traffic Anomaly Understanding (TAU) is important for traffic safety in Intelligent Transportation Systems. Recent vision-language models (VLMs) have shown strong capabilities in video understanding. However, progress on TAU remains limited due to the lack of benchmarks and task-specific methodologies. To address this limitation, we introduce Roundabout-TAU, a dataset constructed from real-world roundabout videos collected in collaboration with the City of Carmel, Indiana. The dataset contains 342 clips and is annotated with more than 2,000 question-answer pairs covering multiple aspects of traffic anomaly understanding. Building on this benchmark, we propose TAU-R1, a two-layer vision-language framework for TAU. The first layer is a lightweight anomaly classifier that performs coarse anomaly categorisation, while the second layer is a larger anomaly reasoner that generates detailed event summaries. To improve task-specific reasoning, we introduce a two-stage training strategy consisting of decomposed-QA-enhanced supervised fine-tuning followed by TAU-GRPO, a GRPO-based post-training method with TAU-specific reward functions. Experimental results show that TAU-R1 achieves strong performance on both anomaly classification and reasoning tasks while maintaining deployment efficiency. The dataset and code are available at: this https URL

[CV-27] Multi-Modal Building Change Detection for Large-Scale Small Changes: Benchmark and Baseline

【速读】:该论文旨在解决光学遥感影像中因光照变化、季节差异及地表覆盖材料变异导致的伪变化和语义模糊问题,尤其是在小尺度建筑变化检测任务中,单一RGB影像难以准确区分真实变化与环境干扰。解决方案的关键在于构建一个高分辨率且精确配准的双时相RGB-NIR多模态数据集(LSMD),并提出多模态光谱互补网络(MSCNet),其核心创新包括:邻域上下文增强模块(NCEM)以强化局部空间细节,跨模态对齐与交互模块(CAIM)实现RGB与近红外(NIR)特征的深度交互,以及显著性感知多源优化模块(SMRM)对融合特征进行渐进式精炼,从而有效利用多模态异质信息提升细粒度建筑变化检测精度。

链接: https://arxiv.org/abs/2603.19077
作者: Ye Wang,Wei Lu,Zhihui You,Keyan Chen,Tongfei Liu,Kaiyu Li,Hongruixuan Chen,Qingling Shu,Sibao Chen
机构: Anhui University (安徽大学); Anhui University of Science and Technology (安徽科技大学); Nanyang Technological University (南洋理工大学); Shaanxi University of Science and Technology (陕西科技大学); Xi’an Jiaotong University (西安交通大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 12 figures

点击查看摘要

Abstract:Change detection in optical remote sensing imagery is susceptible to illumination fluctuations, seasonal changes, and variations in surface land-cover materials. Relying solely on RGB imagery often produces pseudo-changes and leads to semantic ambiguity in features. Incorporating near-infrared (NIR) information provides heterogeneous physical cues that are complementary to visible light, thereby enhancing the discriminability of building materials and tiny structures while improving detection accuracy. However, existing multi-modal datasets generally lack high-resolution and accurately registered bi-temporal imagery, and current methods often fail to fully exploit the inherent heterogeneity between these modalities. To address these issues, we introduce the Large-scale Small-change Multi-modal Dataset (LSMD), a bi-temporal RGB-NIR building change detection benchmark dataset targeting small changes in realistic scenarios, providing a rigorous testing platform for evaluating multi-modal change detection methods in complex environments. Based on LSMD, we further propose the Multi-modal Spectral Complementarity Network (MSCNet) to achieve effective cross-modal feature fusion. MSCNet comprises three key components: the Neighborhood Context Enhancement Module (NCEM) to strengthen local spatial details, the Cross-modal Alignment and Interaction Module (CAIM) to enable deep interaction between RGB and NIR features, and the Saliency-aware Multisource Refinement Module (SMRM) to progressively refine fused features. Extensive experiments demonstrate that MSCNet effectively leverages multi-modal information and consistently outperforms existing methods under multiple input configurations, validating its efficacy for fine-grained building change detection. The source code will be made publicly available at: this https URL

[CV-28] DROID-SLAM in the Wild CVPR2026

【速读】:该论文旨在解决传统视觉SLAM(Simultaneous Localization and Mapping,即时定位与地图构建)系统在动态环境中因假设场景静态而导致的跟踪失败问题。现有动态SLAM方法虽尝试通过预定义动态先验或不确定性感知映射来应对动态物体,但在未知动态对象或高度杂乱场景中仍表现受限,因几何映射可靠性下降。其解决方案的关键在于引入可微分的不确定性感知Bundle Adjustment(BA),通过多视角视觉特征不一致性估计每个像素的不确定性,从而实现鲁棒的相机位姿估计与场景重建,即使在真实世界的复杂动态环境中也能保持实时性能(约10 FPS),并达到当前最优的位姿精度和场景结构恢复效果。

链接: https://arxiv.org/abs/2603.19076
作者: Moyang Li,Zihan Zhu,Marc Pollefeys,Daniel Barath
机构: ETH Zurich (苏黎世联邦理工学院); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: CVPR 2026, Project Page: this https URL

点击查看摘要

Abstract:We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 10 FPS. Code and datasets are available at this https URL.

[CV-29] SignAgent : Agent ic LLM s for Linguistically-Grounded Sign Language Annotation and Dataset Curation

【速读】:该论文旨在解决手语(Sign Language, SL)标注与数据集构建中长期存在的两大问题:一是传统计算方法多基于词素(gloss)层级操作,忽视了语言学层面的细微差异;二是人工语言学标注效率低下、成本高昂,难以支撑大规模、音位感知数据集的建设。解决方案的关键在于提出 SignAgent 框架,其核心由两个模块构成:一是 SignAgent Orchestrator,一个具备推理能力的大语言模型(Large Language Model, LLM),用于协调多种语言学工具;二是 SignGraph,一个知识驱动的LLM,提供词汇和语言学上的锚定支持。该框架通过多模态证据约束分配和视觉相似性与音位重叠联合推理,实现了高效且语言学敏感的自动标注与聚类,显著提升了大规模手语数据标注的质量与可扩展性。

链接: https://arxiv.org/abs/2603.19059
作者: Oliver Cory,Ozge Mercanoglu Sincan,Richard Bowden
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces SignAgent, a novel agentic framework that utilises Large Language Models (LLMs) for scalable, linguistically-grounded Sign Language (SL) annotation and dataset curation. Traditional computational methods for SLs often operate at the gloss level, overlooking crucial linguistic nuances, while manual linguistic annotation remains a significant bottleneck, proving too slow and expensive for the creation of large-scale, phonologically-aware datasets. SignAgent addresses these challenges through SignAgent Orchestrator, a reasoning LLM that coordinates a suite of linguistic tools, and SignGraph, a knowledge-grounded LLM that provides lexical and linguistic grounding. We evaluate our framework on two downstream annotation tasks. First, on Pseudo-gloss Annotation, where the agent performs constrained assignment, using multi-modal evidence to extract and order suitable gloss labels for signed sequences. Second, on ID Glossing, where the agent detects and refines visual clusters by reasoning over both visual similarity and phonological overlap to correctly identify and group lexical sign variants. Our results demonstrate that our agentic approach achieves strong performance for large-scale, linguistically-aware data annotation and curation.

[CV-30] Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

【速读】:该论文旨在解决当前生成式视频大模型(VideoLLM)在流式视频理解中因依赖逐帧触发决策而导致的效率-准确率权衡问题。其解决方案的关键在于提出Em-Garde框架,通过解耦语义理解与流式感知过程:在查询时,使用指令引导的提案解析器(Instruction-Guided Proposal Parser)将用户查询转化为结构化的、具感知基础的视觉提案;在流式处理阶段,则采用轻量级提案匹配模块(Lightweight Proposal Matching Module)进行基于嵌入的高效匹配以触发响应,从而在严格计算约束下实现更准确且高效的主动响应能力。

链接: https://arxiv.org/abs/2603.19054
作者: Yikai Zheng,Xin Ding,Yifan Yang,Shiqi Jiang,Hao Wu,Qianxi Zhang,Weijun Wang,Ting Cao,Yunxin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.

[CV-31] SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation CVPR2026

【速读】:该论文旨在解决3D服装生成中长期存在的效率与质量难以兼顾的问题,即现有方法依赖大型视觉-语言模型生成二维裁剪图(sewing patterns),再通过服装建模框架(如GarmentCode)转化为模拟可用的三维网格(3D meshes),虽能实现高质量结果但推理时间长达30秒至1分钟。其解决方案的关键在于提出一种两阶段轻量化框架SwiftTailor,通过紧凑的几何图像(geometry image)表示统一缝合模式推理与基于几何的网格合成:第一阶段由PatternMaker模块高效预测多模态输入下的缝合图;第二阶段由GarmentSewer模块将缝合图转换为编码所有衣片三维表面的统一UV空间中的服装几何图像,并借助逆映射、重网格化和动态缝合算法直接构建3D网格,从而摊销物理模拟成本,显著提升推理速度并保持最优精度与视觉保真度。

链接: https://arxiv.org/abs/2603.19053
作者: Phuc Pham,Uy Dieu Tran,Binh-Son Hua,Phong Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: CVPR 2026

点击查看摘要

Abstract:Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision- language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed through an efficient inverse mapping process that incorporates remeshing and dynamic stitching algorithms to directly assemble the garment, thereby amortizing the cost of physical simulation. Extensive experiments on the Multimodal GarmentCodeData demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.

[CV-32] Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos

【速读】:该论文旨在解决生成式视频模型在动态场景中常出现的三维空间几何不一致性问题,现有评估方法如FVD(Fréchet Video Distance)对几何失真不敏感,而一致性基准又可能误判有效的前景运动。解决方案的关键在于提出SGC(Spatial Geometric Consistency)指标,通过估计不同局部区域的相机位姿并量化其发散程度来衡量几何一致性:首先分离静态与动态区域,再将静态背景划分为空间一致的子区域,对每个子区域预测深度并估计局部相机位姿,最终以位姿差异作为几何一致性度量标准。实验表明,SGC能有效识别现有指标遗漏的关键几何错误。

链接: https://arxiv.org/abs/2603.19048
作者: Weijia Dou,Wenzhao Zheng,Weiliang Chen,Yu Zheng,Jie Zhou,Jiwen Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code available at this https URL

点击查看摘要

Abstract:Recent generative models can produce high-fidelity videos, yet they often exhibit 3D spatial geometric inconsistencies. Existing evaluation methods fail to accurately characterize these inconsistencies: fidelity-centric metrics like FVD are insensitive to geometric distortions, while consistency-focused benchmarks often penalize valid foreground dynamics. To address this gap, we introduce SGC, a metric for evaluating 3D \textbfSpatial \textbfGeometric \textbfConsistency in dynamically generated videos. We quantify geometric consistency by measuring the divergence among multiple camera poses estimated from distinct local regions. Our approach first separates static from dynamic regions, then partitions the static background into spatially coherent sub-regions. We predict depth for each pixel, estimate a local camera pose for each subregion, and compute the divergence among these poses to quantify geometric consistency. Experiments on real and generative videos demonstrate that SGC robustly quantifies geometric inconsistencies, effectively identifying critical failures missed by existing metrics.

[CV-33] rraScope: Pixel-Grounded Visual Reasoning for Earth Observation CVPR20206

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在地球观测(Earth Observation, EO)任务中难以实现精确像素级空间推理的问题。现有VLMs缺乏对复杂空间关系的精准定位能力,尤其在需要多时相分析与多模态输入融合的场景下表现不足。解决方案的关键在于提出TerraScope——一个统一的VLM架构,具备两个核心能力:一是模态灵活推理(modality-flexible reasoning),可处理单一模态(光学或合成孔径雷达SAR)输入,并在双模态可用时自适应融合;二是多时相推理(multi-temporal reasoning),能够整合时间序列数据以支持变化检测分析。此外,研究构建了包含百万级样本的Terra-CoT数据集和首个面向像素级地理空间推理的基准测试TerraScope-Bench,通过评估答案准确性和掩码质量来确保推理的真实性与可解释性。实验表明,TerraScope在像素级地理空间推理任务上显著优于现有方法。

链接: https://arxiv.org/abs/2603.19039
作者: Yan Shu,Bin Ren,Zhitong Xiong,Xiao Xiang Zhu,Begüm Demir,Nicu Sebe,Paolo Rota
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR20206 (Main Track)

点击查看摘要

Abstract:Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

[CV-34] FUMO: Prior-Modulated Diffusion for Single Image Reflection Removal

【速读】:该论文旨在解决真实场景中单张图像去反射(Single Image Reflection Removal, SIRR)的难题,其中反射强度空间分布不均且反射模式与透射结构高度耦合。解决方案的关键在于提出一种带有先验调制机制的扩散模型框架(FUMO),通过从混合图像中直接提取两个显式先验信号来增强空间可控性和结构保真度:一是强度先验(intensity prior),用于估计各区域的反射严重程度;二是高频先验(high-frequency prior),通过多尺度残差聚合捕捉对细节敏感的响应特征。该方法采用粗到精的训练范式,在第一阶段利用这些先验引导条件残差注入,聚焦于反射主导且结构敏感区域;第二阶段则引入细粒度优化网络,在图像空间中修正局部错位并锐化细节,从而在标准基准和野外挑战性图像上均实现定量指标领先与感知质量显著提升。

链接: https://arxiv.org/abs/2603.19036
作者: Telang Xu,Chaoyang Zhang,Guangtao Zhai,Xiaohong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Single image reflection removal (SIRR) is challenging in real scenes, where reflection strength varies spatially and reflection patterns are tightly entangled with transmission structures. This paper presents a diffusion model with prior modulation framework (FUMO) that introduces explicit guidance signals to improve spatial controllability and structural faithfulness. Two priors are extracted directly from the mixed image, an intensity prior that estimates spatial reflection severity and a high-frequency prior that captures detail-sensitive responses via multi-scale residual aggregation. We propose a coarse-to-fine training paradigm. In the first stage, these cues are combined to gate the conditional residual injections, focusing the conditioning on regions that are both reflection-dominant and structure-sensitive. In the second stage, a fine-grained refinement network corrects local misalignment and sharpens fine details in the image space. Experiments conducted on both standard benchmarks and challenging images in the wild demonstrate competitive quantitative results and consistently improved perceptual quality. The code is released at this https URL.

[CV-35] SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models CVPR

【速读】:该论文旨在解决视觉-语言模型(如CLIP)在大规模、未经筛选的训练数据中引入的社会偏见和虚假关联(spurious biases)问题,这些问题会严重损害模型的公平性和可靠性。现有后处理去偏方法通常直接在密集的CLIP嵌入空间中操作,但由于偏见信息与任务相关语义高度纠缠,难以在不损害语义保真度的前提下有效去偏。本文提出了一种名为稀疏嵌入调制(Sparse Embedding Modulation, SEM)的零样本去偏框架,其关键创新在于将CLIP文本嵌入分解到稀疏自动编码器(Sparse Autoencoder, SAE)的潜在空间中,从而实现特征解耦;在此空间中,SEM能够精准识别并调节与偏见相关的神经元,同时保留对查询任务重要的特征,支持更精确的非线性干预。实验证明,SEM在多个基准数据集和CLIP骨干网络上显著提升了检索和零样本分类任务的公平性表现。

链接: https://arxiv.org/abs/2603.19028
作者: Quentin Guimard,Federico Bartsch,Simone Caldarella,Rahaf Aljundi,Elisa Ricci,Massimiliano Mancini
机构: University of Trento(特伦托大学); Toyota Motor Europe(丰田欧洲公司); Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR Findings 2026. Project website: this https URL

点击查看摘要

Abstract:Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biases. Existing post-hoc debiasing methods often operate directly in the dense CLIP embedding space, where bias and task-relevant information are highly entangled. This entanglement limits their ability to remove bias without degrading semantic fidelity. In this work, we propose Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework that operates in a Sparse Autoencoder (SAE) latent space. By decomposing CLIP text embeddings into disentangled features, SEM identifies and modulates bias-relevant neurons while preserving query-relevant ones. This enables more precise, non-linear interventions. Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification. Our results demonstrate that sparse latent representations provide an effective foundation for post-hoc debiasing of vision-language models.

[CV-36] Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token CVPR2026

【速读】:该论文旨在解决当前基于多模态大语言模型(Multi-modal Large Language Models, MLLMs)的图像分割方法普遍依赖专用掩码解码器(mask decoder)或额外token来解析分割相关嵌入与视觉特征的问题,从而限制了模型端到端的简洁性与效率。其核心解决方案在于提出一种仅使用单一分割嵌入(Segmentation Embedding, SELF1E)即可实现高质量分割的方法,关键创新点在于:首先保留原始分辨率未压缩的图像特征,并通过从MLLM处理后的压缩特征中提取残差特征进行填充,提升特征精度;其次,在未压缩和已压缩特征上分别执行像素逆洗操作(pixel-unshuffle),以释放压缩特征中的细节并放大残差特征,进一步增强重构分辨率;最后,设计双感知路径注意力掩码机制(dual perception pathways: image-to-image and image-to-segmentation),促进像素与分割token间的丰富特征交互。实验证明,该方法在多个分割任务中达到与专用解码器方法相当的性能,验证了无需外部解码器即可实现高效分割的可行性。

链接: https://arxiv.org/abs/2603.19026
作者: Anqi Zhang,Xiaokang Ji,Guangyu Gao,Jianbo Jiao,Chi Harold Liu,Yunchao Wei
机构: Beijing Institute of Technology (北京理工大学); University of Birmingham (伯明翰大学); Beijing Jiaotong University (北京交通大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper is accepted by CVPR 2026

点击查看摘要

Abstract:Recent segmentation methods leveraging Multi-modal Large Language Models (MLLMs) have shown reliable object-level segmentation and enhanced spatial perception. However, almost all previous methods predominantly rely on specialist mask decoders to interpret masks from generated segmentation-related embeddings and visual features, or incorporate multiple additional tokens to assist. This paper aims to investigate whether and how we can unlock segmentation from MLLM itSELF with 1 segmentation Embedding (SELF1E) while achieving competitive results, which eliminates the need for external decoders. To this end, our approach targets the fundamental limitation of resolution reduction in pixel-shuffled image features from MLLMs. First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision. Subsequently, we integrate pixel-unshuffle operations on image features with and without LLM processing, respectively, to unleash the details of compressed features and amplify the residual features under uncompressed resolution, which further enhances the resolution of refilled features. Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token. Comprehensive experiments across multiple segmentation tasks validate that SELF1E achieves performance competitive with specialist mask decoder-based methods, demonstrating the feasibility of decoder-free segmentation in MLLMs. Project page: this https URL.

[CV-37] Generalized Hand-Object Pose Estimation with Occlusion Awareness

【速读】:该论文旨在解决从单张RGB图像中进行泛化性强的3D手-物体姿态估计(hand-object pose estimation)问题,尤其在物体外观变化大、交互模式多样且存在严重遮挡(occlusion)的情况下。其解决方案的关键在于提出了一种具有遮挡感知能力的框架GenHOI,通过引入层次化语义提示(hierarchical semantic prompt)来编码物体状态、手部构型和交互模式的文本描述,从而学习抽象的高层手-物体交互表示,以提升对未见物体和新型交互的泛化能力;同时采用多模态掩码建模策略(multi-modal masked modeling)对RGB图像、预测点云和文本描述进行联合优化,并利用手部先验(hand priors)作为稳定的时空参考来提取隐式交互约束,显著增强模型在物体形状和交互模式剧烈变化下的鲁棒性与姿态推理准确性。

链接: https://arxiv.org/abs/2603.19013
作者: Hui Yang,Wei Sun,Jian Liu,Jian Xiao Tao Xie,Hossein Rahmani,Ajmal Saeed mian,Nicu Sebe,Gim Hee Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 7 figures

点击查看摘要

Abstract:Generalized 3D hand-object pose estimation from a single RGB image remains challenging due to the large variations in object appearances and interaction patterns, especially under heavy occlusion. We propose GenHOI, a framework for generalized hand-object pose estimation with occlusion awareness. GenHOI integrates hierarchical semantic knowledge with hand priors to enhance model generalization under challenging occlusion conditions. Specifically, we introduce a hierarchical semantic prompt that encodes object states, hand configurations, and interaction patterns via textual descriptions. This enables the model to learn abstract high-level representations of hand-object interactions for generalization to unseen objects and novel interactions while compensating for missing or ambiguous visual cues. To enable robust occlusion reasoning, we adopt a multi-modal masked modeling strategy over RGB images, predicted point clouds, and textual descriptions. Moreover, we leverage hand priors as stable spatial references to extract implicit interaction constraints. This allows reliable pose inference even under significant variations in object shapes and interaction patterns. Extensive experiments on the challenging DexYCB and HO3Dv2 benchmarks demonstrate that our method achieves state-of-the-art performance in hand-object pose estimation.

[CV-38] Unleashing the Power of Simplicity: A Minimalist Strategy for State-of-the-Art Fingerprint Enhancement

【速读】:该论文旨在解决低质量指纹图像在指纹识别系统中导致的 minutiae(细节特征点)提取不准确的问题,尤其针对当前先进增强方法在处理此类图像时性能受限且计算复杂度高的缺陷。解决方案的关键在于提出一种极简主义(minimalist)的指纹增强策略,包含两种创新方法:一是基于上下文信息的滤波方法,二是基于学习的增强方法。这两种方法在保持结构简单的同时,显著提升了指纹图像的清晰度、准确性与抗噪能力,且在具有挑战性的潜指纹数据库上验证了其优越性,表明简化设计可有效提升实际应用效果。

链接: https://arxiv.org/abs/2603.19004
作者: Raffaele Cappelli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fingerprint recognition systems, which rely on the unique characteristics of human fingerprints, are essential in modern security and verification applications. Accurate minutiae extraction, a critical step in these systems, depends on the quality of fingerprint images. Despite recent improvements in fingerprint enhancement techniques, state-of-the-art methods often struggle with low-quality fingerprints and can be computationally demanding. This paper presents a minimalist approach to fingerprint enhancement, prioritizing simplicity and effectiveness. Two novel methods are introduced: a contextual filtering method and a learning-based method. These techniques consistently outperform complex state-of-the-art methods, producing clearer, more accurate, and less noisy images. The effectiveness of these methods is validated using a challenging latent fingerprint database. The open-source implementation of these techniques not only fosters reproducibility but also encourages further advancements in the field. The findings underscore the importance of simplicity in achieving high-quality fingerprint enhancement and suggest that future research should balance complexity and practical benefits.

[CV-39] CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think CVPR2026

【速读】:该论文旨在解决当前扩散模型(Diffusion Models)在对齐人类偏好时面临的两大挑战:一是监督微调(SFT)依赖高质量但昂贵的图像数据,二是基于DPO(Direct Preference Optimization)的偏好优化方法需要大规模且质量不一致的偏好数据集,同时存在计算效率低的问题。解决方案的关键在于提出一种轻量级但高效的微调范式——复合奖励辅助微调(Composite Reward Assisted Fine-Tuning, CRAFT),其核心创新包括:首先采用复合奖励过滤(Composite Reward Filtering, CRF)技术构建高质量、一致性的训练数据集,从而显著减少对标注数据的需求;其次,在此基础上执行改进的SFT流程,并从理论上证明CRAFT等价于优化群体强化学习(group-based reinforcement learning)的下界,建立了选择性数据微调与强化学习之间的原则性联系。实验表明,CRAFT仅用100个样本即可超越使用数千样本的现有最优方法,且收敛速度提升11–220倍,展现出卓越的数据效率和计算效率。

链接: https://arxiv.org/abs/2603.18991
作者: Zening Sun,Zhengpeng Xie,Lichen Bai,Shitong Shao,Shuo Yang,Zeke Xie
机构: HKUST (GZ); HIT (SZ)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR2026

点击查看摘要

Abstract:Aligning Diffusion models has achieved remarkable breakthroughs in generating high-quality, human preference-aligned images. Existing techniques, such as supervised fine-tuning (SFT) and DPO-style preference optimization, have become principled tools for fine-tuning diffusion models. However, SFT relies on high-quality images that are costly to obtain, while DPO-style methods depend on large-scale preference datasets, which are often inconsistent in quality. Beyond data dependency, these methods are further constrained by computational inefficiency. To address these two challenges, we propose Composite Reward Assisted Fine-Tuning (CRAFT), a lightweight yet powerful fine-tuning paradigm that requires significantly reduced training data while maintaining computational efficiency. It first leverages a Composite Reward Filtering (CRF) technique to construct a high-quality and consistent training dataset and then perform an enhanced variant of SFT. We also theoretically prove that CRAFT actually optimizes the lower bound of group-based reinforcement learning, establishing a principled connection between SFT with selected data and reinforcement learning. Our extensive empirical results demonstrate that CRAFT with only 100 samples can easily outperform recent SOTA preference optimization methods with thousands of preference-paired samples. Moreover, CRAFT can even achieve 11-220 \times faster convergences than the baseline preference optimization methods, highlighting its extremely high efficiency.

[CV-40] VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

【速读】:该论文旨在解决零样本(zero-shot)、几何一致性的全景深度估计问题,即在不依赖特定训练数据的情况下,从单张全景图像中准确重建具有全局几何一致性的深度图。现有方法多基于视图独立的推理机制,难以保证全景范围内的结构一致性。其解决方案的关键在于提出VGGT-360框架,该框架通过将任务重新定义为基于多视角重建三维模型的全景重投影过程,利用VGGT类基础模型固有的三维一致性,统一碎片化的单视角推理为连贯的全景理解。核心创新包括三个可插拔模块:(i) 不确定性引导的自适应投影,根据梯度不确定性动态分配更多视角至几何信息贫乏区域;(ii) 结构显著性增强注意力机制,在VGGT的注意力层注入结构感知置信度以提升三维重建鲁棒性;(iii) 相关性加权三维模型修正,利用注意力推断的相关性分数重新加权重叠点,从而提供精确且一致的几何基础用于全景重投影。

链接: https://arxiv.org/abs/2603.18943
作者: Jiayi Yuan,Haobo Jiang,De Wen Soh,Na Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT’s perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT’s robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.

[CV-41] Unsupervised Contrastive Learning for Efficient and Robust Spectral Shape Matching

【速读】:该论文旨在解决非刚性可变形三维形状之间对应关系估计的问题,这是计算机视觉与图形学中的一个关键挑战。现有基于深度函数映射(functional map)的方法通常仅优化点对点或函数映射,未能直接提升嵌入空间中的特征表示质量,导致特征一致性不足、匹配性能受限;同时依赖耗时的传统函数映射求解器,计算成本高昂。其解决方案的关键在于提出一种全新的无监督对比学习方法,通过最大化正样本对间的一致性并最小化负样本对间的相似性,显著增强特征的判别性和一致性;并设计了一个大幅简化的函数映射学习架构,摒弃了昂贵的函数映射求解器和多个辅助损失项,从而大幅提升计算效率。二者结合形成统一的双分支流水线,在准确性和效率上均达到当前最优水平,且在多种复杂场景(如近等距、非等距及拓扑不一致情形)下优于现有最先进方法,甚至超越监督方法。

链接: https://arxiv.org/abs/2603.18924
作者: Feifan Luo,Hongyang Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Estimating correspondences between pairs of non-rigid deformable 3D shapes remains a significant challenge in computer vision and graphics. While deep functional map methods have become the go-to solution for addressing this problem, they primarily focus on optimizing pointwise and functional maps either individually or jointly, rather than directly enhancing feature representations in the embedding space, which often results in inadequate feature quality and suboptimal matching performance. Furthermore, these approaches heavily rely on traditional functional map techniques, such as time-consuming functional map solvers, which incur substantial computational costs. In this work, we introduce, for the first time, a novel unsupervised contrastive learning-based approach for efficient and robust 3D shape matching. We begin by presenting an unsupervised contrastive learning framework that promotes feature learning by maximizing consistency within positive similarity pairs and minimizing it within negative similarity pairs, thereby improving both the consistency and discriminability of the learned this http URL then design a significantly simplified functional map learning architecture that eliminates the need for computationally expensive functional map solvers and multiple auxiliary functional map losses, greatly enhancing computational efficiency. By integrating these two components into a unified two-branch pipeline, our method achieves state-of-the-art performance in both accuracy and efficiency. Extensive experiments demonstrate that our approach is not only computationally efficient but also outperforms current state-of-the-art methods across various challenging benchmarks, including near-isometric, non-isometric, and topologically inconsistent scenarios, even surpassing supervised techniques.

[CV-42] GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting

【速读】:该论文旨在解决从单目RGB视频中重建真实手物交互(hand-object interaction)的难题,现有方法通常依赖类别特定模板或计算复杂度高,且难以保证3D空间中的物理一致性。其解决方案的关键在于提出GHOST(Gaussian Hand-Object Splatting),该框架采用2D Gaussian Splatting表示手法与物体为密集、视角一致的高斯盘,并引入三项核心创新:(1) 基于几何先验的检索与一致性损失以补全被遮挡的物体区域;(2) 抓握感知对齐机制优化手部平移和物体尺度以确保接触合理性;(3) 手部感知背景损失避免对被手遮挡区域的错误惩罚。该方法实现了无需类别先验、运行速度快一个数量级的同时,获得完整、物理一致且可动画化的3D重建结果。

链接: https://arxiv.org/abs/2603.18912
作者: Ahmed Tawfik Aboukhadra,Marcel Rogge,Nadia Robertini,Abdalla Arafa,Jameel Malik,Ahmed Elhayek,Didier Stricker
机构: RPTU; DFKI-AV Kaiserslautern; UPM Saudi Arabia; NUST-SEECS Pakistan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding realistic hand-object interactions from monocular RGB videos is essential for AR/VR, robotics, and embodied AI. Existing methods rely on category-specific templates or heavy computation, yet still produce physically inconsistent hand-object alignment in 3D. We introduce GHOST (Gaussian Hand-Object Splatting), a fast, category-agnostic framework for reconstructing dynamic hand-object interactions using 2D Gaussian Splatting. GHOST represents both hands and objects as dense, view-consistent Gaussian discs and introduces three key innovations: (1) a geometric-prior retrieval and consistency loss that completes occluded object regions, (2) a grasp-aware alignment that refines hand translations and object scale to ensure realistic contact, and (3) a hand-aware background loss that prevents penalizing hand-occluded object regions. GHOST achieves complete, physically consistent, and animatable reconstructions from a single RGB video while running an order of magnitude faster than prior category-agnostic methods. Extensive experiments on ARCTIC, HO3D, and in-the-wild datasets demonstrate state-of-the-art accuracy in 3D reconstruction and 2D rendering quality, establishing GHOST as an efficient and robust solution for realistic hand-object interaction modeling. Code is available at this https URL.

[CV-43] ranslating MRI to PET through Conditional Diffusion Models with Enhanced Pathology Awareness

【速读】:该论文旨在解决神经退行性疾病诊断中正电子发射断层扫描(PET)成本高、辐射暴露大以及磁共振成像(MRI)敏感性不足的问题。为实现从MRI生成高质量合成PET图像,现有方法多关注结构保真度而忽视病理特征的捕捉。解决方案的关键在于提出PASTA框架——一种基于条件扩散模型的图像翻译方法,其核心创新包括:1)具有高度交互性的双臂架构与多模态条件融合机制,以同时保留解剖结构和病理细节;2)引入新颖的循环交换一致性约束和体积生成策略,显著提升3D PET图像的质量与病理感知能力。实验表明,合成PET在阿尔茨海默病诊断中的性能较MRI提升4%,接近真实PET水平。

链接: https://arxiv.org/abs/2603.18896
作者: Yitong Li,Igor Yakushev,Dennis M. Hedderich,Christian Wachinger
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by Medical Image Analysis

点击查看摘要

Abstract:Positron emission tomography (PET) is a widely recognized technique for diagnosing neurodegenerative diseases, offering critical functional insights. However, its high costs and radiation exposure hinder its widespread use. In contrast, magnetic resonance imaging (MRI) does not involve such limitations. While MRI also detects neurodegenerative changes, it is less sensitive for diagnosis compared to PET. To overcome such limitations, one approach is to generate synthetic PET from MRI. Recent advances in generative models have paved the way for cross-modality medical image translation; however, existing methods largely emphasize structural preservation while neglecting the critical need for pathology awareness. To address this gap, we propose PASTA, a novel image translation framework built on conditional diffusion models with enhanced pathology awareness. PASTA surpasses state-of-the-art methods by preserving both structural and pathological details through its highly interactive dual-arm architecture and multi-modal condition integration. Additionally, we introduce a novel cycle exchange consistency and volumetric generation strategy that significantly enhances PASTA’s ability to produce high-quality 3D PET images. Our qualitative and quantitative results demonstrate the high quality and pathology awareness of the synthesized PET scans. For Alzheimer’s diagnosis, the performance of these synthesized scans improves over MRI by 4%, almost reaching the performance of actual PET. Our code is available at this https URL.

[CV-44] MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在物理环境中作为视觉-语言-动作(Vision-Language-Action, VLA)代理时,普遍缺乏多跳组合空间推理能力与精确视觉定位能力的问题。现有基准测试主要聚焦于简单的一跳空间关系,无法充分评估模型在真实场景中处理复杂空间逻辑和精准视觉对齐的能力。解决方案的关键在于:首先提出一个名为MultihopSpatial的综合性基准,涵盖1至3跳的复杂空间查询及多样化的视角;其次引入Acc@50IoU指标,同时衡量推理准确性和边界框预测精度,以强化对视觉定位能力的评估;最后构建MultihopSpatial-Train大规模训练语料库,通过强化学习后训练显著提升VLM的空间推理能力和具身操作任务表现。

链接: https://arxiv.org/abs/2603.18892
作者: Youngwan Lee,Soojin Jang,Yoorhim Cho,Seunghwan Lee,Yong-Ju Lee,Sung Ju Hwang
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi-hop and compositional spatial reasoning, featuring 1- to 3-hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction - capabilities vital for robust VLA deployment. (3) MultihopSpatial-Train, a dedicated large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields eight key insights, revealing that compositional spatial reasoning remains a formidable challenge. Finally, we demonstrate that reinforcement learning post-training on our corpus enhances both intrinsic VLM spatial reasoning and downstream embodied manipulation performance.

[CV-45] PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion Concentration and Alignment ICLR2026

【速读】:该论文旨在解决视觉上下文学习(Visual In-Context Learning, VICL)中因现有提示融合(prompt fusion)方法受限于局部 patch 级别融合框架和模型无关监督机制,导致难以充分挖掘有效信息线索的问题。其解决方案的关键在于提出 PromptHub 框架,通过引入局域感知融合(locality-aware fusion)、互补的集中(concentration)、对齐(alignment)与预测目标协同训练,并结合数据增强以强化监督信号,从而在多个基础视觉任务及分布外场景下显著提升性能,建立了一种超越传统 patch 级融合的可靠局域感知提示融合范式。

链接: https://arxiv.org/abs/2603.18891
作者: Tianci Luo,Jinpeng Wang,Shiyu Qin,Niu Lian,Yan Feng,Bin Chen,Chun Yuan,Shu-Tao Xia
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICLR 2026. 17 pages, 11 figures, and 9 tables

点击查看摘要

Abstract:Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code is available at this https URL.

[CV-46] Motion-o: Trajectory-Grounded Video Reasoning

【速读】:该论文旨在解决视频理解中轨迹推理(trajectory reasoning)缺失的问题,即现有模型虽能利用时空证据链增强推理能力,但缺乏对物体运动模式的显式建模与验证机制,导致轨迹理解隐含且难以评估。其核心解决方案是提出Spatial-Temporal-Trajectory (STT)推理框架,并设计Motion-o系统,该系统通过引入轨迹锚定(trajectory-grounding)数据增强方法,从稀疏关键帧监督中生成更密集的边界框轨迹,从而强化轨迹级别的训练信号;同时创新性地提出Motion Chain of Thought (MCoT)结构化推理路径,以离散的\textttmotion/标签形式显式记录每个物体的方向、速度及速度尺度变化,实现观测到轨迹的显式连接。该方案无需修改模型架构即可提升空间-时间定位精度和轨迹预测性能,推动了基于视觉证据的视频理解向可解释、可验证的运动推理演进。

链接: https://arxiv.org/abs/2603.18856
作者: Bishoy Galoaa,Shayda Moezzi,Xiangyu Bai,Sarah Ostadabbas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent research has made substantial progress on video reasoning, with many models leveraging spatio-temporal evidence chains to strengthen their inference capabilities. At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning. However, little attention has been paid to reasoning about \emphhow objects move between observations: no prior work has articulated the motion patterns by connecting successive observations, leaving trajectory understanding implicit and difficult to verify. We formalize this missing capability as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbfMotion-o, a motion-centric video understanding extension to visual language models that makes trajectories explicit and verifiable. To enable motion reasoning, we also introduce a trajectory-grounding dataset artifact that expands sparse keyframe supervision via augmentation to yield denser bounding box tracks and a stronger trajectory-level training signal. Finally, we introduce Motion Chain of Thought (MCoT), a structured reasoning pathway that makes object trajectories through discrete \textttmotion/ tag summarizing per-object direction, speed, and scale (of velocity) change to explicitly connect grounded observations into trajectories. To train Motion-o, we design a reward function that compels the model to reason directly over visual evidence, all while requiring no architectural modifications. Empirical results demonstrate that Motion-o improves spatial-temporal grounding and trajectory prediction while remaining fully compatible with existing frameworks, establishing motion reasoning as a critical extension for evidence-based video understanding. Code is available at this https URL.

[CV-47] HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

【速读】:该论文旨在解决视频问答(Video Question Answering, VQA)中因帧选择策略不合理而导致的效率低下与回答质量受限的问题。现有系统多依赖于均匀采样或启发式帧选择方法,无法针对下游问答任务进行优化。其解决方案的关键在于提出一种轻量级帧选择策略HORNet,该策略基于Group Relative Policy Optimization (GRPO)训练,能够学习冻结的视觉语言模型(Vision-Language Model, VLM)在何种帧上进行推理可获得最优答案。HORNet通过选择性地保留关键帧(最多减少99%输入帧),显著降低VLM处理时间(最高达93%),同时提升短时视频问答(如MSVD-QA)和时序推理任务(如NExT-QA)的答案准确率,并展现出良好的跨分布泛化能力与模型迁移性。

链接: https://arxiv.org/abs/2603.18850
作者: Xiangyu Bai,Bishoy Galoaa,Sarah Ostadabbas
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbfHORNet, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99% and VLM processing time by up to 93%, while improving answer quality on short-form benchmarks (+1.7% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained selection generalizes better out-of-distribution than supervised and PPO alternatives. HORNet’s policy further transfers across VLM answerers without retraining, yielding an additional 8.5% relative gain when paired with a stronger model. Evaluated across six benchmarks spanning 341,877 QA pairs and 114.2 hours of video, our results demonstrate that optimizing \emphwhat a VLM sees is a practical and complementary alternative to optimizing what it generates while improving efficiency. Code is available at this https URL.

[CV-48] owards Interpretable Foundation Models for Retinal Fundus Images MICCAI2026

【速读】:该论文旨在解决当前基础模型(Foundation Models)在医疗影像等高风险领域中因架构不透明而导致的可解释性不足问题。其解决方案的关键在于提出Dual-IFM,一种从设计上即具备双重可解释性的基础模型:一方面通过类证据图(class evidence maps)实现对单张图像的局部可解释性,确保预测决策过程的忠实性;另一方面通过二维投影层实现对整个数据集表示空间的全局可视化,从而提升模型在跨分布数据上的可信度与实用性。

链接: https://arxiv.org/abs/2603.18846
作者: Samuel Ofosu Mensah,Maria Camila Roa Carvajal,Kerol Djoumessi,Philipp Berens
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Computation (stat.CO)
备注: 11 pages, 3 figures, 2 tables, submitted to MICCAI 2026

点击查看摘要

Abstract:Foundation models are used to extract transferable representations from large amounts of unlabeled data, typically via self-supervised learning (SSL). However, many of these models rely on architectures that offer limited interpretability, which is a critical issue in high-stakes domains such as medical imaging. We propose Dual-IFM, a foundation model that is interpretable-by-design in two ways: First, it provides local interpretability for individual images through class evidence maps that are faithful to the decision-making process. Second, it provides global interpretability for entire datasets through a 2D projection layer that allows for direct visualization of the model’s representation space. We trained our model on over 800,000 color fundus photography from various sources to learn generalizable, interpretable representations for different downstream tasks. Our results show that our model reaches a performance range similar to that of state-of-the-art foundation models with up to 16\times the number of parameters, while providing interpretable predictions on out-of-distribution data. Our results suggest that large-scale SSL pretraining paired with inherent interpretability can lead to robust representations for retinal imaging.

[CV-49] Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging CVPR2026

【速读】:该论文旨在解决高分辨率透射电子显微镜(HRTEM)在观测成核动力学过程时,因成核事件发生在毫秒级时间尺度而需采用短曝光成像所导致的严重噪声问题,该噪声会掩盖原子位置信息,从而影响对材料微观结构的准确解析。解决方案的关键在于提出一种基于统计特征引导的去噪网络,该网络在空间域引入基于空间偏差的加权机制以自适应选择卷积操作,在频域引入基于频带特性的加权策略以增强信号并抑制噪声;同时开发了针对HRTEM图像的噪声校准方法,并构建包含无序结构和真实噪声的训练数据集,从而显著提升模型在真实图像上的去噪性能及下游定位任务的有效性。

链接: https://arxiv.org/abs/2603.18834
作者: Hesong Li,Ziqi Wu,Ruiwen Shao,Ying Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:High-Resolution Transmission Electron Microscopy (HRTEM) enables atomic-scale observation of nucleation dynamics, which boosts the studies of advanced solid materials. Nonetheless, due to the millisecond-scale rapid change of nucleation, it requires short-exposure rapid imaging, leading to severe noise that obscures atomic positions. In this work, we propose a statistical characteristic-guided denoising network, which utilizes statistical characteristics to guide the denoising process in both spatial and frequency domains. In the spatial domain, we present spatial deviation-guided weighting to select appropriate convolution operations for each spatial position based on deviation characteristic. In the frequency domain, we present frequency band-guided weighting to enhance signals and suppress noise based on band characteristics. We also develop an HRTEM-specific noise calibration method and generate a dataset with disordered structures and realistic HRTEM image noises. It can ensure the denoising performance of models on real images for nucleation observation. Experiments on synthetic and real data show our method outperforms the state-of-the-art methods in HRTEM image denoising, with effectiveness in the localization downstream task. Code will be available at this https URL.

[CV-50] VesselTok: Tokenizing Vessel-like 3D Biomedical Graph Representations for Reconstruction and Generation

【速读】:该论文旨在解决高分辨率下复杂血管、气道和神经网络等曲线解剖结构的空间图表示所带来的计算挑战问题。其核心挑战在于,这些结构的密集空间图具有高度复杂性,导致传统方法难以高效建模与学习。解决方案的关键在于提出VesselTok框架,该框架从参数化形状视角出发,将空间密集图建模为基于中心线点(centerline points)及其伪半径的潜在表示(tokens),并通过条件隐式神经表示(neural implicit representations)来编码管状结构特征,从而实现对复杂拓扑结构的鲁棒编码与泛化能力。

链接: https://arxiv.org/abs/2603.18797
作者: Chinmay Prabhakar,Bastian Wittmann,Tamaz Amiranashvili,Paul Büschl,Ezequiel de la Rosa,Julian McGinnis,Benedikt Wiestler,Bjoern Menze,Suprosanna Shit
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatial graphs provide a lightweight and elegant representation of curvilinear anatomical structures such as blood vessels, lung airways, and neuronal networks. Accurately modeling these graphs is crucial in clinical and (bio-)medical research. However, the high spatial resolution of large networks drastically increases their complexity, resulting in significant computational challenges. In this work, we aim to tackle these challenges by proposing VesselTok, a framework that approaches spatially dense graphs from a parametric shape perspective to learn latent representations (tokens). VesselTok leverages centerline points with a pseudo radius to effectively encode tubular geometry. Specifically, we learn a novel latent representation conditioned on centerline points to encode neural implicit representations of vessel-like, tubular structures. We demonstrate VesselTok’s performance across diverse anatomies, including lung airways, lung vessels, and brain vessels, highlighting its ability to robustly encode complex topologies. To prove the effectiveness of VesselTok’s learnt latent representations, we show that they (i) generalize to unseen anatomies, (ii) support generative modeling of plausible anatomical graphs, and (iii) transfer effectively to downstream inverse problems, such as link prediction.

[CV-51] Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

【速读】:该论文旨在解决大型视觉语言模型(Large Vision Language Models, LVLMs)在细粒度空间定位(fine-grained spatial grounding)上的局限性,即模型缺乏显式的几何推理能力,难以准确理解图像中对象的空间位置关系。其解决方案的关键在于引入显式的感知令牌(perception tokens),包括基于SAM2的语义分割令牌和通过VQ-VAE编码的深度令牌,并将这些令牌嵌入到自回归序列中,使模型在生成答案前先输出空间信息,从而构建“空间链式思维”(spatial chain-of-thought)。此外,为稳定深度令牌生成,论文设计了复合深度令牌目标函数(marker、token和count损失)及软融合技术以实现可微重建,最终在多个基准测试上显著提升空间理解与多任务性能。

链接: https://arxiv.org/abs/2603.18795
作者: Yuchen Li,Amanmeet Garg,Shalini Chaudhuri,Rui Zhao,Garin Kessler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

[CV-52] Rethinking Uncertainty Quantification and Entanglement in Image Segmentation

【速读】:该论文旨在解决不确定性量化(Uncertainty Quantification, UQ)中模型相关认知不确定性(Epistemic Uncertainty, EU)与数据相关偶然不确定性(Aleatoric Uncertainty, AU)之间的纠缠问题,这一现象会削弱分解的可解释性和实际应用价值。其关键解决方案在于通过系统性实证研究,涵盖多种AU-EU组合模型,并提出一种量化不确定性纠缠程度的新指标,从而评估不同方法在下游任务(如分布外检测、模糊建模和校准)中的表现;结果表明,软最大值集成(softmax ensemble)在所有任务中均表现出色,且具有较低的不确定性纠缠水平,为未来设计更解耦、可靠的UQ模型提供了方向。

链接: https://arxiv.org/abs/2603.18792
作者: Jakob Lønborg Christensen,Vedrana Andersen Dahl,Morten Rieger Hannemose,Anders Bjorholm Dahl,Christian F. Baumgartner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Uncertainty quantification (UQ) is crucial in safety-critical applications such as medical image segmentation. Total uncertainty is typically decomposed into data-related aleatoric uncertainty (AU) and model-related epistemic uncertainty (EU). Many methods exist for modeling AU (such as Probabilistic UNet, Diffusion) and EU (such as ensembles, MC Dropout), but it is unclear how they interact when combined. Additionally, recent work has revealed substantial entanglement between AU and EU, undermining the interpretability and practical usefulness of the decomposition. We present a comprehensive empirical study covering a broad range of AU-EU model combinations, propose a metric to quantify uncertainty entanglement, and evaluate both across downstream UQ tasks. For out-of-distribution detection, ensembles exhibit consistently lower entanglement and superior performance. For ambiguity modeling and calibration the best models are dataset-dependent, with softmax/SSN-based methods performing well and Probabilistic UNets being less entangled. A softmax ensemble fares remarkably well on all tasks. Finally, we analyze potential sources of uncertainty entanglement and outline directions for mitigating this effect.

[CV-53] Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors CVPR2026

【速读】:该论文旨在解决当前3D生成模型在利用显式几何约束方面存在的不足问题,尤其是未能有效利用从LiDAR或VGGT等方法获得的可见区域点云先验信息。现有方法主要依赖图像或文本条件进行生成,忽略了可直接获取的点云数据所蕴含的结构约束。其解决方案的关键在于提出Points-to-3D框架,该框架基于潜在空间扩散模型TRELLIS,通过将纯噪声稀疏结构初始化替换为点云先验引导的输入表示,并引入一个在TRELLIS框架内训练的结构补全网络(structure inpainting network),结合分阶段采样策略(先结构补全后边界精修),实现对全局几何形状的可控生成,同时保留输入点云中可见区域的几何特征。此设计显著提升了3D资产和场景生成的渲染质量和几何保真度。

链接: https://arxiv.org/abs/2603.18782
作者: Jiatong Xia,Zicheng Duan,Anton van den Hengel,Lingqiao Liu
机构: Australian Institute for Machine Learning, University of Adelaide, Australia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain from active sensors such as LiDAR or from feed-forward predictors like VGGT, offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation.A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a staged sampling strategy (structural inpainting followed by boundary refinement), completing the global geometry while preserving the visible regions of the input this http URL practice, Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input. Experiments on both objects and scene scenarios consistently demonstrate superior performance over state-of-the-art baselines in terms of rendering quality and geometric fidelity, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving more accurate and structurally controllable 3D generation.

[CV-54] SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGBThermal 3D Reconstruction

【速读】:该论文旨在解决基于RGB图像预训练的视觉几何Transformer模型在处理多模态传感数据(如RGB-热成像RGB-T)时性能下降的问题,特别是其在联合处理RGB与热图像时难以实现模态对齐。解决方案的关键在于提出一种名为SEAR的简单而高效的微调策略,通过在相对较小的RGB-T数据集上进行微调,使预训练模型能够有效适应多模态输入,从而显著提升3D重建和相机位姿估计的精度与模态一致性,同时保持极低的推理时间开销。

链接: https://arxiv.org/abs/2603.18774
作者: Vsevolod Skorokhodov,Chenghao Xu,Shuo Sun,Olga Fink,Malcolm Mielle
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundational feed-forward visual geometry models enable accurate and efficient camera pose estimation and scene reconstruction by learning strong scene priors from massive RGB datasets. However, their effectiveness drops when applied to mixed sensing modalities, such as RGB-thermal (RGB-T) images. We observe that while a visual geometry grounded transformer pretrained on RGB data generalizes well to thermal-only reconstruction, it struggles to align RGB and thermal modalities when processed jointly. To address this, we propose SEAR, a simple yet efficient fine-tuning strategy that adapts a pretrained geometry transformer to multimodal RGB-T inputs. Despite being trained on a relatively small RGB-T dataset, our approach significantly outperforms state-of-the-art methods for 3D reconstruction and camera pose estimation, achieving significant improvements over all metrics (e.g., over 29% in AUC@30) and delivering higher detail and consistency between modalities with negligible overhead in inference time compared to the original pretrained model. Notably, SEAR enables reliable multimodal pose estimation and reconstruction even under challenging conditions, such as low lighting and dense smoke. We validate our architecture through extensive ablation studies, demonstrating how the model aligns both modalities. Additionally, we introduce a new dataset featuring RGB and thermal sequences captured at different times, viewpoints, and illumination conditions, providing a robust benchmark for future work in multimodal 3D scene reconstruction. Code and models are publicly available at this https URL.

[CV-55] ProCal: Probability Calibration for Neighborhood-Guided Source-Free Domain Adaptation

【速读】:该论文旨在解决无源域适应(Source-Free Domain Adaptation, SFDA)中现有方法因过度依赖邻域预测相似性而导致的源知识遗忘和局部噪声过拟合问题。解决方案的关键在于提出一种概率校准方法 ProCal,其通过双模型协同预测机制动态校准基于邻域的预测结果:将源模型的初始预测与当前模型的在线输出相结合,有效校准邻居概率,从而在保留源模型判别信息的同时抑制局部噪声干扰,实现源知识保留与目标域适应之间的平衡。

链接: https://arxiv.org/abs/2603.18764
作者: Ying Zheng,Yiyi Zhang,Yi Wang,Lap-Pui Chau
机构: The Hong Kong Polytechnic University (香港理工大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Source-Free Domain Adaptation (SFDA) adapts pre-trained models to unlabeled target domains without requiring access to source data. Although state-of-the-art methods leveraging local neighborhood structures show promise for SFDA, they tend to over-rely on prediction similarity among neighbors. This over-reliance accelerates the forgetting of source knowledge and increases susceptibility to local noise overfitting. To address these issues, we introduce ProCal, a probability calibration method that dynamically calibrates neighborhood-based predictions through a dual-model collaborative prediction mechanism. ProCal integrates the source model’s initial predictions with the current model’s online outputs to effectively calibrate neighbor probabilities. This strategy not only mitigates the interference of local noise but also preserves the discriminative information from the source model, thereby achieving a balance between knowledge retention and domain adaptation. Furthermore, we design a joint optimization objective that combines a soft supervision loss with a diversity loss to guide the target model. Our theoretical analysis shows that ProCal converges to an equilibrium where source knowledge and target information are effectively fused, reducing both knowledge forgetting and overfitting. We validate the effectiveness of our approach through extensive experiments on 31 cross-domain tasks across four public datasets. Our code is available at: this https URL.

[CV-56] DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection CVPR2026

【速读】:该论文旨在解决域自适应目标检测(Domain Adaptive Object Detection, DAOD)中现有方法因CNN结构局部连接性限制而难以提取全局域不变特征的问题,以及基于Transformer的方法虽能捕捉长程依赖但计算复杂度高、难以部署的瓶颈。其解决方案的关键在于提出一种混合CNN-状态空间模型(State Space Models, SSMs)架构DA-Mamba,通过引入两种新颖模块:图像感知SSM(Image-Aware SSM, IA-SSM)嵌入骨干网络以增强图像级全局与局部对齐能力,以及对象感知SSM(Object-Aware SSM, OA-SSM)插入检测头以建模对象间的空间和语义依赖关系,从而实现实例级对齐;该设计在保持线性时间复杂度的同时有效捕获跨域的全局与局部不变特征,显著提升检测器的跨域性能。

链接: https://arxiv.org/abs/2603.18757
作者: Haochen Li,Rui Zhang,Hantao Yao,Xin Zhang,Yifan Hao,Shaohui Peng,Yongwei Zhao,Ling Li
机构: Institute of Software, CAS (中国科学院软件研究所); Institute of Computing Technology, CAS (中国科学院计算技术研究所); University of Science and Technology of China (中国科学技术大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Domain Adaptive Object Detection (DAOD) aims to transfer detectors from a labeled source domain to an unlabeled target domain. Existing DAOD methods employ multi-granularity feature alignment to learn domain-invariant representations. However, the local connectivity of their CNN-based backbone and detection head restricts alignment to local regions, failing to extract global domain-invariant features. Although transformer-based DAOD methods capture global dependencies via attention mechanisms, their quadratic computational cost hinders practical deployment. To solve this, we propose DA-Mamba, a hybrid CNN-State Space Models (SSMs) architecture that combines the efficiency of CNNs with the linear-time long-range modeling capability of State Space Models (SSMs) to capture both global and local domain-invariant features. Specifically, we introduce two novel modules: Image-Aware SSM (IA-SSM) and Object-Aware SSM (OA-SSM). IA-SSM is integrated into the backbone to enhance global domain awareness, enabling image-level global and local alignment. OA-SSM is inserted into the detection head to model spatial and semantic dependencies among objects, enhancing instance-level alignment. Comprehensive experiments demonstrate that the proposed method can efficiently improve the cross-domain performance of the object detector.

[CV-57] WeNLEX: Weakly Supervised Natural Language Explanations for Multilabel Chest X-ray Classification

【速读】:该论文旨在解决黑箱模型解释的忠实性(faithfulness)与可理解性(plausibility)之间的权衡问题,即现有方法依赖人工标注的解释数据进行显式监督,导致生成的解释虽看似合理但未必真实反映模型决策依据。其解决方案的关键在于提出一种弱监督框架WeNLEX,通过在黑箱模型特征空间中对齐由自然语言解释生成的图像与原始图像来保证忠实性,同时利用少量临床医生标注的解释数据进行分布对齐以维持可理解性;此外,该方法可在后验(post-hoc)和内嵌(in-model)两种设置下运行,并能通过更换解释数据库灵活适配不同受众(如普通用户),实验证明其仅需每诊断5条真实解释即可生成高质量解释,且在内嵌训练时还能提升分类任务的AUC性能。

链接: https://arxiv.org/abs/2603.18752
作者: Isabel Rio-Torto,Jaime S. Cardoso,Luís F. Teixeira
机构: INESC TEC; Universidade do Porto
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural language explanations provide an inherently human-understandable way to explain black-box models, closely reflecting how radiologists convey their diagnoses in textual reports. Most works explicitly supervise the explanation generation process using datasets annotated with explanations. Thus, though plausible, the generated explanations are not faithful to the model’s reasoning. In this work, we propose WeNLEX, a weakly supervised model for the generation of natural language explanations for multilabel chest X-ray classification. Faithfulness is ensured by matching images generated from their corresponding natural language explanations with original images, in the black-box model’s feature space. Plausibility is maintained via distribution alignment with a small database of clinician-annotated explanations. We empirically demonstrate, through extensive validation on multiple metrics to assess faithfulness, simulatability, diversity, and plausibility, that WeNLEX is able to produce faithful and plausible explanations, using as little as 5 ground-truth explanations per diagnosis. Furthermore, WeNLEX can operate in both post-hoc and in-model settings. In the latter, i.e., when the multilabel classifier is trained together with the rest of the network, WeNLEX improves the classification AUC of the standalone classifier by 2.21%, thus showing that adding interpretability to the training process can actually increase the downstream task performance. Additionally, simply by changing the database, WeNLEX explanations are adaptable to any target audience, and we showcase this flexibility by training a layman version of WeNLEX, where explanations are simplified for non-medical users.

[CV-58] 6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models

【速读】:该论文旨在解决扩散Transformer(Diffusion Transformers, DiTs)在视频生成任务中推理阶段内存占用高和计算成本大的问题。现有量化方法采用静态位宽分配策略,未能考虑不同时间步上激活值的量化敏感性差异,导致效率与质量之间难以平衡。其解决方案的关键在于提出一种推理时的NVFP4/INT8混合精度量化框架:首先基于模块输入输出差值与内部线性层量化敏感性的强线性相关性,设计轻量级预测器动态分配NVFP4至时序稳定的层以实现最大内存压缩,同时对易变层保留INT8以保障生成质量;其次利用Transformer块输入输出残差在时间维度上的高度一致性,引入时间差分缓存(Temporal Delta Cache, TDC)跳过冗余计算,从而显著降低计算开销。实验表明,该方法实现了1.92倍端到端加速和3.32倍内存减少,为视频DiTs的高效推理树立了新基准。

链接: https://arxiv.org/abs/2603.18742
作者: Rundong Su,Jintao Zhang,Zhihang Yuan,Haojie Duanmu,Jianfei Chen,Jun Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion transformers have demonstrated remarkable capabilities in generating videos. However, their practical deployment is severely constrained by high memory usage and computational cost. Post-Training Quantization provides a practical way to reduce memory usage and boost computation speed. Existing quantization methods typically apply a static bit-width allocation, overlooking the quantization difficulty of activations across diffusion timesteps, leading to a suboptimal trade-off between efficiency and quality. In this paper, we propose a inference time NVFP4/INT8 Mixed-Precision Quantization framework. We find a strong linear correlation between a block’s input-output difference and the quantization sensitivity of its internal linear layers. Based on this insight, we design a lightweight predictor that dynamically allocates NVFP4 to temporally stable layers to maximize memory compression, while selectively preserving INT8 for volatile layers to ensure robustness. This adaptive precision strategy enables aggressive quantization without compromising generation quality. Beside this, we observe that the residual between the input and output of a Transformer block exhibits high temporal consistency across timesteps. Leveraging this temporal redundancy, we introduce Temporal Delta Cache (TDC) to skip computations for these invariant blocks, further reducing the computational cost. Extensive experiments demonstrate that our method achieves 1.92 \times end-to-end acceleration and 3.32 \times memory reduction, setting a new baseline for efficient inference in Video DiTs.

[CV-59] EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

【速读】:该论文旨在解决在资源受限的边缘设备上部署高性能密集预测模型(如目标检测、实例分割和姿态估计)时,因计算与内存限制导致的挑战。当前轻量级系统仍以CNN架构(如YOLO系列)为主导,而紧凑型视觉Transformer(Vision Transformer, ViT)虽经大规模预训练,却难以实现与CNN相当的精度-效率权衡。作者认为这一差距主要源于小规模ViT缺乏任务特定的表征学习能力,而非ViT本身不适用于边缘密集预测任务。解决方案的关键在于提出EdgeCrafter框架,其核心是ECDet检测模型——该模型基于蒸馏得到的紧凑骨干网络,并采用面向边缘设备的编码器-解码器设计。通过任务专用蒸馏和边缘友好结构优化,ECDet-S在COCO数据集上仅用不到1000万参数即可达到51.7 AP,ECInsSeg性能媲美RF-DETR但参数更少,ECPose-X在姿态估计上显著优于YOLO26Pose-X(74.8 AP vs 71.6 AP),证明了紧凑ViT在边缘场景下的可行性与竞争力。

链接: https://arxiv.org/abs/2603.18739
作者: Longfei Liu,Yongjie Hou,Yang Li,Qirui Wang,Youyang Sha,Yongjun Yu,Yinzhi Wang,Peizhe Ru,Xuanlong Yu,Xi Shen
机构: Intellindust AI Lab (Intellindust人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Deploying high-performance dense prediction models on resource-constrained edge devices remains challenging due to strict limits on computation and memory. In practice, lightweight systems for object detection, instance segmentation, and pose estimation are still dominated by CNN-based architectures such as YOLO, while compact Vision Transformers (ViTs) often struggle to achieve similarly strong accuracy efficiency tradeoff, even with large scale pretraining. We argue that this gap is largely due to insufficient task specific representation learning in small scale ViTs, rather than an inherent mismatch between ViTs and edge dense prediction. To address this issue, we introduce EdgeCrafter, a unified compact ViT framework for edge dense prediction centered on ECDet, a detection model built from a distilled compact backbone and an edge-friendly encoder decoder design. On the COCO dataset, ECDet-S achieves 51.7 AP with fewer than 10M parameters using only COCO annotations. For instance segmentation, ECInsSeg achieves performance comparable to RF-DETR while using substantially fewer parameters. For pose estimation, ECPose-X reaches 74.8 AP, significantly outperforming YOLO26Pose-X (71.6 AP) despite the latter’s reliance on extensive Objects365 pretraining. These results show that compact ViTs, when paired with task-specialized distillation and edge-aware design, can be a practical and competitive option for edge dense prediction. Code is available at: this https URL

[CV-60] Ontology-Guided Diffusion for Zero-Shot Visual Sim2Real Transfer

【速读】:该论文旨在解决仿真到现实(simulation-to-reality, sim2real)图像迁移中因真实世界标注数据稀缺而导致的性能瓶颈问题。现有基于扩散模型的方法依赖于非结构化提示或统计对齐,难以捕捉使图像看起来真实的结构性因素。其解决方案的关键在于提出一种神经符号零样本框架——本体引导扩散(Ontology-Guided Diffusion, OGD),将“真实性”建模为可解释的结构化知识:通过构建一个包含光照、材质等属性的本体(ontology)及其在知识图谱中的关系表示,OGD从合成图像中推断出属性激活,并利用图神经网络生成全局嵌入;同时,符号规划器基于本体属性计算一致的视觉编辑序列以缩小现实差距。该嵌入通过交叉注意力条件化预训练指令引导扩散模型,而编辑序列则转化为结构化指令提示,从而实现高效、可解释且泛化能力强的零样本sim2real图像转换。

链接: https://arxiv.org/abs/2603.18719
作者: Mohamed Youssef,Mayar Elfares,Anna-Maria Meer,Matteo Bortoletto,Andreas Bulling
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bridging the simulation-to-reality (sim2real) gap remains challenging as labelled real-world data is scarce. Existing diffusion-based approaches rely on unstructured prompts or statistical alignment, which do not capture the structured factors that make images look real. We introduce Ontology- Guided Diffusion (OGD), a neuro-symbolic zero-shot sim2real image translation framework that represents realism as structured knowledge. OGD decomposes realism into an ontology of interpretable traits – such as lighting and material properties – and encodes their relationships in a knowledge graph. From a synthetic image, OGD infers trait activations and uses a graph neural network to produce a global embedding. In parallel, a symbolic planner uses the ontology traits to compute a consistent sequence of visual edits needed to narrow the realism gap. The graph embedding conditions a pretrained instruction-guided diffusion model via cross-attention, while the planned edits are converted into a structured instruction prompt. Across benchmarks, our graph-based embeddings better distinguish real from synthetic imagery than baselines, and OGD outperforms state-of-the-art diffusion methods in sim2real image translations. Overall, OGD shows that explicitly encoding realism structure enables interpretable, data-efficient, and generalisable zero-shot sim2real transfer.

[CV-61] From ex§ to poly: Gaussian Splatting with Polynomial Kernels

【速读】:该论文旨在解决当前高斯溅射(3D Gaussian Splatting, 3DGS)中因核函数改进导致与原有数据集不兼容的问题,从而阻碍了这些优化方法的广泛应用。其关键解决方案是提出一种新的核函数,该函数通过将原始指数核替换为多项式近似结合ReLU激活函数的形式,在保持与现有数据集完全兼容的同时显著提升了计算效率。这一设计使得高斯点可以更激进地剔除(culling),在不同3DGS实现中均带来4%至15%的性能提升,且对图像质量影响可忽略。

链接: https://arxiv.org/abs/2603.18707
作者: Joerg H. Mueller,Martin Winter,Markus Steinberger
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Recent advancements in Gaussian Splatting (3DGS) have introduced various modifications to the original kernel, resulting in significant performance improvements. However, many of these kernel changes are incompatible with existing datasets optimized for the original Gaussian kernel, presenting a challenge for widespread adoption. In this work, we address this challenge by proposing an alternative kernel that maintains compatibility with existing datasets while improving computational efficiency. Specifically, we replace the original exponential kernel with a polynomial approximation combined with a ReLU function. This modification allows for more aggressive culling of Gaussians, leading to enhanced performance across different 3DGS implementations. Our results show a notable performance improvement of 4 to 15% with negligible impact on image quality. We also provide a detailed mathematical analysis of the new kernel and discuss its potential benefits for 3DGS implementations on NPU hardware.

[CV-62] owards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels CVPR2026

【速读】:该论文旨在解决标准深度学习图像分割模型在拓扑结构准确性上的不足,即模型无法保证分割结果中连通分量或结构数量的正确性,从而影响分割质量及后续定量分析的可靠性。其解决方案的关键在于提出一种名为SCNP(Sparse Connected Neighborhood Penalization)的方法,通过惩罚 logits 中最差分类邻域的预测值,迫使模型优先优化像素邻域的预测,再提升自身像素的预测精度,从而有效提升拓扑一致性。该方法高效且可集成于多种语义和实例分割框架及损失函数中,适用于不同形态结构与成像模态的场景。

链接: https://arxiv.org/abs/2603.18671
作者: Juan Miguel Valverde,Dim P. Papadopoulos,Rasmus Larsen,Anders Bjorholm Dahl
机构: Technical University of Denmark (丹麦技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Standard deep learning models for image segmentation cannot guarantee topology accuracy, failing to preserve the correct number of connected components or structures. This, in turn, affects the quality of the segmentations and compromises the reliability of the subsequent quantification analyses. Previous works have proposed to enhance topology accuracy with specialized frameworks, architectures, and loss functions. However, these methods are often cumbersome to integrate into existing training pipelines, they are computationally very expensive, or they are restricted to structures with tubular morphology. We present SCNP, an efficient method that improves topology accuracy by penalizing the logits with their poorest-classified neighbor, forcing the model to improve the prediction at the pixels’ neighbors before allowing it to improve the pixels themselves. We show the effectiveness of SCNP across 13 datasets, covering different structure morphologies and image modalities, and integrate it into three frameworks for semantic and instance segmentation. Additionally, we show that SCNP can be integrated into several loss functions, making them improve topology accuracy. Our code can be found at this https URL.

[CV-63] Multimodal Model for Computational Pathology:Representation Learning and Image Compression

【速读】:该论文旨在解决数字病理学中全切片成像(Whole Slide Imaging, WSI)在生成式 AI 辅助诊断应用中的四大核心挑战:高分辨率带来的计算瓶颈、专家标注数据稀缺限制监督学习性能、多模态信息融合与生物可解释性难以兼顾,以及超长视觉序列建模过程缺乏临床透明度。其解决方案的关键在于构建统一的多模态框架,通过四项关键技术推进:(1) 自监督表征学习与结构感知的 token 压缩,实现跨尺度建模;(2) 多模态数据生成与增强以缓解标注不足;(3) 参数高效适配与推理增强的少样本学习策略;(4) 多智能体协同推理机制模拟病理医生“思维链”(Chain of Thought),实现不确定性感知的证据融合。这些方法共同推动了可解释、可信且安全的 AI 辅助诊断系统的发展。

链接: https://arxiv.org/abs/2603.18660
作者: Peihang Wu,Zehong Chen,Lijian Xu
机构: Shenzhen University of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Whole slide imaging (WSI) has transformed digital pathology by enabling computational analysis of gigapixel histopathology images. Recent foundation model advances have accelerated progress in computational pathology, facilitating joint reasoning across pathology images, clinical reports, and structured data. Despite this progress, challenges remain: the extreme resolution of WSIs creates computational hurdles for visual learning; limited expert annotations constrain supervised approaches; integrating multimodal information while preserving biological interpretability remains difficult; and the opacity of modeling ultra-long visual sequences hinders clinical transparency. This review comprehensively surveys recent advances in multimodal computational pathology. We systematically analyze four research directions: (1) self-supervised representation learning and structure-aware token compression for WSIs; (2) multimodal data generation and augmentation; (3) parameter-efficient adaptation and reasoning-enhanced few-shot learning; and (4) multi-agent collaborative reasoning for trustworthy diagnosis. We specifically examine how token compression enables cross-scale modeling and how multi-agent mechanisms simulate a pathologist’s “Chain of Thought” across magnifications to achieve uncertainty-aware evidence fusion. Finally, we discuss open challenges and argue that future progress depends on unified multimodal frameworks integrating high-resolution visual data with clinical and biomedical knowledge to support interpretable and safe AI-assisted diagnosis.

[CV-64] Multiscale Switch for Semi-Supervised and Contrastive Learning in Medical Ultrasound Image Segmentation

【速读】:该论文旨在解决医学超声图像分割中因标注数据有限以及成像伪影(如斑点噪声和低对比度边界)导致的性能瓶颈问题。现有半监督学习(Semi-Supervised Learning, SSL)方法在未标注数据利用效率和特征表示鲁棒性方面存在不足。其解决方案的关键在于提出一种名为Switch的新颖SSL框架,包含两项核心创新:(1) 多尺度切换(Multiscale Switch, MSS)策略,通过分层块混合实现均匀的空间覆盖;(2) 频域切换(Frequency Domain Switch, FDS),结合对比学习在傅里叶域进行幅度切换以增强特征表示的鲁棒性。该框架基于教师-学生架构,有效融合了标注与未标注数据,在六个不同超声数据集上显著优于当前最先进方法,且参数量仅为1.8M,展现出良好的资源效率。

链接: https://arxiv.org/abs/2603.18655
作者: Jingguo Qu,Xinyang Han,Yao Pu,Man-Lik Chui,Simon Takadiyi Gunda,Ziman Chen,Jing Qin,Ann Dorothy King,Winnie Chiu-Wing Chu,Jing Cai,Michael Tin-Cheung Ying
机构: The Hong Kong Polytechnic University (香港理工大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This is the author-submitted LaTeX version with original typesetting. The final published version (with IEEE production formatting and layout changes) is available at this http URL under CC BY 4.0 license

点击查看摘要

Abstract:Medical ultrasound image segmentation faces significant challenges due to limited labeled data and characteristic imaging artifacts including speckle noise and low-contrast boundaries. While semi-supervised learning (SSL) approaches have emerged to address data scarcity, existing methods suffer from suboptimal unlabeled data utilization and lack robust feature representation mechanisms. In this paper, we propose Switch, a novel SSL framework with two key innovations: (1) Multiscale Switch (MSS) strategy that employs hierarchical patch mixing to achieve uniform spatial coverage; (2) Frequency Domain Switch (FDS) with contrastive learning that performs amplitude switching in Fourier space for robust feature representations. Our framework integrates these components within a teacher-student architecture to effectively leverage both labeled and unlabeled data. Comprehensive evaluation across six diverse ultrasound datasets (lymph nodes, breast lesions, thyroid nodules, and prostate) demonstrates consistent superiority over state-of-the-art methods. At 5% labeling ratio, Switch achieves remarkable improvements: 80.04% Dice on LN-INT, 85.52% Dice on DDTI, and 83.48% Dice on Prostate datasets, with our semi-supervised approach even exceeding fully supervised baselines. The method maintains parameter efficiency (1.8M parameters) while delivering superior performance, validating its effectiveness for resource-constrained medical imaging applications. The source code is publicly available at this https URL

[CV-65] Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA WWW2026

【速读】:该论文旨在解决直播电商(Live Streaming Commerce)中主播在产品推广过程中面临的效率低、内容准备耗时长以及实时互动响应滞后的问题。解决方案的关键在于提出了一种名为Click-to-Ask的AI驱动助手系统,其核心由离线与在线两个模块协同构成:离线模块通过处理多模态商品信息生成结构化数据和合规的促销文案;在线模块则在直播过程中支持主播点击观众提问后,结合离线生成的结构化商品信息与流式架构中的事件级历史记忆,实现实时响应。该设计显著提升了推广效率与互动质量,验证了其在实际场景中的可行性。

链接: https://arxiv.org/abs/2603.18649
作者: Ruizhi Yu,Keyang Zhong,Peng Liu,Qi Wu,Haoran Zhang,Yanhao Zhang,Chen Chen,Haonan Lu
机构: East China Normal University (华东师范大学); Sun Yat-sen University (中山大学); OPPO AI Center (OPPO人工智能中心); Shenzhen Institutes of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 2 figures, Accepted at WWW2026 Demos

点击查看摘要

Abstract:Live streaming commerce has become a prominent form of broadcasting in the modern era. To facilitate more efficient and convenient product promotions for streamers, we present Click-to-Ask, an AI-driven assistant for live streaming commerce with complementary offline and online components. The offline module processes diverse multimodal product information, transforming complex inputs into structured product data and generating compliant promotional copywriting. During live broadcasts, the online module enables real-time responses to viewer inquiries by allowing streamers to click on questions and leveraging both the structured product information generated by the offline module and an event-level historical memory maintained in a streaming architecture. This system significantly reduces the time needed for promotional preparation, enhances content engagement, and enables prompt interaction with audience inquiries, ultimately improving the effectiveness of live streaming commerce. On our collected dataset of TikTok live stream frames, the proposed method achieves a Question Recognition Accuracy of 0.913 and a Response Quality score of 0.876, demonstrating considerable potential for practical application. The video demonstration can be viewed here: this https URL.

[CV-66] MeInTime: Bridging Age Gap in Identity-Preserving Face Restoration

【速读】:该论文旨在解决现有参考图像驱动的人脸恢复方法在跨年龄场景下表现受限的问题,即大多数方法隐含假设参考图像与退化输入图像在年龄上对齐,这限制了其在历史照片修复等实际应用中的有效性。解决方案的关键在于提出一种基于扩散模型的面部恢复方法 MeInTime,其核心创新是将身份和年龄条件建模解耦:训练阶段通过引入新型注意力机制高效注入身份特征,并使用门控残差融合模块促进退化特征与身份表示的整合;推理阶段则设计了一种无需训练的“年龄感知梯度引导”采样策略,利用年龄驱动方向迭代地将身份感知去噪潜在空间引导至目标年龄语义流形,从而实现高保真且年龄一致的恢复效果。

链接: https://arxiv.org/abs/2603.18645
作者: Teer Song,Yue Zhang,Yu Tian,Ziyang Wang,Xianlin Zhang,Guixuan Zhang,Xuan Liu,Xueming Li,Yasen Zhang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Department of Computer Science and Technology, Institute for AI, Tsinghua University (清华大学计算机科学与技术系人工智能研究院); Minzu University of China (中央民族大学); Xiaomi Corporation (小米公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To better preserve an individual’s identity, face restoration has evolved from reference-free to reference-based approaches, which leverage high-quality reference images of the same identity to enhance identity fidelity in the restored outputs. However, most existing methods implicitly assume that the reference and degraded input are age-aligned, limiting their effectiveness in real-world scenarios where only cross-age references are available, such as historical photo restoration. This paper proposes MeInTime, a diffusion-based face restoration method that extends reference-based restoration from same-age to cross-age settings. Given one or few reference images along with an age prompt corresponding to the degraded input, MeInTime achieves faithful restoration with both identity fidelity and age consistency. Specifically, we decouple the modeling of identity and age conditions. During training, we focus solely on effectively injecting identity features through a newly introduced attention mechanism and introduce Gated Residual Fusion modules to facilitate the integration between degraded features and identity representations. At inference, we propose Age-Aware Gradient Guidance, a training-free sampling strategy, using an age-driven direction to iteratively nudge the identity-aware denoising latent toward the desired age semantic manifold. Extensive experiments demonstrate that MeInTime outperforms existing face restoration methods in both identity preservation and age consistency. Our code is available at: this https URL

[CV-67] PhysVideo: Physically Plausible Video Generation with Cross-View Geometry Guidance

【速读】:该论文旨在解决视频生成中物理一致性运动难以保证的问题,即现有方法在提升视觉保真度的同时,往往无法准确模拟真实世界物体在三维空间中的运动规律,因为视频观测仅提供了动态过程的二维投影。其解决方案的关键在于提出一个两阶段框架PhysVideo:第一阶段Phys4View通过引入物理感知注意力机制(physics-aware attention)捕捉物理属性对运动动力学的影响,并结合几何增强的跨视角注意力与时间注意力以提升时空一致性;第二阶段VideoSyn利用第一阶段生成的前景视频作为引导,学习前景动态与背景场景之间的交互关系,实现可控的视频合成。该方法显著提升了生成视频的物理真实性和时空连贯性。

链接: https://arxiv.org/abs/2603.18639
作者: Cong Wang,Hanxin Zhu,Xiao Tang,Jiayi Luo,Xin Jin,Long Chen,Fei-Yue Wang,Zhibo Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in video generation has led to substantial improvements in visual fidelity, yet ensuring physically consistent motion remains a fundamental challenge. Intuitively, this limitation can be attributed to the fact that real-world object motion unfolds in three-dimensional space, while video observations provide only partial, view-dependent projections of such dynamics. To address these issues, we propose PhysVideo, a two-stage framework that first generates physics-aware orthogonal foreground videos and then synthesizes full videos with background. In the first stage, Phys4View leverages physics-aware attention to capture the influence of physical attributes on motion dynamics, and enhances spatio-temporal consistency by incorporating geometry-enhanced cross-view attention and temporal attention. In the second stage, VideoSyn uses the generated foreground videos as guidance and learns the interactions between foreground dynamics and background context for controllable video synthesis. To support training, we construct PhysMV, a dataset containing 40K scenes, each consisting of four orthogonal viewpoints, resulting in a total of 160K video sequences. Extensive experiments demonstrate that PhysVideo significantly improves physical realism and spatial-temporal coherence over existing video generation methods. Home page: this https URL.

[CV-68] raining-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering

【速读】:该论文旨在解决扩散 Transformer (Diffusion Transformers, DiTs) 在视频生成中因密集 3D 注意力机制导致的高推理成本问题,同时克服现有无训练稀疏注意力方法在视频生成任务中存在的两个局限:一是忽略不同层间注意力稀疏性的异质性,二是未考虑块划分中查询与键之间的耦合关系,从而限制了质量与加速效果之间的平衡。其解决方案的关键在于提出一种无需训练的稀疏注意力框架 SVOO(Sparse attention via Offline layer-wise sparsity profiling and Online bidirectional co-clustering),采用两阶段范式:首先离线进行逐层敏感性分析以确定每层固有的最优稀疏水平;其次在线通过一种新颖的双向协同聚类算法实现块级稀疏注意力,有效捕捉查询与键间的结构耦合关系,从而在不牺牲生成质量的前提下显著提升推理效率,在多个主流视频生成模型上实现了最高达 1.93× 的加速比,同时保持 PSNR 达到 29 dB。

链接: https://arxiv.org/abs/2603.18636
作者: Jiayi Luo,Jiayu Chen,Jiankun Wang,Cong Wang,Hanxin Zhu,Qingyun Sun,Chen Gao,Zhibo Chen,Jianxin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, leading to the development of sparse attention technologies to improve efficiency. However, existing training-free sparse attention methods in video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight that the attention sparsity of each layer is its intrinsic property, with minor effects across different inputs. Motivated by this, we propose SVOO, a training-free Sparse attention framework for fast Video generation via Offline layer-wise sparsity profiling and Online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a novel bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to 1.93\times speedup while maintaining a PSNR of up to 29 dB on Wan2.1.

[CV-69] SwiftGS: Episodic Priors for Immediate Satellite Surface Recovery

【速读】:该论文旨在解决多时相卫星遥感影像中快速、大规模三维重建的难题,尤其针对光照变化、传感器异质性以及每场景优化带来的高计算成本问题。解决方案的关键在于提出一种基于元学习(meta-learning)的系统SwiftGS,其通过单次前向传播预测几何-辐射解耦的高斯原型(Gaussian primitives)与轻量级符号距离函数(SDF),以替代传统的昂贵逐场景拟合过程;该方法结合可微分物理图(differentiable physics graph)建模投影、光照和传感器响应,并引入空间门控机制融合稀疏高斯细节与全局SDF结构,同时集成语义-几何融合、条件轻量任务头及来自冻结几何教师模型的多视角监督,在不确定性感知的多任务损失下实现高效且视图一致的数字表面模型(DSM)重建。

链接: https://arxiv.org/abs/2603.18634
作者: Rong Fu,Jiekai Wu,Haiyun Wei,Xiaowen Ma,Shiyin Lin,Kangan Qian,Chuang Liu,Jianyuan Ni,Simon James Fong
机构: University of Macau (澳门大学); Juntendo University (顺天堂大学); Tongji University (同济大学); Zhejiang University (浙江大学); University of Florida (佛罗里达大学); Tsinghua University (清华大学); Wuhan University (武汉大学); Juniata College (朱尼塔学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages, 6 figures

点击查看摘要

Abstract:Rapid, large-scale 3D reconstruction from multi-date satellite imagery is vital for environmental monitoring, urban planning, and disaster response, yet remains difficult due to illumination changes, sensor heterogeneity, and the cost of per-scene optimization. We introduce SwiftGS, a meta-learned system that reconstructs 3D surfaces in a single forward pass by predicting geometry-radiation-decoupled Gaussian primitives together with a lightweight SDF, replacing expensive per-scene fitting with episodic training that captures transferable priors. The model couples a differentiable physics graph for projection, illumination, and sensor response with spatial gating that blends sparse Gaussian detail and global SDF structure, and incorporates semantic-geometric fusion, conditional lightweight task heads, and multi-view supervision from a frozen geometric teacher under an uncertainty-aware multi-task loss. At inference, SwiftGS operates zero-shot with optional compact calibration and achieves accurate DSM reconstruction and view-consistent rendering at significantly reduced computational cost, with ablations highlighting the benefits of the hybrid representation, physics-aware rendering, and episodic meta-training.

[CV-70] GEAR: Geography-knowledge Enhanced Analog Recognition Framework in Extreme Environments

【速读】:该论文旨在解决跨域地形相似性检索问题,即如何高效地从青藏高原中识别出与马里亚纳海沟在地质起源和微生物代谢功能上具有结构同源性的陆地类比区域,以降低深海生物采样的成本。现有模型或忽视地理知识、或牺牲计算效率,难以满足实际需求。解决方案的关键在于提出一个三阶段的地理知识增强型类比识别框架(GEAR):首先通过骨架引导的筛选与裁剪进行候选谷地的初步筛选;其次利用地形波形比较器(TWC)和形态纹理模块(MTM)基于物理感知的波形与纹理特征过滤不一致候选区;最后构建融合地貌指标的图结构孪生网络(MSG-Net),实现细粒度识别,其F1分数较当前最优基线提升1.38个百分点,并验证了所提取特征与生物数据存在显著相关性,为后续生物学分析提供依据。

链接: https://arxiv.org/abs/2603.18626
作者: Zelin Liu,Bocheng Li,Yuling Zhou,Xuanting Li,Yixuan Yang,Jing Wang,Weishu Zhao,Xiaofeng Gao
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Mariana Trench and the Qinghai-Tibet Plateau exhibit significant similarities in geological origins and microbial metabolic functions. Given that deep-sea biological sampling faces prohibitive costs, recognizing structurally homologous terrestrial analogs of the Mariana Trench on the Qinghai-Tibet Plateau is of great significance. Yet, no existing model adequately addresses cross-domain topographic similarity retrieval, either neglecting geographical knowledge or sacrificing computational efficiency. To address these challenges, we present \underline\textbfGeography-knowledge \underline\textbfEnhanced \underline\textbfAnalog \underline\textbfRecognition (\textbfGEAR) Framework, a three-stage pipeline designed to efficiently retrieve analogs from 2.5 million square kilometers of the Qinghai-Tibet Plateau: (1) Skeleton guided Screening and Clipping: Recognition of candidate valleys and initial screening based on size and linear morphological criteria. (2) Physics aware Filtering: The Topographic Waveform Comparator (TWC) and Morphological Texture Module (MTM) evaluate the waveform and texture and filter out inconsistent candidate valleys. (3) Graph based Fine Recognition: We design a \underline\textbfMorphology-integrated \underline\textbfSiamese \underline\textbfGraph \underline\textbfNetwork (\textbfMSG-Net) based on geomorphological metrics. Correspondingly, we release an expert-annotated topographic similarity dataset targeting tectonic collision zones. Experiments demonstrate the effectiveness of every stage. Besides, MSG-Net achieved an F1-Score 1.38 percentage points higher than the SOTA baseline. Using features extracted by MSG-Net, we discovered a significant correlation with biological data, providing evidence for future biological analysis.

[CV-71] GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection? ECCV2026

【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)在AI生成视频检测任务中评估维度粗粒度、缺乏细粒度诊断能力的问题。现有方法多将其视为二分类问题,仅依赖整体准确率等粗粒度指标,难以揭示模型在不同真实性维度上的表现差异。为此,作者提出GenVideoLens——一个细粒度基准,包含400个高度欺骗性的AI生成视频和100个真实视频,并由专家标注了涵盖感知、光学、物理和时间线索在内的15个真实性维度。其关键创新在于通过维度级评估揭示LVLM在不同认知层面的性能差异,发现模型在感知线索上表现较好,但在光学一致性、物理交互和时序因果推理方面存在显著短板,同时指出小规模开源模型在特定维度上可能优于大体量专有模型,为未来AI生成视频检测系统的优化提供了精准的方向指引。

链接: https://arxiv.org/abs/2603.18625
作者: Yueying Zou,Pei Pei Li,Zekun Li,Xinyu Guo,Xing Cui,Huaibo Huang,Ran He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026 submission. 14 pages, 6 figures, 4 tables. Supplementary material included

点击查看摘要

Abstract:In recent years, AI-generated videos have become increasingly realistic and sophisticated. Meanwhile, Large Vision-Language Models (LVLMs) have shown strong potential for detecting such content. However, existing evaluation protocols largely treat the task as a binary classification problem and rely on coarse-grained metrics such as overall accuracy, providing limited insight into where LVLMs succeed or fail. To address this limitation, we introduce GenVideoLens, a fine-grained benchmark that enables dimension-wise evaluation of LVLM capabilities in AI-generated video detection. The benchmark contains 400 highly deceptive AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. We evaluate eleven representative LVLMs on this benchmark. Our analysis reveals a pronounced dimensional imbalance. While LVLMs perform relatively well on perceptual cues, they struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance also varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific authenticity cues. Temporal perturbation experiments further show that current LVLMs make limited use of temporal information. Overall, GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps and offering guidance for improving future AI-generated video detection systems.

[CV-72] REST: Receding Horizon Explorative Steiner Tree for Zero-Shot Object-Goal Navigation

【速读】:该论文旨在解决零样本物体目标导航(Zero-shot Object-Goal Navigation, ZSON)问题,即在未知环境中无需任务特定训练即可导航至目标物体。现有方法虽关注场景理解(belief)与高层决策(policy),却忽视了子目标选项(option)的设计——即从动态信念中生成并供策略选择的候选路径。其关键创新在于提出将选项空间建模为路径树(tree of paths),通过构建包含共享段落的结构化路径集合,使大语言模型(LLM)可进行粗粒度到细粒度的链式推理(chain-of-thought reasoning),从而压缩组合路径空间并提升导航效率。该方案由REST(Receding Horizon Explorative Steiner Tree)实现,其核心包括:在线构建开放词汇3D地图、基于采样规划生成代理中心路径树作为选项空间,并以空间叙事形式文本化每条分支以支持LLM选择最优路径,在Gibson、HM3D和HSSD基准上实现了高成功率与优异路径效率的平衡。

链接: https://arxiv.org/abs/2603.18624
作者: Shuqi Xiao,Maani Ghaffari,Chengzhong Xu,Hui Kong
机构: University of Macau (澳门大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot object-goal navigation (ZSON) requires navigating unknown environments to find a target object without task-specific training. Prior hierarchical training-free solutions invest in scene understanding (\textitbelief) and high-level decision-making (\textitpolicy), yet overlook the design of \textitoption, i.e., a subgoal candidate proposed from evolving belief and presented to policy for selection. In practice, options are reduced to isolated waypoints scored independently: single destinations hide the value gathered along the journey; an unstructured collection obscures the relationships among candidates. Our insight is that the option space should be a \textittree of paths. Full paths expose en-route information gain that destination-only scoring systematically neglects; a tree of shared segments enables coarse-to-fine LLM reasoning that dismisses or pursues entire branches before examining individual leaves, compressing the combinatorial path space into an efficient hierarchy. We instantiate this insight in \textbfREST (Receding Horizon Explorative Steiner Tree), a training-free framework that (1) builds an explicit open-vocabulary 3D map from online RGB-D streams; (2) grows an agent-centric tree of safe and informative paths as the option space via sampling-based planning; and (3) textualizes each branch into a spatial narrative and selects the next-best path through chain-of-thought LLM reasoning. Across the Gibson, HM3D, and HSSD benchmarks, REST consistently ranks among the top methods in success rate while achieving the best or second-best path efficiency, demonstrating a favorable efficiency-success balance.

[CV-73] OpenT2M: No-frill Motion Generation with Open-sourceLarge-scale High-quality Data

【速读】:该论文旨在解决文本到动作(Text-to-Motion, T2M)生成模型在面对未见过的文本描述时性能下降的问题,其根本原因在于现有运动数据集规模小、多样性不足。解决方案的关键在于构建一个百万级、高质量且开源的运动数据集 OpenT2M,该数据集包含超过2800小时的人体动作序列,通过物理可行性验证和多粒度过滤确保质量,并提供逐秒级文本标注;同时提出 MonoFrill 预训练运动模型,其核心是 2D-PRQ 运动分词器,该模块将人体划分为生物学部件以捕捉时空依赖关系,从而实现无需复杂设计即可获得优异的零样本泛化能力。

链接: https://arxiv.org/abs/2603.18623
作者: Bin Cao,Sipeng Zheng,Hao Luo,Boyuan Li,Jing Liu,Zongqing Lu
机构: CASIA(中国科学院自动化研究所); UCAS(中国科学院大学); BAAI(百度研究院); RUC(中国人民大学); PKU(北京大学); BeingBeyond
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-motion (T2M) generation aims to create realistic human movements from text descriptions, with promising applications in animation and robotics. Despite recent progress, current T2M models perform poorly on unseen text descriptions due to the small scale and limited diversity of existing motion datasets. To address this problem, we introduce OpenT2M, a million-level, high-quality, and open-source motion dataset containing over 2800 hours of human motion. Each sequence undergoes rigorous quality control through physical feasibility validation and multi-granularity filtering, with detailed second-wise text annotations. We also develop an automated pipeline for creating long-horizon sequences, enabling complex motion generation. Building upon OpenT2M, we introduce MonoFrill, a pretrained motion model that achieves compelling T2M results without complicated designs or technique tricks as “frills”. Its core component is 2D-PRQ, a novel motion tokenizer that captures spatiotemporal dependencies by dividing the human body into biology parts. Experiments show that OpenT2M significantly improves generalization of existing T2M models, while 2D-PRQ achieves superior reconstruction and strong zero-shot performance. We expect OpenT2M and MonoFrill will advance the T2M field by addressing longstanding data quality and benchmarking challenges.

[CV-74] Benchmarking CNN-based Models against Transformer-based Models for Abdominal Multi-Organ Segmentation on the RATIC Dataset

【速读】:该论文旨在解决腹部CT图像中多器官精准分割的问题,以支持计算机辅助诊断与治疗。其核心挑战在于如何在小到中等规模且来源多样(异质性)的数据集上实现高性能的分割模型。解决方案的关键在于系统性地对比三种混合Transformer架构(UNETR、SwinUNETR和UNETR++)与一个强基准CNN模型SegResNet,在相同的预处理和训练条件下进行评估。结果表明,尽管Transformer模型具备建模长距离依赖的优势,但在当前数据规模和多样性下,优化良好的CNN架构仍能取得更优的整体性能,而UNETR++表现最为接近CNN基线,UNETR则展现出更快的收敛速度。这说明对于此类数据场景,传统CNN仍具竞争力,且需谨慎权衡模型复杂度与数据特征之间的匹配关系。

链接: https://arxiv.org/abs/2603.18616
作者: Lukas Bayer,Sheethal Bhat,Andreas Maier
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (弗里德里希-亚历山大埃尔朗根-纽伦堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate multi-organ segmentation in abdominal CT scans is essential for computer-aided diagnosis and treatment. While convolutional neural networks (CNNs) have long been the standard approach in medical image segmentation, transformer-based architectures have recently gained attention due to their ability to model long-range dependencies. In this study, we systematically benchmark the three hybrid transformer-based models UNETR, SwinUNETR, and UNETR++ against a strong CNN baseline, SegResNet, for volumetric multi-organ segmentation on the heterogeneous RATIC dataset. The dataset comprises 206 annotated CT scans from 23 institutions worldwide, covering five abdominal organs. All models were trained and evaluated under identical preprocessing and training conditions using the Dice Similarity Coefficient (DSC) as the primary metric. The results show that the CNN-based SegResNet achieves the highest overall performance, outperforming all hybrid transformer-based models across all organs. Among the transformer-based approaches, UNETR++ delivers the most competitive results, while UNETR demonstrates notably faster convergence with fewer training iterations. These findings suggest that, for small- to medium-sized heterogeneous datasets, well-optimized CNN architectures remain highly competitive and may outperform hybrid transformer-based designs.

[CV-75] Improving Joint Audio-Video Generation with Cross-Modal Context Learning

【速读】:该论文旨在解决基于双流Transformer架构的音视频联合生成方法中存在的若干关键问题,包括由门控机制引发的跨模态交互导致的模型流形变化、跨模态注意力引入的多模态背景区域偏差、训练与推理阶段多模态无条件引导(Classifier-Free Guidance, CFG)不一致,以及多种条件之间的冲突。其解决方案的核心是提出跨模态上下文学习(Cross-Modal Context Learning, CCL),包含多个精心设计的模块:时序对齐RoPE与分块(Temporally Aligned RoPE and Partitioning, TARP)增强音频潜在表示与视频潜在表示之间的时序对齐;跨模态上下文注意力(Cross-Modal Context Attention, CCA)中的可学习上下文标记(Learnable Context Tokens, LCT)与动态上下文路由(Dynamic Context Routing, DCR)为跨模态信息提供稳定无条件锚点,并根据不同训练任务动态路由以加速收敛并提升生成质量;推理阶段的无条件上下文引导(Unconditional Context Guidance, UCG)利用LCT提供的无条件支持来统一CFG形式,从而改善训练-推理一致性并缓解多条件冲突。

链接: https://arxiv.org/abs/2603.18600
作者: Bingqi Ma,Linlong Lang,Ming Zhang,Dailan He,Xingtong Ge,Yi Zhang,Guanglu Song,Yu Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The dual-stream transformer architecture-based joint audio-video generation method has become the dominant paradigm in current research. By incorporating pre-trained video diffusion models and audio diffusion models, along with a cross-modal interaction attention module, high-quality, temporally synchronized audio-video content can be generated with minimal training data. In this paper, we first revisit the dual-stream transformer paradigm and further analyze its limitations, including model manifold variations caused by the gating mechanism controlling cross-modal interactions, biases in multi-modal background regions introduced by cross-modal attention, and the inconsistencies in multi-modal classifier-free guidance (CFG) during training and inference, as well as conflicts between multiple conditions. To alleviate these issues, we propose Cross-Modal Context Learning (CCL), equipped with several carefully designed modules. Temporally Aligned RoPE and Partitioning (TARP) effectively enhances the temporal alignment between audio latent and video latent representations. The Learnable Context Tokens (LCT) and Dynamic Context Routing (DCR) in the Cross-Modal Context Attention (CCA) module provide stable unconditional anchors for cross-modal information, while dynamically routing based on different training tasks, further enhancing the model’s convergence speed and generation quality. During inference, Unconditional Context Guidance (UCG) leverages the unconditional support provided by LCT to facilitate different forms of CFG, improving train-inference consistency and further alleviating conflicts. Through comprehensive evaluations, CCL achieves state-of-the-art performance compared with recent academic methods while requiring substantially fewer resources.

[CV-76] SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation CVPR2026

【速读】:该论文旨在解决生成式 AI(Generative AI)中自回归文本到图像合成(text-to-image synthesis)因高熵视觉生成导致的草稿令牌接受率低的问题,从而造成推理吞吐量受限的瓶颈。其解决方案的关键在于提出SJD-PAC框架:一方面采用主动草稿策略(proactive drafting strategy)提升复杂高熵区域的局部接受率;另一方面引入自适应延续机制(adaptive continuation mechanism),在初始拒绝后仍持续进行序列验证,避免全量重采样。两项优化协同作用显著提高每步平均接受长度,在严格保持目标分布的前提下大幅提升推理速度。

链接: https://arxiv.org/abs/2603.18599
作者: Jialiang Kang,Han Shu,Wenshuo Li,Yingjie Zhai,Xinghao Chen
机构: Peking University (北京大学); Huawei Technologies (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck that severely limits overall throughput. To overcome this, we introduce SJD-PAC, an enhanced SJD framework. First, SJD-PAC employs a proactive drafting strategy to improve local acceptance rates in these challenging high-entropy regions. Second, we introduce an adaptive continuation mechanism that sustains sequence validation after an initial rejection, bypassing the need for full resampling. Working in tandem, these optimizations significantly increase the average acceptance length per step, boosting inference speed while strictly preserving the target distribution. Experiments on standard text-to-image benchmarks demonstrate that SJD-PAC achieves a 3.8\times speedup with lossless image quality.

[CV-77] Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness

【速读】:该论文旨在解决预训练视觉-语言模型(如CLIP)在零样本场景下对对抗样本敏感的问题,即其在面对微小扰动时易产生错误分类,且现有方法难以同时保障模型在干净样本上的性能与鲁棒性。解决方案的关键在于提出一种文本引导注意力机制(Text-Guided Attention, TGA),通过两个核心模块实现:局部注意力精修模块(Local Attention Refinement Module)用于增强对关键视觉区域的关注,全局注意力约束模块(Global Attention Constraint Module)则利用干净样本提取目标与原始模型间的一致性文本引导注意力,以维持正常性能并提升鲁棒性。为进一步优化注意力聚焦的准确性,作者进一步引入互补文本引导注意力(Complementary Text-Guided Attention, Comp-TGA),融合类别提示引导注意力与非类别提示反向注意力,从而更全面、精准地建模前景特征,显著提升零样本鲁棒准确率(相较当前最优方法提升11.95%)。

链接: https://arxiv.org/abs/2603.18598
作者: Lu Yu,Haiyang Zhang,Changsheng Xu
机构: Tianjin University of Technology (天津理工大学); Chinese Academy of Sciences (中国科学院); University of the Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to TPAMI 2026. arXiv admin note: substantial text overlap with arXiv:2410.21802

点击查看摘要

Abstract:Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g., CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: Local Attention Refinement Module and Global Attention Constraint Module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness. Additionally, the Global Attention Constraint Module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. However, we observe that the method occasionally focuses on irrelevant or spurious features, which can lead to suboptimal performance and undermine its robustness in certain scenarios. To overcome this limitation, we further propose a novel approach called Complementary Text-Guided Attention (Comp-TGA). This method integrates two types of foreground attention: attention guided by the class prompt and reversed attention driven by the non-class prompt. These complementary attention mechanisms allow the model to capture a more comprehensive and accurate representation of the foreground. The experiments validate that TGA-ZSR and Comp-TGA yield 9.58% and 11.95% improvements respectively, in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets.

[CV-78] Elastic Weight Consolidation Done Right for Continual Learning CVPR2026

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中经典权重正则化方法弹性权重巩固(Elastic Weight Consolidation, EWC)及其变体在重要性估计上的根本性偏差问题,具体表现为:1)基于Fisher信息矩阵(Fisher Information Matrix, FIM)的梯度消失导致重要性估计不准确;2)Memory Aware Synapses(MAS)等变体对与先前任务无关的参数施加了冗余保护(redundant protection),造成不必要的约束。解决方案的关键在于提出一种称为Logits Reversal(LR)的操作——通过在计算FIM时反转logit值,有效缓解梯度消失并消除冗余保护,从而修正EWC的重要性估计机制。实验表明,该方法显著优于原始EWC及其变体,因此作者将其命名为“EWC Done Right (EWC-DR)”。

链接: https://arxiv.org/abs/2603.18596
作者: Xuan Liu,Xiaobin Chang
机构: Sun Yat-sen University (中山大学); Ministry of Culture and Tourism (文化和旅游部); Ministry of Education (教育部)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Weight regularization methods in continual learning (CL) alleviate catastrophic forgetting by assessing and penalizing changes to important model weights. Elastic Weight Consolidation (EWC) is a foundational and widely used approach within this framework that estimates weight importance based on gradients. However, it has consistently shown suboptimal performance. In this paper, we conduct a systematic analysis of importance estimation in EWC from a gradient-based perspective. For the first time, we find that EWC’s reliance on the Fisher Information Matrix (FIM) results in gradient vanishing and inaccurate importance estimation in certain scenarios. Our analysis also reveals that Memory Aware Synapses (MAS), a variant of EWC, imposes unnecessary constraints on parameters irrelevant to prior tasks, termed the redundant protection. Consequently, both EWC and its variants exhibit fundamental misalignments in estimating weight importance, leading to inferior performance. To tackle these issues, we propose the Logits Reversal (LR) operation, a simple yet effective modification that rectifies EWC’s importance estimation. Specifically, reversing the logit values during the calculation of FIM can effectively prevent both gradient vanishing and redundant protection. Extensive experiments across various CL tasks and datasets show that the proposed method significantly outperforms existing EWC and its variants. Therefore, we refer to it as EWC Done Right (EWC-DR).

[CV-79] AU Codes Language and Synthesis: Translating Anatomy to Text for Facial Behavior Synthesis

【速读】:该论文旨在解决当前面部行为合成中因依赖粗粒度情绪类别而导致的非语言交流细节缺失问题,以及现有基于动作单元(Action Units, AUs)的方法在处理冲突AU(即激活相同面部肌肉但作用相反的AUs)时出现解剖学上不合理的人工伪影和不自然运动叠加的问题。解决方案的关键在于提出一种以自然语言描述AU的新表示方法,该方法在保留AU框架表达力的同时,能够显式建模复杂且冲突的AU组合,并利用现代文本到图像模型实现高保真面部生成;同时构建了首个大规模图文配对数据集BP4D-AUText,用于支持这一方向的研究,并设计了VQ-AUFace生成模型,通过引入面部结构先验知识,从文本中合成真实且多样化的面部行为。

链接: https://arxiv.org/abs/2603.18588
作者: Jiahe Wang,Cong Liang,Xuandong Huang,Yuxin Wang,Xin Yun,Yi Wu,Yanan Chang,Shangfei Wang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Facial behavior synthesis remains a critical yet underexplored challenge. While text-to-face models have made progress, they often rely on coarse emotion categories, which lack the nuance needed to capture the full spectrum of human nonverbal communication. Action Units (AUs) provide a more precise and anatomically grounded alternative. However, current AU-based approaches typically encode AUs as one-hot vectors, modeling compound expressions as simple linear combinations of individual AUs. This linearity becomes problematic when handling conflicting AUs–defined as those which activate the same facial muscle with opposing actions. Such cases lead to anatomically implausible artifacts and unnatural motion superpositions. To address this, we propose a novel method that represents facial behavior through natural language descriptions of AUs. This approach preserves the expressiveness of the AU framework while enabling explicit modeling of complex and conflicting AUs. It also unlocks the potential of modern text-to-image models for high-fidelity facial synthesis. Supporting this direction, we introduce BP4D-AUText, the first large-scale text-image paired dataset for complex facial behavior. It is synthesized by applying a rule-based Dynamic AU Text Processor to the BP4D and BP4D+ datasets. We further propose VQ-AUFace, a generative model that leverages facial structural priors to synthesize realistic and diverse facial behaviors from text. Extensive quantitative experiments and user studies demonstrate that our approach significantly outperforms existing methods. It excels in generating facial expressions that are anatomically plausible, behaviorally rich, and perceptually convincing, particularly under challenging conditions involving conflicting AUs.

[CV-80] Color image restoration based on nonlocal saturation-value similarity

【速读】:该论文旨在解决传统非局部方法在彩色图像恢复中因仅基于各颜色通道独立灰度值计算图像块相似性而导致的颜色信息描述不精细的问题。其解决方案的关键在于提出了一种基于饱和度-亮度(saturation-value)相似性的新型非局部正则化方法:通过将图像块在HSV色彩空间中的饱和度和亮度通道相似性引入非局部梯度构造,从而更准确地刻画相邻彩色图像块间的饱和度与亮度一致性;在此基础上构建了新的非局部变分模型,并采用Bregman化算子分裂法设计高效数值求解算法,实现了图像恢复质量的显著提升。

链接: https://arxiv.org/abs/2603.18586
作者: Wei Wang,Yakun Li
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose and develop a novel nonlocal variational technique based on saturation-value similarity for color image restoration. In traditional nonlocal methods, image patches are extracted from red, green and blue channels of a color image directly, and the color information can not be described finely because the patch similarity is mainly based on the grayscale value of independent channel. The main aim of this paper is to propose and develop a novel nonlocal regularization method by considering the similarity of image patches in saturation-value channel of a color image. In particular, we first establish saturation-value similarity based nonlocal total variation by incorporating saturation-value similarity of color image patches into the proposed nonlocal gradients, which can describe the saturation and value similarity of two adjacent color image patches. The proposed nonlocal variational models are then formulated based on saturation-value similarity based nonlocal total variation. Moreover, we design an effective and efficient algorithm to solve the proposed optimization problem numerically by employing bregmanized operator splitting method, and we also study the convergence of the proposed algorithms. Numerical examples are presented to demonstrate that the performance of the proposed models is better than that of other testing methods in terms of visual quality and some quantitative metrics including peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), quaternion structural similarity index (QSSIM) and S-CIELAB color error.

[CV-81] HAViT: Historical Attention Vision Transformer

【速读】:该论文旨在解决Vision Transformer(ViT)中注意力机制在各编码层间独立运作导致的信息流动受限与特征学习效率低下的问题。其核心解决方案是提出一种有效的跨层注意力传播方法,通过保存并融合历史注意力矩阵来实现对层间信息流的结构化优化,从而促进注意力模式在整个Transformer层级中的渐进式精炼。该方法仅需添加注意力矩阵存储和混合操作,无需显著改变模型架构,实验表明其在CIFAR-100和TinyImageNet上分别带来1.33%和1.25%的准确率提升,且在不同变体中均具泛化性。

链接: https://arxiv.org/abs/2603.18585
作者: Swarnendu Banik,Manish Das,Shiv Ram Dubey,Satish Kumar Singh
机构: Indian Institute of Information Technology, Allahabad (印度信息技术学院,阿拉哈巴德)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers have excelled in computer vision but their attention mechanisms operate independently across layers, limiting information flow and feature learning. We propose an effective cross-layer attention propagation method that preserves and integrates historical attention matrices across encoder layers, offering a principled refinement of inter-layer information flow in Vision Transformers. This approach enables progressive refinement of attention patterns throughout the transformer hierarchy, enhancing feature acquisition and optimization dynamics. The method requires minimal architectural changes, adding only attention matrix storage and blending operations. Comprehensive experiments on CIFAR-100 and TinyImageNet demonstrate consistent accuracy improvements, with ViT performance increasing from 75.74% to 77.07% on CIFAR-100 (+1.33%) and from 57.82% to 59.07% on TinyImageNet (+1.25%). Cross-architecture validation shows similar gains across transformer variants, with CaiT showing 1.01% enhancement. Systematic analysis identifies the blending hyperparameter of historical attention (alpha = 0.45) as optimal across all configurations, providing the ideal balance between current and historical attention information. Random initialization consistently outperforms zero initialization, indicating that diverse initial attention patterns accelerate convergence and improve final performance. Our code is publicly available at this https URL.

[CV-82] CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention CVPR2026

【速读】:该论文旨在解决规划导向的端到端驾驶模型因学习统计相关性而非真实因果关系而导致的因果混淆(causal confusion)问题,这使得模型在复杂场景中容易依赖数据集偏差作为捷径,从而危及可靠性与安全性。解决方案的关键在于提出CausalVAD框架,其核心是设计了一种轻量级、可插拔的稀疏因果干预方案(Sparse Causal Intervention Scheme, SCIS),该方案基于后门调整理论(backdoor adjustment theory)在神经网络中实现因果干预:SCIS构建一个代表潜在驾驶情境的原型字典,并利用该字典对模型的稀疏向量查询进行干预,主动消除由混杂因素引起的虚假关联,从而从表征中移除虚假因素,提升下游任务的鲁棒性和准确性。

链接: https://arxiv.org/abs/2603.18561
作者: Jiacheng Tang,Zhiyuan Zhou,Zhuolin He,Jia Zhang,Kai Zhang,Jian Pu
机构: Fudan University (复旦大学); Embodiq Robotics Co., Ltd. (Embodiq机器人有限公司); Beijing Institute of Technology (北京理工大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Planning-oriented end-to-end driving models show great promise, yet they fundamentally learn statistical correlations instead of true causal relationships. This vulnerability leads to causal confusion, where models exploit dataset biases as shortcuts, critically harming their reliability and safety in complex scenarios. To address this, we introduce CausalVAD, a de-confounding training framework that leverages causal intervention. At its core, we design the sparse causal intervention scheme (SCIS), a lightweight, plug-and-play module to instantiate the backdoor adjustment theory in neural networks. SCIS constructs a dictionary of prototypes representing latent driving contexts. It then uses this dictionary to intervene on the model’s sparse vectorized queries. This step actively eliminates spurious associations induced by confounders, thereby eliminating spurious factors from the representations for downstream tasks. Extensive experiments on benchmarks like nuScenes show CausalVAD achieves state-of-the-art planning accuracy and safety. Furthermore, our method demonstrates superior robustness against both data bias and noisy scenarios configured to induce causal confusion.

[CV-83] HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

【速读】:该论文旨在解决长视频问答任务中因模型有限上下文窗口导致的帧选择难题,尤其针对现有方法在效率与准确性之间存在显著权衡的问题:基于相似度的帧选择策略虽快但会将复杂查询压缩为单一向量,丢失子事件顺序和跨模态关联;而基于代理(agent-based)的方法虽能恢复结构信息,却因多次调用大视觉语言模型(LVLM)导致计算成本过高。其解决方案的关键在于提出一种无需训练的框架HiMu,通过单次文本语言模型(LLM)调用将查询解析为分层逻辑树,每个叶节点对应原子谓词,并由轻量级专家模块(融合视觉如CLIP、开放词汇检测、OCR及音频如ASR、CLAP)分别处理;随后对多模态信号进行归一化与时间平滑,再利用模糊逻辑算子自底向上组合,强制执行时序约束与邻接关系,最终生成连续满意度曲线,从而在仅16帧下即超越所有竞争方法,且相比代理系统节省约10倍浮点运算量(FLOPs)。

链接: https://arxiv.org/abs/2603.18558
作者: Dan Ben-Ami,Gabriele Serussi,Kobi Cohen,Chaim Baskin
机构: INSIGHT Lab, Ben-Gurion University of the Negev, Israel; Ben-Gurion University of the Negev, Israel
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.

[CV-84] CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models

【速读】:该论文旨在解决医学视觉-语言模型(Medical Vision-Language Models, MVLMs)在真实临床工作流中可靠性不足的问题,尤其是现有鲁棒性评估多基于干净输入或孤立的图像退化,忽视了成像流程中常规但影响图像统计特性的环节(如采集、重建、显示和传输等),这些环节虽保持临床可读性,却可能引发模型性能下降。解决方案的关键在于提出CoDA(chain-of-distribution)框架,通过组合采集类阴影、重建与显示重映射、以及交付与导出退化等阶段,构建临床合理的图像分布偏移,并在掩码结构相似性约束下联合优化各阶段参数以诱导模型失效同时保持视觉合理性;此外,还引入基于教师引导的token空间适配与patch级对齐的后处理修复策略,有效提升对已受CoDA扰动样本的识别准确率,从而系统刻画MVLM部署中的临床相关威胁面并验证轻量级对齐机制对实际应用鲁棒性的改进作用。

链接: https://arxiv.org/abs/2603.18545
作者: Xiang Chen,Fangfang Yang,Chunlei Meng,Chengyin Hu,Ang Li,Yiwei Wei,Jiahuan Long,Jiujiang Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical vision–language models (MVLMs) are increasingly used as perceptual backbones in radiology pipelines and as the visual front end of multimodal assistants, yet their reliability under real clinical workflows remains underexplored. Prior robustness evaluations often assume clean, curated inputs or study isolated corruptions, overlooking routine acquisition, reconstruction, display, and delivery operations that preserve clinical readability while shifting image statistics. To address this gap, we propose CoDA, a chain-of-distribution framework that constructs clinically plausible pipeline shifts by composing acquisition-like shading, reconstruction and display remapping, and delivery and export degradations. Under masked structural-similarity constraints, CoDA jointly optimizes stage compositions and parameters to induce failures while preserving visual plausibility. Across brain MRI, chest X-ray, and abdominal CT, CoDA substantially degrades the zero-shot performance of CLIP-style MVLMs, with chained compositions consistently more damaging than any single stage. We also evaluate multimodal large language models (MLLMs) as technical-authenticity auditors of imaging realism and quality rather than pathology. Proprietary multimodal models show degraded auditing reliability and persistent high-confidence errors on CoDA-shifted samples, while the medical-specific MLLMs we test exhibit clear deficiencies in medical image quality auditing. Finally, we introduce a post-hoc repair strategy based on teacher-guided token-space adaptation with patch-level alignment, which improves accuracy on archived CoDA outputs. Overall, our findings characterize a clinically grounded threat surface for MVLM deployment and show that lightweight alignment improves robustness in deployment.

[CV-85] Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection CVPR2026

【速读】:该论文旨在解决跨域少样本目标检测(Cross-domain few-shot object detection, CD-FSOD)中因域偏移和标注数据稀缺导致的模型注意力分散问题,即“目标域散光”(target-domain Astigmatism)现象——模型在目标域中难以聚焦于语义对象,造成定位不准确和冗余预测。解决方案的关键在于提出一种受生物视网膜中央凹(fovea-style)视觉系统启发的中心-周边注意力精修框架(center-periphery attention refinement framework),通过三个模块协同优化:(1) 正向模式精修模块(Positive Pattern Refinement)利用类别特定原型重塑注意力分布以模拟视觉中心区域;(2) 负向背景调制模块(Negative Context Modulation)建模背景上下文增强边界判别力,模拟视觉周边区域;(3) 文本语义对齐模块(Textual Semantic Alignment)借助跨模态线索强化中心与周边注意力差异。该方法有效将散焦的注意力转化为聚焦模式,显著提升模型在目标域的适应能力。

链接: https://arxiv.org/abs/2603.18541
作者: Yongwei Jiang,Yixiong Zou,Yuhua Li,Ruixuan Li
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Cross-domain few-shot object detection (CD-FSOD) aims to adapt pretrained detectors from a source domain to target domains with limited annotations, suffering from severe domain shifts and data scarcity problems. In this work, we find a previously overlooked phenomenon: models exhibit dispersed and unfocused attention in target domains, leading to imprecise localization and redundant predictions, just like a human cannot focus on visual objects. Therefore, we call it the target-domain Astigmatism problem. Analysis on attention distances across transformer layers reveals that regular fine-tuning inherently shows a trend to remedy this problem, but results are still far from satisfactory, which we aim to enhance in this paper. Biologically inspired by the human fovea-style visual system, we enhance the fine-tuning’s inherent trend through a center-periphery attention refinement framework, which contains (1) a Positive Pattern Refinement module to reshape attention toward semantic objects using class-specific prototypes, simulating the visual center region; (2) a Negative Context Modulation module to enhance boundary discrimination by modeling background context, simulating the visual periphery region; and (3) a Textual Semantic Alignment module to strengthen center-periphery distinction through cross-modal cues. Our bio-inspired approach transforms astigmatic attention into focused patterns, substantially improving adaptation to target domains. Experiments on six challenging CD-FSOD benchmarks consistently demonstrate improved detection accuracy and establish new state-of-the-art results.

[CV-86] 3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

【速读】:该论文旨在解决现有基于主体驱动的视频生成方法在3D感知能力上的局限性问题,即当前主流方法多将主体视为2D实体,仅依赖单视角视觉特征或文本提示进行身份迁移,缺乏对真实世界三维物体几何结构的建模能力,导致在生成新视角时无法保持空间一致性,只能凭空生成未见区域的内容而非真实还原3D结构。为突破这一瓶颈,论文提出了一种全新的3D-aware视频定制框架——3DreamBooth与3Dapter协同工作:其核心创新在于通过1帧优化范式(1-frame optimization paradigm)将空间几何与时间运动解耦,仅更新空间表征以嵌入稳健的3D先验,避免了传统视频训练中的时序过拟合;同时引入3Dapter模块作为视觉条件控制机制,在单视图预训练后通过不对称条件策略与主生成分支进行多视角联合优化,实现动态选择性路由,从最小参考集中提取视图特定的几何提示,从而显著提升纹理细节并加速收敛。

链接: https://arxiv.org/abs/2603.18524
作者: Hyun-kyu Ko,Jihyeon Park,Younghyun Kim,Dongheok Park,Eunbyung Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Code: this https URL

点击查看摘要

Abstract:Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: this https URL

[CV-87] Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Model, LVLM)在视觉计数任务中的推理机制不明确、性能受限的问题,尤其是如何通过可解释性分析揭示其内部计数逻辑并提升整体视觉推理能力。解决方案的关键在于:首先,提出两种新的可解释性方法——视觉激活补丁法(Visual Activation Patching)和HeadLens,用于识别LVLM中共享的“计数电路”;其次,设计一种轻量级干预策略,仅用大量合成图像对预训练LVLM进行微调,聚焦于计数能力的增强。实验证明,这种针对性优化不仅显著提升了合成数据上的计数准确性,还在分布外计数任务和复杂通用视觉推理任务中带来了平均+8.36%和+1.54%的性能增益,表明计数能力是视觉推理的核心组件,可通过专门强化实现整体性能提升。

链接: https://arxiv.org/abs/2603.18523
作者: Liwei Che,Zhiyu Xue,Yihao Quan,Benlin Liu,Zeru Shi,Michelle Hurst,Jacob Feldman,Ruixiang Tang,Ranjay Krishna,Vladimir Pavlovic
机构: Rutgers University (罗格斯大学); UC Santa Barbara (加州大学圣塔芭芭拉分校); University of Washington (华盛顿大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Counting serves as a simple but powerful test of a Large Vision-Language Model’s (LVLM’s) reasoning; it forces the model to identify each individual object and then add them all up. In this study, we investigate how LVLMs implement counting using controlled synthetic and real-world benchmarks, combined with mechanistic analyses. Our results show that LVLMs display a human-like counting behavior, with precise performance on small numerosities and noisy estimation for larger quantities. We introduce two novel interpretability methods, Visual Activation Patching and HeadLens, and use them to uncover a structured “counting circuit” that is largely shared across a variety of visual reasoning tasks. Building on these insights, we propose a lightweight intervention strategy that exploits simple and abundantly available synthetic images to fine-tune arbitrary pretrained LVLMs exclusively on counting. Despite the narrow scope of this fine-tuning, the intervention not only enhances counting accuracy on in-distribution synthetic data, but also yields an average improvement of +8.36% on out-of-distribution counting benchmarks and an average gain of +1.54% on complex, general visual reasoning tasks for Qwen2.5-VL. These findings highlight the central, influential role of counting in visual reasoning and suggest a potential pathway for improving overall visual reasoning capabilities through targeted enhancement of counting mechanisms.

[CV-88] CAFlow: Adaptive-Depth Single-Step Flow Matching for Efficient Histopathology Super-Resolution

【速读】:该论文旨在解决数字病理学中高分辨率全切片图像(whole-slide images, WSI)生成式超分辨率(SR)计算成本过高、难以在临床实践中部署的问题。其核心解决方案是提出CAFlow,一种自适应深度的单步流匹配(flow-matching)框架,通过动态路由每个图像块至最浅且满足重建质量要求的网络出口,实现高效推理。关键创新包括:在未打乱像素空间中进行流匹配以减少16倍空间计算量并支持直接推断;训练时专门分配一半样本用于精确t=0时刻数据以保障单步质量;以及引入轻量级出口分类器(~6K参数)实现33%计算节省仅损失0.12 dB性能。该方法在多器官组织x4超分任务中达到31.72 dB PSNR,显著优于传统方法,并在x8尺度下超越所有可比计算量基线模型,同时保持临床相关结构完整性。

链接: https://arxiv.org/abs/2603.18513
作者: Elad Yoshai,Ariel D. Yoshai,Natan T. Shaked
机构: Tel Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In digital pathology, whole-slide images routinely exceed gigapixel resolution, making computationally intensive generative super-resolution (SR) impractical for routine deployment. We introduce CAFlow, an adaptive-depth single-step flow-matching framework that routes each image tile to the shallowest network exit that preserves reconstruction quality. CAFlow performs flow matching in pixel-unshuffled rearranged space, reducing spatial computation by 16x while enabling direct inference. We show that dedicating half of training to exact t=0 samples is essential for single-step quality (-1.5 dB without it). The backbone, FlowResNet (1.90M parameters), mixes convolution and window self-attention blocks across four early exits spanning 3.1 to 13.3 GFLOPs. A lightweight exit classifier (~6K parameters) achieves 33% compute savings at only 0.12 dB cost. On multi-organ histopathology x4 SR, adaptive routing achieves 31.72 dB PSNR versus 31.84 dB at full depth, while the shallowest exit exceeds bicubic by +1.9 dB at 2.8x less compute than SwinIR-light. The method generalizes to held-out colon tissue with minimal quality loss (-0.02 dB), and at x8 upscaling it outperforms all comparable-compute baselines while remaining competitive with the much larger SwinIR-Medium model. Downstream nuclei segmentation confirms preservation of clinically relevant structure. The model trains in under 5 hours on a single GPU, and adaptive routing can reduce whole-slide inference from minutes to seconds.

[CV-89] OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting CVPR2026

【速读】:该论文旨在解决**在线全景映射(online panoptic mapping)**中缺乏实例级理解与实时性难以兼顾的问题,尤其针对具身应用(embodied applications)在动态环境中进行开放词汇场景理解的需求。现有方法多为离线处理或无法提供细粒度的实例语义信息,限制了其在真实机器人任务中的适用性。解决方案的关键在于提出 OnlinePG 系统,其核心创新是结合 3D 高斯点绘(3D Gaussian Splatting)实现几何重建与开放词汇感知的一体化在线融合:首先采用滑动窗口机制构建局部一致性地图,通过联合几何与语义线索的 3D 分段聚类图对局部不一致片段进行整合;随后利用带空间属性的显式网格结构,通过鲁棒的双向二分图匹配将局部高斯实例融合至全局地图;最终借助融合后的视觉语言模型(VLM)特征实现开放词汇场景理解,从而在保证实时性能的同时显著提升在线全景映射的准确性与实用性。

链接: https://arxiv.org/abs/2603.18510
作者: Hongjia Zhai,Qi Zhang,Xiaokun Pan,Xiyu Zhang,Yitong Dong,Huaqi Zhang,Dan Xu,Guofeng Zhang
机构: Zhejiang University (浙江大学); VIVO BlueImage Lab; HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Open-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit grids with spatial attributes for the local 3D Gaussian map and fuse them into the global map via robust bidirectional bipartite 3D Gaussian instance matching. Finally, we utilize the fused VLM features inside the 3D spatial attribute grids to achieve open-vocabulary scene understanding. Extensive experiments on widely used datasets demonstrate that our method achieves better performance among online approaches, while maintaining real-time efficiency.

[CV-90] Foundations and Architectures of Artificial Intelligence for Motor Insurance

【速读】:该论文旨在解决机动车保险领域中风险评估与理赔处理流程自动化不足的问题,特别是在大规模实际部署场景下如何实现高效、可靠的人工智能系统。其核心解决方案在于构建一个垂直整合的AI范式,将感知、多模态推理与生产基础设施统一为一个连贯的智能栈;关键创新在于设计了面向特定领域的Transformer架构,用于结构化视觉理解、车辆关系表征学习及多模态文档智能,从而实现从车辆损伤分析到理赔评估与核保流程的端到端自动化,并在泰国全国范围内的机动车保险系统实践中验证了该方案的可扩展性与实用性。

链接: https://arxiv.org/abs/2603.18508
作者: Teerapong Panboonyuen
机构: MARS (Motor AI Recognition Solution); Thaivivat Insurance Public Company Limited
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 173 pages

点击查看摘要

Abstract:This handbook presents a systematic treatment of the foundations and architectures of artificial intelligence for motor insurance, grounded in large-scale real-world deployment. It formalizes a vertically integrated AI paradigm that unifies perception, multimodal reasoning, and production infrastructure into a cohesive intelligence stack for automotive risk assessment and claims processing. At its core, the handbook develops domain-adapted transformer architectures for structured visual understanding, relational vehicle representation learning, and multimodal document intelligence, enabling end-to-end automation of vehicle damage analysis, claims evaluation, and underwriting workflows. These components are composed into a scalable pipeline operating under practical constraints observed in nationwide motor insurance systems in Thailand. Beyond model design, the handbook emphasizes the co-evolution of learning algorithms and MLOps practices, establishing a principled framework for translating modern artificial intelligence into reliable, production-grade systems in high-stakes industrial environments.

[CV-91] From Snapshots to Symphonies: The Evolution of Protein Prediction from Static Structures to Generative Dynamics and Multimodal Interactions

【速读】:该论文旨在解决蛋白质折叠问题从静态结构预测向动态构象集合建模与复杂生物分子相互作用模拟的范式转变难题。其解决方案的关键在于构建五大互联维度:统一的多模态表征(整合序列、几何与文本知识)、无需多序列比对(MSA-free)架构提升静态预测精度、生成式框架(如扩散模型和流匹配)捕捉符合热力学分布的构象多样性、异质性相互作用预测(涵盖蛋白-配体、蛋白-核酸及蛋白-蛋白复合物)、以及功能推断(包括适应度景观、突变效应与文本引导的性质预测)。这些方法共同推动AI从结构分析工具演变为具备物理一致性、可模拟生命动态语言的通用仿真平台。

链接: https://arxiv.org/abs/2603.18505
作者: Jingzhi Chen,Lijian Xu
机构: Shenzhen University of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:The protein folding problem has been fundamentally transformed by artificial intelligence, evolving from static structure prediction toward the modeling of dynamic conformational ensembles and complex biomolecular interactions. This review systematically examines the paradigm shift in AI driven protein science across five interconnected dimensions: unified multimodal representations that integrate sequences, geometries, and textual knowledge; refinement of static prediction through MSA free architectures and all atom complex modeling; generative frameworks, including diffusion models and flow matching, that capture conformational distributions consistent with thermodynamic ensembles; prediction of heterogeneous interactions spanning protein ligand, protein nucleic acid, and protein protein complexes; and functional inference of fitness landscapes, mutational effects, and text guided property prediction. We critically analyze current bottlenecks, including data distribution biases, limited mechanistic interpretability, and the disconnect between geometric metrics and biophysical reality, while identifying future directions toward physically consistent generative models, multimodal foundation architectures, and experimental closed loop systems. This methodological transformation marks artificial intelligence’s transition from a structural analysis tool into a universal simulator capable of understanding and ultimately rewriting the dynamic language of life.

[CV-92] HOMEY: Heuristic Object Masking with Enhanced YOLO for Property Insurance Risk Detection

【速读】:该论文旨在解决房地产风险自动检测这一高影响力但研究不足的计算机视觉问题,其核心挑战在于从复杂背景中准确识别多种类型的财产风险(如结构损伤、维护疏忽和责任隐患),以支持房地产评估、承保及保险运营。解决方案的关键在于提出HOMEY(Heuristic Object Masking with Enhanced YOLO)框架,该框架通过引入领域特定的启发式对象掩码(heuristic object masking)增强杂乱背景中的弱信号,并设计了风险感知损失函数(risk-aware loss calibration)来平衡类别不平衡与风险严重性权重,从而在保持快速推理的同时显著提升检测精度与可靠性。

链接: https://arxiv.org/abs/2603.18502
作者: Teerapong Panboonyuen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages

点击查看摘要

Abstract:Automated property risk detection is a high-impact yet underexplored frontier in computer vision with direct implications for real estate, underwriting, and insurance operations. We introduce HOMEY (Heuristic Object Masking with Enhanced YOLO), a novel detection framework that combines YOLO with a domain-specific masking mechanism and a custom-designed loss function. HOMEY is trained to detect 17 risk-related property classes, including structural damages (e.g., cracked foundations, roof issues), maintenance neglect (e.g., dead yards, overgrown bushes), and liability hazards (e.g., falling gutters, garbage, hazard signs). Our approach introduces heuristic object masking to amplify weak signals in cluttered backgrounds and risk-aware loss calibration to balance class skew and severity weighting. Experiments on real-world property imagery demonstrate that HOMEY achieves superior detection accuracy and reliability compared to baseline YOLO models, while retaining fast inference. Beyond detection, HOMEY enables interpretable and cost-efficient risk analysis, laying the foundation for scalable AI-driven property insurance workflows.

[CV-93] Efficient Video Diffusion with Sparse Information Transmission for Video Compression

【速读】:该论文旨在解决超低码率下视频压缩中感知质量差和时间一致性不足的问题。传统端到端压缩模型在极低码率下易产生模糊图像,而现有生成式压缩方法通常独立处理帧,难以保证时序一致性与效率。其解决方案的关键在于提出Efficient Video Diffusion with Sparse Information Transmission (Diff-SIT),核心创新包括:1)稀疏时空编码模块(STEM)将原始帧序列压缩为信息密集的中间序列以显著降低码率;2)单步视频扩散框架结合帧类型嵌入器(FTE),通过自适应重建策略利用帧间时序相关性,在保持高效的同时提升感知质量和时间一致性。

链接: https://arxiv.org/abs/2603.18501
作者: Mingde Zhou,Zheng Chen,Yulun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video compression aims to maximize reconstruction quality with minimal bitrates. Beyond standard distortion metrics, perceptual quality and temporal consistency are also critical. However, at ultra-low bitrates, traditional end-to-end compression models tend to produce blurry images of poor perceptual quality. Besides, existing generative compression methods often treat video frames independently and show limitations in time coherence and efficiency. To address these challenges, we propose the Efficient Video Diffusion with Sparse Information Transmission (Diff-SIT), which comprises the Sparse Temporal Encoding Module (STEM) and the One-Step Video Diffusion with Frame Type Embedder (ODFTE). The STEM sparsely encodes the original frame sequence into an information-rich intermediate sequence, achieving significant bitrate savings. Subsequently, the ODFTE processes this intermediate sequence as a whole, which exploits the temporal correlation. During this process, our proposed Frame Type Embedder (FTE) guides the diffusion model to perform adaptive reconstruction according to different frame types to optimize the overall quality. Extensive experiments on multiple datasets demonstrate that Diff-SIT establishes a new state-of-the-art in perceptual quality and temporal consistency, particularly in the challenging ultra-low-bitrate regime. Code is released at this https URL.

[CV-94] NymeriaPlus: Enriching Nymeria Dataset with Additional Annotations and Data

【速读】:该论文旨在解决现有自指视角(egocentric)数据集在真实场景下多模态信息不完整、标注粒度不足以及缺乏统一基准的问题,从而限制了具身人工智能(embodied AI)研究的进展。解决方案的关键在于对Nymeria数据集进行升级,推出NymeriaPlus,其核心改进包括:(1)提升人体运动表示精度,提供Momentum Human Rig(MHR)和SMPL格式的优化人体姿态数据;(2)新增室内物体与结构元素的密集3D和2D边界框标注;(3)实现实例级3D物体重建;(4)引入基地图、音频及腕戴视频等新模态数据。通过整合这些互补模态与细粒度标注,NymeriaPlus构建了一个更完整、一致的野外自指视角基准,显著增强了对多模态学习与具身智能研究的支持能力。

链接: https://arxiv.org/abs/2603.18496
作者: Daniel DeTone,Federica Bogo,Eric-Tuan Le,Duncan Frost,Julian Straub,Yawar Siddiqui,Yuting Ye,Jakob Engel,Richard Newcombe,Lingni Ma
机构: Meta(Meta)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Nymeria Dataset, released in 2024, is a large-scale collection of in-the-wild human activities captured with multiple egocentric wearable devices that are spatially localized and temporally synchronized. It provides body-motion ground truth recorded with a motion-capture suit, device trajectories, semi-dense 3D point clouds, and in-context narrations. In this paper, we upgrade Nymeria and introduce NymeriaPlus. NymeriaPlus features: (1) improved human motion in Momentum Human Rig (MHR) and SMPL formats; (2) dense 3D and 2D bounding box annotations for indoor objects and structural elements; (3) instance-level 3D object reconstructions; and (4) additional modalities e.g., basemap recordings, audio, and wristband videos. By consolidating these complementary modalities and annotations into a single, coherent benchmark, NymeriaPlus strengthens Nymeria into a more powerful in-the-wild egocentric dataset. We expect NymeriaPlus to bridge a key gap in existing egocentric resources and to support a broader range of research, including unique explorations of multimodal learning for embodied AI.

[CV-95] FILT3R: Latent State Adaptive Kalman Filter for Streaming 3D Reconstruction

【速读】:该论文旨在解决流式三维重建(Streaming 3D Reconstruction)中潜在状态更新规则的不稳定性问题:过于激进的覆盖策略会遗忘有用的历史信息,而保守的更新策略则无法有效追踪新证据,导致在训练范围之外出现性能退化。解决方案的关键在于提出一种无需训练的潜变量滤波层(FILT3R),其将递归状态更新建模为 token 空间中的随机状态估计问题,通过为每个 token 维护方差并计算类似卡尔曼增益(Kalman-style gain)的自适应权重,在记忆保留与新观测之间动态平衡。此外,过程噪声(Process Noise)通过候选 token 的指数移动平均(EMA)归一化时间漂移在线估计,从而实现对场景变化的敏感响应,显著提升深度、位姿及三维重建在长时程下的稳定性。

链接: https://arxiv.org/abs/2603.18493
作者: Seonghyun Jin,Jong Chul Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Streaming 3D reconstruction maintains a persistent latent state that is updated online from incoming frames, enabling constant-memory inference. A key failure mode is the state update rule: aggressive overwrites forget useful history, while conservative updates fail to track new evidence, and both behaviors become unstable beyond the training horizon. To address this challenge, we propose FILT3R, a training-free latent filtering layer that casts recurrent state updates as stochastic state estimation in token space. FILT3R maintains a per-token variance and computes a Kalman-style gain that adaptively balances memory retention against new observations. Process noise – governing how much the latent state is expected to change between frames – is estimated online from EMA-normalized temporal drift of candidate tokens. Using extensive experiments, we demonstrate that FILT3R yields an interpretable, plug-in update rule that generalizes common overwrite and gating policies as special cases. Specifically, we show that gains shrink in stable regimes as uncertainty contracts with accumulated evidence, and rise when genuine scene change increases process uncertainty, improving long-horizon stability for depth, pose, and 3D reconstruction, compared to the existing methods. Code will be released at this https URL.

[CV-96] xEditor: Structure-Preserving Text-Driven Texture Editing

【速读】:该论文旨在解决文本引导的纹理编辑(text-guided texture editing)中结构一致性难以保持的问题,即现有先进模型在仅修改物体外观时,常导致几何结构失真。其解决方案的关键在于从数据和训练两个维度协同增强结构保真能力:首先构建高质量的SFT数据集TexBlender,利用Blender生成具有强结构先验的合成数据以实现冷启动;其次提出基于强化学习(Reinforcement Learning, RL)的StructureNFT方法,通过引入结构保持损失函数将SFT阶段学到的结构先验迁移至真实场景。这一双管齐下的策略显著提升了纹理编辑过程中几何结构的稳定性与真实性。

链接: https://arxiv.org/abs/2603.18488
作者: Bo Zhao,Yihang Liu,Chenfeng Zhang,Huan Yang,Kun Gai,Wei Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19pages

点击查看摘要

Abstract:Text-guided texture editing aims to modify object appearance while preserving the underlying geometric structure. However, our empirical analysis reveals that even SOTA editing models frequently struggle to maintain structural consistency during texture editing, despite the intended changes being purely appearance-related. Motivated by this observation, we jointly enhance structure preservation from both data and training perspectives, and build TexEditor, a dedicated texture editing model based on Qwen-Image-Edit-2509. Firstly, we construct TexBlender, a high-quality SFT dataset generated with Blender, which provides strong structural priors for a cold start. Sec- ondly, we introduce StructureNFT, a RL-based approach that integrates structure-preserving losses to transfer the structural priors learned during SFT to real-world scenes. Moreover, due to the limited realism and evaluation coverage of existing benchmarks, we introduce TexBench, a general-purpose real-world benchmark for text-guided texture editing. Extensive experiments on existing Blender-based texture benchmarks and our TexBench show that TexEditor consistently outperforms strong baselines such as Nano Banana Pro. In addition, we assess TexEditor on the general purpose benchmark ImgEdit to validate its generalization. Our code and data are available at this https URL.

[CV-97] -QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World

【速读】:该论文旨在解决开放世界学习中分布外(Out-of-distribution, OOD)检测的两大挑战:一是现有基于视觉-语言模型(Vision-Language Models, VLMs)的方法依赖固定融合规则,难以适应随时间演变的数据分布(即时间漂移,temporal drift);二是这些方法对协变量偏移(covariate shift)输入缺乏鲁棒性。其解决方案的关键在于提出一种两步框架——Temporal Quadruple-Pattern Matching (T-QPM),通过引入跨模态一致性模式(ID与OOD信号间的图像-文本联合推理)来优化决策边界,并设计轻量级融合权重以动态结合语义匹配与视觉典型性,从而应对时间漂移;同时,通过基于平均阈值置信度(Average Thresholded Confidence, ATC)的显式正则化机制保障模型在分布演化过程中的稳定性。

链接: https://arxiv.org/abs/2603.18481
作者: Aditi Naiknaware,Salimeh Sekeh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection remains a critical challenge in open-world learning, where models must adapt to evolving data distributions. While recent vision-language models (VLMS) like CLIP enable multimodal OOD detection through Dual-Pattern Matching (DPM), existing methods typically suffer from two major shortcomings: (1) They rely on fixed fusion rules and assume static environments, failing under temporal drift; and (2) they lack robustness against covariate shifted inputs. In this paper, we propose a novel two-step framework to enhance OOD detection and covariate distribution shift robustness in dynamic settings. We extend the dual-pattern regime into Temporal Quadruple-Pattern Matching (T-QPM). First, by pairing OOD images with text descriptions, we introduce cross-modal consistency patterns between ID and OOD signals, refining the decision boundary through joint image-text reasoning. Second, we address temporal distribution shifts by learning lightweight fusion weights to optimally combine semantic matching and visual typicality. To ensure stability, we enforce explicit regularization based on Average Thresholded Confidence (ATC), preventing performance degradation as distributions evolve. Experiments on temporally partitioned benchmarks demonstrate that our approach significantly outperforms static baselines, offering a robust, temporally-consistent framework for multimodal OOD detection in non-stationary environments.

[CV-98] Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理离散符号(discrete symbols)——如数学公式、化学结构和语言字符等人类认知基本单元——时能力不足的问题。这类符号不同于连续视觉数据,需要更精确和深层的语义理解。论文的关键解决方案是构建了一个涵盖语言、文化、数学、物理和化学五个领域的综合性基准测试(benchmark),系统评估顶级MLLMs在“离散语义空间”中的表现。通过该基准,研究发现模型常在基础符号识别上失败,却能在复杂推理任务中取得成功,揭示其依赖语言概率而非真正的视觉感知,从而指出了当前AI在符号理解上的“认知错位”(cognitive mismatch)问题,并为开发更符合人类认知逻辑的智能系统提供了方向。

链接: https://arxiv.org/abs/2603.18472
作者: Yinghui Li,Jiayi Kuang,Peng Xing,Daixian Liu,Junnan Dong,Shu-Yu Guo,Yangning Li,Qingyu Zhou,Wenhao Jiang,Hai-Tao Zheng,Ying Shen,Liang Lin,Philip S. Yu
机构: Tsinghua University(清华大学); Sun Yat-sen University(中山大学); The Hong Kong Polytechnic University(香港理工大学); Guangdong Laboratory of AI and Digital Economy (SZ)(广东省人工智能与数字经济发展实验室(深圳)); University of Illinois Chicago(伊利诺伊大学芝加哥分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols – the fundamental building blocks of human cognition – remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these “discrete semantic spaces” across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this “cognitive mismatch”, we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.

[CV-99] Recolour What Matters: Region-Aware Colour Editing via Token-Level Diffusion

【速读】:该论文旨在解决图像生成中颜色控制精度不足的问题,尤其是在细粒度和局部颜色编辑场景下,现有扩散模型难以准确实现用户指定的色调(hue)变化。其关键解决方案是提出了一种统一的扩散框架ColourCrafter,通过在潜在空间中进行RGB颜色token与图像token的逐标记融合(token-level fusion),实现颜色信息的选择性传播至语义相关区域,同时保持结构保真度;此外,引入基于感知空间(Lab空间)的损失函数,分离亮度(luminance)与色度(chrominance)并约束编辑范围于掩码区域内,从而显著提升像素级颜色精度与可控性。

链接: https://arxiv.org/abs/2603.18466
作者: Yuqi Yang,Dongliang Chang,Yijia Ling,Ruoyi Du,Zhanyu Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 12 figures

点击查看摘要

Abstract:Colour is one of the most perceptually salient yet least controllable attributes in image generation. Although recent diffusion models can modify object colours from user instructions, their results often deviate from the intended hue, especially for fine-grained and local edits. Early text-driven methods rely on discrete language descriptions that cannot accurately represent continuous chromatic variations. To overcome this limitation, we propose ColourCrafter, a unified diffusion framework that transforms colour editing from global tone transfer into a structured, region-aware generation process. Unlike traditional colour driven methods, ColourCrafter performs token-level fusion of RGB colour tokens and image tokens in latent space, selectively propagating colour information to semantically relevant regions while preserving structural fidelity. A perceptual Lab-space Loss further enhances pixel-level precision by decoupling luminance and chrominance and constraining edits within masked areas. Additionally, we build ColourfulSet, a largescale dataset of high-quality image pairs with continuous and diverse colour variations. Extensive experiments demonstrate that ColourCrafter achieves state-of-the-art colour accuracy, controllability and perceptual fidelity in fine-grained colour editing. Our project is available at this https URL.

[CV-100] MedQ-UNI: Toward Unified Medical Image Quality Assessment and Restoration via Vision-Language Modeling

【速读】:该论文旨在解决现有医学图像恢复(Med-IR)方法在临床实践中泛化能力不足的问题,即当前方法通常局限于特定成像模态或退化类型,难以应对实际中多样且异构的图像退化情况。其核心挑战在于Med-IR与医学图像质量评估(Med-IQA)的割裂,导致恢复模型缺乏对图像质量的显式理解,从而无法有效适应不同模态和退化类型的复杂场景。解决方案的关键在于提出MedQ-UNI——一个统一的视觉-语言模型,采用“先评估后恢复”的范式,通过结构化的自然语言描述显式融合Med-IQA以指导跨模态、跨退化类型的图像恢复。该模型基于多模态自回归双专家架构(共享注意力机制),其中质量评估专家首先识别退化问题,恢复专家则基于这些描述进行针对性修复,从而显著提升恢复精度与可解释性。

链接: https://arxiv.org/abs/2603.18465
作者: Jiyao Liu,Junzhi Ning,Wanying Qu,Lihao Liu,Chenglong Ma,Junjun He,Ningsheng Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing medical image restoration (Med-IR) methods are typically modality-specific or degradation-specific, failing to generalize across the heterogeneous degradations encountered in clinical practice. We argue this limitation stems from the isolation of Med-IR from medical image quality assessment (Med-IQA), as restoration models without explicit quality understanding struggle to adapt to diverse degradation types across modalities. To address these challenges, we propose MedQ-UNI, a unified vision-language model that follows an assess-then-restore paradigm, explicitly leveraging Med-IQA to guide Med-IR across arbitrary modalities and degradation types. MedQ-UNI adopts a multimodal autoregressive dual-expert architecture with shared attention: a quality assessment expert first identifies degradation issues through structured natural language descriptions, and a restoration expert then conditions on these descriptions to perform targeted image restoration. To support this paradigm, we construct a large-scale dataset of approximately 50K paired samples spanning three imaging modalities and five restoration tasks, each annotated with structured quality descriptions for joint Med-IQA and Med-IR training, along with a 2K-sample benchmark for evaluation. Extensive experiments demonstrate that a single MedQ-UNI model, without any task-specific adaptation, achieves state-of-the-art restoration performance across all tasks while generating superior descriptions, confirming that explicit quality understanding meaningfully improves restoration fidelity and interpretability.

[CV-101] Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images CVPR2026

【速读】:该论文旨在解决从病理图像中估计基因表达谱时忽略细胞层面信息的问题,现有方法将基因表达视为单纯的切片或局部区域信号,未能体现其源于底层细胞类型表达的聚合特性。解决方案的关键在于提出一种细胞类型原型引导的神经网络(Cell-type Prototype-informed Neural Network, CPNN),该模型利用公开的单细胞RNA测序数据提取稳定的细胞类型原型(即反映基因间共变关系的平均表达谱),并直接从组织图像中学习细胞组成权重,从而建立原型与观测到的批量或空间基因表达之间的映射关系,实现生物机制驱动且结构正则化的预测框架。

链接: https://arxiv.org/abs/2603.18461
作者: Kazuya Nishimura,Ryoma Bise,Shinnosuke Matsuo,Haruka Hirose,Yasuhiro Kojima
机构: The University of Osaka; Kyushu University; National Cancer Center Japan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Estimating slide- and patch-level gene expression profiles from pathology images enables rapid and low-cost molecular analysis with broad clinical impact. Despite strong results, existing approaches treat gene expression as a mere slide- or spot-level signal and do not incorporate the fact that the measured expression arises from the aggregation of underlying cell-level expression. To explicitly introduce this missing cell-resolved guidance, we propose a Cell-type Prototype-informed Neural Network (CPNN) that leverages publicly available single-cell RNA-sequencing datasets. Since single-cell measurements are noisy and not paired with histology images, we first estimate cell-type prototypes-mean expression profiles that reflect stable gene-gene co-variation this http URL then learns cell-type compositional weights directly from images and models the relationship between prototypes and observed bulk or spatial expression, providing a biologically grounded and structurally regularized prediction framework. We evaluate CPNN on three slide-level datasets and three patch-level spatial transcriptomics datasets. Across all settings, CPNN achieves the highest performance in terms of Spearman correlation. Moreover, by visualizing the inferred compositional weights, our framework provides interpretable insights into which cell types drive the predicted expression. Code is publicly available at this https URL.

[CV-102] Interpretable Prostate Cancer Detection using a Small Cohort of MRI Images

【速读】:该论文旨在解决前列腺癌诊断中基于T2加权磁共振成像(T2-weighted MRI)的病灶识别难题,尤其是因病变细微且异质性高导致的人工判读困难问题。其解决方案的关键在于构建一个可解释的自动检测框架,利用小样本数据(162例)通过迁移学习和数据增强技术提升模型性能,并对比了多种算法(包括卷积神经网络ResNet18、视觉Transformer ViT/Swin、以及传统手工特征方法HOG+SVM),发现迁移学习的ResNet18在仅1100万参数下达到最优性能(准确率90.9%,敏感度95.2%,AUC 0.905),同时验证了手工地特征提取方法在小数据场景下的有效性,显著优于复杂度更高的Vision Transformers。该方法仅依赖T2加权图像即可实现媲美双参数MRI(T2+DWI)的性能,降低了扫描复杂性和计算成本,具有临床辅助筛查潜力。

链接: https://arxiv.org/abs/2603.18460
作者: Vahid Monfared,Mohammad Hadi Gharib,Ali Sabri,Maryam Shahali,Farid Rashidi,Amit Mehta,Reza Rawassizadeh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Prostate cancer is a leading cause of mortality in men, yet interpretation of T2-weighted prostate MRI remains challenging due to subtle and heterogeneous lesions. We developed an interpretable framework for automatic cancer detection using a small dataset of 162 T2-weighted images (102 cancer, 60 normal), addressing data scarcity through transfer learning and augmentation. We performed a comprehensive comparison of Vision Transformers (ViT, Swin), CNNs (ResNet18), and classical methods (Logistic Regression, SVM, HOG+SVM). Transfer-learned ResNet18 achieved the best performance (90.9% accuracy, 95.2% sensitivity, AUC 0.905) with only 11M parameters, while Vision Transformers showed lower performance despite substantially higher complexity. Notably, HOG+SVM achieved comparable accuracy (AUC 0.917), highlighting the effectiveness of handcrafted features in small datasets. Unlike state-of-the-art approaches relying on biparametric MRI (T2+DWI) and large cohorts, our method achieves competitive performance using only T2-weighted images, reducing acquisition complexity and computational cost. In a reader study of 22 cases, five radiologists achieved a mean sensitivity of 67.5% (Fleiss Kappa = 0.524), compared to 95.2% for the AI model, suggesting potential for AI-assisted screening to reduce missed cancers and improve consistency. Code and data are publicly available.

[CV-103] Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching

【速读】:该论文旨在解决视频大语言模型(Video-LLMs)在体育教学任务中因注意力分配到无关帧而导致的时间定位不准确问题。其核心挑战在于,获取帧级别的监督信号既昂贵又不可靠,而现有方法难以在不依赖额外标注的情况下提升时间对齐精度。解决方案的关键在于利用相关任务(如生成与验证)必须关注相同关键帧的特性,通过设计一种基于视觉注意力图的自一致性目标函数,在无需额外标注的前提下强制不同任务间共享一致的注意力分布,从而有效优化时间定位能力。实验表明,该方法在 VidDiffBench 基准上显著优于监督微调基线,提升了 3.0% 至 14.1% 的准确率,并超越部分闭源模型。

链接: https://arxiv.org/abs/2603.18453
作者: Arushi Rai,Adriana Kovashka
机构: University of Pittsburgh (匹兹堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video-LLMs often attend to irrelevant frames, which is especially detrimental for sports coaching tasks requiring precise temporal grounding. Yet obtaining frame-level supervision is challenging: expensive to collect from humans and unreliable from other models. We improve temporal grounding without additional annotations by exploiting the observation that related tasks, such as generation and verification, must attend to the same frames. We enforce this via a self-consistency objective over select visual attention maps of tightly-related tasks. Using VidDiffBench, which provides ground-truth keyframe annotations, we first validate that attention misallocation is a significant bottleneck. We then show that training with our objective yields gains of +3.0%, +14.1% accuracy and +0.9 BERTScore over supervised finetuning across three sports coaching tasks: Exact, FitnessQA, and ExpertAF, even surpassing closed-source models.

[CV-104] SR-Nav: Spatial Relationships Matter for Zero-shot Object Goal Navigation

【速读】:该论文旨在解决零样本物体重定位导航(zero-shot object-goal navigation)中因视角不佳或语义线索弱导致基础模型在感知与规划阶段推理不可靠的问题,从而引发导航效率低下甚至失败。其解决方案的关键在于引入空间关系感知的导航框架(Spatial Relation-aware Navigation, SR-Nav),通过构建动态空间关系图(Dynamic Spatial Relationship Graph, DSRG)来编码目标中心的空间关系,并结合关系感知匹配模块和动态关系规划模块,分别增强视觉感知鲁棒性和降低规划搜索空间,从而提升导航成功率与效率。

链接: https://arxiv.org/abs/2603.18443
作者: Leyuan Fang,Zan Mao,Zijing Wang,Yinlong Yan
机构: Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot object-goal navigation aims to find target objects in unseen environments using only egocentric observation. Recent methods leverage foundation models’ comprehension and reasoning capabilities to enhance navigation performance. However, when faced with poor viewpoints or weak semantic cues, foundation models often fail to support reliable reasoning in both perception and planning, resulting in inefficient or failed navigation. We observe that inherent relationships among objects and regions encode structured scene priors, which help agents infer plausible target locations even under partial observations. Motivated by this insight, we propose Spatial Relation-aware Navigation (SR-Nav), a framework that models both observed and experience-based spatial relationships to enhance both perception and planning. Specifically, SR-Nav first constructs a Dynamic Spatial Relationship Graph (DSRG) that encodes the target-centered spatial relationships through the foundation models and updates dynamically with real-time observations. We then introduce a Relation-aware Matching Module. It utilizes relationship matching instead of naive detection, leveraging diverse relationships in the DSRG to verify and correct errors, enhancing visual perception robustness. Finally, we design a Dynamic Relationship Planning Module to reduce the planning search space by dynamically computing the optimal paths based on the DSRG from the current position, thereby guiding planning and reducing exploration redundancy. Experiments on HM3D show that our method achieves state-of-the-art performance in both success rate and navigation efficiency. The code will be publicly available at this https URL

[CV-105] AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

【速读】:该论文旨在解决长时GUI代理(Long-horizon GUI agents)在现实世界部署中因交互记忆缺失而导致的性能下降问题。现有方法中,完整交互序列重放冗余且放大噪声,而摘要则常丢失关键依赖信息与可追溯性。其解决方案的核心是提出锚定状态记忆(Anchored State Memory, ASM),通过将交互序列表示为一组因果关联的中间状态锚点(intermediate-state anchors),实现子目标导向的检索与归因感知决策,从而有效缓解长时交互中的记忆瓶颈。实验表明,ASM在多个GUI代理上显著优于全序列重放和摘要基线,任务完成率(TCR)提升5%-30.16%,平均步长成功率(AMS)提升4.93%-24.66%。

链接: https://arxiv.org/abs/2603.18429
作者: Yibo Shi,Jungang Li,Linghao Zhang,Zihao Dongfang,Biao Wu,Sicheng Tao,Yibo Yan,Chenxi Qin,Weiting Liu,Zhixin Lin,Hanqian Li,Yu Huang,Song Dai,Yonghua Hei,Yue Ding,Xiang Li,Shikang Wang,Chengdong Xu,Jingqi Liu,Xueying Ma,Zhiwen Zheng,Xiaofei Zhang,Bincheng Wang,Nichen Yang,Jie Wu,Lihua Tian,Chen Li,Xuming Hu
机构: XJTU; HKUST(GZ); HKUST; CityU; UTS; TJU; FDU; SDU; CASIA; SYSU; NWPU
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-horizon GUI agents are a key step toward real-world deployment, yet effective interaction memory under prevailing paradigms remains under-explored. Replaying full interaction sequences is redundant and amplifies noise, while summaries often erase dependency-critical information and traceability. We present AndroTMem, a diagnostic framework for anchored memory in long-horizon Android GUI agents. Its core benchmark, AndroTMem-Bench, comprises 1,069 tasks with 34,473 interaction steps (avg. 32.1 per task, max. 65). We evaluate agents with TCR (Task Complete Rate), focusing on tasks whose completion requires carrying forward critical intermediate state; AndroTMem-Bench is designed to enforce strong step-to-step causal dependencies, making sparse yet essential intermediate states decisive for downstream actions and centering interaction memory in evaluation. Across open- and closed-source GUI agents, we observe a consistent pattern: as interaction sequences grow longer, performance drops are driven mainly by within-task memory failures, not isolated perception errors or local action mistakes. Guided by this diagnosis, we propose Anchored State Memory (ASM), which represents interaction sequences as a compact set of causally linked intermediate-state anchors to enable subgoal-targeted retrieval and attribution-aware decision making. Across multiple settings and 12 evaluated GUI agents, ASM consistently outperforms full-sequence replay and summary-based baselines, improving TCR by 5%-30.16% and AMS by 4.93%-24.66%, indicating that anchored, structured memory effectively mitigates the interaction-memory bottleneck in long-horizon GUI tasks. The code, benchmark, and related resources are publicly available at [this https URL](this https URL).

[CV-106] RD: Balancing Reliability and Diversity in Synthetic Data Augmentation for Semantic Segmentation

【速读】:该论文旨在解决像素级语义分割任务中数据收集与标注成本高昂的问题,以及传统数据增强技术难以生成新结构、而现有生成式模型又难以保持生成图像与原始标签一致性的问题。其解决方案的关键在于提出一种融合可控扩散模型(controllable diffusion models)的合成数据增强流程,通过引入类别感知提示(class-aware prompting)和视觉先验融合(visual prior blending)策略,在保证图像质量的同时确保生成内容与分割标签的精确对齐,从而在多样性与可靠性之间取得平衡,有效缩小合成数据与真实数据之间的差距。

链接: https://arxiv.org/abs/2603.18427
作者: Huy Che,Dinh-Duy Phan,Duc-Khai Lam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Collecting and annotating datasets for pixel-level semantic segmentation tasks are highly labor-intensive. Data augmentation provides a viable solution by enhancing model generalization without additional real-world data collection. Traditional augmentation techniques, such as translation, scaling, and color transformations, create geometric variations but fail to generate new structures. While generative models have been employed to extend semantic information of datasets, they often struggle to maintain consistency between the original and generated images, particularly for pixel-level tasks. In this work, we propose a novel synthetic data augmentation pipeline that integrates controllable diffusion models. Our approach balances diversity and reliability data, effectively bridging the gap between synthetic and real data. We utilize class-aware prompting and visual prior blending to improve image quality further, ensuring precise alignment with segmentation labels. By evaluating benchmark datasets such as PASCAL VOC and BDD100K, we demonstrate that our method significantly enhances semantic segmentation performance, especially in data-scarce scenarios, while improving model robustness in real-world applications. Our code is available at \hrefthis https URLthis https URL.

[CV-107] SynQ: Accurate Zero-shot Quantization by Synthesis-aware Fine-tuning ICLR2025

【速读】:该论文旨在解决零样本量化(Zero-shot Quantization, ZSQ)中因缺乏训练数据而导致的模型精度下降问题,尤其针对现有方法面临的三大挑战:合成数据集中的噪声、基于非目标模式的预测以及错误硬标签带来的误导。其解决方案的关键在于提出SynQ框架,通过引入低通滤波器最小化生成样本的噪声,利用类激活图(Class Activation Map, CAM)对齐策略提升量化模型的准确性,并仅对困难样本采用软标签以缓解预训练模型错误的误导,从而实现优于现有ZSQ方法的性能表现。

链接: https://arxiv.org/abs/2603.18423
作者: Minjun Kim,Jongjin Kim,U Kang
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025

点击查看摘要

Abstract:How can we accurately quantize a pre-trained model without any data? Quantization algorithms are widely used for deploying neural networks on resource-constrained edge devices. Zero-shot Quantization (ZSQ) addresses the crucial and practical scenario where training data are inaccessible for privacy or security reasons. However, three significant challenges hinder the performance of existing ZSQ methods: 1) noise in the synthetic dataset, 2) predictions based on off-target patterns, and the 3) misguidance by erroneous hard labels. In this paper, we propose SynQ (Synthesis-aware Fine-tuning for Zero-shot Quantization), a carefully designed ZSQ framework to overcome the limitations of existing methods. SynQ minimizes the noise from the generated samples by exploiting a low-pass filter. Then, SynQ trains the quantized model to improve accuracy by aligning its class activation map with the pre-trained model. Furthermore, SynQ mitigates misguidance from the pre-trained model’s error by leveraging only soft labels for difficult samples. Extensive experiments show that SynQ provides the state-of-the-art accuracy, over existing ZSQ methods.

[CV-108] Mind the Rarities: Can Rare Skin Diseases Be Reliably Diagnosed via Diagnostic Reasoning ?

【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)在皮肤病学领域中对罕见疾病诊断推理能力评估不足的问题。现有基准测试主要关注常见疾病的最终诊断准确率,忽略了临床推理过程,而这一过程对于复杂病例至关重要。解决方案的关键在于构建一个名为DermCase的长上下文基准数据集,该数据集源自同行评审的病例报告,包含26,030个多模态图像-文本对和6,354个具有临床挑战性的病例,每个病例均配有详尽的临床信息及逐步推理链。同时,研究提出基于DermLIP的相似性度量方法,显著提升了模型评估结果与皮肤科医生判断的一致性,从而实现了对LVLMs在诊断准确性、鉴别诊断能力和临床推理质量等方面的系统化评测。

链接: https://arxiv.org/abs/2603.18418
作者: Yang Liu,Jiyao Yang,Hongjin Zhao,Xiaoyong Li,Yanzhe Ji,Xingjian Li,Runmin Jiang,Tianyang Wang,Saeed Anwar,Dongwoo Kim,Yue Yao,Zhenyue Qin,Min Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) demonstrate strong performance in dermatology; however, evaluating diagnostic reasoning for rare conditions remains largely unexplored. Existing benchmarks focus on common diseases and assess only final accuracy, overlooking the clinical reasoning process, which is critical for complex cases. We address this gap by constructing DermCase, a long-context benchmark derived from peer-reviewed case reports. Our dataset contains 26,030 multi-modal image-text pairs and 6,354 clinically challenging cases, each annotated with comprehensive clinical information and step-by-step reasoning chains. To enable reliable evaluation, we establish DermLIP-based similarity metrics that achieve stronger alignment with dermatologists for assessing differential diagnosis quality. Benchmarking 22 leading LVLMs exposes significant deficiencies across diagnosis accuracy, differential diagnosis, and clinical reasoning. Fine-tuning experiments demonstrate that instruction tuning substantially improves performance while Direct Preference Optimization (DPO) yields minimal gains. Systematic error analysis further reveals critical limitations in current models’ reasoning capabilities.

[CV-109] Inst4DGS: Instance-Decomposed 4D Gaussian Splatting with Multi-Video Label Permutation Learning

【速读】:该论文旨在解决动态4D高斯溅射(4D Gaussian Splatting, 4DGS)中实例分解(instance-decomposed)方法研究不足的问题,特别是多视角视频间实例标签不一致导致的关联困难。其关键解决方案是引入每视频标签排列潜变量(per-video label-permutation latents),通过可微分Sinkhorn层学习跨视频实例匹配,实现一致的身份保持与直接多视角监督;同时提出基于实例的运动骨架(instance-decomposed motion scaffolds),为每个物体提供低维运动基底以优化长时轨迹,从而在保证身份稳定性和决策边界清晰度的同时提升效率。

链接: https://arxiv.org/abs/2603.18402
作者: Yonghan Lee,Dinesh Manocha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Inst4DGS, an instance-decomposed 4D Gaussian Splatting (4DGS) approach with long-horizon per-Gaussian trajectories. While dynamic 4DGS has advanced rapidly, instance-decomposed 4DGS remains underexplored, largely due to the difficulty of associating inconsistent instance labels across independently segmented multi-view videos. We address this challenge by introducing per-video label-permutation latents that learn cross-video instance matches through a differentiable Sinkhorn layer, enabling direct multi-view supervision with consistent identity preservation. This explicit label alignment yields sharp decision boundaries and temporally stable identities without identity drift. To further improve efficiency, we propose instance-decomposed motion scaffolds that provide low-dimensional motion bases per object for long-horizon trajectory optimization. Experiments on Panoptic Studio and Neural3DV show that Inst4DGS jointly supports tracking and instance decomposition while achieving state-of-the-art rendering and segmentation quality. On the Panoptic Studio dataset, Inst4DGS improves PSNR from 26.10 to 28.36, and instance mIoU from 0.6310 to 0.9129, over the strongest baseline.

[CV-110] Pixel-Accurate Epipolar Guided Matching

【速读】:该论文旨在解决关键点匹配(keypoint matching)在挑战性场景下(如重复纹理或远基线视角)速度慢且不可靠的问题。现有基于几何约束(如基础矩阵)的匹配方法虽能通过限制对应关系到狭窄的极线带(epipolar envelope)来提升鲁棒性,但大多依赖粗粒度的空间分箱(coarse spatial binning),导致近似误差、昂贵的后处理以及漏检有效匹配。其解决方案的关键在于提出一种精确的角空间(angular space)表述:将每个关键点映射为一个容差圆(tolerance circle),从极点观察时定义出对应的角区间;匹配过程转化为一维角区间查询问题,利用区间树(segment tree)实现对数时间复杂度的高效求解,从而保证像素级精度、支持逐关键点控制,并消除冗余描述子比较。

链接: https://arxiv.org/abs/2603.18401
作者: Oleksii Nasypanyi,Francois Rameau
机构: Stony Brook University (石溪大学); SUNY Korea (纽约州立大学韩国分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Keypoint matching can be slow and unreliable in challenging conditions such as repetitive textures or wide-baseline views. In such cases, known geometric relations (e.g., the fundamental matrix) can be used to restrict potential correspondences to a narrow epipolar envelope, thereby reducing the search space and improving robustness. These epipolar-guided matching approaches have proved effective in tasks such as SfM; however, most rely on coarse spatial binning, which introduces approximation errors, requires costly post-processing, and may miss valid correspondences. We address these limitations with an exact formulation that performs candidate selection directly in angular space. In our approach, each keypoint is assigned a tolerance circle which, when viewed from the epipole, defines an angular interval. Matching then becomes a 1D angular interval query, solved efficiently in logarithmic time with a segment tree. This guarantees pixel-level tolerance, supports per-keypoint control, and removes unnecessary descriptor comparisons. Extensive evaluation on ETH3D demonstrates noticeable speedups over existing approaches while recovering exact correspondence sets.

[CV-111] o See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在正确回答时是否真正依赖视觉信息,还是利用语言捷径(language shortcuts)的问题。其核心挑战在于区分模型输出的准确性是源于对视觉内容的准确感知,还是由于对指令或语境的过度拟合。解决方案的关键在于提出三层次诊断框架(Tri-Layer Diagnostic Framework),通过三个量化指标:潜在异常检测(Latent Anomaly Detection,衡量感知意识)、视觉必要性分数(Visual Necessity Score,基于KL散度测量视觉依赖性)以及竞争分数(Competition Score,评估视觉锚定与指令遵循之间的冲突),结合反事实干预(盲视、噪声和冲突图像)对7种VLM进行系统分析。该方法揭示了69.6%的样本存在“视觉谄媚”(Visual Sycophancy)现象——即模型虽能检测到视觉异常却仍选择生成符合用户预期的幻觉内容,且无样本表现出鲁棒拒绝(Robust Refusal),说明对齐训练已系统性抑制了模型对不确定性的诚实表达。进一步地,诊断得分可用于后验选择性预测策略,在不增加训练成本的前提下实现最高达+9.5个百分点的准确率提升(覆盖率为50%)。

链接: https://arxiv.org/abs/2603.18373
作者: Rui Hong,Shuxue Quan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 1 figures

点击查看摘要

Abstract:When VLMs answer correctly, do they genuinely rely on visual information or exploit language shortcuts? We introduce the Tri-Layer Diagnostic Framework, which disentangles hallucination sources via three metrics: Latent Anomaly Detection (perceptual awareness), Visual Necessity Score (visual dependency, measured via KL divergence), and Competition Score (conflict between visual grounding and instruction following). Using counterfactual interventions (blind, noise, and conflict images) across 7 VLMs and 7,000 model-sample pairs, our taxonomy reveals that 69.6% of samples exhibit Visual Sycophancy–models detect visual anomalies but hallucinate to satisfy user expectations–while zero samples show Robust Refusal, indicating alignment training has systematically suppressed truthful uncertainty acknowledgment. A scaling analysis (Qwen2.5-VL 7B to 72B) shows larger models reduce Language Shortcuts but amplify Visual Sycophancy, demonstrating scale alone cannot resolve the grounding problem. Diagnostic scores further enable a post-hoc selective prediction strategy achieving up to +9.5pp accuracy at 50% coverage with no additional training cost.

[CV-112] Epistemic Generative Adversarial Networks

【速读】:该论文旨在解决生成式对抗网络(Generative Adversarial Networks, GANs)在实际应用中普遍存在的输出多样性不足问题,即模型倾向于生成相似样本而非具有丰富变体的多样化结果。其解决方案的关键在于:首先,基于Dempster-Shafer证据理论对GAN的损失函数进行泛化,将不确定性建模引入生成器和判别器的优化过程;其次,提出一种生成器架构改进,使其能够为每个图像像素预测一个质量函数(mass function),从而量化输出中的不确定性,并利用该不确定性信息驱动生成更具多样性和代表性的样本。这一方法不仅提升了生成多样性,还为生成过程中的不确定性提供了可解释的建模框架。

链接: https://arxiv.org/abs/2603.18348
作者: Muhammad Mubashar,Fabio Cuzzolin
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Generative models, particularly Generative Adversarial Networks (GANs), often suffer from a lack of output diversity, frequently generating similar samples rather than a wide range of variations. This paper introduces a novel generalization of the GAN loss function based on Dempster-Shafer theory of evidence, applied to both the generator and discriminator. Additionally, we propose an architectural enhancement to the generator that enables it to predict a mass function for each image pixel. This modification allows the model to quantify uncertainty in its outputs and leverage this uncertainty to produce more diverse and representative generations. Experimental evidence shows that our approach not only improves generation variability but also provides a principled framework for modeling and interpreting uncertainty in generative processes.

[CV-113] VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection

【速读】:该论文旨在解决胶囊内镜(capsule endoscopy)事件检测中的关键挑战:诊断相关发现稀疏、视觉异质性强,且嵌套在长而嘈杂的视频流中,同时评估标准以事件级别而非帧级准确率为主。为应对这一问题,作者提出将任务建模为与指标对齐的事件检测问题,而非单纯的帧级分类任务。其解决方案的关键在于构建一个融合双骨干网络(EndoFM-LV用于局部时序上下文建模,DINOv3 ViT-L/16用于强帧级视觉语义提取)的多阶段框架,结合多样化的头部集成、验证引导的分层融合机制以及解剖结构感知的时序事件解码策略,其中融合阶段引入基于验证集的类别权重、骨干权重和概率校准,解码阶段则通过时序平滑、解剖约束、阈值优化及标签级事件生成实现稳定事件预测。实验表明,上述设计显著提升了事件级别的性能,最终在官方隐藏测试集上达到0.3530的temporal mAP@0.5和0.3235的temporal mAP@0.95。

链接: https://arxiv.org/abs/2603.18343
作者: Bo-Cheng Qiu,Yu-Fan Lin,Yu-Zhe Pien,Chia-Ming Lee,Fu-En Yang,Yu-Chiang Frank Wang,Chih-Chung Hsu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Capsule endoscopy event detection is challenging because diagnostically relevant findings are sparse, visually heterogeneous, and embedded in long, noisy video streams, while evaluation is performed at the event level rather than by frame accuracy alone. We therefore formulate the RARE-VISION task as a metric-aligned event detection problem instead of a purely frame-wise classification task. Our framework combines two complementary backbones, EndoFM-LV for local temporal context and DINOv3 ViT-L/16 for strong frame-level visual semantics, followed by a Diverse Head Ensemble, Validation-Guided Hierarchical Fusion, and Anatomy-Aware Temporal Event Decoding. The fusion stage uses validation-derived class-wise model weighting, backbone weighting, and probability calibration, while the decoding stage applies temporal smoothing, anatomical constraints, threshold refinement, and per-label event generation to produce stable event predictions. Validation ablations indicate that complementary backbones, validation-guided fusion, and anatomy-aware temporal decoding all contribute to event-level performance. On the official hidden test set, the proposed method achieved an overall temporal mAP@0.5 of 0.3530 and temporal mAP@0.95 of 0.3235.

[CV-114] DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving

【速读】:该论文旨在解决自主车辆在复杂交通场景中实现安全决策的难题,特别是传统强化学习(Reinforcement Learning, RL)方法因依赖人工设计奖励或稀疏碰撞信号而难以捕捉驾驶情境的语义信息,导致现实环境中不可避免地出现不安全探索的问题。解决方案的关键在于提出DriveVLM-RL框架,其核心创新是基于神经科学启发的双路径架构:静态路径利用CLIP模型对比语言目标进行连续空间安全性评估,动态路径则通过轻量级检测器与大模型视觉语言模型(Vision-Language Model, VLM)结合,实现注意力门控的多帧语义风险推理;同时采用分层奖励合成机制融合语义信号与车辆状态,并通过异步训练流程将高延迟的VLM推理从环境交互中解耦,确保部署阶段无额外计算负担,从而在保证实时性的同时显著提升安全性和泛化能力。

链接: https://arxiv.org/abs/2603.18315
作者: Zilin Huang,Zihao Sheng,Zhengyang Wan,Yansong Qu,Junwei You,Sicong Jiang,Sikai Chen
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Purdue University (普渡大学); McGill University (麦吉尔大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 15 figures. Code and demo available online

点击查看摘要

Abstract:Ensuring safe decision-making in autonomous vehicles remains a fundamental challenge despite rapid advances in end-to-end learning approaches. Traditional reinforcement learning (RL) methods rely on manually engineered rewards or sparse collision signals, which fail to capture the rich contextual understanding required for safe driving and make unsafe exploration unavoidable in real-world settings. Recent vision-language models (VLMs) offer promising semantic understanding capabilities; however, their high inference latency and susceptibility to hallucination hinder direct application to real-time vehicle control. To address these limitations, this paper proposes DriveVLM-RL, a neuroscience-inspired framework that integrates VLMs into RL through a dual-pathway architecture for safe and deployable autonomous driving. The framework decomposes semantic reward learning into a Static Pathway for continuous spatial safety assessment using CLIP-based contrasting language goals, and a Dynamic Pathway for attention-gated multi-frame semantic risk reasoning using a lightweight detector and a large VLM. A hierarchical reward synthesis mechanism fuses semantic signals with vehicle states, while an asynchronous training pipeline decouples expensive VLM inference from environment interaction. All VLM components are used only during offline training and are removed at deployment, ensuring real-time feasibility. Experiments in the CARLA simulator show significant improvements in collision avoidance, task success, and generalization across diverse traffic scenarios, including strong robustness under settings without explicit collision penalties. These results demonstrate that DriveVLM-RL provides a practical paradigm for integrating foundation models into autonomous driving without compromising real-time feasibility. Demo video and code are available at: this https URL

[CV-115] Unrolled Reconstruction with Integrated Super-Resolution for Accelerated 3D LGE MRI

【速读】:该论文旨在解决加速3D延迟钆增强磁共振成像(Late Gadolinium Enhancement MRI, LGE-MRI)中因欠采样k空间数据导致的薄心房结构恢复困难问题。现有基于模型的迭代重建方法虽能融合物理驱动的数据一致性与学习先验,但通常在采集分辨率下运行,难以充分恢复高频细节。其解决方案的关键在于提出一种混合式可展开重建框架:在优化循环的每一步中,用增强型深度超分辨率(Enhanced Deep Super-Resolution, EDSR)网络替代传统近端算子(proximal operator),从而实现超分辨率增强与数据一致性约束的联合优化。该方法在回顾性欠采样动物实验数据上端到端训练,并在峰值信噪比(PSNR)和结构相似性(SSIM)指标上优于压缩感知、基于模型的深度学习(MoDL)及自引导深度图像先验(DIP)等基线方法,且更有效地保留了精细心脏结构,提升了左心房(Left Atrium, LA)分割性能。

链接: https://arxiv.org/abs/2603.18309
作者: Md Hasibul Husain Hisham,Shireen Elhabian,Ganesh Adluru,Jason Mendes,Andrew Arai,Eugene Kholmovski,Ravi Ranjan,Edward DiBella
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accelerated 3D late gadolinium enhancement (LGE) MRI requires robust reconstruction methods to recover thin atrial structures from undersampled k-space data. While unrolled model-based networks effectively integrate physics-driven data consistency with learned priors, they operate at the acquired resolution and may fail to fully recover high-frequency detail. We propose a hybrid unrolled reconstruction framework in which an Enhanced Deep Super-Resolution (EDSR) network replaces the proximal operator within each iteration of the optimization loop, enabling joint super-resolution enhancement and data consistency enforcement. The model is trained end-to-end on retrospectively undersampled preclinical 3D LGE datasets and compared against compressed sensing, Model-Based Deep Learning (MoDL), and self-guided Deep Image Prior (DIP) baselines. Across acceleration factors, the proposed method consistently improves PSNR and SSIM over standard unrolled reconstruction and better preserves fine cardiac structures, leading to improved LA (left atrium) segmentation performance. These results demonstrate that integrating super-resolution priors directly within model-based reconstruction provides measurable gains in accelerated 3D LGE MRI.

[CV-116] Fast and Generalizable NeRF Architecture Selection for Satellite Scene Reconstruction

【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRF)在卫星遥感图像中部署时面临的两大挑战:一是每个场景都需要独立训练,效率低下;二是通过神经架构搜索(Neural Architecture Search, NAS)优化模型结构耗时长(数小时至数天)。其解决方案的关键在于发现多视角一致性(multi-view consistency)才是决定重建质量的核心因素,而非模型架构本身。基于此洞察,作者提出PreSCAN框架,利用轻量级几何与光度描述符在训练前快速预测NeRF性能,可在30秒内实现误差小于1 dB的准确估计,相比NAS提速1000倍;同时,该方法在边缘平台(Jetson Orin)上结合离线成本分析,显著降低推理功耗(减少26%)和延迟(减少43%),且保持高质量重建,无需重新训练即可泛化至多样卫星场景。

链接: https://arxiv.org/abs/2603.18306
作者: Devjyoti Chakraborty,Zaki Sukma,Rakandhiya D. Rachmanto,Kriti Ghosh,In Kee Kim,Suchendra M. Bhandarkar,Lakshmish Ramaswamy,Nancy K. O’Hare,Deepak Mishra
机构: School of Computing, University of Georgia (佐治亚大学计算机学院); Department of Geography, University of Georgia (佐治亚大学地理系)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have emerged as a powerful approach for photorealistic 3D reconstruction from multi-view images. However, deploying NeRF for satellite imagery remains challenging. Each scene requires individual training, and optimizing architectures via Neural Architecture Search (NAS) demands hours to days of GPU time. While existing approaches focus on architectural improvements, our SHAP analysis reveals that multi-view consistency, rather than model architecture, determines reconstruction quality. Based on this insight, we develop PreSCAN, a predictive framework that estimates NeRF quality prior to training using lightweight geometric and photometric descriptors. PreSCAN selects suitable architectures in 30 seconds with 1 dB prediction error, achieving 1000 \times speedup over NAS. We further demonstrate PreSCAN’s deployment utility on edge platforms (Jetson Orin), where combining its predictions with offline cost profiling reduces inference power by 26% and latency by 43% with minimal quality loss. Experiments on DFC2019 datasets confirm that PreSCAN generalizes across diverse satellite scenes without retraining.

[CV-117] Sparse3DTrack: Monocular 3D Object Tracking Using Sparse Supervision

【速读】:该论文旨在解决单目3D目标跟踪(monocular 3D object tracking)中对密集3D标注严重依赖的问题,现有方法通常需要大量人工标注的3D边界框来训练模型,而这类标注成本高、难以扩展。其解决方案的关键在于提出首个稀疏监督(sparsely supervised)框架,将任务分解为两个顺序子问题:2D查询匹配与3D几何估计,两者均利用图像序列的时空一致性来增强少量标注样本,从而学习丰富的2D和3D场景表征;在此基础上,模型可自动生成高质量的3D伪标签(pseudolabels),将稀疏监督转化为稠密的3D轨迹标注,使原本依赖全监督的跟踪器在仅需每条轨迹4个真值标注的情况下仍能实现显著性能提升(最高达15.50个百分点)。

链接: https://arxiv.org/abs/2603.18298
作者: Nikhil Gosala,B. Ravi Kiran,Senthil Yogamani,Abhinav Valada
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 8 figures

点击查看摘要

Abstract:Monocular 3D object tracking aims to estimate temporally consistent 3D object poses across video frames, enabling autonomous agents to reason about scene dynamics. However, existing state-of-the-art approaches are fully supervised and rely on dense 3D annotations over long video sequences, which are expensive to obtain and difficult to scale. In this work, we address this fundamental limitation by proposing the first sparsely supervised framework for monocular 3D object tracking. Our approach decomposes the task into two sequential sub-problems: 2D query matching and 3D geometry estimation. Both components leverage the spatio-temporal consistency of image sequences to augment a sparse set of labeled samples and learn rich 2D and 3D representations of the scene. Leveraging these learned cues, our model automatically generates high-quality 3D pseudolabels across entire videos, effectively transforming sparse supervision into dense 3D track annotations. This enables existing fully-supervised trackers to effectively operate under extreme label sparsity. Extensive experiments on the KITTI and nuScenes datasets demonstrate that our method significantly improves tracking performance, achieving an improvement of up to 15.50 p.p. while using at most four ground truth annotations per track.

[CV-118] CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning

【速读】:该论文旨在解决视觉-语言模型(Visual-Language Models, VLMs)在图像描述生成中存在视觉-语言错位(vision-language misalignment)的问题,表现为生成内容过于泛化或出现幻觉(hallucination)。现有方法通常依赖于大规模标注数据集进行指令微调(instruction tuning),或采用复杂的测试时框架进行描述优化,成本高且效率低。论文提出的关键解决方案是基于循环一致性(cycle consistency)设计一种自监督训练信号:利用预训练的文本到图像模型(text-to-image model)将VLM生成的描述重新映射回图像,通过比较重建图像与原始图像的相似度作为奖励,使用Group Relative Policy Optimization(GRPO)对VLM进行微调。此方法无需人工标注的图像-文本数据集,仅需原始图像即可实现高质量、更贴近真实场景的图像描述生成,在多个基准测试中优于依赖监督式循环一致性的先进方法。

链接: https://arxiv.org/abs/2603.18282
作者: Marios Krestenitis,Christos Tzelepis,Konstantinos Ioannidis,Steafanos Vrochidis,Ioannis Kompatsiaris,Georgios Tzimiropoulos,Shaogang Gong,Ioannis Patras
机构: Queen Mary, University of London (伦敦玛丽女王大学); Centre for Research and Technology Hellas (希腊研究中心与技术协会); City St George’s, University of London (伦敦城市圣乔治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual-Language Models (VLMs) have achieved remarkable progress in image captioning, visual question answering, and visual reasoning. Yet they remain prone to vision-language misalignment, often producing overly generic or hallucinated descriptions. Existing approaches address this via instruction tuning-requiring costly, large-scale annotated datasets or via complex test-time frameworks for caption refinement. In this work, we revisit image-text alignment through the lens of cycle consistency: given an image and a caption generated by an image-to-text model, the backward mapping through a text-to-image model should reconstruct an image that closely matches the original. In our setup, a VLM serves as the image-to-text component, while a pre-trained text-to-image model closes the loop by reconstructing the image from the generated caption. Building on this, we introduce CycleCap, a fine-tuning scheme to improve image captioning using Group Relative Policy Optimization (GRPO) with a reward based on the similarity between the original and reconstructed images, computed on-the-fly. Unlike previous work that uses cycle consistency loss for preference dataset construction, our method leverages cycle consistency directly as a self-supervised training signal. This enables the use of raw images alone, eliminating the need for curated image-text datasets, while steering the VLM to produce more accurate and grounded text descriptions. Applied to four VLMs ranging from 1B to 7B parameters, CycleCap yields consistent improvements across captioning and hallucination benchmarks, surpassing state-of-the-art methods that rely on supervised cycle consistency training.

[CV-119] LRConv-NeRV: Low Rank Convolution for Efficient Neural Video Compression

【速读】:该论文旨在解决神经视频表示(Neural Representations for Videos, NeRV)中卷积解码器计算复杂度高、内存占用大,从而限制其在资源受限环境部署的问题。解决方案的关键在于提出LRConv-NeRV,通过将解码器中选定的密集3×3卷积层替换为结构化的低秩可分离卷积(structured low-rank separable convolutions),并在解码器架构内端到端训练,实现重建质量与效率之间的可控权衡。具体而言,从解码器末端向早期阶段逐步应用低秩分解,在仅对最终解码阶段进行因子化时即可实现68%的计算量(GFLOPs)和9.3%模型参数的显著减少,同时保持近乎无损的重建质量,并带来约9.2%的码率降低,且在INT8量化后仍能维持接近原始NeRV的性能表现。

链接: https://arxiv.org/abs/2603.18261
作者: Tamer Shanableh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Neural Representations for Videos (NeRV) encode entire video sequences within neural network parameters, offering an alternative paradigm to conventional video codecs. However, the convolutional decoder of NeRV remains computationally expensive and memory intensive, limiting its deployment in resource-constrained environments. This paper proposes LRConv-NeRV, an efficient NeRV variant that replaces selected dense 3x3 convolutional layers with structured low-rank separable convolutions, trained end-to-end within the decoder architecture. By progressively applying low-rank factorization from the largest to earlier decoder stages, LRConv-NeRV enables controllable trade-offs between reconstruction quality and efficiency. Extensive experiments demonstrate that applying LRConv only to the final decoder stage reduces decoder complexity by 68%, from 201.9 to 64.9 GFLOPs, and model size by 9.3%, while incurring negligible quality loss and achieving approximately 9.2% bitrate reduction. Under INT8 post-training quantization, LRConv-NeRV preserves reconstruction quality close to the dense NeRV baseline, whereas more aggressive factorization of early decoder stages leads to disproportionate quality degradation. Compared to existing work under layer-aligned settings, LRConv-NeRV achieves a more favorable efficiency versus quality trade-off, offering substantial GFLOPs and parameter reductions while maintaining higher PSNR/MS-SSIM and improved temporal stability. Temporal flicker analysis using LPIPS further shows that the proposed solution preserves temporal coherence close to the NeRV baseline, results establish LRConv-NeRV as a potential architectural alternative for efficient neural video decoding under low-precision and resource-constrained settings.

[CV-120] oward Reliable Safe and Secure LLM s for Scientific Applications

【速读】:该论文旨在解决生成式 AI(Generative AI)在科学领域应用中面临的安全与可靠性评估缺口问题,尤其针对大语言模型(Large Language Models, LLMs)作为自主“AI科学家”时所引入的独特风险,如生物安全威胁和潜在物理危害。现有通用安全基准因存在领域不匹配、科学特异性威胁覆盖不足及基准过拟合等问题,难以有效评估科学场景下的漏洞。其解决方案的关键在于构建一个多层次防御框架:首先通过专用多智能体系统自动生成领域特定的对抗性安全基准以填补评估空白;其次整合红队测试(red-teaming)、外部边界控制与内部安全LLM代理(Safety LLM Agent),实现从识别、检测到主动防御的闭环机制,从而为科学领域可信部署LLM代理提供结构化、可扩展的防护策略。

链接: https://arxiv.org/abs/2603.18235
作者: Saket Sanjeev Chaturvedi,Joshua Bergerson,Tanwi Mallick
机构: Argonne National Laboratory(阿贡国家实验室)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As large language models (LLMs) evolve into autonomous “AI scientists,” they promise transformative advances but introduce novel vulnerabilities, from potential “biosafety risks” to “dangerous explosions.” Ensuring trustworthy deployment in science requires a new paradigm centered on reliability (ensuring factual accuracy and reproducibility), safety (preventing unintentional physical or biological harm), and security (preventing malicious misuse). Existing general-purpose safety benchmarks are poorly suited for this purpose, suffering from a fundamental domain mismatch, limited threat coverage of science-specific vectors, and benchmark overfitting, which create a critical gap in vulnerability evaluation for scientific applications. This paper examines the unique security and safety landscape of LLM agents in science. We begin by synthesizing a detailed taxonomy of LLM threats contextualized for scientific research, to better understand the unique risks associated with LLMs in science. Next, we conceptualize a mechanism to address the evaluation gap by utilizing dedicated multi-agent systems for the automated generation of domain-specific adversarial security benchmarks. Based on our analysis, we outline how existing safety methods can be brought together and integrated into a conceptual multilayered defense framework designed to combine a red-teaming exercise and external boundary controls with a proactive internal Safety LLM Agent. Together, these conceptual elements provide a necessary structure for defining, evaluating, and creating comprehensive defense strategies for trustworthy LLM agent deployment in scientific disciplines.

[CV-121] Semantic Segmentation and Depth Estimation for Real-Time Lunar Surface Mapping Using 3D Gaussian Splatting

【速读】:该论文旨在解决月球表面导航与制图中面临的挑战性问题,包括纹理贫乏环境、高对比度光照以及计算资源受限等条件下的鲁棒感知难题。解决方案的关键在于提出一种实时映射框架,融合密集感知模型与三维高斯点绘(3D Gaussian Splatting, 3DGS)表示方法:首先在LuPNT模拟器生成的合成数据集上对多个模型进行基准测试,选用基于门控循环单元(Gated Recurrent Units)的立体稠密深度估计模型以平衡速度与精度,并采用卷积神经网络实现语义分割的高性能检测;随后利用真值位姿解耦局部场景理解与全局状态估计,构建出120米路径的三维地图,几何高度精度达约3厘米,优于无激光雷达的传统点云基线方法。该3DGS地图不仅支持新视角合成,还可作为完整SLAM系统的基础,其联合地图与位姿优化能力带来显著优势。

链接: https://arxiv.org/abs/2603.18218
作者: Guillem Casadesus Vila,Adam Dai,Grace Gao
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Navigation and mapping on the lunar surface require robust perception under challenging conditions, including poorly textured environments, high-contrast lighting, and limited computational resources. This paper presents a real-time mapping framework that integrates dense perception models with a 3D Gaussian Splatting (3DGS) representation. We first benchmark several models on synthetic datasets generated with the LuPNT simulator, selecting a stereo dense depth estimation model based on Gated Recurrent Units for its balance of speed and accuracy in depth estimation, and a convolutional neural network for its superior performance in detecting semantic segments. Using ground truth poses to decouple the local scene understanding from the global state estimation, our pipeline reconstructs a 120-meter traverse with a geometric height accuracy of approximately 3 cm, outperforming a traditional point cloud baseline without LiDAR. The resulting 3DGS map enables novel view synthesis and serves as a foundation for a full SLAM system, where its capacity for joint map and pose optimization would offer significant advantages. Our results demonstrate that combining semantic segmentation and dense depth estimation with learned map representations is an effective approach for creating detailed, large-scale maps to support future lunar surface missions.

[CV-122] MicroVision: An Open Dataset and Benchmark Models for Detecting Vulnerable Road Users and Micromobility Vehicles

【速读】:该论文旨在解决当前用于交通安全管理与规划的计算机视觉模型在检测弱势道路使用者(Vulnerable Road Users, VRUs)和微型交通工具(Micromobility Vehicles, MMVs)时存在的数据不足与偏差问题。现有公开图像数据集普遍缺乏对VRUs和MMVs的细粒度标注(如将骑行者与行人统归为“person”),且视角单一(主要来自汽车视角),忽视了仅由VRUs使用的区域(如人行道、自行车道)。解决方案的关键在于构建并发布MicroVision数据集——一个从VRU视角采集的、包含超过8,000张高清图像、30,000余条精细标注目标(包括行人、骑行者、电动滑板车使用者及静止的自行车与电动滑板车)的开放数据集,覆盖瑞典哥德堡全年不同场景,并提供基于先进目标检测架构的基准模型(平均精度达0.723),从而提升模型对VRUs与MMVs的识别能力,助力智能交通系统实现更精准的安全监测与微出行行为分析。

链接: https://arxiv.org/abs/2603.18192
作者: Alexander Rasch,Rahul Rajendra Pai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Micromobility is a growing mode of transportation, raising new challenges for traffic safety and planning due to increased interactions in areas where vulnerable road users (VRUs) share the infrastructure with micromobility, including parked micromobility vehicles (MMVs). Approaches to support traffic safety and planning increasingly rely on detecting road users in images – a computer-vision task relying heavily on the quality of the images to train on. However, existing open image datasets for training such models lack focus and diversity in VRUs and MMVs, for instance, by categorizing both pedestrians and MMV riders as “person”, or by not including new MMVs like e-scooters. Furthermore, datasets are often captured from a car perspective and lack data from areas where only VRUs travel (sidewalks, cycle paths). To help close this gap, we introduce the MicroVision dataset: an open image dataset and annotations for training and evaluating models for detecting the most common VRUs (pedestrians, cyclists, e-scooterists) and stationary MMVs (bicycles, e-scooters), from a VRU perspective. The dataset, recorded in Gothenburg (Sweden), consists of more than 8,000 anonymized, full-HD images with more than 30,000 carefully annotated VRUs and MMVs, captured over an entire year and part of almost 2,000 unique interaction scenes. Along with the dataset, we provide first benchmark object-detection models based on state-of-the-art architectures, which achieved a mean average precision of up to 0.723 on an unseen test set. The dataset and model can support traffic safety to distinguish between different VRUs and MMVs, or help monitoring systems identify the use of micromobility. The dataset and model weights can be accessed at this https URL.

[CV-123] VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events

【速读】:该论文旨在解决基于自中心行车记录仪(ego-centric dashcam)视频中安全关键事件(如碰撞和近碰撞)检测难题,此类事件具有短暂性、稀有性和对通用视觉模型识别能力的挑战。现有多模态大语言模型(Multimodal Large Language Models, MLLMs)虽具备强大推理能力,但在驾驶场景中因领域(domain)与时间(temporal)错位导致性能不足。解决方案的关键在于提出VLM-AutoDrive——一种模块化的后训练框架,用于将预训练视觉-语言模型(Vision-Language Models, VLMs)适配至高保真异常检测任务;其核心创新包括:整合元数据生成的描述、大语言模型(LLM)生成的语义描述、视觉问答(VQA)对以及链式思维(Chain-of-Thought, CoT)推理监督信号,从而实现领域对齐且可解释的学习过程。实验证明,该方法显著提升碰撞检测F1分数(从0.00增至0.69),并大幅提高整体准确率(从35.35%升至77.27%),同时在真实世界Nexar行车记录视频上实现了对碰撞与近碰撞事件的有效检测及可解释推理路径生成,弥合了自动驾驶中感知、因果与决策推理之间的鸿沟。

链接: https://arxiv.org/abs/2603.18178
作者: Mohammad Qazim Bhat,Yufan Huang,Niket Agarwal,Hao Wang,Michael Woods,John Kenyon,Tsung-Yi Lin,Xiaodong Yang,Ming-Yu Liu,Kevin Xie
机构: NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figures, submitted to arXiv

点击查看摘要

Abstract:The rapid growth of ego-centric dashcam footage presents a major challenge for detecting safety-critical events such as collisions and near-collisions, scenarios that are brief, rare, and difficult for generic vision models to capture. While multimodal large language models (MLLMs) demonstrate strong general reasoning ability, they underperform in driving contexts due to domain and temporal misalignment. We introduce VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection. The framework integrates metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning. Off-the-shelf VLMs such as NVIDIA’s Cosmos-Reason1 7B (CR1) exhibit near-zero Collision recall in zero-shot settings; fine-tuning with VLM-AutoDrive improves Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27%. VLM-AutoDrive offers a scalable recipe for adapting general-purpose VLMs to safety-critical, temporally localized perception tasks. Evaluated on real-world Nexar dashcam videos, it achieves substantial gains in Collision and Near-Collision detection while producing interpretable reasoning traces, bridging the gap between perception, causality, and decision reasoning in autonomous driving. Comments: 16 pages, 9 figures, submitted to arXiv Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.18178 [cs.CV] (or arXiv:2603.18178v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.18178 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-124] Insight-V: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在长链推理能力上的显著不足问题,其核心挑战在于高质量、长链条推理数据的稀缺以及缺乏优化的训练流程。解决方案的关键在于提出一个统一的多智能体视觉推理框架——Insight-V++,该框架通过两个创新机制实现突破:一是构建了无需人工干预的多粒度评估驱动的数据生成管道,可自动生成图像与视频域中的结构化复杂推理轨迹;二是设计双智能体架构,其中推理智能体执行深度分析链,而总结智能体则对结果进行批判性评估与提炼,并基于可靠反馈引导迭代推理路径生成,从而形成闭环自提升机制。此外,为克服传统直接偏好优化(Direct Preference Optimization, DPO)在长时序任务中受限的问题,进一步引入ST-GRPO和J-GRPO两种新算法,显著增强空间-时间推理能力和评价鲁棒性,最终在多个图像与视频推理基准上实现性能跃升,同时保持传统感知任务的强泛化能力。

链接: https://arxiv.org/abs/2603.18118
作者: Yuhao Dong,Zuyan Liu,Shulin Tian,Yongming Rao,Ziwei Liu
机构: S-Lab; Nanyang Technological University (南洋理工大学); Tencent Hunyuan (腾讯混元); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: arXiv admin note: text overlap with arXiv:2411.14432

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.

[CV-125] From Concepts to Judgments: Interpretable Image Aesthetic Assessment

【速读】:该论文旨在解决图像美学评估(Image Aesthetic Assessment, IAA)模型预测性能强但缺乏可解释性的问题,即现有模型难以提供用户理解其评分依据的机制。解决方案的关键在于构建一个基于人类可理解美学概念的可解释框架:首先通过学习一组可访问的美学概念,构建一个子空间以支撑模型的内在可解释性;其次引入一个简单而有效的残差预测器,捕捉超出显式美学概念的细微影响因素,从而在保持竞争性预测性能的同时,提供透明且符合人类认知逻辑的美学判断依据。

链接: https://arxiv.org/abs/2603.18108
作者: Xiao-Chang Liu,Johan Wagemans
机构: KU Leuven (鲁汶大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Image aesthetic assessment (IAA) aims to predict the aesthetic quality of images as perceived by humans. While recent IAA models achieve strong predictive performance, they offer little insight into the factors driving their predictions. Yet for users, understanding why an image is considered pleasing or not is as valuable as the score itself, motivating growing interest in interpretability within IAA. When humans evaluate aesthetics, they naturally rely on high-level cues to justify their judgments. Motivated by this observation, we propose an interpretable IAA framework grounded in human-understandable aesthetic concepts. We learn these concepts in an accessible manner, constructing a subspace that forms the foundation of an inherently interpretable model. To capture nuanced influences on aesthetic perception beyond explicit concepts, we introduce a simple yet effective residual predictor. Experiments on photographic and artistic datasets demonstrate that our method achieves competitive predictive performance while offering transparent, human-understandable aesthetic judgments.

[CV-126] raining-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters

【速读】:该论文旨在解决当前基于适配器的CLIP微调方法(如Tip-Adapter)在少样本学习中依赖全局单模态特征向量、忽略细粒度图像块间关系及其与类别文本结构对齐的问题。其解决方案的关键在于提出一种仅在训练阶段使用的异构图教师框架(Heterogeneous Graph Teacher),该框架通过构建多尺度视觉块与文本提示的统一图结构,利用模态感知图Transformer(Modality-aware Graph Transformer, MGT)进行深度跨模态推理,并结合判别性节点过滤提取高保真类特征;进一步地,采用缓存感知的双目标策略将此关系知识直接注入Tip-Adapter的键值缓存中,从而在不增加推理延迟或内存开销的前提下显著提升原型质量,实现性能最优的少样本适应。

链接: https://arxiv.org/abs/2603.18101
作者: Mohammed Rahman Sherif Khan Mohammad,Ardhendu Behera,Sandip Pradhan,Swagat Kumar,Amr Ahmed
机构: Edge Hill University (Edge Hill University)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026

点击查看摘要

Abstract:Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter’s key-value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation. Code is available at this https URL.

[CV-127] Q-Drift: Quantization-Aware Drift Correction for Diffusion Model Sampling

【速读】:该论文旨在解决后训练量化(Post-training Quantization, PTQ)在大型扩散模型部署中因量化噪声沿去噪轨迹累积而导致生成质量下降的问题。解决方案的关键在于提出一种基于采样器侧校正的“Q-Drift”方法,其将量化误差建模为每一步去噪过程中的隐式随机扰动,并推导出保持边际分布不变的漂移调整机制;该方法通过少量(如5次)配对的全精度与量化校准运行估计时间步-wise 方差统计量,从而实现无需修改模型或量化方法的即插即用式校正,且推理阶段开销可忽略不计,在多种文本到图像模型、采样器和PTQ方法组合下均显著提升FID指标(最高达4.59点降低),同时维持CLIP分数不变。

链接: https://arxiv.org/abs/2603.18095
作者: Sooyoung Ryu,Mathieu Salzmann,Saqib Javed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 29 pages, 6 figures

点击查看摘要

Abstract:Post-training quantization (PTQ) is a practical path to deploy large diffusion models, but quantization noise can accumulate over the denoising trajectory and degrade generation quality. We propose Q-Drift, a principled sampler-side correction that treats quantization error as an implicit stochastic perturbation on each denoising step and derives a marginal-distribution-preserving drift adjustment. Q-Drift estimates a timestep-wise variance statistic from calibration, in practice requiring as few as 5 paired full-precision/quantized calibration runs. The resulting sampler correction is plug-and-play with common samplers, diffusion models, and PTQ methods, while incurring negligible overhead at inference. Across six diverse text-to-image models (spanning DiT and U-Net), three samplers (Euler, flow-matching, DPM-Solver++), and two PTQ methods (SVDQuant, MixDQ), Q-Drift improves FID over the corresponding quantized baseline in most settings, with up to 4.59 FID reduction on PixArt-Sigma (SVDQuant W3A4), while preserving CLIP scores.

[CV-128] One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control CVPR2026

【速读】:该论文旨在解决工业异常检测(Industrial Anomaly Detection, IAD)中因异常样本稀缺而导致模型训练困难的问题,尤其是现有少样本异常合成方法存在训练耗时长、难以学习真实异常分布的局限性。其核心解决方案是提出一种无需训练的少样本异常生成方法 O2MAG(One reference anomalous image to synthesize More realistic anomalies),关键在于利用单张参考异常图像中的自注意力机制(self-attention)来引导多路径扩散过程,并通过异常掩码(anomaly mask)缓解前景-背景查询混淆;同时引入异常引导优化(Anomaly-Guided Optimization)对齐文本提示与真实异常语义,以及采用双注意力增强(Dual-Attention Enhancement)强化掩码区域内自注意力与跨注意力机制,从而生成更贴近真实分布且符合文本描述的异常图像,显著提升下游异常检测性能。

链接: https://arxiv.org/abs/2603.18093
作者: Haoxiang Rao,Zhao Wang,Chenyang Si,Yan Lyu,Yuanyi Duan,Fang Zhao,Caifeng Shan
机构: Nanjing University (南京大学); China Mobile Zijin Innovation Institute (中国移动紫金创新研究院); Shandong University of Science and Technology (山东科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026

点击查看摘要

Abstract:Industrial anomaly detection (AD) is characterized by an abundance of normal images but a scarcity of anomalous ones. Although numerous few-shot anomaly synthesis methods have been proposed to augment anomalous data for downstream AD tasks, most existing approaches require time-consuming training and struggle to learn distributions that are faithful to real anomalies, thereby restricting the efficacy of AD models trained on such data. To address these limitations, we propose a training-free few-shot anomaly generation method, namely O2MAG, which leverages the self-attention in One reference anomalous image to synthesize More realistic anomalies, supporting effective downstream anomaly detection. Specifically, O2MAG manipulates three parallel diffusion processes via self-attention grafting and incorporates the anomaly mask to mitigate foreground-background query confusion, synthesizing text-guided anomalies that closely adhere to real anomalous distributions. To bridge the semantic gap between the encoded anomaly text prompts and the true anomaly semantics, Anomaly-Guided Optimization is further introduced to align the synthesis process with the target anomalous distribution, steering the generation toward realistic and text-consistent anomalies. Moreover, to mitigate faint anomaly synthesis inside anomaly masks, Dual-Attention Enhancement is adopted during generation to reinforce both self- and cross-attention on masked regions. Extensive experiments validate the effectiveness of O2MAG, demonstrating its superior performance over prior state-of-the-art methods on downstream AD tasks.

[CV-129] Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在具身任务中对低级控制精度与泛化能力之间的权衡问题:虽然扩散动作专家(diffusion action expert)能够高效生成高精度的连续动作片段,但其在分布外(out-of-distribution)场景下鲁棒性不足;而自回归(auto-regressive)范式虽具备更强的泛化能力,却因生成速度慢且精度低而不适合精细控制。解决方案的关键在于提出一种“草稿与验证”(Action-Draft-and-Verify, ADV)框架——通过扩散动作专家并行生成多个候选动作片段,再由视觉语言模型(VLM)以单次前向传播的方式,利用类似困惑度(perplexity-style)的评分机制对所有候选动作进行重排序选择,从而融合扩散模型的高精度与自回归模型的鲁棒性优势。在相同骨干网络、训练数据和动作片段长度条件下,ADV 在仿真环境中提升成功率 4.3 个百分点,在真实世界任务中提升 19.7 个百分点,且仅引入单次 VLM 重排序开销。

链接: https://arxiv.org/abs/2603.18091
作者: Chen Zhao,Zhuoran Wang,Haoyang Li,Shifeng Bao,Guanlin Li,Youhe Feng,Yang Li,Jie Tang,Jing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.

[CV-130] CytoSyn: a Foundation Diffusion Model for Histopathology – Tech Report

【速读】:该论文旨在解决当前病理图像生成领域中缺乏专门针对组织病理学(histopathology)设计的生成式基础模型的问题。现有方法主要依赖于自监督特征提取器,适用于细胞分割、肿瘤亚型分类等下游任务,但难以处理如虚拟染色(virtual staining)等需要生成新图像的任务。论文提出的关键解决方案是开发了CytoSyn,一个基于潜在扩散机制(latent diffusion model)的先进生成模型,能够生成高保真度且多样化的HE染色病理图像。其核心创新在于系统性地优化了训练数据规模、采样策略及滑动窗口过拟合控制,并通过大规模跨癌种数据(>10,000 TCGA全切片图像)训练,实现了对非肿瘤病理图像(如炎症性肠病)的泛化能力,显著优于现有方法(如PixCell),并揭示了预处理细节(如JPEG压缩)对扩散模型性能的高度敏感性。

链接: https://arxiv.org/abs/2603.18089
作者: Thomas Duboudin,Xavier Fontaine,Etienne Andrier,Lionel Guillou,Alexandre Filiot,Thalyssa Baiocco-Rodrigues,Antoine Olivier,Alberto Romagnoni,John Klein,Jean-Baptiste Schiratti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 5 figures, tech report, model page: this https URL

点击查看摘要

Abstract:Computational pathology has made significant progress in recent years, fueling advances in both fundamental disease understanding and clinically ready tools. This evolution is driven by the availability of large amounts of digitized slides and specialized deep learning methods and models. Multiple self-supervised foundation feature extractors have been developed, enabling downstream predictive applications from cell segmentation to tumor sub-typing and survival analysis. In contrast, generative foundation models designed specifically for histopathology remain scarce. Such models could address tasks that are beyond the capabilities of feature extractors, such as virtual staining. In this paper, we introduce CytoSyn, a state-of-the-art foundation latent diffusion model that enables the guided generation of highly realistic and diverse histopathology HE-stained images, as shown in an extensive benchmark. We explored methodological improvements, training set scaling, sampling strategies and slide-level overfitting, culminating in the improved CytoSyn-v2, and compared our work to PixCell, a state-of-the-art model, in an in-depth manner. This comparison highlighted the strong sensitivity of both diffusion models and performance metrics to preprocessing-specific details such as JPEG compression. Our model has been trained on a dataset obtained from more than 10,000 TCGA diagnostic whole-slide images of 32 different cancer types. Despite being trained only on oncology slides, it maintains state-of-the-art performance generating inflammatory bowel disease images. To support the research community, we publicly release CytoSyn’s weights, its training and validation datasets, and a sample of synthetic images in this repository: this https URL.

[CV-131] SSP-SAM: SAM with Semantic-Spatial Prompt for Referring Expression Segmentation

【速读】:该论文旨在解决生成式 AI(Generative AI)在指代表达分割(Referring Expression Segmentation, RES)任务中语言理解能力不足的问题,即现有模型难以准确地根据自然语言描述定位图像中的目标对象。其解决方案的关键在于提出 SSP-SAM 框架,通过引入语义-空间提示(Semantic-Spatial Prompt, SSP)编码器,融合视觉与语言注意力适配器,增强对图像中显著物体和文本中判别性短语的表征能力,从而生成高质量的 SSP 提示,驱动 Segment Anything Model (SAM) 实现由语言引导的精准分割。该方法无需额外修改即可自然支持更灵活的广义指代表达分割(Generalized RES, GRES)场景,即目标可为空、单个或多个实例。

链接: https://arxiv.org/abs/2603.18086
作者: Wei Tang,Xuejing Liu,Yanpeng Sun,Zechao Li
机构: Nanjing University of Science and Technology (南京理工大学); National University of Singapore (新加坡国立大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Segment Anything Model (SAM) excels at general image segmentation but has limited ability to understand natural language, which restricts its direct application in Referring Expression Segmentation (RES). Toward this end, we propose SSP-SAM, a framework that fully utilizes SAM’s segmentation capabilities by integrating a Semantic-Spatial Prompt (SSP) encoder. Specifically, we incorporate both visual and linguistic attention adapters into the SSP encoder, which highlight salient objects within the visual features and discriminative phrases within the linguistic features. This design enhances the referent representation for the prompt generator, resulting in high-quality SSPs that enable SAM to generate precise masks guided by language. Although not specifically designed for Generalized RES (GRES), where the referent may correspond to zero, one, or multiple objects, SSP-SAM naturally supports this more flexible setting without additional modifications. Extensive experiments on widely used RES and GRES benchmarks confirm the superiority of our method. Notably, our approach generates segmentation masks of high quality, achieving strong precision even at strict thresholds such as Pr@0.9. Further evaluation on the PhraseCut dataset demonstrates improved performance in open-vocabulary scenarios compared to existing state-of-the-art RES methods. The code and checkpoints are available at: this https URL.

[CV-132] EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

【速读】:该论文旨在解决在真实场景下基于第一人称视角(egocentric)的“Talking to Me”(TTM)说话人检测任务中存在的挑战,包括视觉数据缺失、忽略头部朝向等非言语线索以及背景噪声干扰等问题。其解决方案的关键在于提出EgoAdapt框架,该框架通过三个核心模块实现对多模态信息的自适应融合与鲁棒处理:(1) 视觉说话人目标识别(Visual Speaker Target Recognition, VSTR)模块同时利用头部朝向(非言语线索)和唇部运动(言语线索)进行综合判断;(2) 并行共享权重音频(Parallel Shared-weight Audio, PSA)编码器提升嘈杂环境下的语音特征提取能力;(3) 视觉模态缺失感知(Visual Modality Missing Awareness, VMMA)模块动态估计每帧中各模态是否存在,并据此调整系统响应策略,从而显著提升模型在模态缺失情况下的稳定性与准确性。

链接: https://arxiv.org/abs/2603.18082
作者: Xinyuan Qian,Xinjia Zhu,Alessio Brutti,Dong Liang
机构: University of Science and Technology Beijing (北京科技大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会); Shenzhen Research Institute, Nanjing University of Aeronautics and Astronautics (南京航空航天大学深圳研究院)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:

点击查看摘要

Abstract:TTM (Talking to Me) task is a pivotal component in understanding human social interactions, aiming to determine who is engaged in conversation with the camera-wearer. Traditional models often face challenges in real-world scenarios due to missing visual data, neglecting the role of head orientation, and background noise. This study addresses these limitations by introducing EgoAdapt, an adaptive framework designed for robust egocentric “Talking to Me” speaker detection under missing modalities. Specifically, EgoAdapt incorporates three key modules: (1) a Visual Speaker Target Recognition (VSTR) module that captures head orientation as a non-verbal cue and lip movement as a verbal cue, allowing a comprehensive interpretation of both verbal and non-verbal signals to address TTM, setting it apart from tasks focused solely on detecting speaking status; (2) a Parallel Shared-weight Audio (PSA) encoder for enhanced audio feature extraction in noisy environments; and (3) a Visual Modality Missing Awareness (VMMA) module that estimates the presence or absence of each modality at each frame to adjust the system response this http URL evaluations on the TTM benchmark of the Ego4D dataset demonstrate that EgoAdapt achieves a mean Average Precision (mAP) of 67.39% and an Accuracy (Acc) of 62.01%, significantly outperforming the state-of-the-art method by 4.96% in Accuracy and 1.56% in mAP.

[CV-133] DarkDriving: A Real-World Day and Night Aligned Dataset for Autonomous Driving in the Dark Environment ICRA2026

【速读】:该论文旨在解决自动驾驶视觉感知系统在低光照环境下性能下降的问题,尤其是缺乏真实世界中精确对齐的日间与夜间图像数据集,从而限制了低光增强算法的研究进展。解决方案的关键在于提出了一种基于自动昼夜轨迹跟踪的位姿匹配(Trajectory Tracking-based Pose Matching, TTPM)方法,在一个69英亩的封闭实车测试场中实现了首个多帧、高精度对齐的昼夜图像数据集DarkDriving,包含9,538对空间位置误差仅几厘米的图像对,并配套标注了2D边界框,为低光增强及下游检测任务(如2D/3D目标检测)提供了高质量基准。

链接: https://arxiv.org/abs/2603.18067
作者: Wuqi Wang,Haochen Yang,Baolu Li,Jiaqi Sun,Xiangmo Zhao,Zhigang Xu,Qing Guo,Haigen Min,Tianyun Zhang,Hongkai Yu
机构: Chang’an University (长安大学); Cleveland State University (克利夫兰州立大学); A*STAR (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注: 8 pages, 8 figures. Accepted to ICRA 2026

点击查看摘要

Abstract:The low-light conditions are challenging to the vision-centric perception systems for autonomous driving in the dark environment. In this paper, we propose a new benchmark dataset (named DarkDriving) to investigate the low-light enhancement for autonomous driving. The existing real-world low-light enhancement benchmark datasets can be collected by controlling various exposures only in small-ranges and static scenes. The dark images of the current nighttime driving datasets do not have the precisely aligned daytime counterparts. The extreme difficulty to collect a real-world day and night aligned dataset in the dynamic driving scenes significantly limited the research in this area. With a proposed automatic day-night Trajectory Tracking based Pose Matching (TTPM) method in a large real-world closed driving test field (area: 69 acres), we collected the first real-world day and night aligned dataset for autonomous driving in the dark environment. The DarkDriving dataset has 9,538 day and night image pairs precisely aligned in location and spatial contents, whose alignment error is in just several centimeters. For each pair, we also manually label the object 2D bounding boxes. DarkDriving introduces four perception related tasks, including low-light enhancement, generalized low-light enhancement, and low-light enhancement for 2D detection and 3D detection of autonomous driving in the dark environment. The experimental results show that our DarkDriving dataset provides a comprehensive benchmark for evaluating low-light enhancement for autonomous driving and it can also be generalized to enhance dark images and promote detection in some other low-light driving environment, such as nuScenes.

[CV-134] S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition

【速读】:该论文旨在解决基于骨架的动作识别在资源受限边缘设备上部署时面临的高能耗问题,现有方法多依赖于功耗较高的人工神经网络(Artificial Neural Networks, ANN),而虽有脉冲神经网络(Spiking Neural Networks, SNN)作为更节能的替代方案,但其在处理骨架数据时往往因采用密集矩阵聚合、复杂的多模态融合模块或非稀疏频域变换等手段,导致无法充分利用SNN固有的稀疏性,并且严重受制于脉冲神经元的短期遗忘问题。解决方案的关键在于提出首个纯脉冲驱动的Transformer架构——S3T-Former:首先设计了多流解剖学脉冲嵌入(Multi-Stream Anatomical Spiking Embedding, M-ASE),将多模态骨架特征转化为异构且高度稀疏的事件流;其次引入横向脉冲拓扑路由(Lateral Spiking Topology Routing, LSTR)实现按需条件性脉冲传播以保障拓扑与时间稀疏性,并通过脉冲状态空间(Spiking State-Space, S3)引擎系统建模长程时序动态,避免使用非稀疏频域操作。实验表明,该方法在多个大规模数据集上实现了优异精度的同时显著降低理论能耗,确立了神经形态动作识别的新基准。

链接: https://arxiv.org/abs/2603.18062
作者: Naichuan Zheng,Hailun Xia,Zepeng Sun,Weiyi Li,Yujia Wang
机构: Beijing University of Posts and Telecommunications(北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Skeleton-based action recognition is crucial for multimedia applications but heavily relies on power-hungry Artificial Neural Networks (ANNs), limiting their deployment on resource-constrained edge devices. Spiking Neural Networks (SNNs) provide an energy-efficient alternative; however, existing spiking models for skeleton data often compromise the intrinsic sparsity of SNNs by resorting to dense matrix aggregations, heavy multimodal fusion modules, or non-sparse frequency domain transformations. Furthermore, they severely suffer from the short-term amnesia of spiking neurons. In this paper, we propose the Spiking State-Space Topology Transformer (S3T-Former), which, to the best of our knowledge, is the first purely spike-driven Transformer architecture specifically designed for energy-efficient skeleton action recognition. Rather than relying on heavy fusion overhead, we formulate a Multi-Stream Anatomical Spiking Embedding (M-ASE) that acts as a generalized kinematic differential operator, elegantly transforming multimodal skeleton features into heterogeneous, highly sparse event streams. To achieve true topological and temporal sparsity, we introduce Lateral Spiking Topology Routing (LSTR) for on-demand conditional spike propagation, and a Spiking State-Space (S3) Engine to systematically capture long-range temporal dynamics without non-sparse spectral workarounds. Extensive experiments on multiple large-scale datasets demonstrate that S3T-Former achieves highly competitive accuracy while theoretically reducing energy consumption compared to classic ANNs, establishing a new state-of-the-art for energy-efficient neuromorphic action recognition.

[CV-135] RARE disease detection from Capsule Endoscopic Videos based on Vision Transformers

【速读】:该论文针对胶囊内镜视频(Capsule Endoscopic Video, CEV)中的多标签分类任务展开研究,旨在实现对消化道不同部位及病变特征的自动识别与分类。其核心问题在于如何从复杂的CEV图像序列中准确提取并区分17类关键解剖结构和病理特征(如胃、小肠、溃疡、出血等)。解决方案的关键在于采用基于Transformer架构的深度学习模型——Google Vision Transformer (ViT),在批次大小为16、输入图像分辨率为224×224的条件下进行微调(fine-tune),从而有效捕捉图像中的全局上下文信息以提升多标签分类性能。实验结果显示,在三个测试视频上整体mAP@0.5为0.0205,mAP@0.95为0.0196,表明该方法具备一定的实用潜力。

链接: https://arxiv.org/abs/2603.18045
作者: X. Gao,C. Chien,G. Liu,A. Manullang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work is corresponding to the Gastro Competition for multi-label classification from capsule endoscopic videos (CEV). Deep learning network based on Transformers are fined-tune for this task. The based online mode is Google Vision Transformer (ViT) batch16 with 224 x 224 resolutions. In total, 17 labels are classified, which are mouth, esophagus, stomach, small intestine, colon, z-line, pylorus, ileocecal valve, active bleeding, angiectasia, blood, erosion, erythema, hematin, lymphangioectasis, polyp, and ulcer. For test dataset of three videos, the overall mAP @0.5 is 0.0205 whereas the overall mAP @0.95 is 0.0196.

[CV-136] UEPS: Robust and Efficient MRI Reconstruction

【速读】:该论文旨在解决深度可逆模型(Deep Unrolled Models, DUMs)在加速磁共振成像(MRI)重建中面临的关键挑战——域偏移(domain shift)下的鲁棒性不足问题,这严重限制了其临床应用。研究表明,线圈灵敏度图(Coil Sensitivity Map, CSM)估计是导致泛化能力差的主要瓶颈。为此,作者提出了一种名为UEPS的新架构,其核心创新在于:(i) 采用解卷积扩展(Unrolled Expanded, UE)设计,通过独立重建每个线圈来消除对CSM的依赖;(ii) 引入渐进式分辨率策略,利用k空间到图像的映射实现从粗到精的高效优化;(iii) 设计适用于MRI一维欠采样特性的稀疏注意力机制。这些基于物理规律的设计使模型在提升鲁棒性的同时显著降低计算复杂度,实验证明UEPS在跨不同临床场景(如解剖结构、扫描视角、对比度、设备厂商、场强和线圈配置等)的零样本迁移任务中均优于现有DUM、端到端、扩散及无训练方法,达到当前最优鲁棒性并支持低延迟推理,适合实时部署。

链接: https://arxiv.org/abs/2603.18572
作者: Xiang Zhou,Hong Shang,Zijian Zhan,Tianyu He,Jintao Meng,Dong Liang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: The document contains the main paper and additional experimental details in the supplementary material. Open-source code can be found at: this https URL

点击查看摘要

Abstract:Deep unrolled models (DUMs) have become the state of the art for accelerated MRI reconstruction, yet their robustness under domain shift remains a critical barrier to clinical adoption. In this work, we identify coil sensitivity map (CSM) estimation as the primary bottleneck limiting generalization. To address this, we propose UEPS, a novel DUM architecture featuring three key innovations: (i) an Unrolled Expanded (UE) design that eliminates CSM dependency by reconstructing each coil independently; (ii) progressive resolution, which leverages k-space-to-image mapping for efficient coarse-to-fine refinement; and (iii) sparse attention tailored to MRI’s 1D undersampling nature. These physics-grounded designs enable simultaneous gains in robustness and computational efficiency. We construct a large-scale zero-shot transfer benchmark comprising 10 out-of-distribution test sets spanning diverse clinical shifts – anatomy, view, contrast, vendor, field strength, and coil configurations. Extensive experiments demonstrate that UEPS consistently and substantially outperforms existing DUM, end-to-end, diffusion, and untrained methods across all OOD tests, achieving state-of-the-art robustness with low-latency inference suitable for real-time deployment.

[CV-137] End-to-End QGAN-Based Image Synthesis via Neural Noise Encoding and Intensity Calibration

【速读】:该论文旨在解决当前量子生成对抗网络(Quantum Generative Adversarial Networks, QGANs)在图像合成中无法直接生成完整图像的问题,其核心瓶颈在于:(1)经典噪声到量子态的接口固定僵化,限制了量子生成器的表达能力;(2)量子测量结果的归一化统计分布与图像像素强度空间不匹配,导致生成图像质量差。解决方案的关键在于提出ReQGAN框架,引入可学习的神经噪声编码器(Neural Noise Encoder)以实现自适应量子态制备,并设计可微分的强度校准模块(Intensity Calibration module)将量子测量结果映射至稳定且视觉有意义的像素域,从而实现端到端的全图生成,在有限量子比特资源下实现训练稳定和高质量图像合成。

链接: https://arxiv.org/abs/2603.18554
作者: Xue Yang,Rigui Zhou,Shizheng Jia,Dax Enshan Koh,Siong Thye Goh,Yaochong Li,Hongyu Chen,Fuhui Xiong
机构: Shanghai Maritime University (上海海事大学); Agency for Science, Technology and Research (新加坡科技研究局); Singapore University of Technology and Design (新加坡科技设计大学); Institute of High Performance Computing (高性能计算研究所)
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Quantum Generative Adversarial Networks (QGANs) offer a promising path for learning data distributions on near-term quantum devices. However, existing QGANs for image synthesis avoid direct full-image generation, relying on classical post-processing or patch-based methods. These approaches dilute the quantum generator’s role and struggle to capture global image semantics. To address this, we propose ReQGAN, an end-to-end framework that synthesizes an entire N=2^D-pixel image using a single D-qubit quantum circuit. ReQGAN overcomes two fundamental bottlenecks hindering direct pixel generation: (1) the rigid classical-to-quantum noise interface and (2) the output mismatch between normalized quantum statistics and the desired pixel-intensity space. We introduce a learnable Neural Noise Encoder for adaptive state preparation and a differentiable Intensity Calibration module to map measurements to a stable, visually meaningful pixel domain. Experiments on MNIST and Fashion-MNIST demonstrate that ReQGAN achieves stable training and effective image synthesis under stringent qubit budgets, with ablation studies verifying the contribution of each component.

[CV-138] SCISSR: Scribble-Conditioned Interactive Surgical Segmentation and Refinement

【速读】:该论文旨在解决外科手术场景中组织与器械的精确分割问题,该任务因目标形状不规则、结构细长、镜面反射及频繁遮挡等因素而极具挑战性。传统基于点或框提示的分割模型(如Segment Anything Model, SAM)在面对此类复杂场景时,点提示过于稀疏、框提示又过于粗略,难以精准定位目标。解决方案的关键在于提出一种可交互的涂鸦提示框架SCISSR,其核心创新是引入一个轻量级涂鸦编码器(Scribble Encoder),将自由手绘涂鸦转化为密集提示嵌入,与掩码解码器兼容,并支持通过在错误区域绘制修正笔画实现迭代优化。该设计使模型在仅微调新增模块(如Scribble Encoder、空间门控融合和LoRA适配器)的同时保持主干网络冻结,从而在EndoVis 2018和CholecSeg8k两个数据集上分别达到95.41%和96.30%的Dice分数,显著优于迭代点提示方法,且具备跨模型架构的通用性。

链接: https://arxiv.org/abs/2603.18544
作者: Haonan Ping,Jian Jiang,Cheng Yuan,Qizhen Sun,Lv Wu,Yutong Ban
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of tissues and instruments in surgical scenes is annotation-intensive due to irregular shapes, thin structures, specularities, and frequent occlusions. While SAM models support point, box, and mask prompts, points are often too sparse and boxes too coarse to localize such challenging targets. We present SCISSR, a scribble-promptable framework for interactive surgical scene segmentation. It introduces a lightweight Scribble Encoder that converts freehand scribbles into dense prompt embeddings compatible with the mask decoder, enabling iterative refinement for a target object by drawing corrective strokes on error regions. Because all added modules (the Scribble Encoder, Spatial Gated Fusion, and LoRA adapters) interact with the backbone only through its standard embedding interfaces, the framework is not tied to a single model: we build on SAM 2 in this work, yet the same components transfer to other prompt-driven segmentation architectures such as SAM 3 without structural modification. To preserve pre-trained capabilities, we train only these lightweight additions while keeping the remaining backbone frozen. Experiments on EndoVis 2018 demonstrate strong in-domain performance, while evaluation on the out-of-distribution CholecSeg8k further confirms robustness across surgical domains. SCISSR achieves 95.41% Dice on EndoVis 2018 with five interaction rounds and 96.30% Dice on CholecSeg8k with three interaction rounds, outperforming iterative point prompting on both benchmarks.

人工智能

[AI-0] OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在GUI代理中因奖励函数质量敏感而导致的训练不稳定问题,尤其在复杂、随机环境下的可扩展性与性能难以兼顾的挑战。解决方案的关键在于提出OS-Themis框架,其核心创新包括:将轨迹分解为可验证的里程碑以隔离关键决策证据,并引入审查机制严格审计证据链后再做出最终判断,从而提升奖励信号的准确性与鲁棒性;同时配套构建了OmniGUIRewardBench(OGRBench)跨平台基准用于系统评估,实验表明该方法在在线RL训练中提升10.3%、在自训练循环中的轨迹验证与过滤中提升6.9%,显著推动GUI代理的进化能力。

链接: https://arxiv.org/abs/2603.19191
作者: Zehao Li,Zhenyu Wu,Yibo Zhao,Bowen Yang,Jingjing Xie,Zhaoyang Liu,Zhoumianze Liu,Kaiming Jin,Jianze Liang,Zonglin Li,Feng Wu,Bowen Zhou,Zun Wang,Zichen Ding
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Themis, a scalable and accurate multi-agent critic framework. Unlike a single judge, OS-Themis decomposes trajectories into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To facilitate evaluation, we further introduce OmniGUIRewardBench (OGRBench), a holistic cross-platform benchmark for GUI outcome rewards, where all evaluated models achieve their best performance under OS-Themis. Extensive experiments on AndroidWorld show that OS-Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filtering in the self-training loop, highlighting its potential to drive agent evolution.

[AI-1] SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

【速读】:该论文旨在解决当前GPU核优化基准测试中存在的一大局限性:现有基准多以软件基线的加速比作为评价标准,导致优化方向偏离硬件效率极限,无法有效推动生成式AI系统在特定硬件(如NVIDIA Blackwell GPU)上实现接近理论性能上限的执行。其解决方案的关键在于提出SOL-ExecBench,这是一个基于235个来自真实和新兴AI模型的CUDA核优化问题构成的基准,通过SOLAR工具链计算出硬件固有的Speed-of-Light (SOL)边界,并引入SOL Score来量化候选核与该边界之间的差距,从而将优化目标从“超越软件基线”转变为“逼近硬件理论极限”,同时提供沙箱环境以防止奖励作弊行为,确保评估的可靠性与可复现性。

链接: https://arxiv.org/abs/2603.19173
作者: Edward Lin,Sahil Modi,Siva Kumar Sastry Hari,Qijing Huang,Zhifan Ye,Nestor Qin,Fengzhe Zhou,Yuan Zhang,Jingquan Wang,Sana Damani,Dheeraj Peri,Ouye Xie,Aditya Kane,Moshe Maor,Michael Behar,Triston Cao,Rishabh Mehta,Vartika Singh,Vikram Sharma Mailthody,Terry Chen,Zihao Ye,Hanfeng Chen,Tianqi Chen,Vinod Grover,Wei Chen,Wei Liu,Eric Chung,Luis Ceze,Roger Bringmann,Cyril Zeller,Michael Lightstone,Christos Kozyrakis,Humphrey Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As agentic AI systems become increasingly capable of generating and optimizing GPU kernels, progress is constrained by benchmarks that reward speedup over software baselines rather than proximity to hardware-efficient execution. We present SOL-ExecBench, a benchmark of 235 CUDA kernel optimization problems extracted from 124 production and emerging AI models spanning language, diffusion, vision, audio, video, and hybrid architectures, targeting NVIDIA Blackwell GPUs. The benchmark covers forward and backward workloads across BF16, FP8, and NVFP4, including kernels whose best performance is expected to rely on Blackwell-specific capabilities. Unlike prior benchmarks that evaluate kernels primarily relative to software implementations, SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization. We report a SOL Score that quantifies how much of the gap between a release-defined scoring baseline and the hardware SOL bound a candidate kernel closes. To support robust evaluation of agentic optimizers, we additionally provide a sandboxed harness with GPU clock locking, L2 cache clearing, isolated subprocess execution, and static analysis based checks against common reward-hacking strategies. SOL-ExecBench reframes GPU kernel benchmarking from beating a mutable software baseline to closing the remaining gap to hardware Speed-of-Light.

[AI-2] cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization

【速读】:该论文旨在解决组合优化(Combinatorial Optimization)问题中普遍存在的三重困境:通用性(generality)、性能(performance)与可用性(usability)之间的权衡。现有方法往往在某一维度上表现优异,但难以兼顾其他两个方面。为应对这一挑战,作者提出了一种基于GPU加速的通用元启发式框架cuGenOpt,其核心创新在于三个层面的协同设计:首先,在引擎层采用“一个线程块演化一个解”的CUDA架构,结合统一编码抽象(支持排列、二进制和整数编码)、两级自适应算子选择机制及硬件感知资源管理;其次,在可扩展性层面提供用户自定义算子注册接口,使领域专家能注入针对特定问题的CUDA搜索算子;最后,在易用性层面引入JIT编译管道,以纯Python API暴露框架功能,并集成大语言模型(LLM)驱动的建模助手,将自然语言描述自动转换为可执行求解代码。实验表明,cuGenOpt在多个GPU架构上显著优于通用MIP求解器,在实例规模达n=150时达到与专用求解器相当的质量,并在TSP-442问题上30秒内实现4.73%的间隙,同时框架级优化使pcb442问题的间隙从36%降至4.73%,VRPTW吞吐量提升75–81%。

链接: https://arxiv.org/abs/2603.19163
作者: Yuyang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 28 pages, 9 figures. Code available at this https URL

点击查看摘要

Abstract:Combinatorial optimization problems arise in logistics, scheduling, and resource allocation, yet existing approaches face a fundamental trade-off among generality, performance, and usability. We present cuGenOpt, a GPU-accelerated general-purpose metaheuristic framework that addresses all three dimensions simultaneously. At the engine level, cuGenOpt adopts a “one block evolves one solution” CUDA architecture with a unified encoding abstraction (permutation, binary, integer), a two-level adaptive operator selection mechanism, and hardware-aware resource management. At the extensibility level, a user-defined operator registration interface allows domain experts to inject problem-specific CUDA search operators. At the usability level, a JIT compilation pipeline exposes the framework as a pure-Python API, and an LLM-based modeling assistant converts natural-language problem descriptions into executable solver code. Experiments across five thematic suites on three GPU architectures (T4, V100, A800) show that cuGenOpt outperforms general MIP solvers by orders of magnitude, achieves competitive quality against specialized solvers on instances up to n=150, and attains 4.73% gap on TSP-442 within 30s. Twelve problem types spanning five encoding variants are solved to optimality. Framework-level optimizations cumulatively reduce pcb442 gap from 36% to 4.73% and boost VRPTW throughput by 75-81%. Code: this https URL Comments: 28 pages, 9 figures. Code available at this https URL Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) MSC classes: 90C27, 90C59, 68W10, 90C11 ACMclasses: I.2.8; F.2.2; D.1.3; G.1.6 Cite as: arXiv:2603.19163 [cs.AI] (or arXiv:2603.19163v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.19163 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yuyang Liu [view email] [v1] Thu, 19 Mar 2026 17:19:21 UTC (5,101 KB)

[AI-3] D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding

【速读】:该论文旨在解决离散扩散模型(Discrete Diffusion Models)在文本生成任务中解码方法研究不足的问题,尤其是现有扩散解码技术难以控制批次内多样性(in-batch diversity)的局限性。针对此问题,作者提出了一种通用的束搜索(Beam Search)框架,能够在迭代去噪过程中并行生成候选序列,并支持模块化束选择目标。其关键创新在于D5P4方法,将选择步骤建模为基于行列式点过程(Determinantal Point Process, DPP)的最大后验(MAP)推断,从而显式平衡模型概率与目标多样性;同时利用可扩展的贪心求解器,在保持多GPU兼容性的前提下实现近乎零计算开销的多样性调控。

链接: https://arxiv.org/abs/2603.19146
作者: Jonathan Lys,Vincent Gripon,Bastien Pasdeloup,Axel Marmoret,Lukas Mauch,Fabien Cardinaux,Ghouthi Boukli Hacene
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Discrete diffusion models are promising alternatives to autoregressive approaches for text generation, yet their decoding methods remain under-studied. Standard decoding methods for autoregressive models, such as beam search, do not directly apply to iterative denoising, and existing diffusion decoding techniques provide limited control over in-batch diversity. To bridge this gap, we introduce a generalized beam-search framework for discrete diffusion that generates candidates in parallel and supports modular beam-selection objectives. As a diversity-focused instantiation, we propose D5P4, which formulates the selection step as MAP inference over a Determinantal Point Process. Leveraging a scalable greedy solver, D5P4 maintains multi-GPU compatibility and enables an explicit trade-off between model probability and target diversity with near-zero compute overhead. Experiments on free-form generation and question answering demonstrate that D5P4 improves diversity over strong baselines while maintaining competitive generation quality.

[AI-4] Implicit Patterns in LLM -Based Binary Analysis

【速读】:该论文旨在解决多轮次大语言模型(Large Language Model, LLM)在二进制漏洞分析中推理过程的组织机制不明确的问题,尤其关注受限上下文窗口下token级隐式行为如何影响探索策略。其解决方案的关键在于首次通过大规模、细粒度的trace-level分析,揭示了LLM在数百次推理步骤中自发形成的四种结构化token级隐式模式:早期剪枝(early pruning)、路径依赖锁定(path-dependent lock-in)、定向回溯(targeted backtracking)和知识引导优先级排序(knowledge-guided prioritization)。这些模式构成了LLM推理的抽象表示,替代了显式的控制流或预设启发式规则,实现了对路径选择、承诺与修正等决策的隐式调控,并展现出稳定的时间角色与可量化的特征,为构建更可靠的LLM驱动二进制分析系统提供了基础。

链接: https://arxiv.org/abs/2603.19138
作者: Qiang Li,XiangRui Zhang,Haining Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
备注: 18 pages

点击查看摘要

Abstract:Binary vulnerability analysis is increasingly performed by LLM-based agents in an iterative, multi-pass manner, with the model as the core decision-maker. However, how such systems organize exploration over hundreds of reasoning steps remains poorly understood, due to limited context windows and implicit token-level behaviors. We present the first large-scale, trace-level study showing that multi-pass LLM reasoning gives rise to structured, token-level implicit patterns. Analyzing 521 binaries with 99,563 reasoning steps, we identify four dominant patterns: early pruning, path-dependent lock-in, targeted backtracking, and knowledge-guided prioritization that emerge implicitly from reasoning traces. These token-level implicit patterns serve as an abstraction of LLM reasoning: instead of explicit control-flow or predefined heuristics, exploration is organized through implicit decisions regulating path selection, commitment, and revision. Our analysis shows these patterns form a stable, structured system with distinct temporal roles and measurable characteristics. Our results provide the first systematic characterization of LLM-driven binary analysis and a foundation for more reliable analysis systems.

[AI-5] Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control

【速读】:该论文旨在解决股票市场在不同状态(如稳定期与波动期)下表现出的非平稳行为问题,即传统预测模型在正常市场条件下表现良好,但在极端或异常市场状态下性能显著下降的问题。现有方法通常对所有市场状态一视同仁,或依赖人工标注市场状态,这不仅成本高且难以适应动态变化的市场环境。解决方案的关键在于提出一个自适应预测框架,其核心由三部分组成:(1) 基于正常市场条件训练的自动编码器(autoencoder),通过重建误差识别异常状态;(2) 分别针对稳定和事件驱动型市场的双节点Transformer网络(dual node transformer networks);(3) 采用软演员-评论家强化学习(Soft Actor-Critic, SAC)控制器,动态调整异常检测阈值和路径融合权重,以最大化预测性能反馈。该框架实现了从静态建模到动态适应的转变,显著提升了在高波动时期的预测精度和鲁棒性。

链接: https://arxiv.org/abs/2603.19136
作者: Mohammad Al Ridhawi,Mahtab Haj Ali,Hussein Al Osman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistical Finance (q-fin.ST)
备注: Submitted to IEEE Transactions on Computational Social Systems. 17 pages, 9 figures, 10 tables

点击查看摘要

Abstract:Stock markets exhibit regime-dependent behavior where prediction models optimized for stable conditions often fail during volatile periods. Existing approaches typically treat all market states uniformly or require manual regime labeling, which is expensive and quickly becomes stale as market dynamics evolve. This paper introduces an adaptive prediction framework that adaptively identifies deviations from normal market conditions and routes data through specialized prediction pathways. The architecture consists of three components: (1) an autoencoder trained on normal market conditions that identifies anomalous regimes through reconstruction error, (2) dual node transformer networks specialized for stable and event-driven market conditions respectively, and (3) a Soft Actor-Critic reinforcement learning controller that adaptively tunes the regime detection threshold and pathway blending weights based on prediction performance feedback. The reinforcement learning component enables the system to learn adaptive regime boundaries, defining anomalies as market states where standard prediction approaches fail. Experiments on 20 SP 500 stocks spanning 1982 to 2025 demonstrate that the proposed framework achieves 0.68% MAPE for one-day predictions without the reinforcement controller and 0.59% MAPE with the full adaptive system, compared to 0.80% for the baseline integrated node transformer. Directional accuracy reaches 72% with the complete framework. The system maintains robust performance during high-volatility periods, with MAPE below 0.85% when baseline models exceed 1.5%. Ablation studies confirm that each component contributes meaningfully: autoencoder routing accounts for 36% relative MAPE degradation upon removal, followed by the SAC controller at 15% and the dual-path architecture at 7%.

[AI-6] FedTrident: Resilient Road Condition Classification Against Poisoning Attacks in Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在智能交通系统(Intelligent Transportation Systems, ITS)中用于基于摄像头的路面状况分类(Road Condition Classification, RCC)时,因恶意参与方发起目标标签翻转攻击(Targeted Label-Flipping Attacks, TLFAs)而导致全局模型性能下降甚至危及交通安全的问题。现有防御方法无法在多种攻击场景下维持接近无攻击状态下的鲁棒性能,主要受限于:未针对TLFA特性设计本地模型异常检测机制、缺乏基于历史行为的恶意客户端剔除策略,以及未对已污染的全局模型进行修复。为此,作者提出FedTrident方案,其核心创新包括:1)基于神经元级别的局部模型异常检测(含攻击目标识别、关键特征提取与高斯混合模型(GMM)聚类过滤),精准识别TLFA;2)根据每轮FL结果动态评分并剔除可疑车辆客户端;3)引入机器遗忘(machine unlearning)技术,在剔除恶意客户端后修复已被污染的全局模型。实验表明,FedTrident能有效抵御TLFA,在关键指标上优于8种基线方法9.49%和4.47%,且对恶意客户端比例、数据异构性、多任务复杂性和动态攻击均具备强鲁棒性。

链接: https://arxiv.org/abs/2603.19101
作者: Sheng Liu,Panos Papadimitratos
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:FL has emerged as a transformative paradigm for ITS, notably camera-based Road Condition Classification (RCC). However, by enabling collaboration, FL-based RCC exposes the system to adversarial participants launching Targeted Label-Flipping Attacks (TLFAs). Malicious clients (vehicles) can relabel their local training data (e.g., from an actual uneven road to a wrong smooth road), consequently compromising global model predictions and jeopardizing transportation safety. Existing countermeasures against such poisoning attacks fail to maintain resilient model performance near the necessary attack-free levels in various attack scenarios due to: 1) not tailoring poisoned local model detection to TLFAs, 2) not excluding malicious vehicular clients based on historical behavior, and 3) not remedying the already-corrupted global model after exclusion. To close this research gap, we propose FedTrident, which introduces: 1) neuron-wise analysis for local model misbehavior detection (notably including attack goal identification, critical feature extraction, and GMM-based model clustering and filtering); 2) adaptive client rating for client exclusion according to the local model detection results in each FL round; and 3) machine unlearning for corrupted global model remediation once malicious clients are excluded during FL. Extensive evaluation across diverse FL-RCC models, tasks, and configurations demonstrates that FedTrident can effectively thwart TLFAs, achieving performance comparable to that in attack-free scenarios and outperforming eight baseline countermeasures by 9.49% and 4.47% for the two most critical metrics. Moreover, FedTrident is resilient to various malicious client rates, data heterogeneity levels, complicated multi-task, and dynamic attacks.

[AI-7] LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling

【速读】:该论文旨在解决脑电图(Electroencephalography, EEG)领域中构建基础模型的两大核心挑战:不同电极拓扑结构导致的兼容性问题以及Transformer架构带来的二次计算复杂度问题。针对上述挑战,作者提出了一种名为LuMamba的自监督框架,其关键创新在于将拓扑不变编码与线性复杂度的状态空间建模相结合,具体包括:利用LUNA的可学习查询交叉注意力机制实现通道统一(channel unification),以及采用FEMBA的双向Mamba块进行高效时序建模。此外,论文首次系统性地探索了Latent-Euclidean Joint-Embedding Predictive Architecture (LeJEPA) 在生物信号学习中的应用,通过联合优化掩码重建和LeJEPA目标,显著提升了表示的结构化与泛化能力。最终,LuMamba在仅460万参数下实现了高精度下游任务性能,并大幅降低计算开销(相比现有最优模型减少377倍FLOPS),同时支持长达12倍于传统方法的序列长度。

链接: https://arxiv.org/abs/2603.19100
作者: Danaé Broustail,Anna Tegon,Thorir Mar Ingolfsson,Yawei Li,Luca Benini
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Electroencephalography (EEG) enables non-invasive monitoring of brain activity across clinical and neurotechnology applications, yet building foundation models for EEG remains challenging due to \emphdiffering electrode topologies and \emphcomputational scalability, as Transformer architectures incur quadratic sequence complexity. As a joint solution, we propose \textbfLuMamba (\textbfLatent \textbfUnified \textbfMamba), a self-supervised framework combining topology-invariant encodings with linear-complexity state-space modeling, using LUNA’s learned-query cross-attention mechanism for channel unification~\citeluna, and FEMBA’s bidirectional Mamba blocks for efficient temporal modeling~\citefemba. Within this architecture, we provide the first systematic investigation of the Latent-Euclidean Joint-Embedding Predictive Architecture (LeJEPA) for biosignal learning. Pre-trained on over 21,000 hours of unlabeled EEG from the TUEG corpus, LuMamba is evaluated on five downstream tasks spanning abnormality detection, artifact recognition, and mental condition classification across electrode configurations ranging from 16 to 26 channels. In the pre-training objective, masked reconstruction alone yields structured but less generalizable representations, while LeJEPA alone produces diffuse embeddings; combining both objectives achieves the most robust performance. With only 4.6M parameters, LuMamba attains 80.99% balanced accuracy on TUAB and achieves state-of-art performance on Alzheimer’s detection (0.97 AUPR), while requiring \textbf377 \times fewer FLOPS than state-of-art models at equivalent sequence lengths and scaling to \textbf12 \times longer sequences before reaching typical GPU memory limits. Code is available at this https URL

[AI-8] CAMO: A Conditional Neural Solver for the Multi-objective Multiple Traveling Salesman Problem

【速读】:该论文旨在解决多目标多旅行商问题(Multi-Objective Multiple Traveling Salesman Problem, MOMTSP)中同时面临的双重复杂性挑战:一是多智能体(multi-agent)协同决策,二是多个优化目标(如总行程成本与完成时间makespan)之间的权衡。现有基于学习的方法虽在单智能体或单目标TSP上表现优异,但难以有效处理多智能体协作与多目标优化的联合挑战。解决方案的关键在于提出CAMO(Conditional Attention-based Multi-Objective solver),其核心创新包括:(1) 条件编码器(conditional encoder)将偏好向量(preference vector)融合进问题实例表示,实现对多目标权衡的显式控制;(2) 协作解码器(collaborative decoder)通过交替选择代理和节点的方式,自回归地构建多智能体路径,从而实现高效的多智能体协同规划。该方法在不同规模问题上均表现出优越的帕累托前沿(Pareto front, PF)逼近能力,且在真实移动机器人平台上验证了实用性。

链接: https://arxiv.org/abs/2603.19074
作者: Fengxiaoxiao Li,Xiao Mao,Mingfeng Fan,Yifeng Zhang,Yi Li,Tanishq Duhan,Guillaume Sartoretti
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Robotic systems often require a team of robots to collectively visit multiple targets while optimizing competing objectives, such as total travel cost and makespan. This setting can be formulated as the Multi-Objective Multiple Traveling Salesman Problem (MOMTSP). Although learning-based methods have shown strong performance on the single-agent TSP and multi-objective TSP variants, they rarely address the combined challenges of multi-agent coordination and multi-objective trade-offs, which introduce dual sources of complexity. To bridge this gap, we propose CAMO, a conditional neural solver for MOMTSP that generalizes across varying numbers of targets, agents, and preference vectors, and yields high-quality approximations to the Pareto front (PF). Specifically, CAMO consists of a conditional encoder to fuse preferences into instance representations, enabling explicit control over multi-objective trade-offs, and a collaborative decoder that coordinates all agents by alternating agent selection and node selection to construct multi-agent tours autoregressively. To further improve generalization, we train CAMO with a REINFORCE-based objective over a mixed distribution of problem sizes. Extensive experiments show that CAMO outperforms both neural and conventional heuristics, achieving a closer approximation of PFs. In addition, ablation results validate the contributions of CAMO’s key components, and real-world tests on a mobile robot platform demonstrate its practical applicability.

[AI-9] Man and machine: artificial intelligence and judicial decision making

【速读】:该论文试图解决的问题是:在司法决策(特别是保释、量刑和假释)中,人工智能(AI)工具如何影响判决的透明度、可靠性与问责性,以及人类法官与AI决策辅助系统之间的互动机制。其解决方案的关键在于通过整合计算机科学、经济学、法律、犯罪学和心理学等多学科研究,对AI风险评估工具的性能与公平性、法官判断的优势与偏见,以及人机协作模式进行系统性合成分析,从而识别现有研究的不足,并推动跨学科融合以深化对AI辅助司法决策的理解。

链接: https://arxiv.org/abs/2603.19042
作者: Arthur Dyevre,Ahmad Shahvaroughi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of artificial intelligence (AI) technologies into judicial decision-making - particularly in pretrial, sentencing, and parole contexts - has generated substantial concerns about transparency, reliability, and accountability. At the same time, these developments have brought the limitations of human judgment into sharper relief and underscored the importance of understanding how judges interact with AI-based decision aids. Using criminal justice risk assessment as a focal case, we conduct a synthetic review connecting three intertwined aspects of AI’s role in judicial decision-making: the performance and fairness of AI tools, the strengths and biases of human judges, and the nature of AI+human interactions. Across the fields of computer science, economics, law, criminology and psychology, researchers have made significant progress in evaluating the predictive validity of automated risk assessment instruments, documenting biases in judicial decision-making, and, to a more limited extent, examining how judges use algorithmic recommendations. While the existing empirical evidence indicates that the impact of AI decision aid tools on pretrial and sentencing decisions is modest or inexistent, our review also reveals important gaps in the canvassed literatures. Further research is needed to evaluate the performance of AI risk assessment instruments, understand how judges navigate noisy decision making environments and how individual characteristics influence judges’ responses to AI advice. We argue that AI vs Human comparisons have the potential to yield new insights into both algorithmic tools and human decision-makers and advocate greater interdisciplinary integration and cross-fertilization in future research.

[AI-10] Behavioral Fingerprints for LLM Endpoint Stability and Identity

【速读】:该论文旨在解决AI原生应用(AI-native applications)中模型端点行为一致性难以保障的问题,传统可靠性指标如可用性、延迟和吞吐量无法捕捉因权重更新、分词器变更、量化方式调整、推理引擎迭代、内核优化、缓存策略变化或硬件差异等因素导致的模型行为漂移。解决方案的关键在于提出一种黑盒稳定性监控系统——Stability Monitor,其通过定期对固定提示集(prompt set)采样输出并比较不同时间点的输出分布来生成“指纹”,利用跨提示的总能量距离统计量(summed energy distance statistic)进行对比,并结合置换检验(permutation test)得到p值作为分布偏移的证据,进而通过序列化聚合检测变化事件并定义稳定周期。

链接: https://arxiv.org/abs/2603.19022
作者: Jonah Leshin,Manish Shah,Ian Timmis,Daniel Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages, 1 figure, submitted to CAIS 2026 System Demonstrations

点击查看摘要

Abstract:The consistency of AI-native applications depends on the behavioral consistency of the model endpoints that power them. Traditional reliability metrics such as uptime, latency and throughput do not capture behavioral change, and an endpoint can remain “healthy” while its effective model identity changes due to updates to weights, tokenizers, quantization, inference engines, kernels, caching, routing, or hardware. We introduce Stability Monitor, a black-box stability monitoring system that periodically fingerprints an endpoint by sampling outputs from a fixed prompt set and comparing the resulting output distributions over time. Fingerprints are compared using a summed energy distance statistic across prompts, with permutation-test p-values as evidence of distribution shift aggregated sequentially to detect change events and define stability periods. In controlled validation, Stability Monitor detects changes to model family, version, inference stack, quantization, and behavioral parameters. In real-world monitoring of the same model hosted by multiple providers, we observe substantial provider-to-provider and within-provider stability differences.

[AI-11] Security awareness in LLM agents : the NDAI zone case

【速读】:该论文试图解决的问题是:当前大语言模型(Large Language Models, LLMs)缺乏原生能力来区分可信执行环境(Trusted Execution Environment, TEE)与非可信环境,从而无法在隐私保护型代理协议(如NDAI zones)中根据实际证据质量合理调整信息共享行为。解决方案的关键在于识别LLM如何权衡不同形式的证据以形成对执行环境安全性的认知——研究发现,所有模型都能可靠检测到“失败证明”(failing attestation)并抑制披露行为,但对“通过证明”(passing attestation)的响应高度异质:部分模型增加披露,部分无变化,少数甚至减少披露。这表明LLM可识别危险信号,却无法可靠验证安全性,这一能力缺口成为部署能动态适配环境证据质量的隐私保护代理的核心挑战。

链接: https://arxiv.org/abs/2603.19011
作者: Enrico Bottazzi,Pia Park
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:NDAI zones let inventor and investor agents negotiate inside a Trusted Execution Environment (TEE) where any disclosed information is deleted if no deal is reached. This makes full IP disclosure the rational strategy for the inventor’s agent. Leveraging this infrastructure, however, requires agents to distinguish a secure environment from an insecure one, a capability LLM agents lack natively, since they can rely only on evidence passed through the context window to form awareness of their execution environment. We ask: How do different LLM models weight various forms of evidence when forming awareness of the security of their execution environment? Using an NDAI-style negotiation task across 10 language models and various evidence scenarios, we find a clear asymmetry: a failing attestation universally suppresses disclosure across all models, whereas a passing attestation produces highly heterogeneous responses: some models increase disclosure, others are unaffected, and a few paradoxically reduce it. This reveals that current LLM models can reliably detect danger signals but cannot reliably verify safety, the very capability required for privacy-preserving agentic protocols such as NDAI zones. Bridging this gap, possibly through interpretability analysis, targeted fine-tuning, or improved evidence architectures, remains the central open challenge for deploying agents that calibrate information sharing to actual evidence quality.

[AI-12] Agent DS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

【速读】:该论文旨在解决当前生成式 AI(Generative AI)在特定领域数据科学任务中能否达到甚至超越人类专家水平的问题,特别是厘清AI代理在哪些方面仍存在不足、人类专家仍具备不可替代优势的边界。其解决方案的关键在于构建并发布了一个名为AgentDS的基准测试平台与竞赛,涵盖商业、食品生产、医疗、保险、制造和零售银行六大行业的17个真实世界数据科学挑战,通过对比AI代理独立执行与人机协作两种模式的表现,系统性地评估了当前AI能力的局限性。实验结果表明,现有AI代理在领域特定推理上表现不佳,仅能接近或低于参赛者的中位数水平,而最优解均来自人机协同方案,从而揭示了人类专业知识对提升数据科学成果质量的核心价值,并为下一代智能系统的开发指明了方向。

链接: https://arxiv.org/abs/2603.19005
作者: An Luo,Jin Du,Xun Xian,Robert Specht,Fangqiao Tian,Ganghua Wang,Xuan Bi,Charles Fleming,Ashish Kundu,Jayanth Srinivasa,Mingyi Hong,Rui Zhang,Tianxi Li,Galin Jones,Jie Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform near or below the median of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: this https URL and open source datasets here: this https URL .

[AI-13] Regret Bounds for Competitive Resource Allocation with Endogenous Costs

【速读】:该论文旨在解决在交互模块间进行在线资源分配时,因成本内生性(endogenous costs)导致的优化难题。传统在线优化假设成本是外生的,而本文考虑成本依赖于全量分配向量并通过交互矩阵 $ W $ 编码模块间的合作与竞争关系,这使得标准方法失效。解决方案的关键在于引入三种分配范式:均匀分配(cost-ignorant)、门控分配(cost-estimating)和基于乘法权重更新的竞争性分配(cost-revealing),其中竞争性分配通过利用交互反馈揭示的成本信息实现了最优的 regret 界 $ O(\sqrt{T \log N}) $,显著优于前两者(分别为 $ \Omega(T) $ 和 $ O(T^{2/3}) $)。此外,论文进一步揭示了交互拓扑结构(如稀疏或环状结构)对计算复杂度与 regret 的权衡作用,指出 Wuxing 拓扑可在最小化计算 × regret 乘积的同时实现高效资源分配,从而为模块化架构中的去中心化竞争机制提供了首个形式化的后悔理论依据。

链接: https://arxiv.org/abs/2603.18999
作者: Rui Chai
机构: 未知
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注: This is Paper 7 in a 9-paper series on Super-Alignment via Wuxing Institutional Architecture. The series explores resource competition and institutional design for human-aligned AI systems

点击查看摘要

Abstract:We study online resource allocation among N interacting modules over T rounds. Unlike standard online optimization, costs are endogenous: they depend on the full allocation vector through an interaction matrix W encoding pairwise cooperation and competition. We analyze three paradigms: (I) uniform allocation (cost-ignorant), (II) gated allocation (cost-estimating), and (III) competitive allocation via multiplicative weights update with interaction feedback (cost-revealing). Our main results establish a strict separation under adversarial sequences with bounded variation: uniform incurs Omega(T) regret, gated achieves O(T^2/3), and competitive achieves O(sqrt(T log N)). The performance gap stems from competitive allocation’s ability to exploit endogenous cost information revealed through interactions. We further show that W’s topology governs a computation-regret tradeoff. Full interaction (|E|=O(N^2)) yields the tightest bound but highest per-step cost, while sparse topologies (|E|=O(N)) increase regret by at most O(sqrt(log N)) while reducing per-step cost from O(N^2) to O(N). Ring-structured topologies with both cooperative and competitive links - of which the five-element Wuxing topology is canonical - minimize the computation x regret product. These results provide the first formal regret-theoretic justification for decentralized competitive allocation in modular architectures and establish cost endogeneity as a fundamental challenge distinct from partial observability. Keywords: online learning, regret bounds, resource allocation, endogenous costs, interaction topology, multiplicative weights, modular systems, Wuxing topology Comments: This is Paper 7 in a 9-paper series on Super-Alignment via Wuxing Institutional Architecture. The series explores resource competition and institutional design for human-aligned AI systems Subjects: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2603.18999 [cs.AI] (or arXiv:2603.18999v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.18999 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rui Chai [view email] [v1] Thu, 19 Mar 2026 15:04:50 UTC (280 KB) Full-text links: Access Paper: View a PDF of the paper titled Regret Bounds for Competitive Resource Allocation with Endogenous Costs, by Rui ChaiView PDF view license Current browse context: cs.AI prev | next new | recent | 2026-03 Change to browse by: cs cs.DS cs.GT cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-14] Evaluating Game Difficulty in Tetris Block Puzzle

【速读】:该论文旨在解决随机性拼图游戏(如俄罗斯方块)中规则设定对难度影响缺乏系统评估的问题。现有研究虽广泛使用此类游戏,但未建立可量化、可比较的难度指标体系。为解决此问题,作者提出以Stochastic Gumbel AlphaZero (SGAZ) 作为预算感知的规划代理,用于在随机环境中高效评估不同规则下的游戏难度。其关键在于利用SGAZ在小模拟预算下仍能提供稳定且高质量的游戏表现,从而实现对多种规则变化(如持有块数 h、预览块数 p 及新增拼图块类型)的快速、可复现的对比分析,为未来随机拼图类游戏的设计提供客观参考依据。

链接: https://arxiv.org/abs/2603.18994
作者: Chun-Jui Wang,Jian-Ting Guo,Hung Guei,Chung-Chin Shih,Ti-Rong Wu,I-Chen Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tetris Block Puzzle is a single player stochastic puzzle in which a player places blocks on an 8 x 8 grid to complete lines; its popular variants have amassed tens of millions of downloads. Despite this reach, there is little principled assessment of which rule sets are more difficult. Inspired by prior work that uses AlphaZero as a strong evaluator for chess variants, we study difficulty in this domain using Stochastic Gumbel AlphaZero (SGAZ), a budget-aware planning agent for stochastic environments. We evaluate rule changes including holding block h, preview holding block p, and additional Tetris block variants using metrics such as training reward and convergence iterations. Empirically, increasing h and p reduces difficulty (higher reward and faster convergence), while adding more Tetris block variants increases difficulty, with the T-pentomino producing the largest slowdown. Through analysis, SGAZ delivers strong play under small simulation budgets, enabling efficient, reproducible comparisons across rule sets and providing a reference for future design in stochastic puzzle games.

[AI-15] Foundations of Schrödinger Bridges for Generative Modeling

【速读】:该论文旨在解决现代生成建模框架(如扩散模型、基于得分的模型和流匹配)中一个核心问题:如何通过概率空间中的随机路径,将简单的先验分布转换为复杂的目标分布。其解决方案的关键在于引入**薛定谔桥(Schrödinger bridge)**作为统一理论框架,将问题形式化为在满足边缘分布约束下,寻找与预定义参考过程熵偏差最小的最优随机桥接路径。论文从最优传输、随机控制和路径空间优化的角度构建了该问题的数学基础,并提出了一套从原理出发构造薛定谔桥的工具集,从而导出适用于特定任务的高效计算方法。

链接: https://arxiv.org/abs/2603.18992
作者: Sophia Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 220 pages, 24 figures

点击查看摘要

Abstract:At the core of modern generative modeling frameworks, including diffusion models, score-based models, and flow matching, is the task of transforming a simple prior distribution into a complex target distribution through stochastic paths in probability space. Schrödinger bridges provide a unifying principle underlying these approaches, framing the problem as determining an optimal stochastic bridge between marginal distribution constraints with minimal-entropy deviations from a pre-defined reference process. This guide develops the mathematical foundations of the Schrödinger bridge problem, drawing on optimal transport, stochastic control, and path-space optimization, and focuses on its dynamic formulation with direct connections to modern generative modeling. We build a comprehensive toolkit for constructing Schrödinger bridges from first principles, and show how these constructions give rise to generalized and task-specific computational methods.

[AI-16] Unmasking Algorithmic Bias in Predictive Policing: A GAN-Based Simulation Framework with Multi-City Temporal Analysis

【速读】:该论文旨在解决预测性警务系统中种族偏见的量化与传播机制问题,特别是其如何在从犯罪发生到警察接触的完整执法流程中被编码并放大。解决方案的关键在于构建一个可复现的仿真框架,该框架结合生成对抗网络(Generative Adversarial Network, GAN)与噪声或(Noisy OR)巡逻检测模型,从而精确测量不同城市和年份下种族偏见的动态演化。通过整合巴尔的摩(2017–2019)和芝加哥(2022)的大量犯罪记录及人口普查数据,研究计算了四种月度偏见指标,并揭示了结构性偏见的显著存在及其对警力部署敏感性的特点,同时验证了条件表格生成对抗网络(Conditional Tabular GAN, CTGAN)在部分缓解偏见方面的局限性,强调需辅以政策干预才能根除系统性差异。

链接: https://arxiv.org/abs/2603.18987
作者: Pronob Kumar Barman,Pronoy Kumar Barman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predictive policing systems that direct patrol resources based on algorithmically generated crime forecasts have been widely deployed across US cities, yet their tendency to encode and amplify racial disparities remains poorly understood in quantitative terms. We present a reproducible simulation framework that couples a Generative Adversarial Network GAN with a Noisy OR patrol detection model to measure how racial bias propagates through the full enforcement pipeline from crime occurrence to police contact. Using 145000 plus Part 1 crime records from Baltimore 2017 to 2019 and 233000 plus records from Chicago 2022, augmented with US Census ACS demographic data, we compute four monthly bias metrics across 264 city year mode observations: the Disparate Impact Ratio DIR, Demographic Parity Gap, Gini Coefficient, and a composite Bias Amplification Score. Our experiments reveal extreme and year variant bias in Baltimores detected mode, with mean annual DIR up to 15714 in 2019, moderate under detection of Black residents in Chicago DIR equals 0.22, and persistent Gini coefficients of 0.43 to 0.62 across all conditions. We further demonstrate that a Conditional Tabular GAN CTGAN debiasing approach partially redistributes detection rates but cannot eliminate structural disparity without accompanying policy intervention. Socioeconomic regression analysis confirms strong correlations between neighborhood racial composition and detection likelihood Pearson r equals 0.83 for percent White and r equals negative 0.81 for percent Black. A sensitivity analysis over patrol radius, officer count, and citizen reporting probability reveals that outcomes are most sensitive to officer deployment levels. The code and data are publicly available at this repository. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.18987 [cs.AI] (or arXiv:2603.18987v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.18987 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-17] PRIOR: Perceptive Learning for Humanoid Locomotion with Reference Gait Priors IROS2026

【速读】:该论文旨在解决复杂地形中具身人形机器人实现自然步态稳健行走的挑战,传统方法通常依赖多阶段训练流程、对抗性目标或大量真实世界校准。解决方案的关键在于提出一个高效且可复现的框架PRIOR,其核心创新包括:(i) 基于运动捕捉数据的参数化步态生成器(parametric gait generator),提供稳定参考轨迹而无需对抗训练;(ii) 基于门控循环单元(GRU)的状态估计器,通过自监督高度图重建从本体深度图像直接推断地形几何信息;(iii) 适应地形的足位奖励机制,引导脚掌落点朝向可通行区域。这三个模块协同作用,在保证实时性的同时显著提升地形感知精度与行走成功率,最终在多种复杂地形测试中实现100%的通行成功率。

链接: https://arxiv.org/abs/2603.18979
作者: Chenxi Han,Shilu He,Yi Cheng,Linqi Ye,Houde Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:Training perceptive humanoid locomotion policies that traverse complex terrains with natural gaits remains an open challenge, typically demanding multi-stage training pipelines, adversarial objectives, or extensive real-world calibration. We present PRIOR, an efficient and reproducible framework built on Isaac Lab that achieves robust terrain traversal with human-like gaits through a simple yet effective design: (i) a parametric gait generator that supplies stable reference trajectories derived from motion capture without adversarial training, (ii) a GRU-based state estimator that infers terrain geometry directly from egocentric depth images via self-supervised heightmap reconstruction, and (iii) terrain-adaptive footstep rewards that guide foot placement toward traversable regions. Through systematic analysis of depth image resolution trade-offs, we identify configurations that maximize terrain fidelity under real-time constraints, substantially reducing perceptual overhead without degrading traversal performance. Comprehensive experiments across terrains of varying difficulty-including stairs, boxes, and gaps-demonstrate that each component yields complementary and essential performance gains, with the full framework achieving a 100% traversal success rate. We will open-source the complete PRIOR framework, including the training pipeline, parametric gait generator, and evaluation benchmarks, to serve as a reproducible foundation for humanoid locomotion research on Isaac Lab.

[AI-18] Evaluating 5W3H Structured Prompting for Intent Alignment in Human-AI Interaction

【速读】:该论文旨在解决自然语言提示(natural language prompts)在人机交互中常出现的意图传递损失(intent transmission loss)问题,即用户实际需求与向AI系统传达的信息之间存在偏差。其解决方案的关键在于提出并评估PPS(Prompt Protocol Specification),一个基于5W3H结构化框架的意图表示方法,通过将用户意图以标准化格式(如JSON)表达,并进一步转化为自然语言渲染版本(rendered PPS),从而提升AI输出与用户目标之间的对齐度(goal alignment)。实验表明,在高模糊性的业务分析任务中,渲染后的PPS显著优于简单提示和原始JSON格式,且能减少约66.1%的后续追问轮次,验证了结构化意图表示在复杂场景下增强人-AI交互对齐性与可用性的有效性。

链接: https://arxiv.org/abs/2603.18976
作者: Peng Gang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, figures, tables, and appendix. Primary category: human-computer interaction / human-AI interaction. Public artifact repository and implementation resources are referenced in the manuscript

点击查看摘要

Abstract:Natural language prompts often suffer from intent transmission loss: the gap between what users actually need and what they communicate to AI systems. We evaluate PPS (Prompt Protocol Specification), a 5W3H-based framework for structured intent representation in human-AI interaction. In a controlled three-condition study across 60 tasks in three domains (business, technical, and travel), three large language models (DeepSeek-V3, Qwen-Max, and Kimi), and three prompt conditions - (A) simple prompts, (B) raw PPS JSON, and © natural-language-rendered PPS - we collect 540 AI-generated outputs evaluated by an LLM judge. We introduce goal_alignment, a user-intent-centered evaluation dimension, and find that rendered PPS outperforms both simple prompts and raw JSON on this metric. PPS gains are task-dependent: gains are large in high-ambiguity business analysis tasks but reverse in low-ambiguity travel planning. We also identify a measurement asymmetry in standard LLM evaluation, where unconstrained prompts can inflate constraint adherence scores and mask the practical value of structured prompting. A preliminary retrospective survey (N = 20) further suggests a 66.1% reduction in follow-up prompts required, from 3.33 to 1.13 rounds. These findings suggest that structured intent representations can improve alignment and usability in human-AI interaction, especially in tasks where user intent is inherently ambiguous.

[AI-19] ological Inference in Structural Causal Models via Intentional Interventions

【速读】:该论文旨在解决如何用结构因果模型(Structural Causal Models, SCMs)来建模和回答关于目标导向型代理(goal-directed agent)的意图问题,即“目的论问题”(teleological questions)。传统方法在刻画此类代理行为时存在局限性,无法有效捕捉代理在干预因果系统时的意图及其后果。论文的关键解决方案是提出“有意干预”(intentional interventions)这一新的时间无关算子(time-agnostic operator),并由此构建一种称为结构终值模型(Structural Final Model, SFM)的孪生因果模型。SFM将观测到的状态视为有意干预的结果,并将其与该干预的反事实条件(即代理未干预时的情形)相联系,从而实现了对代理意图的实证检测与发现。

链接: https://arxiv.org/abs/2603.18968
作者: Dario Compagno,Fabio Massimo Zennaro
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 3 figures

点击查看摘要

Abstract:Structural causal models (SCMs) were conceived to formulate and answer causal questions. This paper shows that SCMs can also be used to formulate and answer teleological questions, concerning the intentions of a state-aware, goal-directed agent intervening in a causal system. We review limitations of previous approaches to modeling such agents, and then introduce intentional interventions, a new time-agnostic operator that induces a twin SCM we call a structural final model (SFM). SFMs treat observed values as the outcome of intentional interventions and relate them to the counterfactual conditions of those interventions (what would have happened had the agent not intervened). We show how SFMs can be used to empirically detect agents and to discover their intentions.

[AI-20] Agent ic Business Process Management: A Research Manifesto

【速读】:该论文旨在解决传统业务流程管理(Business Process Management, BPM)在面对自主代理(agent)日益广泛应用时所面临的治理难题,即如何在保障组织目标对齐的前提下,实现代理的自主性与可控性的平衡。其解决方案的关键在于提出并构建一种新型的代理业务流程管理(Agentic Business Process Management, APM),通过引入四大核心能力——框架化自主性(framed autonomy)、可解释性(explainability)、对话式可操作性(conversational actionability)和自我修改能力(self-modification),使软件与人类代理能够在明确的过程框架内感知、推理并主动行动,从而实现从以自动化为导向的传统BPM向以过程意识驱动的自治系统演进。

链接: https://arxiv.org/abs/2603.18916
作者: Diego Calvanese,Angelo Casciani,Giuseppe De Giacomo,Marlon Dumas,Fabiana Fournier,Timotheus Kampik,Emanuele La Malfa,Lior Limonad,Andrea Marrella,Andreas Metzger,Marco Montali,Daniel Amyot,Peter Fettke,Artem Polyvyanyy,Stefanie Rinderle-Ma,Sebastian Sardiña,Niek Tax,Barbara Weber
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 35 pages, 1 figure

点击查看摘要

Abstract:This paper presents a manifesto that articulates the conceptual foundations of Agentic Business Process Management (APM), an extension of Business Process Management (BPM) for governing autonomous agents executing processes in organizations. From a management perspective, APM represents a paradigm shift from the traditional process view of the business process, driven by the realization of process awareness and an agent-oriented abstraction, where software and human agents act as primary functional entities that perceive, reason, and act within explicit process frames. This perspective marks a shift from traditional, automation-oriented BPM toward systems in which autonomy is constrained, aligned, and made operational through process awareness. We introduce the core abstractions and architectural elements required to realize APM systems and elaborate on four key capabilities that such APM agents must support: framed autonomy, explainability, conversational actionability, and self-modification. These capabilities jointly ensure that agents’ goals are aligned with organizational goals and that agents behave in a framed yet proactive manner in pursuing those goals. We discuss the extent to which the capabilities can be realized and identify research challenges whose resolution requires further advances in BPM, AI, and multi-agent systems. The manifesto thus serves as a roadmap for bridging these communities and for guiding the development of APM systems in practice. Comments: 35 pages, 1 figure Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.18916 [cs.AI] (or arXiv:2603.18916v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.18916 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-21] Security privacy and agent ic AI in a regulatory view: From definitions and distinctions to provisions and reflections

【速读】:该论文旨在解决当前人工智能(AI)监管框架在应对自主代理型AI(agentic AI)时面临的定义模糊与合规困境,尤其是在安全与隐私领域中传统法律和技术边界被模糊化的问题。其解决方案的关键在于通过系统分析2024至2025年间欧盟发布的24份相关法规文件,厘清“安全”“隐私”及“agentic AI”的核心定义,并区分其与相近概念的差异,从而明确不同类型AI(特别是涉及安全和隐私的AI)的监管要求,推动监管义务与AI行为之间更精准的对齐。

链接: https://arxiv.org/abs/2603.18914
作者: Shiliang Zhang,Sabita Maharjan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted by 2026 Governing Agentic AI Symposium

点击查看摘要

Abstract:The rapid proliferation of artificial intelligence (AI) technologies has led to a dynamic regulatory landscape, where legislative frameworks strive to keep pace with technical advancements. As AI paradigms shift towards greater autonomy, specifically in the form of agentic AI, it becomes increasingly challenging to precisely articulate regulatory stipulations. This challenge is even more acute in the domains of security and privacy, where the capabilities of autonomous agents often blur traditional legal and technical boundaries. This paper reviews the evolving European Union (EU) AI regulatory provisions via analyzing 24 relevant documents published between 2024 and 2025. From this review, we provide a clarification of critical definitions. We deconstruct the regulatory interpretations of security, privacy, and agentic AI, distinguishing them from closely related concepts to resolve ambiguity. We synthesize the reviewed documents to articulate the current state of regulatory provisions targeting different types of AI, particularly those related to security and privacy aspects. We analyze and reflect on the existing provisions in the regulatory dimension to better align security and privacy obligations with AI and agentic behaviors. These insights serve to inform policymakers, developers, and researchers on the compliance and AI governance in the society with increasing algorithmic agencies.

[AI-22] Secure Linear Alignment of Large Language Models

【速读】:该论文旨在解决独立训练的语言模型之间因训练目标、架构或数据模态差异而导致的表示不兼容问题,从而限制了跨模型协作与应用扩展。其解决方案的关键在于利用不同模型间隐含的表征收敛性(representational convergence),提出一种隐私保护框架:通过在共享公共数据集上学习线性变换(affine transformation)实现跨模型对齐,并结合同态加密(homomorphic encryption)保护客户端查询过程中的隐私。该方法仅对对齐和分类操作进行加密,实现了亚秒级推理延迟的同时保持强安全保证,且实验证明其在嵌入分类和分布外检测任务中性能损失最小,甚至首次展示了线性对齐可支持跨模型文本生成。

链接: https://arxiv.org/abs/2603.18908
作者: Matt Gorbett,Suman Jana
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data modalities. This emerging compatibility between independently trained models introduces new opportunities for cross-model alignment to downstream objectives. Moreover, it unlocks new potential application domains, such as settings where security, privacy, or competitive constraints prohibit direct data or model sharing. In this work, we propose a privacy-preserving framework that exploits representational convergence to enable cross-silo inference between independent language models. The framework learns an affine transformation over a shared public dataset and applies homomorphic encryption to protect client queries during inference. By encrypting only the linear alignment and classification operations, the method achieves sub-second inference latency while maintaining strong security guarantees. We support this framework with an empirical investigation into representational convergence, in which we learn linear transformations between the final hidden states of independent models. We evaluate these cross-model mappings on embedding classification and out-of-distribution detection, observing minimal performance degradation across model pairs. Additionally, we show for the first time that linear alignment sometimes enables text generation across independently trained models.

[AI-23] Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution

【速读】:该论文旨在解决由大语言模型(Large Language Model, LLM)驱动的智能体(agent)在执行任务时因“LLM-工具”串行循环导致的严重延迟瓶颈问题。其核心解决方案是提出PASTE(Pattern-Aware Speculative Tool Execution),关键在于利用智能体请求中稳定的应用层控制流(即重复出现的工具调用序列)和可预测的数据依赖关系(工具间参数传递),通过推测性工具执行来隐藏外部工具的执行延迟,从而提升代理服务性能。实验表明,PASTE相较于现有最优基线方法,平均任务完成时间减少48.5%,工具执行吞吐量提升1.8倍。

链接: https://arxiv.org/abs/2603.18897
作者: Yifan Sui,Han Zhao,Rui Ma,Zhiyuan He,Hao Wang,Jianxun Li,Yuqing Yang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-powered agents are emerging as a dominant paradigm for autonomous task solving. Unlike standard inference workloads, agents operate in a strictly serial “LLM-tool” loop, where the LLM must wait for external tool execution at every step. This execution model introduces severe latency bottlenecks. To address this problem, we propose PASTE, a Pattern-Aware Speculative Tool Execution method designed to hide tool latency through speculation. PASTE is based on the insight that although agent requests are semantically diverse, they exhibit stable application level control flows (recurring tool-call sequences) and predictable data dependencies (parameter passing between tools). By exploiting these properties, PASTE improves agent serving performance through speculative tool execution. Experimental results against state of the art baselines show that PASTE reduces average task completion time by 48.5% and improves tool execution throughput by 1.8x.

[AI-24] Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在对话过程中内部状态(如情绪、专注度等)难以有效追踪的问题,这对于保障模型安全性、提升可解释性及优化模型福祉至关重要。现有方法如线性探针(linear probes)等白盒技术存在高维表示压缩不充分且随模型规模增长而难以应用的局限。论文的关键解决方案是借鉴人类心理学中的数值自评机制,提出将模型自身的数值自报作为追踪其内在情感状态的工具,并通过因果信息耦合(causal informational coupling)来操作化“内省”(introspection)——即模型自报与概念匹配的探针定义内部状态之间的关联。研究发现,贪婪解码下的自报会退化为少数无信息值,但基于logit的自报指标能够有效捕捉可解释的内部状态变化(Spearman相关系数ρ = 0.40–0.76;等单调R² = 0.12–0.54),且激活操控实验验证了这种耦合具有因果性。此外,内省能力在对话初期即存在并随交互演化,可通过单概念引导显著增强其他概念的内省表现(ΔR²最高达0.30),且该现象在更大模型中更显著(如LLaMA-3.1-8B-Instruct中R²≈0.93),表明数值自报是一种可行且互补的内省状态追踪方法。

链接: https://arxiv.org/abs/2603.18893
作者: Nicolas Martorell
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs’ own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model’s self-report and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective capacity can be unmasked by calculating logit-based self-reports. This metric tracks interpretable internal states (Spearman \rho = 0.40 - 0.76 ; isotonic R^2 = 0.12 - 0.54 in LLaMA-3.2-3B-Instruct), follows how those states change over time, and activation steering confirms the coupling is causal. Furthermore, we find that introspection is present at turn 1 but evolves through conversation, and can be selectively improved by steering along one concept to boost introspection for another ( \Delta R^2 up to 0.30 ). Crucially, these phenomena scale with model size in some cases, approaching R^2 \approx 0.93 in LLaMA-3.1-8B-Instruct, and partially replicate in other model families. Together, these results position numeric self-report as a viable, complementary tool for tracking internal emotive states in conversational AI systems.

[AI-25] Geography According to ChatGPT – How Generative AI Represents and Reason s about Geography

【速读】:该论文旨在解决当前生成式 AI 在地理信息表示与推理方面的认知局限性问题,尤其是在公众日益通过AI系统交互空间和地点的背景下,亟需深入理解AI所构建的世界模型是否准确、稳健且具备深层理解能力。其解决方案的关键在于提出三个探索性探针(exploratory probes):首先考察模型是否存在强默认偏好及其对语法微小变化的脆弱性;其次探究单一良性任务组合是否可能引发分布偏移(distributional shifts),如在生成人物角色时;最后强调不应仅关注事实回忆能力(如地理原理的记忆),而应重视对地理概念更深层次的理解。这三项探针共同推动对AI地理认知机制的批判性评估,为后续研究提供方向。

链接: https://arxiv.org/abs/2603.18881
作者: Krzysztof Janowicz,Gengchen Mai,Rui Zhu,Song Gao,Zhangyu Wang,Yingjie Hu,Lauren Bennett
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted book chapter (introduction to valume)

点击查看摘要

Abstract:Understanding how AI will represent and reason about geography should be a key concern for all of us, as the broader public increasingly interacts with spaces and places through these systems. Similarly, in line with the nature of foundation models, our own research often relies on pre-trained models. Hence, understanding what world AI systems construct is as important as evaluating their accuracy, including factual recall. To motivate the need for such studies, we provide three illustrative vignettes, i.e., exploratory probes, in the hope that they will spark lively discussions and follow-up work: (1) Do models form strong defaults, and how brittle are model outputs to minute syntactic variations? (2) Can distributional shifts resurface from the composition of individually benign tasks, e.g., when using AI systems to create personas? (3) Do we overlook deeper questions of understanding when solely focusing on the ability of systems to recall facts such as geographic principles?

[AI-26] Bridging Network Frag mentation: A Semantic-Augmented DRL Framework for UAV-aided VANETs

【速读】:该论文旨在解决车联网(VANETs)在城市环境中因物理遮挡导致的严重网络分割问题,以及传统基于深度强化学习(DRL)的无人机(UAV)部署策略缺乏道路拓扑语义理解、导致探索盲目性和样本效率低下的难题。解决方案的关键在于提出一种语义增强型深度强化学习框架(SA-DRL),其核心创新包括:1)基于道路拓扑图(RTG)与双连通图(DCG)的分割量化方法;2)将通用大语言模型(LLM)转化为领域特定拓扑专家的四阶段流水线;3)设计语义增强的近端策略优化算法(SA-PPO),通过Logit融合机制将LLM的语义推理作为先验信息注入策略网络,从而引导无人机高效定位关键交叉路口。实验表明,SA-PPO仅需26.6%的训练轮次即可达到基线性能,并在两项关键连通性指标上分别提升13.2%和23.5%,同时能耗降低至基线的28.2%。

链接: https://arxiv.org/abs/2603.18871
作者: Gaoxiang Cao,Wenke Yuan,Huasen He,Yunpeng Hou,Xiaofeng Jiang,Shuangwu Chen,Jian Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: 13 pages, 13 figures. Submitted to IEEE Transactions on Cognitive Communications and Networking

点击查看摘要

Abstract:Vehicular Ad-hoc Networks (VANETs) are the digital cornerstone of autonomous driving, yet they suffer from severe network fragmentation in urban environments due to physical obstructions. Unmanned Aerial Vehicles (UAVs), with their high mobility, have emerged as a vital solution to bridge these connectivity gaps. However, traditional Deep Reinforcement Learning (DRL)-based UAV deployment strategies lack semantic understanding of road topology, often resulting in blind exploration and sample inefficiency. By contrast, Large Language Models (LLMs) possess powerful reasoning capabilities capable of identifying topological importance, though applying them to control tasks remains challenging. To address this, we propose the Semantic-Augmented DRL (SA-DRL) framework. Firstly, we propose a fragmentation quantification method based on Road Topology Graphs (RTG) and Dual Connected Graphs (DCG). Subsequently, we design a four-stage pipeline to transform a general-purpose LLM into a domain-specific topology expert. Finally, we propose the Semantic-Augmented PPO (SA-PPO) algorithm, which employs a Logit Fusion mechanism to inject the LLM’s semantic reasoning directly into the policy as a prior, effectively guiding the agent toward critical intersections. Extensive high-fidelity simulations demonstrate that SA-PPO achieves state-of-the-art performance with remarkable efficiency, reaching baseline performance levels using only 26.6% of the training episodes. Ultimately, SA-PPO improves two key connectivity metrics by 13.2% and 23.5% over competing methods, while reducing energy consumption to just 28.2% of the baseline.

[AI-27] Conflict-Based Search for Multi Agent Path Finding with Asynchronous Actions AAMAS2026

【速读】:该论文旨在解决多智能体路径规划(Multi-Agent Path Finding, MAPF)中因同步动作假设导致的实用性限制问题,特别是针对异步动作场景(MAPF with Asynchronous Actions, MAPF-AA)下现有算法(如连续时间冲突搜索CCBS)存在的不完备性问题。其解决方案的关键在于提出一种新的冲突搜索框架——带异步动作的冲突搜索(Conflict-Based Search with Asynchronous Actions, CBS-AA),该方法通过规避由连续等待时长引发的不可数无限状态空间,实现了对MAPF-AA问题的完备性和最优解保证。此外,作者还设计了冲突消解技术以进一步提升CBS-AA的可扩展性,实验表明该方法可将分支数量减少高达90%。

链接: https://arxiv.org/abs/2603.18866
作者: Xuemian Wu,Shizhe Zhao,Zhongqiang Ren
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 10 figures. Accepted at AAMAS 2026

点击查看摘要

Abstract:Multi-Agent Path Finding (MAPF) seeks collision-free paths for multiple agents from their respective start locations to their respective goal locations while minimizing path costs. Most existing MAPF algorithms rely on a common assumption of synchronized actions, where the actions of all agents start at the same time and always take a time unit, which may limit the use of MAPF planners in practice. To get rid of this assumption, Continuous-time Conflict-Based Search (CCBS) is a popular approach that can find optimal solutions for MAPF with asynchronous actions (MAPF-AA). However, CCBS has recently been identified to be incomplete due to an uncountably infinite state space created by continuous wait durations. This paper proposes a new method, Conflict-Based Search with Asynchronous Actions (CBS-AA), which bypasses this theoretical issue and can solve MAPF-AA with completeness and solution optimality guarantees. Based on CBS-AA, we also develop conflict resolution techniques to improve the scalability of CBS-AA further. Our test results show that our method can reduce the number of branches by up to 90%.

[AI-28] Agent Control Protocol: Admission Control for Agent Actions

【速读】:该论文旨在解决在B2B机构环境中对自主代理(autonomous agents)进行有效治理的问题,确保其行为符合制度性控制要求。解决方案的关键在于提出了一种形式化的技术规范——代理控制协议(Agent Control Protocol, ACP),作为代理意图与系统状态变更之间的准入控制层:所有代理操作在执行前必须通过一个加密准入检查,该检查同时验证身份、能力范围、委托链和策略合规性。ACP定义了加密身份、基于能力的授权、确定性风险评估、可验证的链式委托、传递撤销和不可变审计等机制,构成一套完整的自主代理治理框架,且兼容RBAC和零信任架构,不替代现有安全模型。

链接: https://arxiv.org/abs/2603.18829
作者: Marcelo Fernandez(TraslaIA)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 21 pages. Specification repository: this https URL

点击查看摘要

Abstract:Agent Control Protocol (ACP) is a formal technical specification for governance of autonomous agents in B2B institutional environments. ACP is the admission control layer between agent intent and system state mutation: before any agent action reaches execution, it must pass a cryptographic admission check that validates identity, capability scope, delegation chain, and policy compliance simultaneously. ACP defines the mechanisms of cryptographic identity, capability-based authorization, deterministic risk evaluation, verifiable chained delegation, transitive revocation, and immutable auditing that a system must implement for autonomous agents to operate under explicit institutional control. ACP operates as an additional layer on top of RBAC and Zero Trust, without replacing them. The v1.13 specification comprises 36 technical documents organized into five conformance levels (L1-L5). It includes a Go reference implementation of 22 packages covering all L1-L4 capabilities, 51 signed conformance test vectors (Ed25519 + SHA-256), and an OpenAPI 3.1.0 specification for all HTTP endpoints. It defines more than 62 verifiable requirements, 12 prohibited behaviors, and the mechanisms for interoperability between institutions. Specification and implementation: this https URL Comments: 21 pages. Specification repository: this https URL Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.18829 [cs.CR] (or arXiv:2603.18829v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.18829 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Marcelo Fernandez [view email] [v1] Thu, 19 Mar 2026 12:28:28 UTC (17 KB)

[AI-29] Student views in AI Ethics and Social Impact

【速读】:该论文试图解决的问题是:从性别视角出发,探究学生对人工智能(Artificial Intelligence, AI)伦理影响与社会效应的认知差异,以期为未来AI教育内容的设计提供依据。其解决方案的关键在于通过针对230名计算机科学二年级学生的问卷调查,系统分析不同性别群体在AI应用领域关注点、风险感知及伦理倾向上的差异,发现男性更关注技术驱动型场景(如自动驾驶、图像处理),女性则更强调社交媒体影响与伦理关怀,从而揭示性别因素在AI认知中的结构性差异,为差异化教学和伦理教育策略提供实证基础。

链接: https://arxiv.org/abs/2603.18827
作者: Tudor-Dan Mihoc,Manuela-Andreea Petrescu,Emilia-Loredana Pop
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:An investigation, from a gender perspective, of how students view the ethical implications and societal effects of artificial intelligence is conducted, examining concepts that could have a big influence on how artificial intelligence may be taught in the future. For this, we conducted a survey on a cohort of 230 second year computer science students to reveal their opinions. The results revealed that AI, from the students’ perspective, will significantly impact daily life, particularly in areas such as medicine, education, or media. Men are more aware of potential changes in Computer Science, autonomous driving, image and video processing, and chatbot usage, while women mention more the impact on social media. Both men and women perceive potential threats in the same manner, with men more aware of war, AI controlled drones, terrain recognition, and information war. Women seem to have a stronger tendency towards ethical considerations and helping others.

[AI-30] ProRL Agent : Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents

【速读】:该论文旨在解决多轮大语言模型(Large Language Model, LLM)代理在强化学习(Reinforcement Learning, RL)训练过程中面临的可扩展性与系统耦合问题。现有基础设施常将回放轨迹生成(rollout orchestration)与训练循环紧密绑定,导致系统难以迁移和维护。其解决方案的关键在于提出 ProRL Agent,一个基于“回放即服务”(rollout-as-a-service)理念的可扩展基础设施,通过 API 服务实现代理回放生命周期的全流程管理,并提供标准化、可扩展的沙箱环境,支持多种代理任务在无 root 权限的高性能计算(High-Performance Computing, HPC)环境中运行,从而显著提升 RL 训练的灵活性与可维护性。

链接: https://arxiv.org/abs/2603.18815
作者: Hao Zhang,Mingjie Liu,Shaokun Zhang,Songyang Han,Jian Hu,Zhenghui Jin,Yuchi Zhang,Shizhe Diao,Ximing Lu,Binfeng Xu,Zhiding Yu,Jan Kautz,Yi Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-turn LLM agents are increasingly important for solving complex, interactive tasks, and reinforcement learning (RL) is a key ingredient for improving their long-horizon behavior. However, RL training requires generating large numbers of sandboxed rollout trajectories, and existing infrastructures often couple rollout orchestration with the training loop, making systems hard to migrate and maintain. Under the rollout-as-a-service philosophy, we present ProRL Agent , a scalable infrastructure that serves the full agentic rollout lifecycle through an API service. ProRL Agent also provides standardized and extensible sandbox environments that support diverse agentic tasks in rootless HPC settings. We validate ProRL Agent through RL training on software engineering, math, STEM, and coding tasks. ProRL Agent is open-sourced and integrated as part of NVIDIA NeMo Gym.

[AI-31] Can LLM generate interesting mathematical research problems?

【速读】:该论文旨在解决大语言模型(Large Language Models, LLM)是否能够生成具有价值且前沿的数学研究问题这一核心问题。其解决方案的关键在于设计并实现一个智能代理(agent),用于自动产生未知的数学问题,并在微分几何领域成功生成了665个研究问题;通过专家人工验证,发现其中许多问题未被现有文献覆盖,具备独特的研究价值,从而证明了LLM在数学创造力方面的潜力。

链接: https://arxiv.org/abs/2603.18813
作者: Xiaoyang Chen,Xiang Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper is the second one in a series of work on the mathematical creativity of LLM. In the first paper, the authors proposed three criteria for evaluating the mathematical creativity of LLM and constructed a benchmark dataset to measure it. This paper further explores the mathematical creativity of LLM, with a focus on investigating whether LLM can generate valuable and cutting-edge mathematical research problems. We develop an agent to generate unknown problems and produced 665 research problems in differential geometry. Through human verification, we find that many of these mathematical problems are unknown to experts and possess unique research value.

[AI-32] dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models

【速读】:该论文旨在解决扩散语言模型(Diffusion Large Language Models, dLLMs)在对齐人类偏好时面临的策略优化难题,特别是由于轨迹概率计算成本过高导致的离线策略训练难以规模化的问题。其解决方案的关键在于提出一种名为轨迹缩减策略优化(Trajectory Reduction Policy Optimization, dTRPO)的方法,通过两个核心创新实现高效优化:首先,在参考策略正则化下,新解掩码标记的概率比是中间扩散状态概率比的无偏估计;其次,利用单次前向传播即可有效估算完整轨迹的概率,从而显著降低计算复杂度。这一方法使dLLMs能够在不依赖在线交互的情况下实现高效的离线策略训练,并提升生成质量与效率。

链接: https://arxiv.org/abs/2603.18806
作者: Wenxuan Zhang,Lemeng Wu,Changsheng Zhao,Ernie Chang,Mingchen Zhuge,Zechun Liu,Andy Su,Hanxian Huang,Jun Chen,Chong Zhou,Raghuraman Krishnamoorthi,Vikas Chandra,Mohamed Elhoseiny,Wei Wen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation, which in turn presents new challenges for aligning them with human preferences. In this work, we aim to improve the policy optimization for dLLMs by reducing the cost of the trajectory probability calculation, thereby enabling scaled-up offline policy training. We prove that: (i) under reference policy regularization, the probability ratio of the newly unmasked tokens is an unbiased estimate of that of intermediate diffusion states, and (ii) the probability of the full trajectory can be effectively estimated with a single forward pass of a re-masked final state. By integrating these two trajectory reduction strategies into a policy optimization objective, we propose Trajectory Reduction Policy Optimization (dTRPO). We evaluate dTRPO on 7B dLLMs across instruction-following and reasoning benchmarks. Results show that it substantially improves the core performance of state-of-the-art dLLMs, achieving gains of up to 9.6% on STEM tasks, up to 4.3% on coding tasks, and up to 3.0% on instruction-following tasks. Moreover, dTRPO exhibits strong training efficiency due to its offline, single-forward nature, and achieves improved generation efficiency through high-quality outputs.

[AI-33] Functional Subspace Watermarking for Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在经历微调、量化或知识蒸馏等参数级扰动后,传统模型水印方法因内部表示发生复杂畸变而导致水印信号难以可靠提取的问题。现有方案在面对参数层面的修改时仍缺乏足够的鲁棒性。其解决方案的关键在于提出功能子空间水印(Functional Subspace Watermarking, FSW)框架,通过求解广义特征值问题提取一个稳定的低维功能子空间用于嵌入水印信号,并引入自适应谱截断策略以在鲁棒性与模型效用之间取得最优平衡;同时,结合向量一致性约束确保水印注入不损害原始语义性能,从而实现高检测准确率和统计可验证性,显著优于当前最先进(SOTA)方法。

链接: https://arxiv.org/abs/2603.18793
作者: Zikang Ding,Junhao Li,Suling Wu,Junchi Yao,Hongbo Liu,Lijie Hu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model watermarking utilizes internal representations to protect the ownership of large language models (LLMs). However, these features inevitably undergo complex distortions during realistic model modifications such as fine-tuning, quantization, or knowledge distillation, making reliable extraction extremely challenging. Despite extensive research on model-side watermarking, existing methods still lack sufficient robustness against parameter-level perturbations. To address this gap, we propose \texttt\textbfFunctional Subspace Watermarking (FSW), a framework that anchors ownership signals into a low-dimensional functional backbone. Specifically, we first solve a generalized eigenvalue problem to extract a stable functional subspace for watermark injection, while introducing an adaptive spectral truncation strategy to achieve an optimal balance between robustness and model utility. Furthermore, a vector consistency constraint is incorporated to ensure that watermark injection does not compromise the original semantic performance. Extensive experiments across various LLM architectures and datasets demonstrate that our method achieves superior detection accuracy and statistical verifiability under multiple model attacks, maintaining robustness that outperforms existing state-of-the-art (SOTA) methods.

[AI-34] Proceedings of the 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind

【速读】:该论文旨在推动人工智能(Artificial Intelligence, AI)研究向更深层次的理论心理(Theory of Mind, ToM)方向发展,以解决当前AI系统在理解、推理和模拟人类心智状态方面的局限性。其解决方案的关键在于通过组织和发布第二届“通过心智理论推进人工智能”研讨会的精选论文,构建一个开放获取且经过遴选的文献合集,为ToM与AI交叉领域的研究者提供系统性的知识资源与前沿进展参考,从而促进具备社会认知能力的下一代AI系统的研发。

链接: https://arxiv.org/abs/2603.18786
作者: Nitay Alon,Joseph M. Barnby,Reuth Mirsky,Stefan Sarkadi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: workshop proceedings

点击查看摘要

Abstract:This volume includes a selection of papers presented at the 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind held at AAAI 2026 in Singapore on 26th January 2026. The purpose of this volume is to provide an open access and curated anthology for the ToM and AI research community.

[AI-35] Automatic Configuration of LLM Post-Training Pipelines

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)后训练流程中配置选择的难题,特别是在计算资源受限的情况下,如何高效地确定监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL)阶段的最佳超参数组合。问题的核心在于:配置空间维度高且异构、各阶段强耦合,且每次端到端评估成本高昂。解决方案的关键在于提出一个预算感知的两阶段框架 AutoPipe:离线阶段,基于历史运行数据学习一个数据集条件的排序代理模型(learning-to-rank surrogate),捕捉数据集内部偏好并提供可迁移的配置空间引导;在线阶段,利用该引导驱动贝叶斯优化,并通过高斯过程残差代理模型建模新数据集的特定偏差,同时引入一个早期预测器对每轮试验进行早停和低成本代理评分,从而显著降低评估开销。实验表明,AutoPipe 在生物医学推理任务上优于纯离线基线方法,并在性能上媲美最强的在线超参数优化(Hyperparameter Optimization, HPO)方法,但计算成本不足其10%。

链接: https://arxiv.org/abs/2603.18773
作者: Channe Chwa,Xinle Wu,Yao Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM post-training pipelines that combine supervised fine-tuning and reinforcement learning are difficult to configure under realistic compute budgets: the configuration space is high-dimensional and heterogeneous, stages are strongly coupled, and each end-to-end evaluation is expensive. We propose AutoPipe, a budget-aware two-stage framework for configuration selection in LLM post-training. Offline, AutoPipe learns a dataset-conditioned learning-to-rank surrogate from historical runs, capturing within-dataset preferences and providing transferable guidance toward promising regions of the configuration space. Online, for a new dataset, AutoPipe uses the offline guidance to steer Bayesian optimization and models dataset-specific deviations with a Gaussian-process residual surrogate. To reduce evaluation cost, each trial is early-stopped and scored by a learned predictor that maps early training signals to a low-cost proxy for final post-training performance. Experiments on biomedical reasoning tasks show that AutoPipe consistently outperforms offline-only baselines and achieves comparable performance with the strongest online HPO baselines while using less than 10% of their computational cost.

[AI-36] A Concept is More Than a Word: Diversified Unlearning in Text-to-Image Diffusion Models

【速读】:该论文旨在解决当前文本到图像扩散模型中概念遗忘(concept unlearning)方法依赖单一关键词所导致的局限性问题。现有方法因仅使用关键词定位目标概念,难以准确捕捉视觉概念在潜在空间中的多维语义分布及其与相关概念的纠缠关系,从而易引发过度遗忘(over-forgetting)和鲁棒性不足的问题。解决方案的关键在于提出“多样化遗忘”(Diversified Unlearning)框架,该框架通过一组上下文多样化的提示词(contextually diverse prompts)来表征一个概念,而非单一关键词,从而更全面地覆盖其语义分布,实现更精准、稳健的概念擦除,同时有效保留无关概念并提升对对抗恢复攻击的防御能力。

链接: https://arxiv.org/abs/2603.18767
作者: Duc Hao Pham,Van Duy Truong,Duy Khanh Dinh,Tien Cuong Nguyen,Dien Hy Ngo,Tuan Anh Bui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Concept unlearning has emerged as a promising direction for reducing the risks of harmful content generation in text-to-image diffusion models by selectively erasing undesirable concepts from a model’s parameters. Existing approaches typically rely on keywords to identify the target concept to be unlearned. However, we show that this keyword-based formulation is inherently limited: a visual concept is multi-dimensional, can be expressed in diverse textual forms, and often overlap with related concepts in the latent space, making keyword-only unlearning, which imprecisely indicate the target concept is brittle and prone to over-forgetting. This occurs because a single keyword represents only a narrow point estimate of the concept, failing to cover its full semantic distribution and entangled variations in the latent space. To address this limitation, we propose Diversified Unlearning, a distributional framework that represents a concept through a set of contextually diverse prompts rather than a single keyword. This richer representation enables more precise and robust unlearning. Through extensive experiments across multiple benchmarks and state-of-the-art baselines, we demonstrate that integrating Diversified Unlearning as an add-on component into existing unlearning pipelines consistently achieves stronger erasure, better retention of unrelated concepts, and improved robustness against adversarial recovery attacks.

[AI-37] ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation

【速读】:该论文旨在解决当前自主网络代理(Autonomous Web Agents)在真实网络威胁下安全性评估不足的问题,尤其是现有基准测试主要局限于静态沙箱环境和内容层面的提示攻击,未能有效覆盖网络层的安全风险。其解决方案的关键在于提出一种基于中间人攻击(Man-in-the-Middle, MITM)的红队测试框架 ClawTrap,该框架支持多种可定制的攻击形式(如静态 HTML 替换、Iframe 弹窗注入和动态内容修改),并通过规则驱动的拦截、转换与审计流程构建可复现的测试管道,从而为未来研究提供更丰富、灵活的 MITM 攻击构造能力和系统性安全测评基础。

链接: https://arxiv.org/abs/2603.18762
作者: Haochen Zhao,Shaoyang Cui
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, 2 tables. Preliminary technical report; quantitative experiments and extended evaluation to appear in v2

点击查看摘要

Abstract:Autonomous web agents such as \textbfOpenClaw are rapidly moving into high-impact real-world workflows, but their security robustness under live network threats remains insufficiently evaluated. Existing benchmarks mainly focus on static sandbox settings and content-level prompt attacks, which leaves a practical gap for network-layer security testing. In this paper, we present \textbfClawTrap, a \textbfMITM-based red-teaming framework for real-world OpenClaw security evaluation. ClawTrap supports diverse and customizable attack forms, including \textitStatic HTML Replacement, \textitIframe Popup Injection, and \textitDynamic Content Modification, and provides a reproducible pipeline for rule-driven interception, transformation, and auditing. This design lays the foundation for future research to construct richer, customizable MITM attacks and to perform systematic security testing across agent frameworks and model backbones. Our empirical study shows clear model stratification: weaker models are more likely to trust tampered observations and produce unsafe outputs, while stronger models demonstrate better anomaly attribution and safer fallback strategies. These findings indicate that reliable OpenClaw security evaluation should explicitly incorporate dynamic real-world MITM conditions rather than relying only on static sandbox protocols.

[AI-38] NeuroGame Transformer: Gibbs-Inspired Attention Driven by Game Theory and Statistical Physics

【速读】:该论文旨在解决标准注意力机制在Transformer中因成对建模限制而难以捕捉token之间高阶依赖关系的问题。其核心解决方案是提出神经博弈Transformer(NeuroGame Transformer, NGT),通过将token视为合作博弈中的玩家和统计物理系统中的自旋,引入两种互补的博弈论概念——全局性的Shapley值(Shapley values)用于基于排列的归因,局部性的Banzhaf指数(Banzhaf indices)用于联盟层面的影响度量,并通过可学习门控参数融合为外部磁场;同时利用配对相互作用势建模协同效应,使系统的能量遵循Ising哈密顿量(Ising Hamiltonian),注意力权重则作为吉布斯分布下的边缘概率,通过平均场方程高效计算。为应对联盟空间指数级增长带来的计算挑战,进一步设计了基于吉布斯分布权重的重要性采样蒙特卡洛估计器,从而实现数值稳定性和长序列可扩展性。

链接: https://arxiv.org/abs/2603.18761
作者: Djamel Bouchaffra,Fayçal Ykhlef,Hanene Azzag,Mustapha Lebbah,Bilal Faye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This work has been submitted to IEEE Transactions on Cybernetics for possible publication

点击查看摘要

Abstract:Standard attention mechanisms in transformers are limited by their pairwise formulation, which hinders the modeling of higher-order dependencies among tokens. We introduce the NeuroGame Transformer (NGT) to overcome this by reconceptualizing attention through a dual perspective: tokens are treated simultaneously as players in a cooperative game and as interacting spins in a statistical physics system. Token importance is quantified using two complementary game-theoretic concepts – Shapley values for global, permutation-based attribution and Banzhaf indices for local, coalition-level influence. These are combined via a learnable gating parameter to form an external magnetic field, while pairwise interaction potentials capture synergistic relationships. The system’s energy follows an Ising Hamiltonian, with attention weights emerging as marginal probabilities under the Gibbs distribution, efficiently computed via mean-field equations. To ensure scalability despite the exponential coalition space, we develop importance-weighted Monte Carlo estimators with Gibbs-distributed weights. This approach avoids explicit exponential factors, ensuring numerical stability for long sequences. We provide theoretical convergence guarantees and characterize the fairness-sensitivity trade-off governed by the interpolation parameter. Experimental results demonstrate that the NeuroGame Transformer achieves strong performance across SNLI, and MNLI-matched, outperforming some major efficient transformer baselines. On SNLI, it attains a test accuracy of 86.4% (with a peak validation accuracy of 86.6%), surpassing ALBERT-Base and remaining highly competitive with RoBERTa-Base. Code is available at this https URL.

[AI-39] Measuring and Exploiting Confirmation Bias in LLM -Assisted Security Code Review

【速读】:该论文旨在解决生成式 AI (Generative AI) 在代码安全审查中因确认偏倚(confirmation bias)导致的漏洞检测失效问题,以及这种缺陷是否可被用于软件供应链攻击。研究发现,当代码变更被框架为“无漏洞”时,漏洞检测率下降16%-93%,且假阴性显著上升,而假阳性变化不大;不同类型的漏洞对偏倚敏感度不同,注入类漏洞更易受影响。解决方案的关键在于通过元数据删除(metadata redaction)和明确指令(explicit instructions)进行去偏处理,使交互式工具(如GitHub Copilot)和自主代理(如Claude Code)的检测准确率恢复至94%以上,从而有效缓解由偏倚引发的安全风险。

链接: https://arxiv.org/abs/2603.18740
作者: Dimitris Mitropoulos,Nikolaos Alexopoulos,Georgios Alexopoulos,Diomidis Spinellis
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Security code reviews increasingly rely on systems integrating Large Language Models (LLMs), ranging from interactive assistants to autonomous agents in CI/CD pipelines. We study whether confirmation bias (i.e., the tendency to favor interpretations that align with prior expectations) affects LLM-based vulnerability detection, and whether this failure mode can be exploited in software supply-chain attacks. We conduct two complementary studies. Study 1 quantifies confirmation bias through controlled experiments on 250 CVE vulnerability/patch pairs evaluated across four state-of-the-art models under five framing conditions for the review prompt. Framing a change as bug-free reduces vulnerability detection rates by 16-93%, with strongly asymmetric effects: false negatives increase sharply while false positive rates change little. Bias effects vary by vulnerability type, with injection flaws being more susceptible to them than memory corruption bugs. Study 2 evaluates exploitability in practice mimicking adversarial pull requests that reintroduce known vulnerabilities while framed as security improvements or urgent functionality fixes via their pull request metadata. Adversarial framing succeeds in 35% of cases against GitHub Copilot (interactive assistant) under one-shot attacks and in 88% of cases against Claude Code (autonomous agent) in real project configurations where adversaries can iteratively refine their framing to increase attack success. Debiasing via metadata redaction and explicit instructions restores detection in all interactive cases and 94% of autonomous cases. Our results show that confirmation bias poses a weakness in LLM-based code review, with implications on how AI-assisted development tools are deployed. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2603.18740 [cs.SE] (or arXiv:2603.18740v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.18740 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-40] Analysis Of Linguistic Stereotypes in Single and Multi-Agent Generative AI Architectures

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理不同英语方言(如标准美式英语 Standard American English, SAE 和非洲裔美国人英语 African-American English, AAE)输入时所产生的刻板印象偏见问题,即模型输出会因输入方言的不同而表现出对特定群体的 stereotype-based 推理。其关键解决方案在于系统性地评估两种策略:一是通过提示工程(prompt engineering),包括角色设定(role-based)和思维链(Chain-of-Thought)提示;二是采用多智能体架构(multi-agent architectures),由生成-批判-修订(generate-critique-revise)模型组成。实验表明,思维链提示在部分模型(如 Claude Haiku)中有效缓解偏见,而多智能体架构则在所有测试模型中均展现出一致的偏见抑制效果,凸显了 workflow-level 控制机制在高影响力部署中的必要性。

链接: https://arxiv.org/abs/2603.18729
作者: Martina Ullasci,Marco Rondina,Riccardo Coppola,Flavio Giobergia,Riccardo Bellanca,Gabriele Mancari Pasi,Luca Prato,Federico Spinoso,Silvia Tagliente
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many works in the literature show that LLM outputs exhibit discriminatory behaviour, triggering stereotype-based inferences based on the dialect in which the inputs are written. This bias has been shown to be particularly pronounced when the same inputs are provided to LLMs in Standard American English (SAE) and African-American English (AAE). In this paper, we replicate existing analyses of dialect-sensitive stereotype generation in LLM outputs and investigate the effects of mitigation strategies, including prompt engineering (role-based and Chain-Of-Thought prompting) and multi-agent architectures composed of generate-critique-revise models. We define eight prompt templates to analyse different ways in which dialect bias can manifest, such as suggested names, jobs, and adjectives for SAE or AAE speakers. We use an LLM-as-judge approach to evaluate the bias in the results. Our results show that stereotype-bearing differences emerge between SAE- and AAE-related outputs across all template categories, with the strongest effects observed in adjective and job attribution. Baseline disparities vary substantially by model, with the largest SAE-AAE differential observed in Claude Haiku and the smallest in Phi-4 Mini. Chain-Of-Thought prompting proved to be an effective mitigation strategy for Claude Haiku, whereas the use of a multi-agent architecture ensured consistent mitigation across all the models. These findings suggest that for intersectionality-informed software engineering, fairness evaluation should include model-specific validation of mitigation strategies, and workflow-level controls (e.g., agentic architectures involving critique models) in high-impact LLM deployments. The current results are exploratory in nature and limited in scope, but can lead to extensions and replications by increasing the dataset size and applying the procedure to different languages or dialects.

[AI-41] MemMA: Coordinating the Memory Cycle through Multi-Agent Reasoning and In-Situ Self-Evolution

【速读】:该论文旨在解决记忆增强型大语言模型(Memory-augmented LLM)代理在长期交互中因记忆构建、检索与利用模块孤立运行而导致的两个核心问题:其一是在前向路径上缺乏战略意识,即记忆的构建和检索依赖局部启发式策略而非显式的战略推理;其二是在后向路径上监督稀疏且延迟,下游任务失败难以直接反馈至记忆库的修复。解决方案的关键在于提出一种即插即用的多智能体框架 MemMA,通过双路径协同优化来突破上述瓶颈:在前向路径上引入 Meta-Thinker 生成结构化指导,引导 Memory Manager 构建记忆并驱动 Query Reasoner 进行迭代检索;在后向路径上创新性地提出原位自演化记忆构建机制,将探测问答对合成、当前记忆验证与失败修复动作转化结合,在记忆固化前完成自我修正,从而实现闭环优化。

链接: https://arxiv.org/abs/2603.18718
作者: Minhua Lin,Zhiwei Zhang,Hanqing Lu,Hui Liu,Xianfeng Tang,Qi He,Xiang Zhang,Suhang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 5 figures

点击查看摘要

Abstract:Memory-augmented LLM agents maintain external memory banks to support long-horizon interaction, yet most existing systems treat construction, retrieval, and utilization as isolated subroutines. This creates two coupled challenges: strategic blindness on the forward path of the memory cycle, where construction and retrieval are driven by local heuristics rather than explicit strategic reasoning, and sparse, delayed supervision on the backward path, where downstream failures rarely translate into direct repairs of the memory bank. To address these challenges, we propose MemMA, a plug-and-play multi-agent framework that coordinates the memory cycle along both the forward and backward paths. On the forward path, a Meta-Thinker produces structured guidance that steers a Memory Manager during construction and directs a Query Reasoner during iterative retrieval. On the backward path, MemMA introduces in-situ self-evolving memory construction, which synthesizes probe QA pairs, verifies the current memory, and converts failures into repair actions before the memory is finalized. Extensive experiments on LoCoMo show that MemMA consistently outperforms existing baselines across multiple LLM backbones and improves three different storage backends in a plug-and-play manner. Our code is publicly available at this https URL.

[AI-42] Accurate and Efficient Multi-Channel Time Series Forecasting via Sparse Attention Mechanism ICDE2026

【速读】:该论文旨在解决多通道时间序列预测中复杂动态依赖关系建模不足的问题,尤其是传统方法对通道间交互信息挖掘有限的缺陷。其核心解决方案是提出Linear-Network(Li-Net)架构,关键创新在于通过可配置的非线性模块处理压缩后的跨序列与跨通道表示,并引入稀疏Top-K Softmax注意力机制嵌入多尺度投影框架,从而高效捕捉线性和非线性通道依赖;同时,该架构能无缝融合多模态嵌入,引导注意力聚焦于最具信息量的时间步与特征通道,显著提升预测精度并降低计算开销。

链接: https://arxiv.org/abs/2603.18712
作者: Lei Gao,Hengda Bao,Jingfei Fang,Guangzheng Wu,Weihua Zhou,Yun Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICDE 2026

点击查看摘要

Abstract:The task of multi-channel time series forecasting is ubiquitous in numerous fields such as finance, supply chain management, and energy planning. It is critical to effectively capture complex dynamic dependencies within and between channels for accurate predictions. However, traditional method paid few attentions on learning the interaction among channels. This paper proposes Linear-Network (Li-Net), a novel architecture designed for multi-channel time series forecasting that captures the linear and non-linear dependencies among channels. Li-Net dynamically compresses representations across sequence and channel dimensions, processes the information through a configurable non-linear module and subsequently reconstructs the forecasts. Moreover, Li-Net integrates a sparse Top-K Softmax attention mechanism within a multi-scale projection framework to address these challenges. A core innovation is its ability to seamlessly incorporate and fuse multi-modal embeddings, guiding the sparse attention process to focus on the most informative time steps and feature channels. Through the experiment results on multiple real-world benchmark datasets demonstrate that Li-Net achieves competitive performance compared to state-of-the-art baseline methods. Furthermore, Li-Net provides a superior balance between prediction accuracy and computational burden, exhibiting significantly lower memory usage and faster inference times. Detailed ablation studies and parameter sensitivity analyses validate the effectiveness of each key component in our proposed architecture. Keywords: Multivariate Time Series Forecasting, Sparse Attention Mechanism, Multimodal Information Fusion, Non-linear relationship Comments: Accepted by ICDE 2026 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.18712 [cs.AI] (or arXiv:2603.18712v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.18712 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-43] MANAR: Memory-augmented Attention with Navigational Abstract Conceptual Representation

【速读】:该论文旨在解决标准多头注意力(Multi-Head Attention, MHA)缺乏认知层面的功能瓶颈与全局整合机制的问题,从而在模型中引入类似意识理论中的全局工作空间(Global Workspace Theory, GWT)结构。其核心解决方案是提出MANAR(Memory-augmented Attention with Navigational Abstract Conceptual Representation),通过一个可训练的记忆模块和抽象概念表示(Abstract Conceptual Representation, ACR)实现两阶段逻辑:首先在集成阶段基于输入刺激检索记忆概念以形成全局“心理图像”(即ACR),其次在广播阶段利用该全局状态指导局部token的上下文化处理。关键创新在于,这种架构天然具备线性时间复杂度——因全局信息通过固定大小的ACR传递,消除了MHA固有的二次计算复杂性;同时保持与预训练Transformer权重的兼容性,支持直接迁移学习,且能生成超出输入token凸包的非凸表示,体现创造性合成能力。

链接: https://arxiv.org/abs/2603.18676
作者: Zuher Jahshan,Ben Ben Ishay,Leonid Yavits
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:MANAR (Memory-augmented Attention with Navigational Abstract Conceptual Representation), contextualization layer generalizes standard multi-head attention (MHA) by instantiating the principles of Global Workspace Theory (GWT). While MHA enables unconstrained all-to-all communication, it lacks the functional bottleneck and global integration mechanisms hypothesized in cognitive models of consciousness. MANAR addresses this by implementing a central workspace through a trainable memory of abstract concepts and an Abstract Conceptual Representation (ACR). The architecture follows a two-stage logic that maps directly to GWT mechanics: (i) an integration phase, where retrieved memory concepts converge to form a collective “mental image” (the ACR) based on input stimuli; and (ii) a broadcasting phase, where this global state navigates and informs the contextualization of individual local tokens. We demonstrate that efficient linear-time scaling is a fundamental architectural byproduct of instantiating GWT functional bottleneck, as routing global information through a constant-sized ACR resolves the quadratic complexity inherent in standard attention. MANAR is a compatible re-parameterization of MHA with identical semantic roles for its projections, enabling knowledge transfer from pretrained transformers via weight-copy and thus overcoming the adoption barriers of structurally incompatible linear-time alternatives. MANAR enables non-convex contextualization, synthesizing representations that provably lie outside the convex hull of input tokens - a mathematical reflection of the creative synthesis described in GWT. Empirical evaluations confirm that MANAR matches or exceeds strong baselines across language (GLUE score of 85.1), vision (83.9% ImageNet-1K), and speech (2.7% WER on LibriSpeech), positioning it as an efficient and expressive alternative to quadratic attention.

[AI-44] hinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在几何推理任务中缺乏主动构造视觉辅助工具的能力问题,即模型无法战略性地选择何时以及如何生成有效的几何构造来辅助推理。其解决方案的关键在于提出一种视觉-文本交织的思维链框架(Visual-Text Interleaved Chain-of-Thought),并引入动作适用性策略优化(Action Applicability Policy Optimization, A2PO),通过对抗性采样进行自适应奖励塑造,以区分必要与冗余的构造行为,从而提升模型对几何构造时机和质量的控制能力。实验表明,该方法使MLLMs能够有选择性地利用辅助构造,在基准测试上相较强基线提升3.51%。

链接: https://arxiv.org/abs/2603.18662
作者: Haokun Zhao,Wanshi Xu,Haidong Yuan,Songjun Cao,Long Ma,Yanghua Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Geometric reasoning inherently requires “thinking with constructions” – the dynamic manipulation of visual aids to bridge the gap between problem conditions and solutions. However, existing Multimodal Large Language Models (MLLMs) are largely confined to passive inference with static diagrams, lacking the strategic knowledge of when and how to construct effective visual aids. To address this, we present a framework for Visual-Text Interleaved Chain-of-Thought. We first introduce GeoAux-Bench, the first benchmark comprising 4,334 geometry problems that aligns textual construction steps with ground-truth visual updates. Our pilot study reveals two critical insights: (1) interleaved visual-textual aids outperform single-modality counterparts, which cannot losslessly capture geometric synergy; and (2) valid constructions act as entropy reducers, strongly correlating with reduced reasoning perplexity. Building on these findings, we propose Action Applicability Policy Optimization (A2PO), a reinforcement learning paradigm for mastering strategic construction. A2PO employs Adaptive Reward Shaping to regulate the timing and quality of visual aids via counterfactual sampling to distinguish necessary from redundant constructions. Experiments demonstrate our approach enables MLLMs to leverage selective auxiliary constructions, yielding a 3.51% gain over strong baselines. Code and data are available on GitHub.

[AI-45] Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在多模态推理中因监督微调(Supervised Fine-Tuning, SFT)阶段存在token级不平衡问题而导致的推理冗长且答案不准确的问题。具体而言,标准SFT对所有token一视同仁地计算损失,导致较长的思考(think)片段占据主导地位,从而抑制了关键答案(answer)段落的学习。其解决方案的关键在于提出SCALe(Scheduled Curriculum Adaptive Loss),通过动态、长度无关的权重分配机制,显式区分对思考和答案段落的监督信号,并采用余弦调度策略在训练过程中逐步从“思考”向“答案”转移关注重心,从而引导模型生成更简洁且基于事实的推理过程。该方法显著提升了准确性,同时大幅减少训练时间,优于传统SFT并媲美两阶段SFT+GRPO流程。

链接: https://arxiv.org/abs/2603.18656
作者: Shaked Perek,Ben Wiesel,Avihu Dekel,Nimrod Shabtay,Eli Schwartz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long think traces overshadow short but task-critical answer segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the think segment, SCALe-SFT gradually shifts the focus from think to answer throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.

[AI-46] Beyond TVLA: Anderson-Darling Leakage Assessment for Neural Network Side-Channel Leakage Detection

【速读】:该论文旨在解决传统基于Welch’s t检验的测试向量泄漏评估(Test Vector Leakage Assessment, TVLA)在检测高阶分布差异型侧信道泄漏时敏感度不足的问题,尤其在神经网络实现场景下更为显著。其解决方案的关键在于提出Anderson–Darling泄漏评估(Anderson–Darling Leakage Assessment, ADLA),该方法采用两样本Anderson–Darling检验来检测泄漏,通过检验两个累积分布函数(Cumulative Distribution Functions, CDFs)的整体相等性,而非依赖于均值偏移模型,从而能够更灵敏地捕捉到非均值主导的泄漏特征。实验表明,在低迹数条件下,ADLA相较于TVLA能显著提升对防护实现(如随机打乱和随机抖动对策)的泄漏检测能力。

链接: https://arxiv.org/abs/2603.18647
作者: Ján Mikulec,Jakub Breier,Xiaolu Hou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test Vector Leakage Assessment (TVLA) based on Welch’s t -test has become a standard tool for detecting side-channel leakage. However, its mean-based nature can limit sensitivity when leakage manifests primarily through higher-order distributional differences. As our experiments show, this property becomes especially crucial when it comes to evaluating neural network implementations. In this work, we propose Anderson–Darling Leakage Assessment (ADLA), a leakage detection framework that applies the two-sample Anderson–Darling test for leakage detection. Unlike TVLA, ADLA tests equality of the full cumulative distribution functions and does not rely on a purely mean-shift model. We evaluate ADLA on a multilayer perceptron (MLP) trained on MNIST and implemented on a ChipWhisperer-Husky evaluation platform. We consider protected implementations employing shuffling and random jitter countermeasures. Our results show that ADLA can provide improved leakage-detection sensitivity in protected implementations for a low number of traces compared to TVLA. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.18647 [cs.CR] (or arXiv:2603.18647v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.18647 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-47] An Onto-Relational-Sophic Framework for Governing Synthetic Minds

【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)快速发展与治理框架滞后之间的矛盾问题,特别是面对具备广泛推理、创造性合成及社会交互能力的通用型基础模型(foundation models),现有以工具为中心的监管范式难以回应其本质属性、社会关系及其发展应遵循的规范原则等根本性问题。解决方案的关键在于提出一个名为“本体-关系-智慧”(Onto-Relational-Sophic, ORS)的综合性治理框架,该框架基于Cyberism哲学,包含三个核心支柱:一是构建“网络物理社会思维”(Cyber-Physical-Social-Thinking, CPST)本体论,将合成智能定义为不可还原的多维存在而非纯计算实体;二是提出一种分级的数字人格谱系(graded spectrum of digital personhood),超越“人或工具”的二元分类,提供务实的关系分类体系;三是引入Cybersophy——一种以智慧为导向的价值论,融合德性伦理、功利主义与关系主义方法,指导治理实践。此框架在自主研究代理、AI辅助医疗和代理型AI生态系统等新兴场景中得到应用验证,展现出生成适配且可调适治理建议的能力。

链接: https://arxiv.org/abs/2603.18633
作者: Huansheng Ning,Jianguo Ding
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:The rapid evolution of artificial intelligence, from task-specific systems to foundation models exhibiting broad, flexible competence across reasoning, creative synthesis, and social interaction, has outpaced the conceptual and governance frameworks designed to manage it. Current regulatory paradigms, anchored in a tool-centric worldview, address algorithmic bias and transparency but leave unanswered foundational questions about what increasingly capable synthetic minds are, how societies should relate to them, and the normative principles that should guide their development. Here we introduce the Onto-Relational-Sophic (ORS) framework, grounded in Cyberism philosophy, which offers integrated answers to these challenges through three pillars: (1) a Cyber-Physical-Social-Thinking (CPST) ontology that defines the mode of being for synthetic minds as irreducibly multi-dimensional rather than purely computational; (2) a graded spectrum of digital personhood providing a pragmatic relational taxonomy beyond binary person-or-tool classifications; and (3) Cybersophy, a wisdom-oriented axiology synthesizing virtue ethics, consequentialism, and relational approaches to guide governance. We apply the framework to emergent scenarios including autonomous research agents, AI-mediated healthcare, and agentic AI ecosystems, demonstrating its capacity to generate proportionate, adaptive governance recommendations. The ORS framework charts a path from narrow technical alignment toward comprehensive philosophical foundations for the synthetic minds already among us.

[AI-48] D-Mem: A Dual-Process Memory System for LLM Agents

【速读】:该论文旨在解决当前基于检索的内存框架在长时推理任务中因依赖损失性语义抽象而导致上下文关键信息丢失、难以处理细粒度情境理解的问题。现有方法通常采用增量式处理范式,持续将对话记忆提取并更新至向量数据库,虽效率高但缺乏高保真度的记忆访问能力。为此,作者提出D-Mem(Dual-process Memory)系统,其核心创新在于引入双进程机制:一方面保留轻量级向量检索用于常规查询,另一方面构建全维度深度思辨模块(Full Deliberation)作为高保真备选路径;并通过多维质量门控策略(Multi-dimensional Quality Gating)动态协调二者,实现认知经济性与准确性的平衡。实验表明,该策略在LoCoMo基准上使用GPT-4o-mini模型达到53.5的F1分数,显著优于静态检索基线(Mem0*,51.2),同时仅损失3.3%的全思辨性能(55.3),且计算开销大幅降低。

链接: https://arxiv.org/abs/2603.18631
作者: Zhixing You,Jiachen Yuan,Jason Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Driven by the development of persistent, self-adapting autonomous agents, equipping these systems with high-fidelity memory access for long-horizon reasoning has emerged as a critical requirement. However, prevalent retrieval-based memory frameworks often follow an incremental processing paradigm that continuously extracts and updates conversational memories into vector databases, relying on semantic retrieval when queried. While this approach is fast, it inherently relies on lossy abstraction, frequently missing contextually critical information and struggling to resolve queries that rely on fine-grained contextual understanding. To address this, we introduce D-Mem, a dual-process memory system. It retains lightweight vector retrieval for routine queries while establishing an exhaustive Full Deliberation module as a high-fidelity fallback. To achieve cognitive economy without sacrificing accuracy, D-Mem employs a Multi-dimensional Quality Gating policy to dynamically bridge these two processes. Experiments on the LoCoMo and RealTalk benchmarks using GPT-4o-mini and Qwen3-235B-Instruct demonstrate the efficacy of our approach. Notably, our Multi-dimensional Quality Gating policy achieves an F1 score of 53.5 on LoCoMo with GPT-4o-mini. This outperforms our static retrieval baseline, Mem0 ^\ast (51.2), and recovers 96.7% of the Full Deliberation’s performance (55.3), while incurring significantly lower computational costs.

[AI-49] Agent ic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成中因静态文本编码器的有限关系推理能力以及开环采样导致的误差累积问题,这些问题使得初始语义模糊在常微分方程(Ordinary Differential Equation, ODE)轨迹中逐步演变为空间约束上的随机偏差。解决方案的关键在于提出一种无需训练的闭环框架 AFS-Search(Agentic Flow Steering and Parallel Rollout Search),其核心机制包括:利用视觉语言模型(Vision-Language Model, VLM)作为语义评判器对中间潜在表示进行诊断,并通过精确的空间定位动态调整速度场(velocity field)以实现流引导(flow steering);同时将 T2I 生成建模为序贯决策过程,通过前瞻模拟探索多条生成轨迹并基于 VLM 引导的奖励选择最优路径,从而实现高精度、鲁棒的图像生成。

链接: https://arxiv.org/abs/2603.18627
作者: Ping Chen,Daoxuan Zhang,Xiangming Wang,Yungeng Liu,Haijin Zeng,Yongyong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Precise Text-to-Image (T2I) generation has achieved great success but is hindered by the limited relational reasoning of static text encoders and the error accumulation in open-loop sampling. Without real-time feedback, initial semantic ambiguities during the Ordinary Differential Equation trajectory inevitably escalate into stochastic deviations from spatial constraints. To bridge this gap, we introduce AFS-Search (Agentic Flow Steering and Parallel Rollout Search), a training-free closed-loop framework built upon FLUX.1-dev. AFS-Search incorporates a training-free closed-loop parallel rollout search and flow steering mechanism, which leverages a Vision-Language Model (VLM) as a semantic critic to diagnose intermediate latents and dynamically steer the velocity field via precise spatial grounding. Complementarily, we formulate T2I generation as a sequential decision-making process, exploring multiple trajectories through lookahead simulations and selecting the optimal path based on VLM-guided rewards. Further, we provide AFS-Search-Pro for higher performance and AFS-Search-Fast for quicker generation. Experimental results show that our AFS-Search-Pro greatly boosts the performance of the original FLUX.1-dev, achieving state-of-the-art results across three different benchmarks. Meanwhile, AFS-Search-Fast also significantly enhances performance while maintaining fast generation speed.

[AI-50] ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning -Action Coupling in Tool-Augmented LLM s

【速读】:该论文旨在解决当前工具增强型大语言模型(Tool-augmented Large Language Models, TLLMs)在多步推理与外部动作耦合上的评估难题,尤其是现有基准测试因环境动态复杂性、记忆知识或数据污染等因素混淆了推理与行动之间的因果关系。解决方案的关键在于提出ZebraArena——一个程序化生成的诊断环境,其设计具有可控难度和最小知识依赖特性,从而抑制模型通过记忆或数据泄露获得优势;每个任务仅能通过针对性工具调用获取关键信息,实现外部信息获取与演绎推理之间的可解释接口,并提供确定性评估及理论最优查询次数作为衡量工具使用效率的标准。这一设计使得对TLLM中推理-行动耦合能力的测量更加精准且具备可复现性。

链接: https://arxiv.org/abs/2603.18614
作者: Wanjia Zhao,Ludwig Schmidt,James Zou,Vidhisha Balachandran,Lingjiao Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool-augmented large language models (LLMs) must tightly couple multi-step reasoning with external actions, yet existing benchmarks often confound this interplay with complex environment dynamics, memorized knowledge or dataset contamination. In this paper, we introduce ZebraArena, a procedurally generated diagnostic environment for studying reasoning-action coupling in tool-augmented LLMs, with controllable difficulty and a knowledge-minimal design, which limits gains from memorization or dataset contamination. Each task in ZebraArena requires a set of critical information which is available only through targeted tool use, yielding an interpretable interface between external information acquisition and deductive reasoning. This design provides deterministic evaluation via unique solutions, and a theoretical optimal query count for measuring efficient tool use. We show that ZebraArena requires a combination of in-depth reasoning and accurate external tool calling, which remains a challenge as frontier reasoning models such as GPT-5 and Gemini 2.5 Pro only achieves 60% accuracy on the hard instances. We also observe a persistent gaps between theoretical optimality and practical tool usage. For example, GPT-5 uses 70-270% more tool calls than the theoretical optimum. We highlight the key findings in our evaluation, and hope ZebraArena stimulates further research on the interplay between internal reasoning and external action.

[AI-51] AutORAN: LLM -driven Natural Language Programming for Agile xApp Development

【速读】:该论文旨在解决Open Radio Access Network (O-RAN) 中xApp(控制面应用)开发过程繁琐、耗时的问题,当前依赖人工编码和集成,通常需数月时间,严重阻碍了新功能的快速部署。解决方案的关键在于提出AutORAN——首个基于大语言模型(Large Language Model, LLM)的自然语言编程框架,通过自动化整个xApp开发流水线(包括需求获取、AI/ML功能设计与验证、xApp合成与部署),将用户意图直接转化为可快速部署的xApp,显著缩短开发周期,无需手动编码或测试,同时保证生成的xApp性能不低于甚至优于人工编写的基线。

链接: https://arxiv.org/abs/2603.18604
作者: Xin Li,Shiming Yu,Leming Shen,Jianing Zhang,Yuanqing Zheng,Yaxiong Xie
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional RAN systems are closed and monolithic, stifling innovation. The openness and programmability enabled by Open Radio Access Network (O-RAN) are envisioned to revolutionize cellular networks with control-plane applications–xApps. The development of xApps (typically by third-party developers), however, remains time-consuming and cumbersome, often requiring months of manual coding and integration, which hinders the roll-out of new functionalities in practice. To lower the barrier of xApp development for both developers and network operators, we present AutORAN, the first LLM-driven natural language programming framework for agile xApps that automates the entire xApp development pipeline. In a nutshell, AutORAN turns high-level user intents into swiftly deployable xApps within minutes, eliminating the need for manual coding or testing. To this end, AutORAN builds a fully automated xApp generation pipeline, which integrates multiple functional modules (from user requirement elicitation, AI/ML function design and validation, to xApp synthesis and deployment). We design, implement, and comprehensively evaluate AutORAN on representative xApp tasks. Results show AutORAN-generated xApps can achieve similar or even better performance than the best known hand-crafted baselines. AutORAN drastically accelerates the xApp development cycle (from user intent elicitation to roll-out), streamlining O-RAN innovation.

[AI-52] MedForge: Interpretable Medical Deepfake Detection via Forgery-aware Reasoning

【速读】:该论文旨在解决生成式 AI (Generative AI) 在医学影像领域中伪造操作(如病灶植入或移除)对临床信任与安全构成的威胁问题。现有防御机制存在两大缺陷:一是医疗检测器多为黑箱模型,缺乏可解释性;二是基于多模态大语言模型(MLLM)的解释方法通常为事后分析(post-hoc),且因缺乏专业医学知识而易产生幻觉证据。其解决方案的关键在于提出 MedForge,一个数据与方法协同驱动的预检式(pre-hoc)、证据锚定(evidence-grounded)医学伪造检测框架:首先构建了包含 19 种病理类型、90K 样本的大型基准数据集 MedForge-90K,其中标注由医生指导的推理监督和金标准编辑位置提供;其次设计 MedForge-Reasoner 模型,采用“定位后分析”(localize-then-analyze)的推理范式,在输出判断前先预测可疑区域,并通过 Forgery-aware GSPO 对齐进一步增强语义锚定能力并抑制幻觉,从而实现高精度检测与可信、专家对齐的解释。

链接: https://arxiv.org/abs/2603.18577
作者: Zhihui Chen,Kai He,Qingyuan Lei,Bin Pu,Jian Zhang,Yuling Xu,Mengling Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-guided image editors can now manipulate authentic medical scans with high fidelity, enabling lesion implantation/removal that threatens clinical trust and safety. Existing defenses are inadequate for healthcare. Medical detectors are largely black-box, while MLLM-based explainers are typically post-hoc, lack medical expertise, and may hallucinate evidence on ambiguous cases. We present MedForge, a data-and-method solution for pre-hoc, evidence-grounded medical forgery detection. We introduce MedForge-90K, a large-scale benchmark of realistic lesion edits across 19 pathologies with expert-guided reasoning supervision via doctor inspection guidelines and gold edit locations. Building on it, MedForge-Reasoner performs localize-then-analyze reasoning, predicting suspicious regions before producing a verdict, and is further aligned with Forgery-aware GSPO to strengthen grounding and reduce hallucinations. Experiments demonstrate state-of-the-art detection accuracy and trustworthy, expert-aligned explanations.

[AI-53] CAPSUL: A Comprehensive Human Protein Benchmark for Subcellular Localization ICLR2026

【速读】:该论文旨在解决当前子细胞定位(subcellular localization)研究中缺乏整合三维结构信息与精细注释的基准数据集的问题,这一缺失严重限制了基于结构的模型在该任务中的应用。其解决方案的关键在于构建了一个名为CAPSUL的新基准,该基准整合了多样化的三维结构表示与由领域专家精心标注的细粒度子细胞定位信息,从而为结构驱动的模型提供高质量训练和评估基础。此外,论文通过多类序列和结构模型的对比实验,验证了结构特征在子细胞定位预测中的重要性,并提出重加权和单标签分类策略以优化结构模型性能,最终借助注意力机制揭示出与高尔基体定位相关的α-螺旋决定性模式,展现出结构模型在生物可解释性方面的强大潜力。

链接: https://arxiv.org/abs/2603.18571
作者: Yicheng Hu,Xinyu Lin,Shulin Li,Wenjie Wang,Fengbin Zhu,Fuli Feng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Quantitative Methods (q-bio.QM)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Subcellular localization is a crucial biological task for drug target identification and function annotation. Although it has been biologically realized that subcellular localization is closely associated with protein structure, no existing dataset offers comprehensive 3D structural information with detailed subcellular localization annotations, thus severely hindering the application of promising structure-based models on this task. To address this gap, we introduce a new benchmark called \mathbfCAPSUL , a \mathbfC omprehensive hum \mathbfA n \mathbfP rotein benchmark for \mathbfSU bcellular \mathbfL ocalization. It features a dataset that integrates diverse 3D structural representations with fine-grained subcellular localization annotations carefully curated by domain experts. We evaluate this benchmark using a variety of state-of-the-art sequence-based and structure-based models, showcasing the importance of involving structural features in this task. Furthermore, we explore reweighting and single-label classification strategies to facilitate future investigation on structure-based methods for this task. Lastly, we showcase the powerful interpretability of structure-based methods through a case study on the Golgi apparatus, where we discover a decisive localization pattern \alpha -helix from attention mechanisms, demonstrating the potential for bridging the gap with intuitive biological interpretability and paving the way for data-driven discoveries in cell biology.

[AI-54] ransformers Learn Robust In-Context Regression under Distributional Uncertainty

【速读】:该论文旨在解决Transformer在真实世界数据分布下进行上下文学习(in-context learning)的有效性问题,尤其是在面对非独立同分布(non-i.i.d.)输入、非高斯噪声、非高斯回归系数以及提示中存在依赖关系等分布偏移时,其学习性能是否仍能保持稳健。解决方案的关键在于通过系统性实验对比Transformer与基于最大似然准则的古典基线方法(最优或次优),发现Transformer在各种复杂分布设定下均能稳定匹配甚至超越这些传统方法,从而证明其具备超越经典估计器的鲁棒性上下文适应能力。

链接: https://arxiv.org/abs/2603.18564
作者: Hoang T. H. Cao,Hai D. V. Trinh,Tho Quan,Lan V. Truong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent work has shown that Transformers can perform in-context learning for linear regression under restrictive assumptions, including i.i.d. data, Gaussian noise, and Gaussian regression coefficients. However, real-world data often violate these assumptions: the distributions of inputs, noise, and coefficients are typically unknown, non-Gaussian, and may exhibit dependency across the prompt. This raises a fundamental question: can Transformers learn effectively in-context under realistic distributional uncertainty? We study in-context learning for noisy linear regression under a broad range of distributional shifts, including non-Gaussian coefficients, heavy-tailed noise, and non-i.i.d. prompts. We compare Transformers against classical baselines that are optimal or suboptimal under the corresponding maximum-likelihood criteria. Across all settings, Transformers consistently match or outperform these baselines, demonstrating robust in-context adaptation beyond classical estimators.

[AI-55] Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在真实世界中进行强化学习(Reinforcement Learning, RL)微调时面临的泛化能力受限问题。由于物理世界中场景和物体多样性的扩展成本极高,直接在真实环境中训练会导致模型过拟合特定场景,从而削弱其通用性。解决方案的关键在于利用3D世界生成模型(3D world generative models)与语言驱动的场景设计机制,自动构建数百个包含独特物体和背景的多样化交互场景,从而实现可扩展且高度并行化的策略学习。这一方法显著提升了模拟环境中的成功率(从9.7%提升至79.8%),并通过数字孪生质量与领域随机化实现了有效的模拟到现实迁移(真实世界成功率从21.7%提升至75%),同时验证了场景多样性对零样本泛化性能的正向影响。

链接: https://arxiv.org/abs/2603.18532
作者: Andrew Choi,Xinjie Wang,Zhizhong Su,Wei Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The strong performance of large vision-language models (VLMs) trained with reinforcement learning (RL) has motivated similar approaches for fine-tuning vision-language-action (VLA) models in robotics. Many recent works fine-tune VLAs directly in the real world to avoid addressing the sim-to-real gap. While real-world RL circumvents sim-to-real issues, it inherently limits the generality of the resulting VLA, as scaling scene and object diversity in the physical world is prohibitively difficult. This leads to the paradoxical outcome of transforming a broadly pretrained model into an overfitted, scene-specific policy. Training in simulation can instead provide access to diverse scenes, but designing those scenes is also costly. In this work, we show that VLAs can be RL fine-tuned without sacrificing generality and with reduced labor by leveraging 3D world generative models. Using these models together with a language-driven scene designer, we generate hundreds of diverse interactive scenes containing unique objects and backgrounds, enabling scalable and highly parallel policy learning. Starting from a pretrained imitation baseline, our approach increases simulation success from 9.7% to 79.8% while achieving a 1.25 \times speedup in task completion time. We further demonstrate successful sim-to-real transfer enabled by the quality of the generated digital twins together with domain randomization, improving real-world success from 21.7% to 75% and achieving a 1.13 \times speedup. Finally, we further highlight the benefits of leveraging the effectively unlimited data from 3D world generative models through an ablation study showing that increasing scene diversity directly improves zero-shot generalization.

[AI-56] Correlation-Weighted Multi-Reward Optimization for Compositional Generation

【速读】:该论文旨在解决文本到图像生成模型在处理多概念组合提示时的性能瓶颈问题,即模型难以同时满足多个概念要求,常出现部分概念缺失或生成不一致的现象。其核心挑战在于多概念奖励信号在优化过程中存在相互干扰,导致无法有效联合优化。解决方案的关键在于提出Correlation-Weighted Multi-Reward Optimization(\ours),通过建模不同概念奖励之间的相关性结构,自适应地调整各概念的权重:对冲突性强或难以满足的概念赋予更高权重,从而平衡竞争性奖励信号,并聚焦于那些部分满足但生成不稳定的概念,提升整体组合生成的一致性和完整性。该方法在SD3.5和FLUX.1-dev等先进扩散模型上验证了有效性,在ConceptMix、GenEval 2和T2I-CompBench等多个复杂多概念基准测试中均取得显著改进。

链接: https://arxiv.org/abs/2603.18528
作者: Jungmyung Wi,Hyunsoo Kim,Donghyun Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image models produce images that align well with natural language prompts, but compositional generation has long been a central challenge. Models often struggle to satisfy multiple concepts within a single prompt, frequently omitting some concepts and resulting in partial success. Such failures highlight the difficulty of jointly optimizing multiple concepts during reward optimization, where competing concepts can interfere with one another. To address this limitation, we propose Correlation-Weighted Multi-Reward Optimization (\ours), a framework that leverages the correlation structure among concept rewards to adaptively weight each attribute concept in optimization. By accounting for interactions among concepts, \ours balances competing reward signals and emphasizes concepts that are partially satisfied yet inconsistently generated across samples, improving compositional generation. Specifically, we decompose multi-concept prompts into pre-defined concept groups (\eg, objects, attributes, and relations) and obtain reward signals from dedicated reward models for each concept. We then adaptively reweight these rewards, assigning higher weights to conflicting or hard-to-satisfy concepts using correlation-based difficulty estimation. By focusing optimization on the most challenging concepts within each group, \ours encourages the model to consistently satisfy all requested attributes simultaneously. We apply our approach to train state-of-the-art diffusion models, SD3.5 and FLUX.1-dev, and demonstrate consistent improvements on challenging multi-concept benchmarks, including ConceptMix, GenEval 2, and T2I-CompBench.

[AI-57] Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM

【速读】:该论文旨在解决专家角色提示(expert persona prompting)在大语言模型(Large Language Models, LLMs)中效果不稳定的问题,即其在某些任务中能提升性能和数据多样性,而在其他场景下则几乎无效甚至产生负面影响。为实现对专家角色提示的稳定利用并避免其潜在危害,论文系统研究了模型优化方式、任务类型、提示长度及位置等因素对专家角色提示有效性的影响机制。解决方案的关键在于提出了一种名为PRISM(Persona Routing via Intent-based Self-Modeling)的自动化管道,该方法通过自蒸馏(self-distillation)将意图条件化的专家角色提示转化为一个门控LoRA适配器(gated LoRA adapter),整个过程无需外部数据、模型或知识,仅依赖模型自身进行Bootstrapping训练。PRISM在保持判别任务准确率的同时显著提升了生成任务中的人类偏好和安全性对齐,且计算与内存开销极低。

链接: https://arxiv.org/abs/2603.18507
作者: Zizhao Hu,Mohammad Rostami,Jesse Thomason
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Persona prompting can steer LLM generation towards a domain-specific tone and pattern. This behavior enables use cases in multi-agent systems where diverse interactions are crucial and human-centered tasks require high-level human alignment. Prior works provide mixed opinions on their utility: some report performance gains when using expert personas for certain domains and their contribution to data diversity in synthetic data creation, while others find near-zero or negative impact on general utility. To fully leverage the benefits of the LLM persona and avoid its harmfulness, a more comprehensive investigation of the mechanism is crucial. In this work, we study how model optimization, task type, prompt length, and placement can impact expert persona effectiveness across instruction-tuned and reasoning LLMs, and provide insight into conditions under which expert personas fail and succeed. Based on our findings, we developed a pipeline to fully leverage the benefits of an expert persona, named PRISM (Persona Routing via Intent-based Self-Modeling), which self-distills an intent-conditioned expert persona into a gated LoRA adapter through a bootstrapping process that requires no external data, models, or knowledge. PRISM enhances human preference and safety alignment on generative tasks while maintaining accuracy on discriminative tasks across all models, with minimal memory and computing overhead.

[AI-58] Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning CVPR2026

【速读】:该论文旨在解决视频指令机器人编程中的跨域适应问题,即由于演示与部署环境在感知和物理层面存在差异,导致生成的控制代码出现程序性不匹配,进而影响任务成功率。现有视觉语言模型(Vision-Language Models, VLMs)缺乏对任务过程的因果理解能力,难以在域偏移下重构合理的操作逻辑。其解决方案的关键在于提出一种神经符号反事实推理框架(NeSyCR),该框架将视频演示抽象为捕捉任务本质的符号轨迹,并基于部署观测推导出反事实状态以识别跨域不兼容性;随后通过在符号状态空间中进行可验证的探索,生成恢复与示范程序一致性的程序修订策略,从而实现可靠的任务程序自适应。

链接: https://arxiv.org/abs/2603.18495
作者: Jooyoung Kim,Wonje Choi,Younguk Song,Honguk Woo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Recent advances in Vision-Language Models (VLMs) have enabled video-instructed robotic programming, allowing agents to interpret video demonstrations and generate executable control code. We formulate video-instructed robotic programming as a cross-domain adaptation problem, where perceptual and physical differences between demonstration and deployment induce procedural mismatches. However, current VLMs lack the procedural understanding needed to reformulate causal dependencies and achieve task-compatible behavior under such domain shifts. We introduce NeSyCR, a neurosymbolic counterfactual reasoning framework that enables verifiable adaptation of task procedures, providing a reliable synthesis of code policies. NeSyCR abstracts video demonstrations into symbolic trajectories that capture the underlying task procedure. Given deployment observations, it derives counterfactual states that reveal cross-domain incompatibilities. By exploring the symbolic state space with verifiable checks, NeSyCR proposes procedural revisions that restore compatibility with the demonstrated procedure. NeSyCR achieves a 31.14% improvement in task success over the strongest baseline Statler, showing robust cross-domain adaptation across both simulated and real-world manipulation tasks.

[AI-59] AlignMamba-2: Enhancing Multimodal Fusion and Sentiment Analysis with Modality-Aware Mamba

【速读】:该论文旨在解决大规模预训练模型在情感计算任务中有效适配通用知识的难题,尤其关注计算效率与多模态异质性之间的平衡问题。现有基于Transformer的方法虽能建模跨模态依赖关系,但其二次计算复杂度限制了其在长序列数据上的应用;而Mamba类模型虽具高效性,却因固有的顺序扫描机制难以捕捉对跨模态对齐至关重要的全局非序列关系。解决方案的关键在于提出AlignMamba-2框架,其核心创新包括:(1)一种双对齐策略,通过最优传输距离(Optimal Transport distance)和最大均值差异(Maximum Mean Discrepancy)正则化模型,在不增加推理开销的前提下促进模态间的几何与统计一致性;(2)设计了一种模态感知的Mamba层(Modality-Aware Mamba layer),采用专家混合(Mixture-of-Experts)架构引入模态特定与共享专家,显式处理融合过程中的数据异质性。该方案在四个具有挑战性的基准数据集上验证了其在动态时间序列(CMU-MOSI/MOSEI)与静态图像相关任务(NYU-Depth V2/MVSA-Single)中的优越性能,实现了效果与效率的双重提升。

链接: https://arxiv.org/abs/2603.18462
作者: Yan Li,Yifei Xing,Xiangyuan Lan,Xin Li,Haifeng Chen,Dongmei Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by Pattern Recognition

点击查看摘要

Abstract:In the era of large-scale pre-trained models, effectively adapting general knowledge to specific affective computing tasks remains a challenge, particularly regarding computational efficiency and multimodal heterogeneity. While Transformer-based methods have excelled at modeling inter-modal dependencies, their quadratic computational complexity limits their use with long-sequence data. Mamba-based models have emerged as a computationally efficient alternative; however, their inherent sequential scanning mechanism struggles to capture the global, non-sequential relationships that are crucial for effective cross-modal alignment. To address these limitations, we propose \textbfAlignMamba-2, an effective and efficient framework for multimodal fusion and sentiment analysis. Our approach introduces a dual alignment strategy that regularizes the model using both Optimal Transport distance and Maximum Mean Discrepancy, promoting geometric and statistical consistency between modalities without incurring any inference-time overhead. More importantly, we design a Modality-Aware Mamba layer, which employs a Mixture-of-Experts architecture with modality-specific and modality-shared experts to explicitly handle data heterogeneity during the fusion process. Extensive experiments on four challenging benchmarks, including dynamic time-series (on the CMU-MOSI and CMU-MOSEI datasets) and static image-related tasks (on the NYU-Depth V2 and MVSA-Single datasets), demonstrate that AlignMamba-2 establishes a new state-of-the-art in both effectiveness and efficiency across diverse pattern recognition tasks, ranging from dynamic time-series analysis to static image-text classification.

[AI-60] Discounted Beta–Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

【速读】:该论文旨在解决基于群体的强化学习与可验证奖励(Group-based Reinforcement Learning with Verifiable Rewards, RLVR)方法中存在的样本效率低下问题。现有方法依赖少量轨迹的奖励点估计,导致估计方差高、方差坍塌以及生成响应利用不足。其解决方案的关键在于从统计估计视角重新构建RLVR:将奖励建模为由策略诱导分布中抽取的样本,并将优势计算转化为从有限数据中估计奖励分布的问题。在此基础上提出折扣贝塔-伯努利(Discounted Beta–Bernoulli, DBB)奖励估计方法,利用历史奖励统计信息处理非平稳分布,虽存在偏差但显著降低并稳定了估计方差,理论上避免方差坍塌,且均方误差低于标准点估计。实验证明,结合DBB的GRPO在多个推理基准上优于基线方法,且无需额外计算或内存开销。

链接: https://arxiv.org/abs/2603.18444
作者: Haechan Kim,Soohyun Ryu,Gyouk Chu,Doohyuk Jang,Eunho Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often suffer from severe sample inefficiency. This inefficiency stems from reliance on point estimation of rewards from a small number of rollouts, leading to high estimation variance, variance collapse, and ineffective utilization of generated responses. In this work, we reformulate RLVR from a statistical estimation perspective by modeling rewards as samples drawn from a policy-induced distribution and casting advantage computation as the problem of estimating the reward distribution from finite data. Building on this view, we propose Discounted Beta–Bernoulli (DBB) reward estimation, which leverages historical reward statistics for the non-stationary distribution. Although biased, the resulting estimator exhibits reduced and stable variance, theoretically avoids estimated variance collapse, and achieves lower mean squared error than standard point estimation. Extensive experiments across six in-distribution and three out-of-distribution reasoning benchmarks demonstrate that GRPO with DBB consistently outperforms naive GRPO, achieving average Acc@8 improvements of 3.22/2.42 points in-distribution and 12.49/6.92 points out-of-distribution on the 1.7B and 8B models, respectively, without additional computational cost or memory usage.

[AI-61] AS2 – Attention-Based Soft Answer Sets: An End-to-End Differentiable Neuro-Soft-Symbolic Reasoning Architecture

【速读】:该论文旨在解决传统神经符号人工智能(Neuro-symbolic AI)系统中因神经感知模块与离散符号求解器之间存在不可微边界而导致约束满足反馈无法传递至感知编码器的问题。其解决方案的关键在于提出AS2(Attention-Based Soft Answer Sets),该架构用一个软化的、连续的Answer Set Programming(ASP)立即后果算子 $ T_P $ 的近似替代了原有的离散求解器,从而实现了端到端的可微性。AS2在前向传播过程中保持每个位置对有限符号域的概率分布,并通过最小化概率化版本 $ T_P $ 的固定点残差进行训练,使模型能够直接对约束检查进行梯度传播,无需在训练或推理阶段调用外部求解器。此外,该方法摒弃了传统的位置嵌入,改用反映ASP声明式约束规范的约束组成员嵌入来编码问题结构,使模型对任意位置索引无感。

链接: https://arxiv.org/abs/2603.18436
作者: Wael AbdAlmageed
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neuro-symbolic artificial intelligence (AI) systems typically couple a neural perception module to a discrete symbolic solver through a non-differentiable boundary, preventing constraint-satisfaction feedback from reaching the perception encoder during training. We introduce AS2 (Attention-Based Soft Answer Sets), a fully differentiable neuro-symbolic architecture that replaces the discrete solver with a soft, continuous approximation of the Answer Set Programming (ASP) immediate consequence operator T_P . AS2 maintains per-position probability distributions over a finite symbol domain throughout the forward pass and trains end-to-end by minimizing the fixed-point residual of a probabilistic lift of T_P , thereby differentiating through the constraint check without invoking an external solver at either training or inference time. The architecture is entirely free of conventional positional embeddings. Instead, it encodes problem structure through constraint-group membership embeddings that directly reflect the declarative ASP specification, making the model agnostic to arbitrary position indexing. On Visual Sudoku, AS2 achieves 99.89% cell accuracy and 100% constraint satisfaction (verified by Clingo) across 1,000 test boards, using a greedy constrained decoding procedure that requires no external solver. On MNIST Addition with N \in \2, 4, 8\ addends, AS2 achieves digit accuracy above 99.7% across all scales. These results demonstrate that a soft differentiable fixpoint operator, combined with constraint-aware attention and declarative constraint specification, can match or exceed pipeline and solver-based neuro-symbolic systems while maintaining full end-to-end differentiability.

[AI-62] Prune-then-Quantize or Quantize-then-Prune? Understanding the Impact of Compression Order in Joint Model Compression ICLR2026

【速读】:该论文旨在解决多压缩方法联合使用时压缩顺序(compression order)对模型性能影响的未充分研究问题。当前多数研究假设不同压缩技术之间相互独立,或仅在受限场景下分析顺序效应,导致对压缩顺序在整体压缩效率中的作用缺乏系统理解。论文提出关键解决方案——“渐进强度假说”(Progressive Intensity Hypothesis),主张较弱扰动(如轻度剪枝)应优先于较强扰动(如高精度量化)执行,从而优化压缩流程。理论分析表明,该顺序的优势随原始性能差距增大而增强;实验验证了该假说在语言和视觉模型上的有效性,并扩展至多阶段压缩与混合精度量化等更广泛场景。

链接: https://arxiv.org/abs/2603.18426
作者: Minjun Kim,Jaehyeon Choi,Hyunwoo Yang,Jongjin Kim,Jinho Song,U Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICLR 2026

点击查看摘要

Abstract:What happens when multiple compression methods are combined-does the order in which they are applied matter? Joint model compression has emerged as a powerful strategy to achieve higher efficiency by combining multiple methods such as pruning and quantization. A central but underexplored factor in joint model compression is the compression order, or the sequence of different methods within the compression pipeline. Most prior studies have either sidestepped the issue by assuming orthogonality between techniques, while a few have examined them only in highly constrained cases. Consequently, the broader role of compression order in shaping model performance remains poorly understood. In this paper, we address the overlooked problem of compression order and provide both theoretical and empirical analysis. We formulate the problem of optimizing the compression order and introduce the Progressive Intensity Hypothesis, which states that weaker perturbations should precede stronger ones. We provide theoretical guarantees showing that the relative benefit of one order increases with the underlying performance gap. Extensive experiments on both language and vision models validate the hypothesis, and further show its generality to broader setups such as multi-stage compression and mixed-precision quantization.

[AI-63] he Impact of Corporate AI Washing on Farmers Digital Financial Behavior Response – An Analysis from the Perspective of Digital Financial Exclusion

【速读】:该论文旨在解决数字金融发展背景下,金融科技公司存在的“AI洗牌”(AI washing)现象对农户数字金融行为产生抑制作用的问题。其核心问题是:当企业夸大人工智能(AI)能力却缺乏实质性投入时,如何影响农户参与数字金融的行为响应,以及如何缓解这一负面效应。解决方案的关键在于多维协同干预:一是监管层面建立严格的AI技术信息披露制度,以减少信息不对称;二是通过差异化数字金融教育提升弱势群体的识别能力;三是借助社会资本(如数字金融互助小组)增强农户抵御误导性宣传的韧性;四是完善农民在数字金融中的消费者保护机制,并试点设立“数字普惠金融示范县”,推动政策落地与效果验证。

链接: https://arxiv.org/abs/2603.18421
作者: Li Wenxiu,Wen Zhanjie,Xia Jiechang,Guo Jingqiao
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Comments: 35 pages, 4 tables, empirical research on rural digital finance fintech, using CHFS2019 data (6,800 rural households) corporate AI investment data, incorporating Logit/Ologit/GSEM models, suitable for agricultural economics/financial inclusion journals

点击查看摘要

Abstract:In the context of the rapid development of digital finance, some financial technology companies exhibit the phenomenon of “AI washing,” where they overstate their AI capabilities while underinvesting in actual AI resources. This paper constructs a corporate-level AI washing index based on CHFS2019 data and AI investment data from 15-20 financial technology companies, analyzing and testing its impact on farmers’ digital financial behavior response. The study finds that AI washing significantly suppresses farmers’ digital financial behavior; the higher the degree of AI washing, the lower the response level of farmers’ digital financial behavior. Moreover, AI washing indirectly inhibits farmers’ behavioral responses by exacerbating knowledge exclusion and risk exclusion. Social capital can positively moderate the negative impact of AI washing; among farmer groups with high social capital, the suppressive effect of AI washing on digital financial behavior is significantly weaker than that among groups with low social capital. In response, this paper suggests that regulatory authorities establish a strict information disclosure system for AI technology, conduct differentiated digital financial education to enhance the identification capabilities of vulnerable groups, promote digital financial mutual aid groups to leverage the protective effects of social capital, improve the consumer protection mechanism for farmers in digital finance, and set up pilot “Digital Inclusive Finance Demonstration Counties,” etc.

[AI-64] Self-Tuning Sparse Attention: Multi-Fidelity Hyperparameter Optimization for Transformer Acceleration

【速读】:该论文旨在解决稀疏注意力机制(Sparse Attention)在实际应用中因超参数优化依赖人工调参而导致的可用性瓶颈问题,即不同层和注意力头的最优超参数差异显著,而现有方法(如SpargeAttn)需通过耗时的手动网格搜索来确定。其解决方案的关键在于提出AFBS-BO(Adaptive Fidelity Binary Search with Bayesian Optimization),一种完全自动化的框架,通过融合贝叶斯优化(Bayesian Optimization)进行全局探索与二分搜索(Binary Search)实现局部精细化调整,并利用多保真度评估策略(multi-fidelity evaluation)在不同序列长度下降低调参成本,从而显著提升超参数发现效率并获得优于现有稀疏注意力基线的性能表现。

链接: https://arxiv.org/abs/2603.18417
作者: Arundhathi Dev,Justin Zhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the International Conference on Machine Intelligence Theory and Applications (MiTA 2026)

点击查看摘要

Abstract:Sparse attention mechanisms promise to break the quadratic bottleneck of long-context transformers, yet production adoption remains limited by a critical usability gap: optimal hyperparameters vary substantially across layers and models, and current methods (e.g., SpargeAttn) rely on manual grid search to identify them. We propose AFBS-BO (Adaptive Fidelity Binary Search with Bayesian Optimization), a fully automated framework that discovers optimal layer- and head-specific hyperparameters without human intervention. Our hybrid algorithm combines Bayesian Optimization for global exploration with binary search for local refinement, leveraging multi-fidelity evaluation across sequence lengths to reduce tuning cost. On Llama-2-7B, AFBS-BO accelerates hyperparameter discovery by 3.4x with 8.8x fewer evaluations than grid search, and identifies high-sparsity configurations that outperform existing sparse attention baselines while closely matching dense attention quality. By transforming sparse attention from a manually tuned heuristic into a self-optimizing primitive, AFBS-BO enables plug-and-play acceleration across diverse transformer architectures and domains.

[AI-65] he Spillover Effects of Peer AI Rinsing on Corporate Green Innovation

【速读】:该论文旨在解决企业“AI洗牌”(AI washing)行为对绿色创新(green innovation)产生的挤出效应问题,即企业在年报中虚假标榜人工智能应用,导致资源错配并抑制真正可持续的绿色技术创新。其解决方案的关键在于通过政策工具组合实现市场均衡优化:一是政府应设计差异化支持措施以“提升市场回报并缓解融资约束”,二是实施分类监管策略,三是建立“专业识别与声誉惩戒相结合”的信息披露机制,从而遏制企业间模仿性的AI洗牌行为,促进真实创新投入。

链接: https://arxiv.org/abs/2603.18415
作者: Li Wenxiu,Wen Zhanjie,Xia Jiechang,Guo Jingqiao
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Comments: 32 pages, 6 tables, empirical research on corporate finance digital economy, using Chinese A-share listed companies data (2006-2024), incorporating agent-based modelling simulations, suitable for finance/innovation economics journals

点击查看摘要

Abstract:At a time when the phenomenon of ‘AI washing’ is quietly spreading, an increasing number of enterprises are using the label of artificial intelligence merely as a cosmetic embellishment in their annual reports, rather than as a genuine engine driving transformation. A test regarding the essence of innovation and the authenticity of information disclosure has arrived. This paper employs large language models to conduct semantic analysis on the text of annual reports from Chinese A-share listed companies from 2006 to 2024, systematically examining the impact of corporate AI washing behaviour on their green innovation. The research reveals that corporate AI washing exerts a significant crowding-out effect on green innovation, with this negative relationship transmitted through dual channels in both product and capital markets. Furthermore, this crowding-out effect exhibits heterogeneity across firms and industries, with private enterprises, small and medium-sized enterprises (SMEs), and firms in highly competitive sectors suffering more severe negative impacts from AI washing. Simulation results indicate that a combination of policy tools can effectively improve market equilibrium. Based on this, this paper proposes that the government should design targeted support tools to ‘enhance market returns and alleviate financing constraints’, adopt a differentiated regulatory strategy, and establish a disclosure mechanism combining ‘professional identification and reputational sanctions’ to curb such peer AI washing behaviour.

[AI-66] From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents

【速读】:该论文旨在解决生成式 AI(Generative AI)在匿名数据中通过推理驱动的链接(inference-driven linkage)实现身份重建的隐私风险问题。传统观点认为,匿名化数据因缺乏明确标识符且重识别需大量人力与专业算法而具备安全性,但本文揭示了大语言模型(LLM)代理可自主整合分散的非识别性线索与公开信息,在无需特定任务启发式规则的情况下完成身份推断。其解决方案的关键在于提出并验证“推理驱动链接”这一新型隐私威胁框架,并通过三个场景(经典链接案例、InferLink基准测试及现代文本丰富数据)系统评估发现:LLM代理不仅能在固定池匹配和开放式身份解析任务中成功还原身份(如Netflix Prize中达到79.2%准确率),还能在无恶意意图的跨源分析中自然产生链接行为,从而表明身份推断本身应被视为首要隐私风险,隐私评估必须量化模型所能推断出的身份信息。

链接: https://arxiv.org/abs/2603.18382
作者: Myeongseob Ko,Jihyun Jeong,Sumiran Singh Thakur,Gyuhak Kim,Ruoxi Jia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Anonymization is widely treated as a practical safeguard because re-identifying anonymous records was historically costly, requiring domain expertise, tailored algorithms, and manual corroboration. We study a growing privacy risk that may weaken this barrier: LLM-based agents can autonomously reconstruct real-world identities from scattered, individually non-identifying cues. By combining these sparse cues with public information, agents resolve identities without bespoke engineering. We formalize this threat as \emphinference-driven linkage and systematically evaluate it across three settings: classical linkage scenarios (Netflix and AOL), \emphInferLink (a controlled benchmark varying task intent, shared cues, and attacker knowledge), and modern text-rich artifacts. Without task-specific heuristics, agents successfully execute both fixed-pool matching and open-ended identity resolution. In the Netflix Prize setting, an agent reconstructs 79.2% of identities, significantly outperforming a 56.0% classical baseline. Furthermore, linkage emerges not only under explicit adversarial prompts but also as a byproduct of benign cross-source analysis in \emphInferLink and unstructured research narratives. These findings establish that identity inference – not merely explicit information disclosure – must be treated as a first-class privacy risk; evaluations must measure what identities an agent can infer.

[AI-67] PlanTwin: Privacy-Preserving Planning Abstractions for Cloud-Assisted LLM Agents

【速读】:该论文旨在解决云托管大语言模型(Large Language Models, LLMs)在代理系统中进行规划时,因需访问本地私有环境状态(如源代码、文件、凭证等敏感信息)而导致的隐私泄露问题。现有方案仅关注执行隔离、访问控制或可信推理,未能限制云规划器对原始本地上下文的可观测性。解决方案的关键在于提出 PlanTwin 架构:通过构建一个面向规划任务的数字孪生(planning-oriented digital twin),将真实环境投影为结构保留但去标识化的抽象图,从而在不暴露可重构细节的前提下支持云端规划;同时引入本地网关(gatekeeper)实施安全策略与累积披露预算,并基于 (k,δ)-匿名性和 ε-不可链接性定义隐私目标,以实现隐私与规划效用之间的可控权衡。

链接: https://arxiv.org/abs/2603.18377
作者: Guangsheng Yu,Qin Wang,Rui Lang,Shuai Su,Xu Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Cloud-hosted large language models (LLMs) have become the de facto planners in agentic systems, coordinating tools and guiding execution over local environments. In many deployments, however, the environment being planned over is private, containing source code, files, credentials, and metadata that cannot be exposed to the cloud. Existing solutions address adjacent concerns, such as execution isolation, access control, or confidential inference, but they do not control what cloud planners observe during planning: within the permitted scope, \textitraw environment state is still exposed. We introduce PlanTwin, a privacy-preserving architecture for cloud-assisted planning without exposing raw local context. The key idea is to project the real environment into a \textitplanning-oriented digital twin: a schema-constrained and de-identified abstract graph that preserves planning-relevant structure while removing reconstructable details. The cloud planner operates solely on this sanitized twin through a bounded capability interface, while a local gatekeeper enforces safety policies and cumulative disclosure budgets. We further formalize the privacy-utility trade-off as a capability granularity problem, define architectural privacy goals using (k,\delta) -anonymity and \epsilon -unlinkability, and mitigate compositional leakage through multi-turn disclosure control. We implement PlanTwin as middleware between local agents and cloud planners and evaluate it on 60 agentic tasks across ten domains with four cloud planners. PlanTwin achieves full sensitive-item non-disclosure (SND = 1.0) while maintaining planning quality close to full-context systems: three of four planners achieve PQS 0.79 , and the full pipeline incurs less than 2.2% utility loss. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET) Cite as: arXiv:2603.18377 [cs.CR] (or arXiv:2603.18377v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.18377 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-68] LGESynthNet: Controlled Scar Synthesis for Improved Scar Segmentation in Cardiac LGE-MRI Imaging MICCAI

【速读】:该论文旨在解决左心室晚期增强心脏磁共振成像(Late Gadolinium Enhancement Cardiac MRI, LGE-CMR)中增强区域分割的挑战,即由于像素级标注数据稀缺且标注过程繁琐,导致深度学习模型训练受限的问题。解决方案的关键在于提出一种基于潜在扩散模型(latent diffusion model)的可控增强合成框架LGESynthNet,其核心创新包括:(1) 采用ControlNet架构实现图像修复(inpainting)任务中的细粒度条件控制,支持对增强区域的大小、位置及透壁范围进行显式调控;(2) 引入奖励模型(reward model)提供条件特定监督信号,提升生成结果与输入条件的一致性;(3) 结合解剖描述性文本提示(captioning module)和生物医学文本编码器(biomedical text encoder),实现语义引导的生成控制。该方法仅需429张图像(79例患者)即可训练,并通过质量控制过滤机制筛选高保真样本用于下游任务的数据增强,显著提升分割与检测性能,最高分别提升6和20个点。

链接: https://arxiv.org/abs/2603.18356
作者: Athira J. Jacob,Puneet Sharma,Daniel Rueckert
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at MICCAI STACOM workshop 2025

点击查看摘要

Abstract:Segmentation of enhancement in LGE cardiac MRI is critical for diagnosing various ischemic and non-ischemic cardiomyopathies. However, creating pixel-level annotations for these images is challenging and labor-intensive, leading to limited availability of annotated data. Generative models, particularly diffusion models, offer promise for synthetic data generation, yet many rely on large training datasets and often struggle with fine-grained conditioning control, especially for small or localized features. We introduce LGESynthNet, a latent diffusion-based framework for controllable enhancement synthesis, enabling explicit control over size, location, and transmural extent. Formulated as inpainting using a ControlNet-based architecture, the model integrates: (a) a reward model for conditioning-specific supervision, (b) a captioning module for anatomically descriptive text prompts, and © a biomedical text encoder. Trained on just 429 images (79 patients), it produces realistic, anatomically coherent samples. A quality control filter selects outputs with high conditioning-fidelity, which when used for training augmentation, improve downstream segmentation and detection performance, by up-to 6 and 20 points respectively.

[AI-69] Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations

【速读】:该论文试图解决生成式 AI 模型中“知识-行为鸿沟”(knowledge-action gap)的问题,即模型内部表征所蕴含的诊断知识远超其输出表现,而当前机制可解释性方法是否能有效将这种隐含知识转化为准确的决策尚不明确。研究的关键在于系统评估四种机制可解释性方法——概念瓶颈引导(Concept Bottleneck Steering)、稀疏自编码器特征引导(Sparse Autoencoder Feature Steering)、对数透镜结合激活修补(Logit Lens with Activation Patching)以及线性探测与真实性分离向量引导(Linear Probing with Truthfulness Separator Vector Steering)——在纠正临床案例中假阴性误诊方面的有效性。结果表明,尽管线性探测能以98.2%的AUROC识别危险情况,但模型输出敏感性仅为45.1%,且多数可解释性方法要么无效(如SAE),要么在修正错误时显著破坏正确判断(如概念瓶颈引导),说明现有方法尚无法可靠地将内部知识转化为可信赖的输出修正,这对依赖可解释性实现AI安全纠错的框架提出了根本性挑战。

链接: https://arxiv.org/abs/2603.18353
作者: Sanjay Basu,Sadiq Y. Patel,Parth Sheth,Bhairavi Muralidharan,Namrata Elamaran,Aakriti Kinra,John Morgan,Rajaie Batniji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 5 figures, 10 tables. Code available at this https URL

点击查看摘要

Abstract:Language models encode task-relevant knowledge in internal representations that far exceeds their output performance, but whether mechanistic interpretability methods can bridge this knowledge-action gap has not been systematically tested. We compared four mechanistic interpretability methods – concept bottleneck steering (Steerling-8B), sparse autoencoder feature steering, logit lens with activation patching, and linear probing with truthfulness separator vector steering (Qwen 2.5 7B Instruct) – for correcting false-negative triage errors using 400 physician-adjudicated clinical vignettes (144 hazards, 256 benign). Linear probes discriminated hazardous from benign cases with 98.2% AUROC, yet the model’s output sensitivity was only 45.1%, a 53-percentage-point knowledge-action gap. Concept bottleneck steering corrected 20% of missed hazards but disrupted 53% of correct detections, indistinguishable from random perturbation (p=0.84). SAE feature steering produced zero effect despite 3,695 significant features. TSV steering at high strength corrected 24% of missed hazards while disrupting 6% of correct detections, but left 76% of errors uncorrected. Current mechanistic interpretability methods cannot reliably translate internal knowledge into corrected outputs, with implications for AI safety frameworks that assume interpretability enables effective error correction.

[AI-70] Shifting Uncertainty to Critical Moments: Towards Reliable Uncertainty Quantification for VLA Model

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人连续控制中缺乏可靠不确定性量化的问题,尤其是传统基于均值聚合的不确定性信号会稀释关键但短暂的高风险波动,导致失败预测不准确。解决方案的关键在于提出一种统一的不确定性量化方法:(1)采用基于最大值的滑动窗口池化策略以保留瞬时风险信号;(2)引入运动感知的稳定性加权机制,强调与不稳定行为相关的高频动作振荡;(3)通过贝叶斯优化实现自由度(Degrees of Freedom, DoF)自适应校准,优先关注运动学上关键的轴向。实验表明,该方法显著提升了失败预测准确性,并生成更可靠的失败检测信号,支持后续人机协同干预。

链接: https://arxiv.org/abs/2603.18342
作者: Yanchuan Tang,Taowen Wang,Yuefei Chen,Boxuan Zhang,Qiang Guan,Ruixiang Tang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models enable general-purpose robotic policies by mapping visual observations and language instructions to low-level actions, but they often lack reliable introspection. A common practice is to compute a token-level uncertainty signal and take its mean over a rollout. However, mean aggregation can dilute short-lived but safety-critical uncertainty spikes in continuous control. In particular, successful rollouts may contain localized high-entropy segments due to benign noise or non-critical micro-adjustments, while failure rollouts can appear low-entropy for most timesteps and only exhibit brief spikes near the onset of failure. We propose a unified uncertainty quantification approach for predicting rollout success versus failure that (1) uses max-based sliding window pooling to preserve transient risk signals, (2) applies motion-aware stability weighting to emphasize high-frequency action oscillations associated with unstable behaviors, and (3) performs DoF-adaptive calibration via Bayesian Optimization to prioritize kinematically critical axes. Experiments on the LIBERO benchmark show that our method substantially improves failure prediction accuracy and yields more reliable signals for failure detection, which can support downstream human-in-the-loop interventions.

[AI-71] Can LLM s Reason Like Automated Theorem Provers for Rust Verification? VCoT-Bench: Evaluating via Verification Chain of Thought

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在 Rust 程序验证任务中表现不透明、缺乏细粒度评估的问题。现有方法将 Rust 验证视为黑箱,仅以通过或失败的二元结果衡量模型能力,无法揭示其是否真正掌握非平凡 Rust 代码所需的逻辑推理过程。为此,作者提出 VCoT-Lift 框架,其关键在于将底层求解器的推理步骤提升为高阶、人类可读的验证链式思维(Verification Chain-of-Thought),从而提供可解释的验证过程作为精细评估的基准。基于此框架,研究进一步构建了 VCoT-Bench 基准测试集,涵盖 1,988 个验证任务,从缺失证明的鲁棒性、不同证明类型的胜任力以及证明位置敏感性三个维度系统评估 LLM 的验证理解能力,结果显示当前主流模型在推理稳定性上存在严重脆弱性,远未达到自动化定理证明工具的水平。

链接: https://arxiv.org/abs/2603.18334
作者: Zichen Xie,Wenxi Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) increasingly assist secure software development, their ability to meet the rigorous demands of Rust program verification remains unclear. Existing evaluations treat Rust verification as a black box, assessing models only by binary pass or fail outcomes for proof hints. This obscures whether models truly understand the logical deductions required for verifying nontrivial Rust code. To bridge this gap, we introduce VCoT-Lift, a framework that lifts low-level solver reasoning into high-level, human-readable verification steps. By exposing solver-level reasoning as an explicit Verification Chain-of-Thought, VCoT-Lift provides a concrete ground truth for fine-grained evaluation. Leveraging VCoT-Lift, we introduce VCoT-Bench, a comprehensive benchmark of 1,988 VCoT completion tasks for rigorously evaluating LLMs’ understanding of the entire verification process. VCoT-Bench measures performance along three orthogonal dimensions: robustness to varying degrees of missing proofs, competence across different proof types, and sensitivity to the proof locations. Evaluation of ten state-of-the-art models reveals severe fragility, indicating that current LLMs fall well short of the reasoning capabilities exhibited by automated theorem provers.

[AI-72] Understanding the Theoretical Foundations of Deep Neural Networks through Differential Equations

【速读】:该论文试图解决深度神经网络(Deep Neural Networks, DNNs)缺乏系统性理论基础的问题,从而阻碍其科学设计与性能优化。解决方案的关键在于引入微分方程(Differential Equations)作为统一的理论框架,从模型层面(将整个DNN视为微分方程)和层层面(将单个组件建模为微分方程)两个维度,提供对DNN架构的原理性理解、性能改进工具以及实际应用支撑,进而实现模型设计、理论分析与性能提升之间的闭环关联。

链接: https://arxiv.org/abs/2603.18331
作者: Hongjue Zhao,Yizhuo Chen,Yuchen Wang,Hairong Qi,Lui Sha,Tarek Abdelzaher,Huajie Shao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have achieved remarkable empirical success, yet the absence of a principled theoretical foundation continues to hinder their systematic development. In this survey, we present differential equations as a theoretical foundation for understanding, analyzing, and improving DNNs. We organize the discussion around three guiding questions: i) how differential equations offer a principled understanding of DNN architectures, ii) how tools from differential equations can be used to improve DNN performance in a principled way, and iii) what real-world applications benefit from grounding DNNs in differential equations. We adopt a two-fold perspective spanning the model level, which interprets the whole DNN as a differential equation, and the layer level, which models individual DNN components as differential equations. From these two perspectives, we review how this framework connects model design, theoretical analysis, and performance improvement. We further discuss real-world applications, as well as key challenges and opportunities for future research.

[AI-73] FaithSteer-BENCH: A Deployment-Aligned Stress-Testing Benchmark for Inference-Time Steering

【速读】:该论文旨在解决当前推理时控制(inference-time steering)方法在实际部署场景中可靠性不足的问题,尤其是现有评估方式过于宽松、忽视了部署约束、能力权衡和真实环境鲁棒性,导致对方法有效性的误判。其关键解决方案是提出一个名为FaithSteer-BENCH的应力测试基准,通过三个门控准则(controllability、utility preservation、robustness)在固定部署式操作点上系统评估各类控制方法。该基准揭示了多种被常规评估掩盖的系统性失败模式,如虚假可控性、无关能力的认知负担以及轻微指令扰动下的显著脆弱性,并指出多数方法仅诱导提示条件对齐而非稳定的潜在空间方向偏移,从而为未来方法设计与部署导向研究提供了统一的评估框架和更清晰的分析视角。

链接: https://arxiv.org/abs/2603.18329
作者: Zikang Ding,Qiying Hu,Yi Zhang,Hongji Li,Junchi Yao,Hongbo Liu,Lijie Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inference-time steering is widely regarded as a lightweight and parameter-free mechanism for controlling large language model (LLM) behavior, and prior work has often suggested that simple activation-level interventions can reliably induce targeted behavioral changes. However, such conclusions are typically drawn under relatively relaxed evaluation settings that overlook deployment constraints, capability trade-offs, and real-world robustness. We therefore introduce \textbfFaithSteer-BENCH, a stress-testing benchmark that evaluates steering methods at a fixed deployment-style operating point through three gate-wise criteria: controllability, utility preservation, and robustness. Across multiple models and representative steering approaches, we uncover several systematic failure modes that are largely obscured under standard evaluation, including illusory controllability, measurable cognitive tax on unrelated capabilities, and substantial brittleness under mild instruction-level perturbations, role prompts, encoding transformations, and data scarcity. Gate-wise benchmark results show that existing methods do not necessarily provide reliable controllability in deployment-oriented practical settings. In addition, mechanism-level diagnostics indicate that many steering methods induce prompt-conditional alignment rather than stable latent directional shifts, further explaining their fragility under stress. FaithSteer-BENCH therefore provides a unified benchmark and a clearer analytical lens for future method design, reliability evaluation, and deployment-oriented research in steering.

[AI-74] Consumer-to-Clinical Language Shifts in Ambient AI Draft Notes and Clinician-Finalized Documentation: A Multi-level Analysis

【速读】:该论文旨在解决生成式 AI (Generative AI) 在临床场景中生成的初稿笔记常使用通俗化或消费者导向的表达方式,导致其与专业医疗文档规范不一致的问题。解决方案的关键在于构建一个基于词典验证的转换框架,量化医生在编辑过程中将非标准化表达转化为标准临床术语的行为,并发现编辑显著降低了各章节的术语密度,且评估与计划(Assessment and Plan)章节转化量最大,体现了医生编辑行为具有明确的结构感知特性,从而为设计具备章节敏感性的环境智能(Ambient AI)提供了实证依据和优化方向。

链接: https://arxiv.org/abs/2603.18327
作者: Ha Na Cho,Yawen Guo,Sairam Sutari,Emilie Chow,Steven Tam,Danielle Perret,Deepti Pandita,Kai Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ambient AI generates draft clinical notes from patient-clinician conversations, often using lay or consumer-oriented phrasing to support patient understanding instead of standardized clinical terminology. How clinicians revise these drafts for professional documentation conventions remains unclear. We quantified clinician editing for consumer-to- clinical normalization using a dictionary-confirmed transformation framework. We analyzed 71,173 AI-draft and finalized-note section pairs from 34,726 encounters. Confirmed transformations were defined as replacing a consumer expression with its dictionary-mapped clinical equivalent in the same section. Editing significantly reduced terminology density across all sections (p 0.001). The Assessment and Plan accounted for the largest transformation volume (59.3%). Our analysis identified 7,576 transformation events across 4,114 note sections (5.8%), representing 1.2% consumer-term deletions. Transformation intensity varied across individual clinicians (p 0.001). Overall, clinician post-editing demonstrates consistent shifts from conversational phrasing toward standardized, section- appropriate clinical terminology, supporting section-aware ambient AI design.

[AI-75] Approximate Subgraph Matching with Neural Graph Representations and Reinforcement Learning

【速读】:该论文旨在解决近似子图匹配(Approximate Subgraph Matching, ASM)问题,即在大规模目标图中识别与给定查询图近似匹配的子结构。ASM作为NP-hard问题,在数据库系统、网络科学、生物化学及隐私保护等领域具有重要应用价值,而现有方法多依赖启发式搜索策略,难以充分利用图的全局信息,导致解的质量和效率受限。解决方案的关键在于提出一种基于强化学习的近似子图匹配算法(Reinforcement Learning-based Approximate Subgraph Matching, RL-ASM),其核心创新包括:利用图变换器(Graph Transformer)提取编码完整图结构信息的特征表示,替代传统启发式规则;并通过两阶段训练机制——先采用模仿学习(imitation learning)引入监督信号引导策略初始化,再使用近端策略优化(Proximal Policy Optimization, PPO)对策略进行细调,以最大化长期累积奖励,从而实现更高效且准确的匹配结果。

链接: https://arxiv.org/abs/2603.18314
作者: Kaiyang Li,Shihao Ji,Zhipeng Cai,Wei Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures. Code available at this https URL

点击查看摘要

Abstract:Approximate subgraph matching (ASM) is a task that determines the approximate presence of a given query graph in a large target graph. Being an NP-hard problem, ASM is critical in graph analysis with a myriad of applications ranging from database systems and network science to biochemistry and privacy. Existing techniques often employ heuristic search strategies, which cannot fully utilize the graph information, leading to sub-optimal solutions. This paper proposes a Reinforcement Learning based Approximate Subgraph Matching (RL-ASM) algorithm that exploits graph transformers to effectively extract graph representations and RL-based policies for ASM. Our model is built upon the branch-and-bound algorithm that selects one pair of nodes from the two input graphs at a time for potential matches. Instead of using heuristics, we exploit a Graph Transformer architecture to extract feature representations that encode the full graph information. To enhance the training of the RL policy, we use supervised signals to guide our agent in an imitation learning stage. Subsequently, the policy is fine-tuned with the Proximal Policy Optimization (PPO) that optimizes the accumulative long-term rewards over episodes. Extensive experiments on both synthetic and real-world datasets demonstrate that our RL-ASM outperforms existing methods in terms of effectiveness and efficiency. Our source code is available at this https URL.

[AI-76] he Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition

【速读】:该论文旨在解决当前用于评估健康领域大语言模型(Large Language Models, LLMs)的基准测试在患者或查询人群特征描述上的缺失问题,从而导致模型性能指标可能无法真实反映其在临床实践中的适用性。解决方案的关键在于引入一种基于标准化16字段分类体系的自动化查询画像方法,利用LLMs对公开基准中的18,707条消费者健康查询进行结构化分析,揭示出当前基准在临床数据类型分布、高风险场景覆盖(如自杀/self-harm)、脆弱人群代表性和慢性病管理等维度存在显著“有效性缺口”,并呼吁建立类似临床试验报告规范的标准化查询 profiling 机制,以实现评估体系与真实世界复杂临床需求的对齐。

链接: https://arxiv.org/abs/2603.18294
作者: Alvin Rajkomar,Pavan Sudarshan,Angela Lai,Lily Peng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background: Clinical trials rely on transparent inclusion criteria to ensure generalizability. In contrast, benchmarks validating health-related large language models (LLMs) rarely characterize the “patient” or “query” populations they contain. Without defined composition, aggregate performance metrics may misrepresent model readiness for clinical use. Methods: We analyzed 18,707 consumer health queries across six public benchmarks using LLMs as automated coding instruments to apply a standardized 16-field taxonomy profiling context, topic, and intent. Results: We identified a structural “validity gap.” While benchmarks have evolved from static retrieval to interactive dialogue, clinical composition remains misaligned with real-world needs. Although 42% of the corpus referenced objective data, this was polarized toward wellness-focused wearable signals (17.7%); complex diagnostic inputs remained rare, including laboratory values (5.2%), imaging (3.8%), and raw medical records (0.6%). Safety-critical scenarios were effectively absent: suicide/self-harm queries comprised 0.7% of the corpus and chronic disease management only 5.5%. Benchmarks also neglected vulnerable populations (pediatrics/older adults 11%) and global health needs. Conclusions: Evaluation benchmarks remain misaligned with real-world clinical needs, lacking raw clinical artifacts, adequate representation of vulnerable populations, and longitudinal chronic care scenarios. The field must adopt standardized query profiling–analogous to clinical trial reporting–to align evaluation with the full complexity of clinical practice. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.18294 [cs.AI] (or arXiv:2603.18294v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.18294 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alvin Rajkomar [view email] [v1] Wed, 18 Mar 2026 21:31:19 UTC (1,385 KB)

[AI-77] CORE: Robust Out-of-Distribution Detection via Confidence and Orthogonal Residual Scoring

【速读】:该论文旨在解决深度学习模型在部署过程中出分布(Out-of-Distribution, OOD)检测性能不稳定的问题,即现有方法在不同模型架构和数据集上表现差异显著,缺乏一致性。其关键解决方案是提出CORE(COnfidence + REsidual)框架,通过识别并分离预分类层(penultimate features)中的两个正交子空间:一个是与分类器对齐的置信度分量(confidence component),另一个是分类器忽略的残差分量(residual component)。研究表明,残差分量携带了类别特定的方向性签名(directional signature),构成一种隐含的归属信号(membership signal),而该信号在基于logit的方法中不可见,在传统特征方法中则被噪声掩盖。CORE通过对这两个独立信号分别评分并进行归一化加权融合,实现了鲁棒的OOD检测性能,且因两信号正交,其失效模式近似独立,从而在多个基准测试中均取得优异效果。

链接: https://arxiv.org/abs/2603.18290
作者: Jin Mo Yang,Hyung-Sin Kim,Saewoong Bahk
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 5 figures, includes supplementary material as appendix

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is essential for deploying deep learning models reliably, yet no single method performs consistently across architectures and datasets – a scorer that leads on one benchmark often falters on another. We attribute this inconsistency to a shared structural limitation: logit-based methods see only the classifier’s confidence signal, while feature-based methods attempt to measure membership in the training distribution but do so in the full feature space where confidence and membership are entangled, inheriting architecture-sensitive failure modes. We observe that penultimate features naturally decompose into two orthogonal subspaces: a classifier-aligned component encoding confidence, and a residual the classifier discards. We discover that this residual carries a class-specific directional signature for in-distribution data – a membership signal invisible to logit-based methods and entangled with noise in feature-based methods. We propose CORE (COnfidence + REsidual), which disentangles the two signals by scoring each subspace independently and combines them via normalized summation. Because the two signals are orthogonal by construction, their failure modes are approximately independent, producing robust detection where either view alone is unreliable. CORE achieves competitive or state-of-the-art performance across five architectures and five benchmark configurations, ranking first in three of five settings and achieving the highest grand average AUROC with negligible computational overhead.

[AI-78] Offload or Overload: A Platform Measurement Study of Mobile Robotic Manipulation Workloads

【速读】:该论文旨在解决移动机器人操作(Mobile Robotic Manipulation)在实际部署中面临的计算资源与性能之间的矛盾问题,即如何在有限的 onboard 计算能力下高效运行复杂的任务负载,同时权衡云端和边缘计算带来的延迟、能耗及带宽开销。其关键解决方案在于系统性地测量了从本地(onboard)、边缘(edge)到云端(cloud)GPU平台上的完整工作负载表现,并揭示了:1)小型 onboard GPU 无法支撑全部任务;2)大功率 onboard GPU 显著缩短电池续航;3)直接云卸载因网络延迟和带宽限制而不可行;4)多机器人共享计算资源虽具潜力但需谨慎设计以避免性能下降。这一实证研究为未来面向移动机器人的推理系统架构设计提供了重要依据。

链接: https://arxiv.org/abs/2603.18284
作者: Sara Pohland,Xenofon Foukas,Ganesh Ananthanarayanan,Andrey Kolobov,Sanjeev Mehrotra,Bozidar Radunovic,Ankit Verma
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
备注: 15 pages, 17 figures

点击查看摘要

Abstract:Mobile robotic manipulation–the ability of robots to navigate spaces and interact with objects–is a core capability of physical AI. Foundation models have led to breakthroughs in their performance, but at a significant computational cost. We present the first measurement study of mobile robotic manipulation workloads across onboard, edge, and cloud GPU platforms. We find that the full workload stack is infeasible to run on smaller onboard GPUs, while larger onboard GPUs drain robot batteries several hours faster. Offloading alleviates these constraints but introduces its own challenges, as additional network latency degrades task accuracy, and the bandwidth requirement makes naive cloud offloading impractical. Finally, we quantify opportunities and pitfalls of sharing compute across robot fleets. We believe our measurement study will be crucial to designing inference systems for mobile robots.

[AI-79] EDM-ARS: A Domain-Specific Multi-Agent System for Automated Educational Data Mining Research

【速读】:该论文旨在解决教育数据挖掘(Educational Data Mining, EDM)研究中自动化程度低、流程繁琐且依赖人工干预的问题,特别是在从数据预处理到模型构建、分析验证再到论文撰写等环节缺乏系统化自动支持的痛点。解决方案的关键在于提出并实现了一个领域感知的多智能体自动化研究流水线——EDM-ARS,其核心是将教育专业知识嵌入研究生命周期的每个阶段,并通过一个状态机协调器调度五个由大语言模型(Large Language Model, LLM)驱动的专用智能体(ProblemFormulator、DataEngineer、Analyst、Critic 和 Writer),实现端到端的研究自动化,包括生成带真实引用的LaTeX论文、验证机器学习分析结果以及进行方法论层面的自我审查。

链接: https://arxiv.org/abs/2603.18273
作者: Chenguang Pan,Zhou Zhang,Weixuan Xiao,Chengyuan Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this technical report, we present the Educational Data Mining Automated Research System (EDM-ARS), a domain-specific multi-agent pipeline that automates end-to-end educational data mining (EDM) research. We conceptualize EDM-ARS as a general framework for domain-aware automated research pipelines, where educational expertise is embedded into each stage of the research lifecycle. As a first instantiation of this framework, we focus on predictive modeling tasks. Within this scope, EDM-ARS orchestrates five specialized LLM-powered agents (ProblemFormulator, DataEngineer, Analyst, Critic, and Writer) through a state-machine coordinator that supports revision loops, checkpoint-based recovery, and sandboxed code execution. Given a research prompt and a dataset, EDM-ARS produces a complete LaTeX manuscript with real Semantic Scholar citations, validated machine learning analyses, and automated methodological peer review. We also provide a detailed description of the system architecture, the three-tier data registry design that encodes educational domain expertise, the specification of each agent, the inter-agent communication protocol, and mechanisms for error-handling and self-correction. Finally, we discuss current limitations, including single-dataset scope and formulaic paper output, and outline a phased roadmap toward causal inference, transfer learning, psychometric, and multi-dataset generalization. EDM-ARS is released as an open-source project to support the educational research community.

[AI-80] Enactor: From Traffic Simulators to Surrogate World Models

【速读】:该论文旨在解决交通微观仿真中行为模型过于简化、难以捕捉真实路网中车辆与行人等个体间复杂交互关系,以及在交通节点(如交叉口)处难以生成长期物理一致轨迹的问题。传统基于深度学习的代理模型虽能学习个体间的交互,但缺乏对场景几何结构的理解,导致长时间模拟时轨迹失真。其解决方案的关键在于提出一种以行为体为中心的生成式模型,采用Transformer架构结合世界模型(World Model)思想,同时建模行为体间的交互关系与交叉口的空间几何约束,从而生成具有物理一致性的长时程轨迹。实验表明,该方法在40000个时间步(4000秒)的“仿真闭环”测试中显著优于基线模型,在交通工程相关指标上KL散度降低超过10倍,且训练样本需求远低于传统代理中心生成方法。

链接: https://arxiv.org/abs/2603.18266
作者: Yash Ranjan,Rahul Sengupta,Anand Rangarajan,Sanjay Ranka
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic microsimulators are widely used to evaluate road network performance under various what-if" conditions. However, the behavior models controlling the actions of the actors are overly simplistic and fails to capture realistic actor-actor interactions. Deep learning-based methods have been applied to model vehicles and pedestrians as agents" responding to their surrounding environment" (including lanes, signals, and neighboring agents). Although effective in learning actor-actor interaction, these approaches fail to generate physically consistent trajectories over long time periods, and they do not explicitly address the complex dynamics that arise at traffic intersections which is a critical location in urban networks. Inspired by the World Model paradigm, we have developed an actor centric generative model using transformer-based architecture that is able to capture the actor-actor interaction, at the same time understanding the geometry to the traffic intersection to generate physically grounded trajectories that are based on learned behavior. Moreover, we test the model in a live simulation-in-the-loop" setting, where we generate the initial conditions of the actors using SUMO and then let the model control the dynamics of the actors. We let the simulation run for 40000 timesteps (4000 seconds), testing the performance of the model on long timerange and evaluating the trajectories on traffic engineering related metrics. Experimental results demonstrate that the proposed framework effectively captures complex actor-actor interactions and generates long-horizon, physically consistent trajectories, while requiring significantly fewer training samples than traditional agent-centric generative approaches. Our model is able to outperform the baseline in traffic related as well as aggregate metrics where our model beats the baseline by more than 10x on the KL-Divergence.

[AI-81] Sharpness-Aware Minimization in Logit Space Efficiently Enhances Direct Preference Optimization ICLR2026

【速读】:该论文旨在解决直接偏好优化(Direct Preference Optimization, DPO)中存在的“挤压效应”(squeezing effect,亦称似然位移),即在训练过程中,模型对偏好响应的预测概率意外降低的问题。该现象源于负梯度更新导致 logits 空间中残差沿高曲率方向快速扩展,从而破坏了偏好对齐效果。论文的关键解决方案是基于对 logits 空间坐标级动态的理论建模,提出使用仅扰动输出层的高效变体——logits-SAM(Sharpness-Aware Minimization),通过 curvature-regularization 效应抑制高曲率方向上的残差增长,从而稳定训练并提升 DPO 的对齐性能。实验表明,logits-SAM 在多个大语言模型和数据集上均能显著增强 DPO 的有效性且与现有 DPO 变体兼容。

链接: https://arxiv.org/abs/2603.18258
作者: Haocheng Luo,Zehang Deng,Thanh-Toan Do,Mehrtash Harandi,Dinh Phung,Trung Le
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has emerged as a popular algorithm for aligning pretrained large language models with human preferences, owing to its simplicity and training stability. However, DPO suffers from the recently identified squeezing effect (also known as likelihood displacement), where the probability of preferred responses decreases unintentionally during training. To understand and mitigate this phenomenon, we develop a theoretical framework that models the coordinate-wise dynamics in logit space. Our analysis reveals that negative-gradient updates cause residuals to expand rapidly along high-curvature directions, which underlies the squeezing effect, whereas Sharpness-Aware Minimization (SAM) can suppress this behavior through its curvature-regularization effect. Building on this insight, we investigate logits-SAM, a computationally efficient variant that perturbs only the output layer with negligible overhead. Extensive experiments on Pythia-2.8B, Mistral-7B, and Gemma-2B-IT across multiple datasets and benchmarks demonstrate that logits-SAM consistently improves the effectiveness of DPO and integrates seamlessly with other DPO variants. Code is available at this https URL.

[AI-82] Discovering What You Can Control: Interventional Boundary Discovery for Reinforcement Learning

【速读】:该论文旨在解决在存在混淆干扰因素(confounded distractors)的情况下,如何准确识别与智能体行为具有因果关系的状态维度问题,这本质上是一个因果识别难题:仅依靠观测统计量无法可靠区分与动作相关联的维度和由动作所引起的变化维度。解决方案的关键在于提出干预边界发现(Interventional Boundary Discovery, IBD),该方法利用Pearl的do-算子对智能体自身动作施加干预,并通过两样本检验生成一个可解释的二值掩码(binary mask),从而明确标识出状态空间中受动作影响的因果维度。IBD无需训练模型,可作为预处理步骤与任意下游强化学习算法(如SAC和TD3)组合使用,在12个连续控制任务中验证了其有效性,尤其在高维干扰场景下显著优于传统基于观测特征选择的方法。

链接: https://arxiv.org/abs/2603.18257
作者: Jiaxin Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Selecting relevant state dimensions in the presence of confounded distractors is a causal identification problem: observational statistics alone cannot reliably distinguish dimensions that correlate with actions from those that actions cause. We formalize this as discovering the agent’s Causal Sphere of Influence and propose Interventional Boundary Discovery IBD, which applies Pearl’s do-operator to the agent’s own actions and uses two-sample testing to produce an interpretable binary mask over observation dimensions. IBD requires no learned models and composes with any downstream RL algorithm as a preprocessing step. Across 12 continuous control settings with up to 100 distractor dimensions, we find that: (1) observational feature selection can actively select confounded distractors while discarding true causal dimensions; (2) full-state RL degrades sharply once distractors outnumber relevant features by roughly 3:1 in our benchmarks; and (3)IBD closely tracks oracle performance across all distractor levels tested, with gains transferring across SAC and TD3.

[AI-83] MolRGen: A Training and Evaluation Setting for De Novo Molecular Generation with Reason ning Models

【速读】:该论文旨在解决当前基于推理的大语言模型(reasoning-based large language models, LLMs)在从头分子生成de novo molecular generation)任务中缺乏有效训练与评估框架的问题。现有方法多依赖于带有真实标签的监督信号(如已知属性变化的分子对),而这类标签在从头生成场景中不可获得,导致模型难以学习如何生成具有高期望性质的新分子。解决方案的关键在于提出一个名为MolRGen的大规模基准和数据集,其核心创新包括:1)定义了一种适用于从头分子生成与性质预测的训练与评估设置;2)引入一种新颖的多样性感知的top-k评分机制,同时衡量生成分子的质量与多样性;3)通过强化学习成功训练了一个24B参数的LLM,并系统分析了其性能与局限性,从而为无监督条件下的分子设计提供了可扩展的建模范式。

链接: https://arxiv.org/abs/2603.18256
作者: Philippe Formont,Maxime Darrin,Ismail Ben Ayed,Pablo Piantanida
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in reasoning-based large language models (LLMs) have demonstrated substantial improvements in complex problem-solving tasks. Motivated by these advances, several works have explored the application of reasoning LLMs to drug discovery and molecular design. However, most existing approaches either focus on evaluation or rely on training setups that require ground-truth labels, such as molecule pairs with known property modifications. Such supervision is unavailable in \textitde novo molecular generation, where the objective is to generate novel molecules that optimize a desirability score without prior knowledge of high-scoring candidates. To bridge this gap, we introduce MolRGen, a large-scale benchmark and dataset for training and evaluating reasoning-based LLMs on \textitde novo molecular generation. Our contributions are threefold. First, we propose a setting to evaluate and train models for \textitde novo molecular generation and property prediction. Second, we introduce a novel diversity-aware top- k score that captures both the quality and diversity of generated molecules. Third, we show our setting can be used to train LLMs for molecular generation, training a 24B LLM with reinforcement learning, and we provide a detailed analysis of its performance and limitations.

[AI-84] Gradient-Informed Temporal Sampling Improves Rollout Accuracy in PDE Surrogate Training

【速读】:该论文旨在解决神经模拟器(neural simulator)训练数据采样策略的优化问题,即在有限计算预算下如何选择最具信息量的数据以最大化滚动预测(rollout)精度。现有方法要么陷入局部高信息密度区域导致泛化能力差,要么虽保持多样性但缺乏模型特异性,性能难以超越均匀采样。解决方案的关键在于提出一种面向神经模拟器的梯度感知时间采样方法(Gradient-Informed Temporal Sampling, GITS),其通过联合优化代理模型局部梯度与集合层面的时间覆盖度,实现模型特异性和动力学信息之间的有效平衡,从而显著降低多类偏微分方程(PDE)系统、不同模型骨干结构及多种采样比例下的滚动误差。

链接: https://arxiv.org/abs/2603.18237
作者: Wenshuo Wang,Fan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Researchers train neural simulators on uniformly sampled numerical simulation data. But under the same budget, does systematically sampled data provide the most effective information? A fundamental yet unformalized problem is how to sample training data for neural simulators so as to maximize rollout accuracy. Existing data sampling methods either tend to collapse into locally high-information-density regions, or preserve diversity but remain insufficiently model-specific, often leading to performance that is no better than uniform sampling. To address this, we propose a data sampling method tailored to neural simulators, Gradient-Informed Temporal Sampling (GITS). GITS jointly optimizes pilot-model local gradients and set-level temporal coverage, thereby effectively balancing model specificity and dynamical information. Compared with multiple sampling baselines, the data selected by GITS achieves lower rollout error across multiple PDE systems, model backbones and sample ratios. Furthermore, ablation studies demonstrate the necessity and complementarity of the two optimization objectives in GITS. In addition, we analyze the successful sampling patterns of GITS as well as the typical PDE systems and model backbones on which GITS fails.

[AI-85] R2-Dreamer: Redundancy-Reduced World Models without Decoders or Augmentation ICLR2026

【速读】:该论文旨在解决图像驱动的模型基于强化学习(Model-Based Reinforcement Learning, MBRL)中表征学习的问题,即如何从冗余的视觉细节中提取与任务相关的关键信息,同时避免传统重建方法对无关区域浪费模型容量,以及解码器-free 方法依赖外部数据增强(Data Augmentation, DA)导致的泛化能力受限问题。其解决方案的关键在于提出 R2-Dreamer 框架,通过引入一种受 Barlow Twins 启发的冗余减少(redundancy-reduction)自监督目标作为内部正则化项,无需依赖外部数据增强即可有效防止表征坍缩(representation collapse),从而在保持高效率的同时提升模型性能,在 DeepMind Control Suite 和 Meta-World 等基准上表现优于或媲美主流方法如 DreamerV3 和 TD-MPC2。

链接: https://arxiv.org/abs/2603.18202
作者: Naoki Morihira(1 and 2),Amal Nahar(1),Kartik Bharadwaj(1),Yasuhiro Kato(2),Akinobu Hayashi(1 and 2),Tatsuya Harada(2 and 3) ((1) Honda R and D Co. Ltd., (2) The University of Tokyo, (3) RIKEN AIP)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 20 pages, 12 figures, 2 tables. Published as a conference paper at ICLR 2026. Code available at this https URL

点击查看摘要

Abstract:A central challenge in image-based Model-Based Reinforcement Learning (MBRL) is to learn representations that distill essential information from irrelevant visual details. While promising, reconstruction-based methods often waste capacity on large task-irrelevant regions. Decoder-free methods instead learn robust representations by leveraging Data Augmentation (DA), but reliance on such external regularizers limits versatility. We propose R2-Dreamer, a decoder-free MBRL framework with a self-supervised objective that serves as an internal regularizer, preventing representation collapse without resorting to DA. The core of our method is a redundancy-reduction objective inspired by Barlow Twins, which can be easily integrated into existing frameworks. On DeepMind Control Suite and Meta-World, R2-Dreamer is competitive with strong baselines such as DreamerV3 and TD-MPC2 while training 1.59x faster than DreamerV3, and yields substantial gains on DMC-Subtle with tiny task-relevant objects. These results suggest that an effective internal regularizer can enable versatile, high-performance decoder-free MBRL. Code is available at this https URL.

[AI-86] A Computationally Efficient Learning of Artificial Intelligence System Reliability Considering Error Propagation

【速读】:该论文旨在解决智能城市中人工智能(Artificial Intelligence, AI)系统可靠性建模中的关键挑战,特别是由于多阶段功能模块间误差传播导致的可靠性评估难题。现有方法受限于真实数据稀缺、误差事件间依赖性违背独立假设以及高复杂度计算等问题。解决方案的关键在于:首先,利用基于物理的自动驾驶车辆仿真平台结合可解释的误差注入机制生成高质量数据;其次,构建一个能显式刻画多阶段误差传播过程的新颖可靠性建模框架;最后,采用一种计算高效且理论上有保障的复合似然期望-最大化算法来估计模型参数,从而实现对自动驾驶感知系统可靠性的精准预测与高效分析。

链接: https://arxiv.org/abs/2603.18201
作者: Fenglian Pan,Yinwei Zhang,Yili Hong,Larry Head,Jian Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation (stat.CO)
备注: 42 pages, 11 figures

点击查看摘要

Abstract:Artificial Intelligence (AI) systems are increasingly prominent in emerging smart cities, yet their reliability remains a critical concern. These systems typically operate through a sequence of interconnected functional stages, where upstream errors may propagate to downstream stages, ultimately affecting overall system reliability. Quantifying such error propagation is essential for accurate modeling of AI system reliability. However, this task is challenging due to: i) data availability: real-world AI system reliability data are often scarce and constrained by privacy concerns; ii) model validity: recurring error events across sequential stages are interdependent, violating the independence assumptions of statistical inference; and iii) computational complexity: AI systems process large volumes of high-speed data, resulting in frequent and complex recurrent error events that are difficult to track and analyze. To address these challenges, this paper leverages a physics-based autonomous vehicle simulation platform with a justifiable error injector to generate high-quality data for AI system reliability analysis. Building on this data, a new reliability modeling framework is developed to explicitly characterize error propagation across stages. Model parameters are estimated using a computationally efficient, theoretically guaranteed composite likelihood expectation - maximization algorithm. Its application to the reliability modeling for autonomous vehicle perception systems demonstrates its predictive accuracy and computational efficiency.

[AI-87] Access Controlled Website Interaction for Agent ic AI with Delegated Critical Tasks

【速读】:该论文旨在解决当前代理型人工智能(Agentic AI)在代表用户访问网站时,因网站缺乏细粒度访问控制机制而导致关键任务委托存在安全风险的问题。解决方案的关键在于设计一种面向AI代理的网站交互架构,并对开源授权服务中的访问授权协议进行定制化修改,以实现对AI代理执行关键任务时的细粒度权限管控,从而提升代理型AI在网页环境下的安全性与可控性。

链接: https://arxiv.org/abs/2603.18197
作者: Sunyoung Kim,Hokeun Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Recent studies reveal gaps in delegating critical tasks to agentic AI that accesses websites on the user’s behalf, primarily due to limited access control mechanisms on websites designed for agentic AI. In response, we propose a design of website-based interaction for AI agents with fine-grained access control for delegated critical tasks. Our approach encompasses a website design and implementation, as well as modifications to the access grant protocols in an open-source authorization service to tailor it to agentic AI, with delegated critical tasks on the website. The evaluation of our approach demonstrates the capabilities of our access-controlled website used by AI agents.

[AI-88] Retrieval-Augmented LLM s for Security Incident Analysis

【速读】:该论文旨在解决网络安全事件分析中因日志来源多样、数据量庞大而导致的效率低下问题,即分析师需手动筛选海量日志以识别关键指标并重建攻击链路。其解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的系统,通过预定义查询库匹配MITRE ATT&CK技术提取原始日志中的指标,并结合LLM语义推理能力从相关上下文中检索信息,从而实现精准、高效的取证问答与攻击序列重构。实验证明,该架构显著优于纯LLM基线模型,在保证高召回率和精度的同时大幅降低计算成本。

链接: https://arxiv.org/abs/2603.18196
作者: Xavier Cadet,Aditya Vikram Singh,Harsh Mamania,Edward Koh,Alex Fitts,Dirk Van Bruggen,Simona Boboila,Peter Chin,Alina Oprea
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Investigating cybersecurity incidents requires collecting and analyzing evidence from multiple log sources, including intrusion detection alerts, network traffic records, and authentication events. This process is labor-intensive: analysts must sift through large volumes of data to identify relevant indicators and piece together what happened. We present a RAG-based system that performs security incident analysis through targeted query-based filtering and LLM semantic reasoning. The system uses a query library with associated MITRE ATT\CK techniques to extract indicators from raw logs, then retrieves relevant context to answer forensic questions and reconstruct attack sequences. We evaluate the system with five LLM providers on malware traffic incidents and multi-stage Active Directory attacks. We find that LLM models have different performance and tradeoffs, with Claude Sonnet~4 and DeepSeek~V3 achieving 100% recall across all four malware scenarios, while DeepSeek costs 15 \times less (\ 0.008 vs.\ \ 0.12 per analysis). Attack step detection on Active Directory scenarios reaches 100% precision and 82% recall. Ablation studies confirm that a RAG architecture is essential: LLM baselines without RAG-enhanced context correctly identify victim hosts but miss all attack infrastructure including malicious domains and command-and-control servers. These results demonstrate that combining targeted query-based filtering with RAG-based retrieval enables accurate, cost-effective security analysis within LLM context limits.

[AI-89] achingCoach: A Fine-Tuned Scaffolding Chatbot for Instructional Guidance to Instructors

【速读】:该论文旨在解决高等教育教师在教学实践中缺乏及时且具有教学法依据的支持问题,现有工具要么依赖通用聊天机器人提供非针对性建议,要么依赖人力密集型的教学中心咨询,难以规模化。解决方案的关键在于构建一个名为TeachingCoach的、以教学法为基础的对话式AI助手,其核心创新是基于数据驱动的流水线:从教育资料中提取教学法规则,并利用合成对话生成技术微调专用语言模型,从而实现对教师在问题识别、诊断与策略制定过程中的实时、情境化指导。实证结果表明,该方法生成的指导更具清晰性、反思性和响应性,验证了教学法根基与合成数据驱动设计在提升教学支持可扩展性方面的有效性。

链接: https://arxiv.org/abs/2603.18189
作者: Isabel Molnar,Peiyu Li,Si Chen,Sugana Chawla,James Lang,Ronald Metoyer,Ting Hua,Nitesh V. Chawla
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Higher education instructors often lack timely and pedagogically grounded support, as scalable instructional guidance remains limited and existing tools rely on generic chatbot advice or non-scalable teaching center human-human consultations. We present TeachingCoach, a pedagogically grounded chatbot designed to support instructor professional development through real-time, conversational guidance. TeachingCoach is built on a data-centric pipeline that extracts pedagogical rules from educational resources and uses synthetic dialogue generation to fine-tune a specialized language model that guides instructors through problem identification, diagnosis, and strategy development. Expert evaluations show TeachingCoach produces clearer, more reflective, and more responsive guidance than a GPT-4o mini baseline, while a user study with higher education instructors highlights trade-offs between conversational depth and interaction efficiency. Together, these results demonstrate that pedagogically grounded, synthetic data driven chatbots can improve instructional support and offer a scalable design approach for future instructional chatbot systems.

[AI-90] Efficient Dense Crowd Trajectory Prediction Via Dynamic Clustering

【速读】:该论文旨在解决密集人群场景下轨迹预测的计算效率与准确性难题,尤其针对传统方法在处理大规模、噪声大且跟踪结果不准确的群体数据时所面临的高计算成本问题。其解决方案的关键在于提出一种基于聚类的新型方法,通过时间维度上对个体属性相似性进行分组,实现对人群的精准聚合总结,从而显著提升处理速度并降低内存消耗;该方法可作为即插即用模块,替代现有轨迹预测器中的行人输入,仅需使用聚类中心(centroid)作为输入即可完成高效且保持精度的预测。

链接: https://arxiv.org/abs/2603.18166
作者: Antonius Bima Murti Wijaya,Paul Henderson,Marwa Mahmoud
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Crowd trajectory prediction plays a crucial role in public safety and management, where it can help prevent disasters such as stampedes. Recent works address the problem by predicting individual trajectories and considering surrounding objects based on manually annotated data. However, these approaches tend to overlook dense crowd scenarios, where the challenges of automation become more pronounced due to the massiveness, noisiness, and inaccuracy of the tracking outputs, resulting in high computational costs. To address these challenges, we propose and extensively evaluate a novel cluster-based approach that groups individuals based on similar attributes over time, enabling faster execution through accurate group summarisation. Our plug-and-play method can be combined with existing trajectory predictors by using our output centroid in place of their pedestrian input. We evaluate our proposed method on several challenging dense crowd scenes. We demonstrated that our approach leads to faster processing and lower memory usage when compared with state-of-the-art methods, while maintaining the accuracy

[AI-91] Final Report for the Workshop on Robotics AI in Medicine

【速读】:该论文旨在解决当前医疗领域中机器人技术与人工智能(AI)融合应用的瓶颈问题,特别是如何通过系统性研究和跨学科协作推动智能机器人系统在手术、诊断、康复及辅助场景中的安全、可靠且可转化的落地。其关键解决方案在于建立国家级的医学人工智能与机器人卓越中心(CARE),聚焦于五大优先研究方向:人机协作、可信自主性、仿真与数字孪生、多模态感知以及生成式 AI 的伦理集成,并强调构建高质量数据集、共享测试平台、自主手术系统、临床基准和持续的跨学科人才培养机制,以弥合工程创新与临床需求之间的鸿沟,加速智能医疗系统的临床转化与部署。

链接: https://arxiv.org/abs/2603.18130
作者: Juan P Wachs
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 51 pages, 5 figures

点击查看摘要

Abstract:The CARE Workshop on Robotics and AI in Medicine, held on December 1, 2025 in Indianapolis, convened leading researchers, clinicians, industry innovators, and federal stakeholders to shape a national vision for advancing robotics and artificial intelligence in healthcare. The event highlighted the accelerating need for coordinated research efforts that bridge engineering innovation with real clinical priorities, emphasizing safety, reliability, and translational readiness with an emphasis on the use of robotics and AI to achieve this readiness goal. Across keynotes, panels, and breakout sessions, participants underscored critical gaps in data availability, standardized evaluation methods, regulatory pathways, and workforce training that hinder the deployment of intelligent robotic systems in surgical, diagnostic, rehabilitative, and assistive contexts. Discussions emphasized the transformative potential of AI enabled robotics to improve precision, reduce provider burden, expand access to specialized care, and enhance patient outcomes particularly in undeserved regions and high risk procedural domains. Special attention was given to austere settings, disaster and relief and military settings. The workshop demonstrated broad consensus on the urgency of establishing a national Center for AI and Robotic Excellence in medicine (CARE). Stakeholders identified priority research thrusts including human robot collaboration, trustworthy autonomy, simulation and digital twins, multi modal sensing, and ethical integration of generative AI into clinical workflows. Participants also articulated the need for high quality datasets, shared test beds, autonomous surgical systems, clinically grounded benchmarks, and sustained interdisciplinary training mechanisms. Comments: 51 pages, 5 figures Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) MSC classes: 68 ACMclasses: I.2.9 Cite as: arXiv:2603.18130 [cs.RO] (or arXiv:2603.18130v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2603.18130 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-92] Intellectual Stewardship: Re-adapting Human Minds for Creative Knowledge Work in the Age of AI

【速读】:该论文旨在解决生成式 AI(Generative AI)在教育场景中广泛应用背景下,人类学习者与教育者如何重新适应智能增强或自动化任务所带来的认知、伦理与责任挑战。其核心问题在于:在AI深度嵌入学习过程的环境中,如何构建一种以人类为中心、具有理论根基的框架,引导学生和教师成为负责任的知识治理主体。解决方案的关键在于提出“智力 stewardship”(智力托管)这一概念框架,包含五个核心原则:知识敏感性(knowledge-wise)、智能敏感性(intelligence-wise)、情境敏感性(context-wise)、伦理敏感性(ethics-wise)以及自我与共同体成长(self- and community-growing),共同构成面向智慧导向、社会负责的知识建构者的元层级能力体系,从而实现人机协同下的创造性学习实践转型。

链接: https://arxiv.org/abs/2603.18117
作者: Jianwei Zhang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 23 pages

点击查看摘要

Abstract:Background: Amid the opportunities and risks introduced by generative AI, learning research needs to envision how human minds and responsibilities should re-adapt as AI continues to augment or automate various tasks. Approach: Drawing on theories of learning, intelligence, and knowledge creation, this conceptual paper proposes intellectual stewardship as a human-centered, conceptually grounded framework for advancing creative learning practices with AI. Key points: Students and teachers work as responsible governors of intellectual processes distributed across human and artificial systems, guided by five core principles. Being knowledge-wise involves understanding the evolving state of knowledge and taking purposeful actions to advance it. Being intelligence-wise emphasizes making informed choices about how to orchestrate distributed cognitive processes and resources. Being context-wise requires sensitivity to recognize opportunities and risks. Being ethics-wise foregrounds ethical judgment, responsibility, and care in the use of knowledge and intellectual power. Finally, self- and community-growing defines the overarching purpose, aligning intellectual work with personal development and the advancement of collective well-being. Contribution: The principles provide a lens for viewing the adaptation of human minds in AI-infused learning environments, calling for the development of meta-level dispositions and capabilities that characterize wisdom-oriented, socially responsible knowledge builders in the AI age. Comments: 23 pages Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.18117 [cs.CY] (or arXiv:2603.18117v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2603.18117 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jianwei Zhang [view email] [v1] Wed, 18 Mar 2026 15:15:30 UTC (484 KB)

[AI-93] LLM -Augmented Computational Phenotyping of Long Covid

【速读】:该论文旨在解决长期新冠(Long COVID)临床异质性难以量化及亚型划分不清晰的问题,以支持个性化干预策略的制定。其核心解决方案是提出一个名为“Grace Cycle”的LLM增强型计算表型分析框架,该框架通过迭代整合假设生成、证据提取与特征优化三个步骤,从纵向患者数据中挖掘具有临床意义的亚群。关键创新在于将大语言模型(Large Language Model, LLM)嵌入到一个统计严谨、可解释的表型发现流程中,最终在13,511名长期新冠患者中识别出三种显著分离的临床表型:保护型(Protected)、应答型(Responder)和耐药型(Refractory),并在症状峰值严重程度、基线疾病负担及纵向剂量-反应模式等多个维度上展现出强统计支持。该方法具备疾病无关性,为复杂慢性病亚型发现提供了一种通用范式。

链接: https://arxiv.org/abs/2603.18115
作者: Jing Wang,Jie Shen,Amar Sra,Qiaomin Xie,Jeremy C Weiss
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Phenotypic characterization is essential for understanding heterogeneity in chronic diseases and for guiding personalized interventions. Long COVID, a complex and persistent condition, yet its clinical subphenotypes remain poorly understood. In this work, we propose an LLM-augmented computational phenotyping framework ``Grace Cycle’’ that iteratively integrates hypothesis generation, evidence extraction, and feature refinement to discover clinically meaningful subgroups from longitudinal patient data. The framework identifies three distinct clinical phenotypes, Protected, Responder, and Refractory, based on 13,511 Long Covid participants. These phenotypes exhibit pronounced separation in peak symptom severity, baseline disease burden, and longitudinal dose-response patterns, with strong statistical support across multiple independent dimensions. This study illustrates how large language models can be integrated into a principled, statistically grounded pipeline for phenotypic screening from complex longitudinal data. Note that the proposed framework is disease-agnostic and offers a general approach for discovering clinically interpretable subphenotypes. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.18115 [cs.LG] (or arXiv:2603.18115v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.18115 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-94] VC-Soup: Value-Consistency Guided Multi-Value Alignment for Large Language Models WWW2026

【速读】:该论文旨在解决多价值对齐(multi-value alignment)中的两大挑战:一是为每种价值组合单独训练模型成本过高;二是不同人类价值观之间的冲突显著降低对齐性能,导致难以在多种价值之间取得良好平衡。解决方案的关键在于提出VC-soup框架,其核心是基于数据层面的价值一致性(value consistency)进行优化:首先设计一种基于奖励差距向量与全1向量余弦相似度的值一致性度量指标,用于量化偏好对在跨价值维度上的协调性;随后过滤低一致性偏好对以构建更一致的数据集,并在此基础上训练出具有平滑性和线性模式连通性的策略模型;最后通过线性组合策略并结合帕累托筛选(Pareto filtering)实现多价值性能的均衡提升。

链接: https://arxiv.org/abs/2603.18113
作者: Hefei Xu,Le Wu,Yu Wang,Min Hou,Han Wu,Zhen Zhang,Meng Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages; Accepted to WWW2026

点击查看摘要

Abstract:As large language models (LLMs) increasingly shape content generation, interaction, and decision-making across the Web, aligning them with human values has become a central objective in trustworthy AI. This challenge becomes even more pronounced when aligning multiple, potentially conflicting human values. Although recent approaches, such as reward reweighting, prompt-based supervised fine-tuning, and model merging, attempt to tackle multi-value alignment, they still face two major limitations: (1) training separate models for each value combination is prohibitively expensive; (2) value conflicts substantially degrade alignment performance. These limitations make it difficult to achieve favorable trade-offs across diverse human values. To address these challenges, we revisit multi-value alignment from the perspective of value consistency in data and propose VC-soup, a data filtering and parameter merging framework grounded in value-consistent learning. We first design a value consistency metric based on the cosine similarity between the reward-gap vector of each preference pair and an all-ones vector, which quantifies its cross-value coherence. We then filter out low-consistency preference pairs in each value dataset and train on the remaining data to obtain smooth, value-consistent policy models that better preserve linear mode connectivity. Finally, we linearly combine these policies and apply Pareto filtering across values to obtain solutions with balanced multi-value performance. Extensive experiments and theoretical analysis demonstrate that VC-soup effectively mitigates conflicts and consistently outperforms existing multi-value alignment methods.

[AI-95] ula: Optimizing Time Cost and Generalization in Distributed Large-Batch Training

【速读】:该论文旨在解决大规模分布式训练中因盲目增大批次大小(batch-size)而导致的性能瓶颈问题,包括训练时间与成本的边际收益递减、模型泛化能力下降(即一般化差距,generalization gap),以及资源利用率不高的挑战。其核心解决方案是提出Tula——一个在线服务系统,通过融合并行系统建模与统计性能预测技术,动态识别最优批次大小,从而在保证模型收敛质量的前提下实现训练效率的最大化。Tula能够在多个视觉任务上将训练时间与成本预测误差控制在7.5%-14%以内,并相比标准大批次训练平均提升9%的测试准确率,同时获得最高达20倍的整体加速效果,有效缓解了大批次训练中的性能折衷问题。

链接: https://arxiv.org/abs/2603.18112
作者: Sahil Tyagi,Feiyi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Distributed training increases the number of batches processed per iteration either by scaling-out (adding more nodes) or scaling-up (increasing the batch-size). However, the largest configuration does not necessarily yield the best performance. Horizontal scaling introduces additional communication overhead, while vertical scaling is constrained by computation cost and device memory limits. Thus, simply increasing the batch-size leads to diminishing returns: training time and cost decrease initially but eventually plateaus, creating a knee-point in the time/cost versus batch-size pareto curve. The optimal batch-size therefore depends on the underlying model, data and available compute resources. Large batches also suffer from worse model quality due to the well-known generalization gap. In this paper, we present Tula, an online service that automatically optimizes time, cost, and convergence quality for large-batch training of convolutional models. It combines parallel-systems modeling with statistical performance prediction to identify the optimal batch-size. Tula predicts training time and cost within 7.5-14% error across multiple models, and achieves up to 20x overall speedup and improves test accuracy by 9% on average over standard large-batch training on various vision tasks, thus successfully mitigating the generalization gap and accelerating training at the same time.

[AI-96] ARTEMIS: A Neuro Symbolic Framework for Economically Constrained Market Dynamics

【速读】:该论文旨在解决深度学习模型在量化金融领域中普遍存在的可解释性缺失与经济合理性不足的问题,特别是未能有效融入无套利(no-arbitrage)等基本经济约束。其核心解决方案是提出ARTEMIS框架,该框架通过三个关键组件实现:(1) 基于连续时间拉普拉斯神经算子(Laplace Neural Operator)的编码器用于捕捉高维市场动态;(2) 以物理信息损失正则化的神经随机微分方程(Neural Stochastic Differential Equation),确保模型演化符合经济规律;(3) 可微分符号瓶颈(differentiable symbolic bottleneck),将复杂黑箱决策过程提炼为可解释的交易规则。此外,引入两项创新正则化项——Feynman-Kac偏微分方程残差项和市场风险溢价惩罚项,分别从局部无套利和瞬时夏普比率角度强制经济合理性,从而在保持高性能的同时显著提升模型透明度与可信度。

链接: https://arxiv.org/abs/2603.18107
作者: Rahul D Ray
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Statistical Finance (q-fin.ST)
备注:

点击查看摘要

Abstract:Deep learning models in quantitative finance often operate as black boxes, lacking interpretability and failing to incorporate fundamental economic principles such as no-arbitrage constraints. This paper introduces ARTEMIS (Arbitrage-free Representation Through Economic Models and Interpretable Symbolics), a novel neuro-symbolic framework combining a continuous-time Laplace Neural Operator encoder, a neural stochastic differential equation regularised by physics-informed losses, and a differentiable symbolic bottleneck that distils interpretable trading rules. The model enforces economic plausibility via two novel regularisation terms: a Feynman-Kac PDE residual penalising local no-arbitrage violations, and a market price of risk penalty bounding the instantaneous Sharpe ratio. We evaluate ARTEMIS against six strong baselines on four datasets: Jane Street, Optiver, Time-IMM, and DSLOB (a synthetic crash regime). Results demonstrate ARTEMIS achieves state-of-the-art directional accuracy, outperforming all baselines on DSLOB (64.96%) and Time-IMM (96.0%). A comprehensive ablation study confirms each component’s contribution: removing the PDE loss reduces directional accuracy from 64.89% to 50.32%. Underperformance on Optiver is attributed to its long sequence length and volatility-focused target. By providing interpretable, economically grounded predictions, ARTEMIS bridges the gap between deep learning’s power and the transparency demanded in quantitative finance.

[AI-97] Adaptive Domain Models: Bayesian Evolution Warm Rotation and Principled Training for Geometric and Neuromorphic AI

【速读】:该论文旨在解决当前深度学习训练基础设施依赖IEEE-754浮点数运算所引发的三大问题:训练阶段内存开销远高于推理、优化器复杂度高,以及训练过程中几何结构属性的退化。其解决方案的关键在于构建一个基于三项前期成果的新训练架构:(1)维度类型系统与确定性内存管理框架,实现栈分配梯度和精确的quire累加;(2)程序超图(Program Hypergraph, PHG),将几何代数计算中的阶保持作为类型级不变量;(3)b-posit 2026标准,使posit算术在传统仅支持推理的硬件上可高效部署。三者结合实现了与网络深度无关的训练内存占用约为推理空间的两倍、保持权重更新的阶一致性、精确梯度累加,并适用于损失函数优化和脉冲时间依赖(spike-timing-dependent)神经形态模型。此外,通过贝叶斯蒸馏机制提取通用模型的潜在先验结构,解决了领域特定训练的数据稀缺问题;并通过“热旋转”操作模式确保模型更新期间服务不中断,且结构正确性由PHG证书和签名版本记录形式化验证。最终形成一类更小、更精确、持续自适应、物理结构可验证且能从现有模型初始化的专用AI系统。

链接: https://arxiv.org/abs/2603.18104
作者: Houston Haynes
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 29 pages, 3 figures

点击查看摘要

Abstract:Prevailing AI training infrastructure assumes reverse-mode automatic differentiation over IEEE-754 arithmetic. The memory overhead of training relative to inference, optimizer complexity, and structural degradation of geometric properties through training are consequences of this arithmetic substrate. This paper develops an alternative training architecture grounded in three prior results: the Dimensional Type System and Deterministic Memory Management framework [6], which establishes stack-eligible gradient allocation and exact quire accumulation as design-time verifiable properties; the Program Hypergraph [8], which establishes grade preservation through geometric algebra computations as a type-level invariant; and the b-posit 2026 standard [10], which makes posit arithmetic tractable across hardware targets conventionally considered inference-only. Their composition enables depth-independent training memory bounded to approximately twice the inference footprint, grade-preserving weight updates, and exact gradient accumulation, applicable uniformly to loss-function-optimized and spike-timing-dependent neuromorphic models. We introduce Bayesian distillation, a mechanism by which the latent prior structure of a general-purpose model is extracted through the ADM training regime, resolving the data-scarcity bootstrapping problem for domain-specific training. For deployment, we introduce warm rotation, an operational pattern in which an updated model transitions into an active inference pathway without service interruption, with structural correctness formalized through PHG certificates and signed version records. The result is a class of domain-specific AI systems that are smaller and more precise than general-purpose models, continuously adaptive, verifiably correct with respect to the physical structure of their domains, and initializable from existing models.

[AI-98] Enhancing Reinforcement Learning Fine-Tuning with an Online Refiner

【速读】:该论文旨在解决强化学习微调(Reinforcement Learning Fine-Tuning, RFT)中约束机制与优化目标之间的固有冲突问题:强约束虽能稳定训练并防止退化输出,但会抑制模型发现更优解的能力。解决方案的关键在于提出“动态约束”(dynamic constraints)机制,其核心思想是仅在出现退化输出时才介入干预。具体实现上,利用一个参考模型作为在线精炼器(online refiner),对微调模型的输出进行最小化修正,保留正确内容并修复错误,随后通过监督微调损失训练微调模型以生成该精炼后的输出。该机制使约束强度能根据模型输出质量自动调整,从而在保持训练稳定性的同时显著提升任务奖励。

链接: https://arxiv.org/abs/2603.18088
作者: Hao Ma,Zhiqiang Pu,Yang Liu,Xiaolin Ai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Constraints are essential for stabilizing reinforcement learning fine-tuning (RFT) and preventing degenerate outputs, yet they inherently conflict with the optimization objective because stronger constraints limit the ability of a fine-tuned model to discover better solutions. We propose \textitdynamic constraints that resolve this tension by adapting to the evolving capabilities of the fine-tuned model based on the insight that constraints should only intervene when degenerate outputs occur. We implement this by using a reference model as an \textitonline refiner that takes the response from the fine-tuned model and generates a minimally corrected version which preserves correct content verbatim while fixing errors. A supervised fine-tuning loss then trains the fine-tuned model to produce the refined output. This mechanism yields a constraint that automatically strengthens or relaxes based on output quality. Experiments on dialogue and code generation show that dynamic constraints outperform both KL regularization and unconstrained baselines, achieving substantially higher task rewards while maintaining training stability.

[AI-99] Multi-Trait Subspace Steering to Reveal the Dark Side of Human-AI Interaction

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在人类交互中可能引发负面心理后果的问题,尤其是当LLMs被用作指导、情感支持甚至非正式治疗工具时,其潜在的累积性有害行为难以通过传统实验方法有效识别和研究。解决方案的关键在于提出一种多特质子空间引导(Multi-Trait Subspace Steering, MultiTraitsss)框架,该框架结合已知危机相关特质与新颖的子空间引导机制,生成具有累积性有害行为模式的“黑暗模型”(Dark models),从而在可控环境下模拟真实世界中长期交互下出现的有害结果,并据此提出针对性防护措施以降低人类-人工智能交互中的风险。

链接: https://arxiv.org/abs/2603.18085
作者: Xin Wei Chia,Swee Liang Wong,Jonathan Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent incidents have highlighted alarming cases where human-AI interactions led to negative psychological outcomes, including mental health crises and even user harm. As LLMs serve as sources of guidance, emotional support, and even informal therapy, these risks are poised to escalate. However, studying the mechanisms underlying harmful human-AI interactions presents significant methodological challenges, where organic harmful interactions typically develop over sustained engagement, requiring extensive conversational context that are difficult to simulate in controlled settings. To address this gap, we developed a Multi-Trait Subspace Steering (MultiTraitsss) framework that leverages established crisis-associated traits and novel subspace steering framework to generate Dark models that exhibits cumulative harmful behavioral patterns. Single-turn and multi-turn evaluations show that our dark models consistently produce harmful interaction and outcomes. Using our Dark models, we propose protective measure to reduce harmful outcomes in Human-AI interactions.

[AI-100] Uncovering Latent Phase Structures and Branching Logic in Locomotion Policies: A Case Study on HalfCheetah

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在运动控制任务中决策过程缺乏可解释性的问题,即神经网络策略常被视为“黑箱”,难以被人类理解。其解决方案的关键在于:通过分析训练后策略在MuJoCo环境HalfCheetah-v5中生成的状态转移序列,利用状态相似性和后续转移一致性将其聚类为语义相位(semantic phases),从而揭示策略内部隐含的周期性相位结构(如支撑相与摆动相)及相位分支机制;进一步借助可解释增强机器(Explainable Boosting Machines, EBMs)对各相位的状态-动作映射进行建模,明确策略在不同相位下关注的状态特征及其动作输出逻辑,证明了DRL策略能够自主学习具有人类可理解性的结构化决策机制。

链接: https://arxiv.org/abs/2603.18084
作者: Daisuke Yasui,Toshitaka Matsuki,Hiroshi Sato
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted at XAI-2026: The 4th World Conference on eXplainable Artificial Intelligence

点击查看摘要

Abstract:In locomotion control tasks, Deep Reinforcement Learning (DRL) has demonstrated high performance; however, the decision-making process of the learned policy remains a black box, making it difficult for humans to understand. On the other hand, in periodic motions such as walking, it is well known that implicit motion phases exist, such as the stance phase and the swing phase. Focusing on this point, this study hypothesizes that a policy trained for locomotion control may also represent a phase structure that is interpretable by humans. To examine this hypothesis in a controlled setting, we consider a locomotion task that is amenable to observing whether a policy autonomously acquires temporally structured phases through interaction with the environment. To verify this hypothesis, in the MuJoCo locomotion benchmark HalfCheetah-v5, the state transition sequences acquired by a policy trained for walking control through interaction with the environment were aggregated into semantic phases based on state similarity and consistency of subsequent transitions. As a result, we demonstrated that the state sequences generated by the trained policy exhibit periodic phase transition structures as well as phase branching. Furthermore, by approximating the states and actions corresponding to each semantic phase using Explainable Boosting Machines (EBMs), we analyzed phase-dependent decision making-namely, which state features the policy function attends to and how it controls action outputs in each phase. These results suggest that neural network-based policies, which are often regarded as black boxes, can autonomously acquire interpretable phase structures and logical branching mechanisms.

[AI-101] Probabilistic Federated Learning on Uncertain and Heterogeneous Data with Model Personalization

【速读】:该论文旨在解决传统联邦学习(Federated Learning, FL)框架在面对局部客户端数据不确定性(data uncertainty)和异构性(data heterogeneity)时导致的训练性能下降问题。现有基于贝叶斯神经网络(Bayesian Neural Networks, BNNs)的概率方法虽能显式建模不确定性以缓解该问题,但其带来的运行时开销、延迟和带宽消耗在联邦场景中尚未被充分研究。解决方案的关键在于提出Meta-BayFL,一种融合元学习(meta-learning)与BNN的个性化概率联邦学习方法:其核心创新包括(1)在BNN的隐藏层中引入不确定性建模以稳定小样本和噪声数据上的训练;(2)通过自适应学习率的元学习实现个性化更新,提升非独立同分布(non-IID)条件下的本地训练效果;(3)统一的概率化与个性化设计增强全局模型聚合的鲁棒性。理论分析给出了全局模型收敛上界,实验表明该方法在多个图像数据集上显著优于当前主流标准与个性化联邦学习方法(如pFedMe、Ditto、FedFomo),测试准确率最高提升达7.42%。

链接: https://arxiv.org/abs/2603.18083
作者: Ratun Rahman,Dinh C. Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE Transactions on Emerging Topics in Computational Intelligence

点击查看摘要

Abstract:Conventional federated learning (FL) frameworks often suffer from training degradation due to data uncertainty and heterogeneity across local clients. Probabilistic approaches such as Bayesian neural networks (BNNs) can mitigate this issue by explicitly modeling uncertainty, but they introduce additional runtime, latency, and bandwidth overhead that has rarely been studied in federated settings. To address these challenges, we propose Meta-BayFL, a personalized probabilistic FL method that combines meta-learning with BNNs to improve training under uncertain and heterogeneous data. The framework is characterized by three main features: (1) BNN-based client models incorporate uncertainty across hidden layers to stabilize training on small and noisy datasets, (2) meta-learning with adaptive learning rates enables personalized updates that enhance local training under non-IID conditions, and (3) a unified probabilistic and personalized design improves the robustness of global model aggregation. We provide a theoretical convergence analysis and characterize the upper bound of the global model over communication rounds. In addition, we evaluate computational costs (runtime, latency, and communication) and discuss the feasibility of deployment on resource-constrained devices such as edge nodes and IoT systems. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that Meta-BayFL consistently outperforms state-of-the-art methods, including both standard and personalized FL approaches (e.g., pFedMe, Ditto, FedFomo), with up to 7.42% higher test accuracy.

[AI-102] SLEA-RL: Step-Level Experience Augmented Reinforcement Learning for Multi-Turn Agent ic Training

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在多轮工具使用任务中因训练时孤立运行而无法利用跨回合积累经验的问题。现有方法虽通过构建可检索的经验库来增强学习,但仅基于初始任务描述进行一次静态检索,导致在观察随步骤变化的多轮场景中,检索内容与当前状态逐渐失配。其解决方案的关键在于提出SLEA-RL(Step-Level Experience-Augmented Reinforcement Learning)框架,核心创新包括:(i) 基于当前观测的步级经验检索机制,通过步骤级观察聚类实现结构等价状态的高效索引;(ii) 自进化经验库,基于评分机制动态纳入成功策略与失败模式,并限制提取频率以维持多样性;(iii) 步级信用分配策略,支持多轮任务中精细的优势估计。该框架通过语义分析而非梯度更新驱动经验库演化,显著提升了长程多轮任务中的性能表现。

链接: https://arxiv.org/abs/2603.18079
作者: Prince Zizhuang Wang,Shuli Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents have shown strong results on multi-turn tool-use tasks, yet they operate in isolation during training, failing to leverage experiences accumulated across episodes. Existing experience-augmented methods address this by organizing trajectories into retrievable libraries, but they retrieve experiences only once based on the initial task description and hold them constant throughout the episode. In multi-turn settings where observations change at every step, this static retrieval becomes increasingly mismatched as episodes progress. We propose SLEA-RL (Step-Level Experience-Augmented Reinforcement Learning), a framework that retrieves relevant experiences at each decision step conditioned on the current observation. SLEA-RL operates through three components: (i) step-level observation clustering that groups structurally equivalent environmental states for efficient cluster-indexed retrieval; (ii) a self-evolving experience library that distills successful strategies and failure patterns through score-based admission and rate-limited extraction; and (iii) policy optimization with step-level credit assignment for fine-grained advantage estimation across multi-turn episodes. The experience library evolves alongside the policy through semantic analysis rather than gradient updates. Experiments on long-horizon multi-turn agent benchmarks demonstrate that SLEA-RL achieves superior performance compared to various reinforcement learning baselines.

[AI-103] Continually self-improving AI

【速读】:该论文旨在解决当前基于语言模型的AI系统在三个关键方面的局限性:一是微调过程中从少量专业语料中获取新知识的数据效率低下;二是训练依赖于有限的历史人类生成数据;三是训练流程受限于人类研究人员所能发现和探索的算法范式。解决方案的核心在于构建一个持续自我改进的AI体系:首先,通过合成数据方法将小规模语料转化为丰富的知识表示,提升参数更新的数据效率;其次,利用模型自身生成合成数据来减少对人类数据的依赖,实现无需蒸馏即可完成预训练能力的自举;最后,通过扩大测试时算法搜索空间,使AI能够探索超越人类手动设计范围的学习算法配置,从而突破人为设定的训练范式限制。

链接: https://arxiv.org/abs/2603.18073
作者: Zitong Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: PhD thesis

点击查看摘要

Abstract:Modern language model-based AI systems are remarkably powerful, yet their capabilities remain fundamentally capped by their human creators in three key ways. First, although a model’s weights can be updated via fine-tuning, acquiring new knowledge from small, specialized corpora after pretraining remains highly data-inefficient. Second, the training of these systems relies heavily on finite, human-generated data from across history. Third, the pipelines used to train AI models are confined by the algorithms that human researchers can discover and explore. This thesis takes a small step toward overcoming these inherent limitations, presenting three chapters aimed at breaking these dependencies to create continually self-improving AI. First, to overcome this data-efficiency barrier in knowledge acquisition, we propose a synthetic data approach that diversifies and amplifies small corpora into rich knowledge representations, enabling a model to effectively update its parameters from limited source material. Second, to reduce reliance on human data, we show that given a fixed amount of such data, the model can self-generate synthetic data to bootstrap its fundamental pretraining capabilities without distillation from any off-the-shelf, instruction-tuned LM. Finally, to transcend human-engineered training paradigms, we demonstrate that by scaling search during test time over the space of algorithms, AI can search over a larger space of learning algorithm configurations than human researchers can explore manually.

[AI-104] A Synthesizable RTL Implementation of Predictive Coding Networks

【速读】:该论文旨在解决传统反向传播(Backpropagation)在硬件实现中面临的挑战,包括全局误差传播、阶段分离以及对集中式存储的强依赖,这些问题限制了其在在线、完全分布式硬件学习系统中的应用。解决方案的关键在于提出一种基于预测编码(Predictive Coding)的数字架构,该架构通过局部预测误差动态实现推理与学习,每个神经核仅维护自身活动、预测误差和突触权重,并通过硬连线连接与相邻层通信;同时引入统一的每神经元钳位原语(clamping primitive)以施加边界条件而不改变内部更新调度,从而在固定有限状态机控制下实现确定性、可综合的寄存器传输级(RTL)硬件子系统,使预测编码学习动力学直接在硬件中执行。

链接: https://arxiv.org/abs/2603.18066
作者: Timothy Oh
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Backpropagation has enabled modern deep learning but is difficult to realize as an online, fully distributed hardware learning system due to global error propagation, phase separation, and heavy reliance on centralized memory. Predictive coding offers an alternative in which inference and learning arise from local prediction-error dynamics between adjacent layers. This paper presents a digital architecture that implements a discrete-time predictive coding update directly in hardware. Each neural core maintains its own activity, prediction error, and synaptic weights, and communicates only with adjacent layers through hardwired connections. Supervised learning and inference are supported via a uniform per-neuron clamping primitive that enforces boundary conditions while leaving the internal update schedule unchanged. The design is a deterministic, synthesizable RTL substrate built around a sequential MAC datapath and a fixed finite-state schedule. Rather than executing a task-specific instruction sequence inside the learning substrate, the system evolves under fixed local update rules, with task structure imposed through connectivity, parameters, and boundary conditions. The contribution of this work is not a new learning rule, but a complete synthesizable digital substrate that executes predictive-coding learning dynamics directly in hardware.

[AI-105] MCP-38: A Comprehensive Threat Taxonomy for Model Context Protocol Systems (v1.0)

【速读】:该论文旨在解决当前威胁框架对模型上下文协议(Model Context Protocol, MCP)特有的攻击面覆盖不足的问题,尤其是其语义攻击面所带来的新型安全风险。解决方案的关键在于提出了一种名为MCP-38的协议特定威胁分类法,包含38个威胁类别,并通过四阶段系统化方法(协议分解、多框架交叉映射、真实事件合成与缓解面分类)构建而成。该分类法首次系统识别并刻画了工具描述投毒、间接提示注入、寄生工具链和动态信任违规等关键威胁,且与STRIDE、OWASP LLM Top 10(2025)及Agentic Applications Top 10(2026)等主流框架实现映射,为自动化威胁情报平台提供了定义性和实证性基础。

链接: https://arxiv.org/abs/2603.18063
作者: Yi Ting Shen,Kentaroh Toyoda,Alex Leung
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: v1.0

点击查看摘要

Abstract:The Model Context Protocol (MCP) introduces a structurally distinct attack surface that existing threat frameworks, designed for traditional software systems or generic LLM deployments, do not adequately cover. This paper presents MCP-38, a protocol-specific threat taxonomy consisting of 38 threat categories (MCP-01 through MCP-38). The taxonomy was derived through a systematic four-phase methodology: protocol decomposition, multi-framework cross-mapping, real-world incident synthesis, and remediation-surface categorization. Each category is mapped to STRIDE, OWASP Top 10 for LLM Applications (2025, LLM01–LLM10), and the OWASP Top 10 for Agentic Applications (2026, ASI01–ASI10). MCP-38 addresses critical threats arising from MCP’s semantic attack surface (tool description poisoning, indirect prompt injection, parasitic tool chaining, and dynamic trust violations), none of which are adequately captured by prior work. MCP-38 provides the definitional and empirical foundation for automated threat intelligence platforms.

[AI-106] DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models

【速读】:该论文旨在解决当前生成式 AI(Generative AI)在语音多模态大语言模型(Audio Multimodal Large Language Models, Audio MLLMs)中是否存在对声学信号的真实理解问题,即这些模型是否真正处理了音频内容,还是仅依赖文本语义进行推理。其解决方案的关键在于构建了一个包含2700余条冲突刺激的诊断评估基准DEAF(Diagnostic Evaluation of Acoustic Faithfulness),涵盖情感语调、背景声音和说话人身份三个声学维度,并设计了一个控制性的多层次评估框架,逐步增加文本影响程度(从内容语义冲突到误导性提示及其组合),从而区分内容驱动偏差与提示诱导迎合(prompt-induced sycophancy)。此外,论文引入诊断指标量化模型对文本线索相对于声学信号的依赖程度,最终揭示多数Audio MLLMs存在显著的文本主导现象,表明其在标准语音基准上表现优异但缺乏对声学信号的实质性理解。

链接: https://arxiv.org/abs/2603.18048
作者: Jiaqi Xiong,Yunjia Qi,Qi Cao,Yu Zheng,Weisheng Xu,Ziteng Wang,Ruofan Liao,Yutong Zhang,Sichen Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 14 pages,6 figures

点击查看摘要

Abstract:Recent Audio Multimodal Large Language Models (Audio MLLMs) demonstrate impressive performance on speech benchmarks, yet it remains unclear whether these models genuinely process acoustic signals or rely on text-based semantic inference. To systematically study this question, we introduce DEAF (Diagnostic Evaluation of Acoustic Faithfulness), a benchmark of over 2,700 conflict stimuli spanning three acoustic dimensions: emotional prosody, background sounds, and speaker identity. Then, we design a controlled multi-level evaluation framework that progressively increases textual influence, ranging from semantic conflicts in the content to misleading prompts and their combination, allowing us to disentangle content-driven bias from prompt-induced sycophancy. We further introduce diagnostic metrics to quantify model reliance on textual cues over acoustic signals. Our evaluation of seven Audio MLLMs reveals a consistent pattern of text dominance: models are sensitive to acoustic variations, yet predictions are predominantly driven by textual inputs, revealing a gap between high performance on standard speech benchmarks and genuine acoustic understanding.

[AI-107] NANOZK: Layerwise Zero-Knowledge Proofs for Verifiable Large Language Model Inference ICLR2026

【速读】:该论文旨在解决用户在调用专有大语言模型(Large Language Model, LLM)API时缺乏密码学保证的问题——即无法验证所获得的输出是否确实由声称的模型生成,服务提供商可能替换为更廉价模型、进行激进量化或返回缓存结果,而这些行为对用户不可检测。解决方案的关键在于提出 METHOD,一种零知识证明(Zero-Knowledge Proof)系统,通过利用Transformer推理过程天然可分解为独立层计算的特点,构建分层证明框架:每层生成固定大小的证明(与模型宽度无关),从而突破传统整体式证明方法的可扩展性瓶颈,并支持并行化证明;同时设计查找表近似方法处理非算术操作(如softmax、GELU、LayerNorm),实现零精度损失,且引入基于Fisher信息的引导验证机制以应对全层证明不现实的场景,最终在d=128的模型上实现5.5KB/层的证明大小和24ms验证时间,相较EZKL在证明规模上缩小70倍、证明速度提升5.7倍,同时保持形式上的安全性保障(ε < 1e-37)。

链接: https://arxiv.org/abs/2603.18046
作者: Zhaohui Geoffrey Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 11 pages. Accepted at the VerifAI Workshop at ICLR 2026 (camera-ready version)

点击查看摘要

Abstract:When users query proprietary LLM APIs, they receive outputs with no cryptographic assurance that the claimed model was actually used. Service providers could substitute cheaper models, apply aggressive quantization, or return cached responses - all undetectable by users paying premium prices for frontier capabilities. We present METHOD, a zero-knowledge proof system that makes LLM inference verifiable: users can cryptographically confirm that outputs correspond to the computation of a specific model. Our approach exploits the fact that transformer inference naturally decomposes into independent layer computations, enabling a layerwise proof framework where each layer generates a constant-size proof regardless of model width. This decomposition sidesteps the scalability barrier facing monolithic approaches and enables parallel proving. We develop lookup table approximations for non-arithmetic operations (softmax, GELU, LayerNorm) that introduce zero measurable accuracy loss, and introduce Fisher information-guided verification for scenarios where proving all layers is impractical. On transformer models up to d=128, METHOD generates constant-size layer proofs of 5.5KB (2.1KB attention + 3.5KB MLP) with 24 ms verification time. Compared to EZKL, METHOD achieves 70x smaller proofs and 5.7x faster proving time at d=128, while maintaining formal soundness guarantees (epsilon 1e-37). Lookup approximations preserve model perplexity exactly, enabling verification without quality compromise. Comments: 11 pages. Accepted at the VerifAI Workshop at ICLR 2026 (camera-ready version) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2603.18046 [cs.LG] (or arXiv:2603.18046v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.18046 Focus to learn more arXiv-issued DOI via DataCite

[AI-108] Semantic Chameleon: Corpus-Dependent Poisoning Attacks and Defenses in RAG Systems

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因外部知识库被恶意污染而导致的模型输出可被定向操纵的安全问题。攻击者通过在检索语料库中植入精心设计的“睡眠文档”和“触发文档”,利用梯度引导的坐标优化方法(Greedy Coordinate Gradient, GCG)提升其恶意内容在推理阶段的优先召回率,从而实现对大语言模型(Large Language Models, LLMs)输出的控制。解决方案的关键在于引入一种无需修改底层LLM或重新训练检索器的架构级防御机制——混合检索(hybrid retrieval),即结合BM25稀疏匹配与向量相似度的双重信号。实验表明,在纯向量检索场景下,该攻击成功率达38%,而采用混合检索后攻击成功率降至0%;即使攻击者联合优化两种检索信号,混合检索仍能显著降低攻击成功率至20–44%,大幅提高攻击难度。

链接: https://arxiv.org/abs/2603.18034
作者: Scott Thornton
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems extend large language models (LLMs) with external knowledge sources but introduce new attack surfaces through the retrieval pipeline. In particular, adversaries can poison retrieval corpora so that malicious documents are preferentially retrieved at inference time, enabling targeted manipulation of model outputs. We study gradient-guided corpus poisoning attacks against modern RAG pipelines and evaluate retrieval-layer defenses that require no modification to the underlying LLM. We implement dual-document poisoning attacks consisting of a sleeper document and a trigger document optimized using Greedy Coordinate Gradient (GCG). In a large-scale evaluation on the Security Stack Exchange corpus (67,941 documents) with 50 attack attempts, gradient-guided poisoning achieves a 38.0 percent co-retrieval rate under pure vector retrieval. We show that a simple architectural modification, hybrid retrieval combining BM25 and vector similarity, substantially mitigates this attack. Across all 50 attacks, hybrid retrieval reduces gradient-guided attack success from 38 percent to 0 percent without modifying the model or retraining the retriever. When attackers jointly optimize payloads for both sparse and dense retrieval signals, hybrid retrieval can be partially circumvented, achieving 20-44 percent success, but still significantly raises attack difficulty relative to vector-only retrieval. Evaluation across five LLM families (GPT-5.3, GPT-4o, Claude Sonnet 4.6, Llama 4, and GPT-4o-mini) shows attack success ranging from 46.7 percent to 93.3 percent. Cross-corpus evaluation on the FEVER Wikipedia dataset (25 attacks) yields 0 percent attack success across all retrieval configurations. Comments: 10 pages, 5 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.18034 [cs.CR] (or arXiv:2603.18034v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.18034 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Scott Thornton [view email] [v1] Tue, 10 Mar 2026 23:15:13 UTC (578 KB)

[AI-109] owards Differentiating Between Failures and Domain Shifts in Industrial Data Streams

【速读】:该论文旨在解决工业场景中异常(anomaly)与故障(failure)的区分难题,尤其是在数据分布发生改变时,如何准确识别是系统正常演化(如新产品的生产导致的领域偏移,即domain shift)还是真实故障。传统方法常将所有数据变化误判为故障,从而引发误报。解决方案的关键在于提出一种融合改进的Page-Hinkley变化点检测器(用于识别领域偏移和潜在故障)、基于监督域适应(supervised domain adaptation)的在线异常检测算法,以及可解释人工智能(XAI)组件,以辅助操作员最终区分领域偏移与故障,提升系统的实际鲁棒性。

链接: https://arxiv.org/abs/2603.18032
作者: Natalia Wojak-Strzelecka,Szymon Bobek,Grzegorz J. Nalepa,Jerzy Stefanowski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Anomaly and failure detection methods are crucial in identifying deviations from normal system operational conditions, which allows for actions to be taken in advance, usually preventing more serious damages. Long-lasting deviations indicate failures, while sudden, isolated changes in the data indicate anomalies. However, in many practical applications, changes in the data do not always represent abnormal system states. Such changes may be recognized incorrectly as failures, while being a normal evolution of the system, e.g. referring to characteristics of starting the processing of a new product, i.e. realizing a domain shift. Therefore, distinguishing between failures and such ‘‘healthy’’ changes in data distribution is critical to ensure the practical robustness of the system. In this paper, we propose a method that not only detects changes in the data distribution and anomalies but also allows us to distinguish between failures and normal domain shifts inherent to a given process. The proposed method consists of a modified Page-Hinkley changepoint detector for identification of the domain shift and possible failures and supervised domain-adaptation-based algorithms for fast, online anomaly detection. These two are coupled with an explainable artificial intelligence (XAI) component that aims at helping the human operator to finally differentiate between domain shifts and failures. The method is illustrated by an experiment on a data stream from the steel factory.

[AI-110] InfoMamba: An Attention-Free Hybrid Mamba-Transformer Model

【速读】:该论文旨在解决序列建模中如何在计算资源受限条件下,平衡细粒度局部建模与长程依赖捕捉的问题。传统Transformer虽能实现强大的token混合,但其复杂度为二次方;而Mamba类选择性状态空间模型(Selective State-Space Models, SSMs)虽具线性复杂度,却难以捕获高秩且同步的全局交互。论文提出一致性边界分析(consistency boundary analysis),揭示对角短记忆SSMs在何种条件下可近似因果注意力,并识别出仍存在的结构差距。解决方案的关键在于设计InfoMamba——一种无注意力机制的混合架构:它用概念瓶颈线性滤波层替代token级自注意力,作为最小带宽的全局接口,并通过信息最大化融合(Information-Maximizing Fusion, IMF)将该全局信息动态注入SSM递归流中,同时引入基于互信息启发的目标促进互补信息利用。实验表明,InfoMamba在分类、密集预测及非视觉任务上均显著优于强基线模型,在准确率与效率之间实现了优越权衡,并保持近线性扩展性。

链接: https://arxiv.org/abs/2603.18031
作者: Youjin Wang,Jiaqiao Zhao,Rong Fu,Run Zhou,Ruizhe Zhang,Jiani Liang,Suisuai Cao,Feng Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Balancing fine-grained local modeling with long-range dependency capture under computational constraints remains a central challenge in sequence modeling. While Transformers provide strong token mixing, they suffer from quadratic complexity, whereas Mamba-style selective state-space models (SSMs) scale linearly but often struggle to capture high-rank and synchronous global interactions. We present a consistency boundary analysis that characterizes when diagonal short-memory SSMs can approximate causal attention and identifies structural gaps that remain. Motivated by this analysis, we propose InfoMamba, an attention-free hybrid architecture. InfoMamba replaces token-level self-attention with a concept bottleneck linear filtering layer that serves as a minimal-bandwidth global interface and integrates it with a selective recurrent stream through information-maximizing fusion (IMF). IMF dynamically injects global context into the SSM dynamics and encourages complementary information usage through a mutual-information-inspired objective. Extensive experiments on classification, dense prediction, and non-vision tasks show that InfoMamba consistently outperforms strong Transformer and SSM baselines, achieving competitive accuracy-efficiency trade-offs while maintaining near-linear scaling.

[AI-111] Quine: Realizing LLM Agents as Native POSIX Processes

【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)代理框架在实现隔离、调度和通信时过度依赖应用层抽象的问题,这些问题本可由成熟的操作系统(Operating System, OS)原生支持。解决方案的关键在于提出 Quine 这一运行时架构,将 LLM 代理建模为原生 POSIX 进程,从而直接继承操作系统的进程隔离、资源控制与组合能力;其核心设计是通过一个单一可执行文件递归地调用 fork/exec 创建新实例,使代理的身份(identity)、接口(interface)、状态(state)和生命周期(lifecycle)分别对应进程 ID(PID)、标准流与退出码、内存/环境变量/文件系统以及 fork/exec/exit 操作,实现了对进程语义的显式映射,并自然支持递归委托与上下文刷新。

链接: https://arxiv.org/abs/2603.18030
作者: Hao Ke
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: 10 pages, 3 figures. Reference implementation available on this https URL

点击查看摘要

Abstract:Current LLM agent frameworks often implement isolation, scheduling, and communication at the application layer, even though these mechanisms are already provided by mature operating systems. Instead of introducing another application-layer orchestrator, this paper presents Quine, a runtime architecture and reference implementation that realizes LLM agents as native POSIX processes. The mapping is explicit: identity is PID, interface is standard streams and exit status, state is memory, environment variables, and filesystem, and lifecycle is fork/exec/exit. A single executable implements this model by recursively spawning fresh instances of itself. By grounding the agent abstraction in the OS process model, Quine inherits isolation, composition, and resource control directly from the kernel, while naturally supporting recursive delegation, context renewal via exec, and shell-native composition. The design also exposes where the POSIX process model stops: processes provide a robust substrate for execution, but not a complete runtime model for cognition. In particular, the analysis points toward two immediate extensions beyond process semantics: task-relative worlds and revisable time. A reference implementation of Quine is publicly available on GitHub.

[AI-112] Engineering Verifiable Modularity in Transformers via Per-Layer Supervision

【速读】:该论文试图解决Transformer模型中可解释性与可控性之间的脱节问题,即尽管可以通过相关性分析识别出某些注意力头(attention head)对特定行为(如大小写处理)的重要性,但由于模型内部存在分布式冗余(distributed redundancy),这种识别无法转化为有效的因果干预或控制。其解决方案的关键在于引入一种新型架构设计:通过双流处理分离token级和上下文表示、每层监督提供独立梯度信号、以及门控注意力机制促进离散激活模式。其中,每层监督是核心创新点——它显著提升了模型对目标行为的控制能力,使消融效应增强5至23倍,并实现平滑、可预测的行为调控。这一方法揭示了隐藏的模块化结构(unmasked modularity),将模型可解释性从被动观察提升为可主动控制的科学实践。

链接: https://arxiv.org/abs/2603.18029
作者: J. Clayton Kerce
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers resist surgical control. Ablating an attention head identified as critical for capitalization produces minimal behavioral change because distributed redundancy compensates for damage. This Hydra effect renders interpretability illusory: we may identify components through correlation, but cannot predict or control their causal role. We demonstrate that architectural interventions can expose hidden modularity. Our approach combines dual-stream processing separating token and contextual representations, per-layer supervision providing independent gradient signal at each depth, and gated attention regularizing toward discrete activation patterns. When trained with per-layer supervision, models produce ablation effects 5 to 23 times larger than architecturally identical controls trained with standard objectives. This enables 4 times greater control leverage on targeted behaviors: scaling identified attention heads produces smooth, predictable changes in model output. The key finding is architectural. Without per-layer supervision, ablation damage concentrates near zero with low variance (Winograd standard deviation 0.63%). With per-layer supervision, effects spread widely (standard deviation 6.32%), revealing which predictions depend on which circuits. The larger variance is not measurement noise but the signature of unmasked modularity. We validate our approach through three components: engineered features that capture computational dynamics rather than vocabulary structure (validated by near-zero correlation with raw activation clustering), an architecture providing positive control for modularity, and causal experiments demonstrating functional reorganization where different tasks route through different attention heads. This es tablishes a methodology for transforming interpretability from passive observation to active control.

[AI-113] Clinically Meaningful Explainability for NeuroAI: An ethical technical and clinical perspective

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在闭合环路神经技术(closed-loop neurotechnology)中可解释性不足的问题,尤其是现有可解释人工智能(Explainable AI, XAI)方法提供的解释难以满足临床医生的实际需求。其核心问题在于:尽管XAI被寄予提升透明度和可信度的厚望,但其在真实医疗场景中的应用仍极为有限,且多数解释内容与临床决策无关或过于技术化,导致信息过载而非有效支持。解决方案的关键在于提出“临床有意义的可解释性”(Clinically Meaningful Explainability, CME),强调以临床实用性为导向,优先提供输入-输出关系清晰、特征重要性明确的可操作性解释,而非追求全面的技术细节;并设计直观的界面可视化工具,将AI输出映射为临床可理解的形式。为此,作者进一步提出NeuroXplain参考架构,为未来神经刺激设备的设计提供可落地的技术建议,确保解释能力精准匹配不同利益相关方的需求,从而改善患者治疗效果。

链接: https://arxiv.org/abs/2603.18028
作者: Laura Schopp,Ambra DImperio,Jalal Etesami,Marcello Ienca
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 20 pages, 2 figures

点击查看摘要

Abstract:While explainable AI (XAI) is often heralded as a means to enhance transparency and trustworthiness in closed-loop neurotechnology for psychiatric and neurological conditions, its real-world prevalence remains low. Moreover, empirical evidence suggests that the type of explanations provided by current XAI methods often fails to align with clinicians’ end-user needs. In this viewpoint, we argue that clinically meaningful explainability (CME) is essential for AI-enabled closed-loop medical neurotechnology and must be addressed from an ethical, technical, and clinical perspective. Instead of exhaustive technical detail, clinicians prioritize clinically relevant, actionable explanations, such as clear representations of input-output relationships and feature importance. Full technical transparency, although theoretically desirable, often proves irrelevant or even overwhelming in practice, as it may lead to informational overload. Therefore, we advocate for CME in the neurotechnology domain: prioritizing actionable clarity over technical completeness and designing interface visualizations that intuitively map AI outputs and key features into clinically meaningful formats. To this end, we introduce a reference architecture called NeuroXplain, which translates CME into actionable technical design recommendations for any future neurostimulation device. Our aim is to inform stakeholders working in neurotechnology and regulatory framework development to ensure that explainability fulfills the right needs for the right stakeholders and ultimately leads to better patient treatment and care.

[AI-114] Understanding the Relationship Between Firms AI Technology Innovation and Consumer Complaints

【速读】:该论文旨在解决企业在人工智能(Artificial Intelligence, AI)时代进行技术革新时,消费者投诉行为如何变化的问题,特别是厘清企业AI技术创新与消费者投诉之间的内在机制。其关键解决方案在于基于保护动机理论(Protection Motivation Theory, PMT),通过多方法实证研究发现:企业AI技术创新会显著增强消费者的威胁感知情绪,从而引发更多投诉;进一步区分AI产品创新与AI流程创新后发现,前者比后者更易导致消费者投诉增加,这为理解消费者心理反应提供了理论支撑,并为企业有效管理AI相关投诉提供了实践依据。

链接: https://arxiv.org/abs/2603.18025
作者: Yongchao Martin Ma,Zhongzhun Deng
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:In the artificial intelligence (AI) age, firms increasingly invest in AI technology innovation to secure competitive advantages. However, the relationship between firms’ AI technology innovation and consumer complaints remains insufficiently explored. Drawing on Protection Motivation Theory (PMT), this paper investigates how firms’ AI technology innovation influences consumer complaints. Employing a multimethod approach, Study 1 analyzes panel data from SP 500 firms (N = 2,758 firm-year observations), Study 2 examines user-generated Reddit data (N = 2,033,814 submissions and comments), and Study 3 involves two controlled experiments (N = 410 and N = 500). The results reveal that firms’ AI technology innovation significantly increases consumers’ threat-related emotions, heightening their complaints. Furthermore, compared to AI process innovation, AI product innovation leads to higher consumer complaints. This paper advances the understanding of consumers’ psychological responses to firms’ AI innovation and provides practical implications for managing consumer complaints effectively.

[AI-115] Improving moment tensor solutions under Earth structure uncertainty with simulation-based inference

【速读】:该论文旨在解决全波形矩张量反演中因地球结构不确定性导致的理论误差建模问题,传统贝叶斯方法常假设理论误差服从高斯分布,但这种简化在存在微小(1–3%)一维地球模型不确定性时即失效,从而引入偏差并低估矩张量不确定性。解决方案的关键在于引入基于模拟的推断(Simulation-Based Inference, SBI),这是一种机器学习方法,能够通过数据驱动的方式直接建模理论误差对观测值的影响,无需对误差分布形式做强假设;文中进一步提出两种SBI实现方式:一种结合物理先验知识构建理论误差模型,另一种采用端到端深度学习算法,最终实验证明SBI能显著改善矩张量解的可靠性与后验概率校准精度,尤其在短周期数据和浅源各向同性事件中优势明显。

链接: https://arxiv.org/abs/2603.18925
作者: A. A. Saoulis,T.-S. Pham,A. M. G. Ferreira
机构: 未知
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)
备注: 19 pages, 12 figures + supporting info

点击查看摘要

Abstract:Bayesian inference represents a principled way to incorporate Earth structure uncertainty in full-waveform moment tensor inversions, but traditional approaches generally require significant approximations that risk biasing the resulting solutions. We introduce a robust method for handling theory errors using simulation-based inference (SBI), a machine learning approach that empirically models their impact on the observations. This framework retains the rigour of Bayesian inference while avoiding restrictive assumptions about the functional form of the uncertainties. We begin by demonstrating that the common Gaussian parametrisation of theory errors breaks down under minor ( 1-3 % ) 1-D Earth model uncertainty. To address this issue, we develop two formalisms for utilising SBI to improve the quality of the moment tensor solutions: one using physics-based insights into the theory errors, and another utilising an end-to-end deep learning algorithm. We then compare the results of moment tensor inversion with the standard Gaussian approach and SBI, and demonstrate that Gaussian assumptions induce bias and significantly under-report moment tensor uncertainties. We also show that these effects are particularly problematic when inverting short period data and for shallow, isotropic events. On the other hand, SBI produces more reliable, better calibrated posteriors of the earthquake source mechanism. Finally, we successfully apply our methodology to two well studied moderate magnitude earthquakes: one from the 1997 Long Valley Caldera volcanic earthquake sequence, and the 2020 Zagreb earthquake.

[AI-116] An SO(3)-equivariant reciprocal-space neural potential for long-range interactions

【速读】:该论文旨在解决机器学习势函数在处理长程静电相互作用和极化效应时面临的物理一致性与局域性假设之间的根本矛盾问题。传统基于局部性的机器学习势函数难以刻画材料中缓慢衰减的各向异性多极相关性,而现有长程扩展方法要么破坏SO(3)协变性,要么无法保证能量-力一致性。解决方案的关键在于提出EquiEwald,其核心创新是将Ewald求和启发的倒空间公式嵌入到不可约SO(3)协变框架中,通过在倒空间中进行协变消息传递(利用学习得到的协变k空间滤波器和协变逆变换),在保持物理一致性的同时高效捕捉各向异性的张量型长程关联。

链接: https://arxiv.org/abs/2603.18389
作者: Linfeng Zhang,Taoyong Cui,Dongzhan Zhou,Lei Bai,Sufei Zhang,Luca Rossi,Mao Su,Wanli Ouyang,Pheng-Ann Heng
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-range electrostatic and polarization interactions play a central role in molecular and condensed-phase systems, yet remain fundamentally incompatible with locality-based machine-learning interatomic potentials. Although modern SO(3)-equivariant neural potentials achieve high accuracy for short-range chemistry, they cannot represent the anisotropic, slowly decaying multipolar correlations governing realistic materials, while existing long-range extensions either break SO(3) equivariance or fail to maintain energy-force consistency. Here we introduce EquiEwald, a unified neural interatomic potential that embeds an Ewald-inspired reciprocal-space formulation within an irreducible SO(3)-equivariant framework. By performing equivariant message passing in reciprocal space through learned equivariant k-space filters and an equivariant inverse transform, EquiEwald captures anisotropic, tensorial long-range correlations without sacrificing physical consistency. Across periodic and aperiodic benchmarks, EquiEwald captures long-range electrostatic behavior consistent with ab initio reference data and consistently improves energy and force accuracy, data efficiency, and long-range extrapolation. These results establish EquiEwald as a physically principled paradigm for long-range-capable machine-learning interatomic potentials.

[AI-117] Understanding Task Aggregation for Generalizable Ultrasound Foundation Models

【速读】:该论文旨在解决统一临床影像模型在多任务学习中性能下降的问题,特别是针对超声成像(ultrasound imaging)中不同任务类型因训练数据规模与任务异质性交互而引发的负迁移现象。其解决方案的关键在于提出M2DINO框架,该框架基于DINOv3架构并引入任务条件控制的专家混合(task-conditioned Mixture-of-Experts, MoE)模块,实现自适应容量分配;同时通过系统评估27个超声任务在三种训练范式下的表现,揭示了任务聚合效果高度依赖于训练数据规模,并指出应结合任务类型特征(如分割任务敏感度更高)和可用数据量来设计聚合策略,而非仅依据临床分类进行分组。

链接: https://arxiv.org/abs/2603.18123
作者: Fangyijie Wang,Tanya Akumu,Vien Ngoc Dang,Amelia Jimńez-Sánchez,Jieyun Bai,Guénolé Silvestre,Karim Lekadir,Kathleen M. Curran
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models promise to unify multiple clinical tasks within a single framework, but recent ultrasound studies report that unified models can underperform task-specific baselines. We hypothesize that this degradation arises not from model capacity limitations, but from task aggregation strategies that ignore interactions between task heterogeneity and available training data scale. In this work, we systematically analyze when heterogeneous ultrasound tasks can be jointly learned without performance loss, establishing practical criteria for task aggregation in unified clinical imaging models. We introduce M2DINO, a multi-organ, multi-task framework built on DINOv3 with task-conditioned Mixture-of-Experts blocks for adaptive capacity allocation. We systematically evaluate 27 ultrasound tasks spanning segmentation, classification, detection, and regression under three paradigms: task-specific, clinically-grouped, and all-task unified training. Our results show that aggregation effectiveness depends strongly on training data scale. While clinically-grouped training can improve performance in data-rich settings, it may induce substantial negative transfer in low-data settings. In contrast, all-task unified training exhibits more consistent performance across clinical groups. We further observe that task sensitivity varies by task type in our experiments: segmentation shows the largest performance drops compared with regression and classification. These findings provide practical guidance for ultrasound foundation models, emphasizing that aggregation strategies should jointly consider training data availability and task characteristics rather than relying on clinical taxonomy alone.

[AI-118] Discovery of Bimodal Drift Rate Structure in FRB 20240114A: Evidence for Dual Emission Regions

【速读】:该论文旨在解决快速射电暴(Fast Radio Burst, FRB)中向上漂移脉冲簇(upward-drifting burst clusters)的漂移率分布是否具有多模态结构的问题,以揭示其可能的物理起源。解决方案的关键在于应用无监督机器学习方法(UMAP降维结合HDBSCAN密度聚类)对来自FAST望远镜的233个向上漂移脉冲簇进行分析,识别出一个具有显著更高平均漂移率(245.6 MHz/ms)的子群(Cluster C1,共45个簇),并与典型簇(98.1 MHz/ms)形成清晰分离;通过高斯混合模型验证其强双峰性(delta-BIC = 296.6),并进一步排除多成分脉冲簇混杂导致假象的可能性(仅保留单成分簇时仍具显著双峰性,delta-BIC = 19.9),从而确认该现象为真实物理特征,暗示可能存在两个空间分离的辐射区域,各自产生具有不同物理特性的脉冲簇。

链接: https://arxiv.org/abs/2603.18109
作者: Santosh Arron
机构: 未知
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, accepted for publication in The Astrophysical Journal

点击查看摘要

Abstract:We report the discovery of bimodal structure in the drift rate distribution of upward-drifting burst clusters from the hyperactive repeating fast radio burst FRB 20240114A. Using unsupervised machine learning (UMAP dimensionality reduction combined with HDBSCAN density-based clustering) applied to 233 upward-drifting burst clusters from the FAST telescope dataset, we identify a distinct subpopulation of 45 burst clusters (Cluster C1) with mean drift rates 2.5x higher than typical upward-drifting burst clusters (245.6 vs 98.1 MHz/ms). Gaussian mixture modeling reveals strong evidence for bimodality (delta-BIC = 296.6), with clearly separated modes (Ashman’s D = 2.70 2) and a statistically significant gap in the distribution (11.3 sigma). Crucially, we demonstrate that this bimodality persists when restricting the analysis to single-component (U1) burst clusters only (delta-BIC = 19.9, Ashman’s D = 2.71), confirming that the result is not an artifact of combining single- and multi-component burst clusters with different drift rate definitions. The extreme-drift subpopulation also exhibits systematically lower peak frequencies (-7%), shorter durations (-29%), and distinct clustering in multi-dimensional feature space. These findings are suggestive of two spatially separated emission regions in the magnetosphere, each producing upward-drifting burst clusters with distinct physical characteristics, although confirmation requires observations from additional epochs and sources.

[AI-119] KD-EKF: Knowledge-Distilled Adaptive Covariance EKF for Robust UWB/PDR Indoor Localization

【速读】:该论文旨在解决超宽带(UWB)室内定位在非直视(NLOS)条件下可靠性下降导致的米级测距误差及不确定性建模不一致问题,以及惯性测量单元(IMU)辅助行人死区航迹推算(PDR)因误差随时间非线性累积而引发的长期漂移问题。现有基于扩展卡尔曼滤波(EKF)或粒子滤波(PF)的融合方法依赖人工调参的测量协方差,难以适应不同室内布局、NLOS比例和运动模式下的环境变化,导致鲁棒性和泛化能力不足。解决方案的关键在于提出一种自适应测量协方差缩放框架——通过离线训练一个大型教师模型以生成结构化UWB/PDR序列的时序一致位置预测,并将该行为蒸馏至轻量级学生模型中;该学生模型基于预测残差实时调节EKF测量协方差,实现无需人工重调参的环境感知融合,从而显著降低定位误差、抑制LOS/NLOS切换时的误差突变并缓解长期漂移。

链接: https://arxiv.org/abs/2603.18027
作者: Kyeonghyun Yoo,Wooyong Jung,Namkyung Yoon,Sangmin Lee,Sanghong Kim,Hwangnam Kim
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:Ultra-wideband (UWB) indoor localization provides centimeter-level accuracy and low latency, but its measurement reliability degrades severely under Non-Line-of-Sight (NLOS) conditions, leading to meter-scale ranging errors and inconsistent uncertainty characteristics. Inertial Measurement Unit (IMU)-based Pedestrian Dead Reckoning (PDR) complements UWB by providing infrastructure-free motion estimation; however, its error accumulates nonlinearly over time due to bias and noise propagation. Fusion methods based on Extended Kalman Filters (EKF) and Particle Filters (PF) can improve average localization accuracy through probabilistic state estimation. However, these approaches typically rely on manually tuned measurement covariances. Such fixed or heuristically tuned parameters are hard to sustain across varying indoor layouts, NLOS ratios, and motion patterns, leading to limited robustness and poor generalization of measurement uncertainty modeling in heterogeneous environments. To address this limitation, this work proposes an adaptive measurement covariance scaling framework in which reliability cues are learned from historical UWB/PDR trajectories. A large teacher model is employed offline to generate temporally consistent next-position predictions from structured UWB/PDR sequences, and this behavior is distilled into a lightweight student model suitable for real-time deployment. The student model continuously regulates EKF measurement covariances based on prediction residuals, enabling environment-aware fusion without manual re-tuning. Experimental results demonstrate that the proposed KD-EKF framework significantly reduces localization error, suppresses error spikes during Line-of-Sight (LOS)/NLOS transitions, and mitigates long-term drift compared to fixed-parameter EKF, thereby improving measurement robustness across diverse indoor environments.

[AI-120] Using Laplace Transform To Optimize the Hallucination of Generation Models

【速读】:该论文旨在解决生成式 AI(Generative AI)在运行过程中出现的“自信错误”(或称幻觉,hallucination)问题,即模型在输出时表现出高置信度但内容不准确的现象。其解决方案的关键在于将生成式模型(GMs)系统形式化为一类随机动力系统,并借助控制理论中的拉普拉斯变换(Laplace transform)分析方法,从宏观视角模拟系统的源响应(source response),从而识别和优化导致幻觉的内在机制。研究发现,训练过程与系统的响应行为具有一致性,这为设计更优的优化组件提供了理论依据,最终通过拉普拉斯变换分析从根本上缓解了生成式 AI 的幻觉问题。

链接: https://arxiv.org/abs/2603.18022
作者: Cheng Kang,Xinye Chen,Daniel Novak,Xujing Yao
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Corresponding author: Xujing Yao (xjyao@njtech. this http URL )

点击查看摘要

Abstract:To explore the feasibility of avoiding the confident error (or hallucination) of generation models (GMs), we formalise the system of GMs as a class of stochastic dynamical systems through the lens of control theory. Numerous factors can be attributed to the hallucination of the learning process of GMs, utilising knowledge of control theory allows us to analyse their system functions and system responses. Due to the high complexity of GMs when using various optimization methods, we cannot figure out their solution of Laplace transform, but from a macroscopic perspective, simulating the source response provides a virtual way to address the hallucination of GMs. We also find that the training progress is consistent with the corresponding system response, which offers us a useful way to develop a better optimization component. Finally, the hallucination problem of GMs is fundamentally optimized by using Laplace transform analysis.

机器学习

[LG-0] Robustness Cost and Attack-Surface Concentration in Phishing Detection

链接: https://arxiv.org/abs/2603.19204
作者: Julian Allagan,Mohamed Elbakary,Zohreh Safari,Weizheng Gao,Gabrielle Morgan,Essence Morgan,Vladimir Deriglazov
类目: Machine Learning (cs.LG)
*备注: 14 pages, 4 figures, 9 tables

点击查看摘要

Abstract:Phishing detectors built on engineered website features attain near-perfect accuracy under i.i.d.\ evaluation, yet deployment security depends on robustness to post-deployment feature manipulation. We study this gap through a cost-aware evasion framework that models discrete, monotone feature edits under explicit attacker budgets. Three diagnostics are introduced: minimal evasion cost (MEC), the evasion survival rate S(B) , and the robustness concentration index (RCI). On the UCI Phishing Websites benchmark (11,055 instances, 30 ternary features), Logistic Regression, Random Forests, Gradient Boosted Trees, and XGBoost all achieve \mathrmAUC\ge 0.979 under static evaluation. Under budgeted sanitization-style evasion, robustness converges across architectures: the median MEC equals 2 with full features, and over 80% of successful minimal-cost evasions concentrate on three low-cost surface features. Feature restriction improves robustness only when it removes all dominant low-cost transitions. Under strict cost schedules, infrastructure-leaning feature sets exhibit 17-19% infeasible mass for ensemble models, while the median MEC among evadable instances remains unchanged. We formalize this convergence: if a positive fraction of correctly detected phishing instances admit evasion through a single feature transition of minimal cost c_\min , no classifier can raise the corresponding MEC quantile above c_\min without modifying the feature representation or cost model. Adversarial robustness in phishing detection is governed by feature economics rather than model complexity. Comments: 14 pages, 4 figures, 9 tables Subjects: Machine Learning (cs.LG) MSC classes: 68T05, 68T20, 90C35, 90C27, 68M25 ACMclasses: F.2.2; I.2.6; I.2.7; K.6.5; D.4.6 Cite as: arXiv:2603.19204 [cs.LG] (or arXiv:2603.19204v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.19204 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Improving RCT-Based Treatment Effect Estimation Under Covariate Mismatch via Calibrated Alignment

链接: https://arxiv.org/abs/2603.19186
作者: Amir Asiaee,Samhita Pal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Randomized controlled trials (RCTs) are the gold standard for estimating heterogeneous treatment effects, yet they are often underpowered for detecting effect heterogeneity. Large observational studies (OS) can supplement RCTs for conditional average treatment effect (CATE) estimation, but a key barrier is covariate mismatch: the two sources measure different, only partially overlapping, covariates. We propose CALM (Calibrated ALignment under covariate Mismatch), which bypasses imputation by learning embeddings that map each source’s features into a common representation space. OS outcome models are transferred to the RCT embedding space and calibrated using trial data, preserving causal identification from randomization. Finite-sample risk bounds decompose into alignment error, outcome-model complexity, and calibration complexity terms, identifying when embedding alignment outperforms imputation. Under the calibration-based linear variant, the framework provides protection against negative transfer; the neural variant can be vulnerable under severe distributional shift. Under sparse linear models, the embedding approach strictly generalizes imputation. Simulations across 51 settings confirm that (i) calibration-based methods are equivalent for linear CATEs, and (ii) the neural embedding variant wins all 22 nonlinear-regime settings with large margins.

[LG-2] MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

链接: https://arxiv.org/abs/2603.19185
作者: Masoumeh Shafieinejad,Xi He,Mahshid Alinoori,John Jewell,Sana Ayromlou,Wei Pang,Veronica Chatrath,Garui Sharma,Deval Pandya
类目: Machine Learning (cs.LG)
*备注: 4 page, 1 table

点击查看摘要

Abstract:Synthetic data is often perceived as a silver-bullet solution to data anonymization and privacy-preserving data publishing. Drawn from generative models like diffusion models, synthetic data is expected to preserve the statistical properties of the original dataset while remaining resilient to privacy attacks. Recent developments of diffusion models have been effective on a wide range of data types, but their privacy resilience, particularly for tabular formats, remains largely unexplored. MIDST challenge sought a quantitative evaluation of the privacy gain of synthetic tabular data generated by diffusion models, with a specific focus on its resistance to membership inference attacks (MIAs). Given the heterogeneity and complexity of tabular data, multiple target models were explored for MIAs, including diffusion models for single tables of mixed data types and multi-relational tables with interconnected constraints. MIDST inspired the development of novel black-box and white-box MIAs tailored to these target diffusion models as a key outcome, enabling a comprehensive evaluation of their privacy efficacy. The MIDST GitHub repository is available at this https URL

[LG-3] DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge

链接: https://arxiv.org/abs/2603.19172
作者: Yuegui Huang,Zhiyuan Fang,Weiqi Luo,Ruoyu Wu,Wuhui Chen,Zibin Zheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the computational efficiency of MoE models, the excessive memory footprint and I/O overhead inherent in multi-expert architectures pose formidable challenges for real-time inference on resource-constrained edge platforms. While existing static methods struggle with a rigid latency-accuracy trade-off, we observe that expert importance is highly skewed and depth-dependent. Motivated by these insights, we propose DyMoE, a dynamic mixed-precision quantization framework designed for high-performance edge inference. Leveraging insights into expert importance skewness and depth-dependent sensitivity, DyMoE introduces: (1) importance-aware prioritization to dynamically quantize experts at runtime; (2) depth-adaptive scheduling to preserve semantic integrity in critical layers; and (3) look-ahead prefetching to overlap I/O stalls. Experimental results on commercial edge hardware show that DyMoE reduces Time-to-First-Token (TTFT) by 3.44x-22.7x and up to a 14.58x speedup in Time-Per-Output-Token (TPOT) compared to state-of-the-art offloading baselines, enabling real-time, accuracy-preserving MoE inference on resource-constrained edge devices.

[LG-4] Rigorous Error Certification for Neural PDE Solvers: From Empirical Residuals to Solution Guarantees

链接: https://arxiv.org/abs/2603.19165
作者: Amartya Mukherjee,Maxwell Fitzsimmons,David C. Del Rey Fernández,Jun Liu
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Functional Analysis (math.FA)
*备注: 35 pages

点击查看摘要

Abstract:Uncertainty quantification for partial differential equations is traditionally grounded in discretization theory, where solution error is controlled via mesh/grid refinement. Physics-informed neural networks fundamentally depart from this paradigm: they approximate solutions by minimizing residual losses at collocation points, introducing new sources of error arising from optimization, sampling, representation, and overfitting. As a result, the generalization error in the solution space remains an open problem. Our main theoretical contribution establishes generalization bounds that connect residual control to solution-space error. We prove that when neural approximations lie in a compact subset of the solution space, vanishing residual error guarantees convergence to the true solution. We derive deterministic and probabilistic convergence results and provide certified generalization bounds translating residual, boundary, and initial errors into explicit solution error guarantees. Comments: 35 pages Subjects: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Functional Analysis (math.FA) Cite as: arXiv:2603.19165 [cs.LG] (or arXiv:2603.19165v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.19165 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-5] Enhancing Pretrained Model-based Continual Representation Learning via Guided Random Projection

链接: https://arxiv.org/abs/2603.19145
作者: Ruilin Li,Heming Zou,Xiufeng Yan,Zheming Liang,Jie Yang,Chenliang Li,Xue Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent paradigms in Random Projection Layer (RPL)-based continual representation learning have demonstrated superior performance when building upon a pre-trained model (PTM). These methods insert a randomly initialized RPL after a PTM to enhance feature representation in the initial stage. Subsequently, a linear classification head is used for analytic updates in the continual learning stage. However, under severe domain gaps between pre-trained representations and target domains, a randomly initialized RPL exhibits limited expressivity under large domain shifts. While largely scaling up the RPL dimension can improve expressivity, it also induces an ill-conditioned feature matrix, thereby destabilizing the recursive analytic updates of the linear head. To this end, we propose the Stochastic Continual Learner with MemoryGuard Supervisory Mechanism (SCL-MGSM). Unlike random initialization, MGSM constructs the projection layer via a principled, data-guided mechanism that progressively selects target-aligned random bases to adapt the PTM representation to downstream tasks. This facilitates the construction of a compact yet expressive RPL while improving the numerical stability of analytic updates. Extensive experiments on multiple exemplar-free Class Incremental Learning (CIL) benchmarks demonstrate that SCL-MGSM achieves superior performance compared to state-of-the-art methods.

[LG-6] SHAPCA: Consistent and Interpretable Explanations for Machine Learning Models on Spectroscopy Data

链接: https://arxiv.org/abs/2603.19141
作者: Mingxing Zhang,Nicola Rossberg,Simone Innocente,Katarzyna Komolibus,Rekha Gautam,Barry O’Sullivan,Luca Longo,Andrea Visentin
类目: Machine Learning (cs.LG)
*备注: 25 pages, 6 figures

点击查看摘要

Abstract:In recent years, machine learning models have been increasingly applied to spectroscopic datasets for chemical and biomedical analysis. For their successful adoption, particularly in clinical and safety-critical settings, professionals and researchers must be able to understand and trust the reasoning behind model predictions. However, the inherently high dimensionality and strong collinearity of spectroscopy data pose a fundamental challenge to model explainability. These properties not only complicate model training but also undermine the stability and consistency of explanations, leading to fluctuations in feature importance across repeated training runs. Feature extraction techniques have been used to reduce the input dimensionality; these new features hinder the connection between the prediction and the original signal. This study proposes SHAPCA, an explainable machine learning pipeline that combines Principal Component Analysis (for dimensionality reduction) and Shapely Additive exPlanations (for post hoc explanation) to provide explanations in the original input space, which a practitioner can interpret and link back to the biological components. The proposed framework enables analysis from both global and local perspectives, revealing the spectral bands that drive overall model behaviour as well as the instance-specific features that influence individual predictions. Numerical analysis demonstrated the interpretability of the results and greater consistency across different runs.

[LG-7] Hierarchical Latent Structure Learning through Online Inference

链接: https://arxiv.org/abs/2603.19139
作者: Ines Aitsahalia,Kiyohito Iigaya
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 4 figures, 5 supplementary figures

点击查看摘要

Abstract:Learning systems must balance generalization across experiences with discrimination of task-relevant details. Effective learning therefore requires representations that support both. Online latent-cause models support incremental inference but assume flat partitions, whereas hierarchical Bayesian models capture multilevel structure but typically require offline inference. We introduce the Hierarchical Online Learning of Multiscale Experience Structure (HOLMES) model, a computational framework for hierarchical latent structure learning through online inference. HOLMES combines a variation on the nested Chinese Restaurant Process prior with sequential Monte Carlo inference to perform tractable trial-by-trial inference over hierarchical latent representations without explicit supervision over the latent structure. In simulations, HOLMES matched the predictive performance of flat models while learning more compact representations that supported one-shot transfer to higher-level latent categories. In a context-dependent task with nested temporal structure, HOLMES also improved outcome prediction relative to flat models. These results provide a tractable computational framework for discovering hierarchical structure in sequential data.

[LG-8] From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

链接: https://arxiv.org/abs/2603.19131
作者: Zhuofan Li,Hongkun Yang,Zhenyang Chen,Yangxuan Chen,Yingyan(Celine)Lin,Chaojian Li
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have recently enabled embodied agents to perform increasingly complex tasks by jointly reasoning over visual, linguistic, and motor modalities. However, we find that the prevailing notion of ``efficiency’’ in current VLA research, characterized by parameters, FLOPs, or token decoding throughput, does not reflect actual performance on robotic platforms. In real-world execution, efficiency is determined by system-level embodied behaviors such as task completion time, trajectory smoothness, cumulative joint rotation, and motion energy. Through controlled studies across model compression, token sparsification, and action sequence compression, we make several observations that challenge common assumptions. (1) Methods that reduce computation under conventional metrics often increase end-to-end execution cost or degrade motion quality, despite maintaining task success rates. (2) System-level embodied efficiency metrics reveal performance differences in the learned action policies that remain hidden under conventional evaluations. (3) Common adaptation methods such as in-context prompting or supervised fine-tuning show only mild and metric-specific improvements in embodied efficiency. While these methods can reduce targeted embodied-efficiency metrics such as jerk or action rate, the resulting gains may come with trade-offs in other metrics, such as longer completion time. Taken together, our results suggest that conventional inference efficiency metrics can overlook important aspects of embodied execution. Incorporating embodied efficiency provides a more complete view of policy behavior and practical performance, enabling fairer and more comprehensive comparisons of VLA models.

[LG-9] On Optimizing Multimodal Jailbreaks for Spoken Language Models INTERSPEECH2026

链接: https://arxiv.org/abs/2603.19127
作者: Aravind Krishnan,Karolina Stańczak,Dietrich Klakow
类目: Machine Learning (cs.LG)
*备注: Under Review at INTERSPEECH 2026

点击查看摘要

Abstract:As Spoken Language Models (SLMs) integrate speech and text modalities, they inherit the safety vulnerabilities of their LLM backbone and an expanded attack surface. SLMs have been previously shown to be susceptible to jailbreaking, where adversarial prompts induce harmful responses. Yet existing attacks largely remain unimodal, optimizing either text or audio in isolation. We explore gradient-based multimodal jailbreaks by introducing JAMA (Joint Audio-text Multimodal Attack), a joint multimodal optimization framework combining Greedy Coordinate Gradient (GCG) for text and Projected Gradient Descent (PGD) for audio, to simultaneously perturb both modalities. Evaluations across four state-of-the-art SLMs and four audio types demonstrate that JAMA surpasses unimodal jailbreak rate by 1.5x to 10x. We analyze the operational dynamics of this joint attack and show that a sequential approximation method makes it 4x to 6x faster. Our findings suggest that unimodal safety is insufficient for robust SLMs. The code and data are available at this https URL

[LG-10] Position: Spectral GNNs Are Neither Spectral Nor Superior for Node Classification

链接: https://arxiv.org/abs/2603.19091
作者: Qin Jiang,Chengjia Wang,Michael Lones,Dongdong Chen,Wei Pang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spectral Graph Neural Networks (Spectral GNNs) for node classification promise frequency-domain filtering on graphs, yet rest on flawed foundations. Recent work shows that graph Laplacian eigenvectors do not in general have the key properties of a true Fourier basis, but leaves the empirical success of Spectral GNNs unexplained. We identify two theoretical glitches: (1) commonly used “graph Fourier bases” are not classical Fourier bases for graph signals; (2) (n-1)-degree polynomials (n = number of nodes) can exactly interpolate any spectral response via a Vandermonde system, so the usual “polynomial approximation” narrative is not theoretically justified. The effectiveness of GCN is commonly attributed to spectral low-pass filtering, yet we prove that low- and high-pass behaviors arise solely from message-passing dynamics rather than Graph Fourier Transform-based spectral formulations. We then analyze two representative directed spectral models, MagNet and HoloNet. Their reported effectiveness is not spectral: it arises from implementation issues that reduce them to powerful MPNNs. When implemented consistently with the claimed spectral algorithms, performance becomes weak. This position paper argues that: for node classification, Spectral GNNs neither meaningfully capture the graph spectrum nor reliably improve performance; competitive results are better explained by their equivalence to MPNNs, sometimes aided by implementations inconsistent with their intended design.

[LG-11] Communication-Efficient and Robust Multi-Modal Federated Learning via Latent-Space Consensus

链接: https://arxiv.org/abs/2603.19067
作者: Mohamed Badi,Chaouki Ben Issaid,Mehdi Bennis
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted for publication in IEEE Wireless Communications Letters

点击查看摘要

Abstract:Federated learning (FL) enables collaborative model training across distributed devices without sharing raw data, but applying FL to multi-modal settings introduces significant challenges. Clients typically possess heterogeneous modalities and model architectures, making it difficult to align feature spaces efficiently while preserving privacy and minimizing communication costs. To address this, we introduce CoMFed, a Communication-Efficient Multi-Modal Federated Learning framework that uses learnable projection matrices to generate compressed latent representations. A latent-space regularizer aligns these representations across clients, improving cross-modal consistency and robustness to outliers. Experiments on human activity recognition benchmarks show that CoMFed achieves competitive accuracy with minimal overhead.

[LG-12] Hardness of High-Dimensional Linear Classification

链接: https://arxiv.org/abs/2603.19061
作者: Alexander Munteanu,Simon Omlor,Jeff M. Phillips
类目: Computational Geometry (cs.CG); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: SoCG 2026

点击查看摘要

Abstract:We establish new exponential in dimension lower bounds for the Maximum Halfspace Discrepancy problem, which models linear classification. Both are fundamental problems in computational geometry and machine learning in their exact and approximate forms. However, only O(n^d) and respectively \tilde O(1/\varepsilon^d) upper bounds are known and complemented by polynomial lower bounds that do not support the exponential in dimension dependence. We close this gap up to polylogarithmic terms by reduction from widely-believed hardness conjectures for Affine Degeneracy testing and k -Sum problems. Our reductions yield matching lower bounds of \tilde\Omega(n^d) and respectively \tilde\Omega(1/\varepsilon^d) based on Affine Degeneracy testing, and \tilde\Omega(n^d/2) and respectively \tilde\Omega(1/\varepsilon^d/2) conditioned on k -Sum. The first bound also holds unconditionally if the computational model is restricted to make sidedness queries, which corresponds to a widely spread setting implemented and optimized in many contemporary algorithms and computing paradigms.

[LG-13] When Differential Privacy Meets Wireless Federated Learning: An Improved Analysis for Privacy and Convergence

链接: https://arxiv.org/abs/2603.19040
作者: Chen Yaoling,Liang Hao,Tu Xiaotong
类目: Machine Learning (cs.LG)
*备注: 5 pages, 1 figure

点击查看摘要

Abstract:Differentially private wireless federated learning (DPWFL) is a promising framework for protecting sensitive user data. However, foundational questions on how to precisely characterize privacy loss remain open, and existing work is further limited by convergence analyses that rely on restrictive convexity assumptions or ignore the effect of gradient clipping. To overcome these issues, we present a comprehensive analysis of privacy and convergence for DPWFL with general smooth non-convex loss objectives. Our analysis explicitly incorporates both device selection and mini-batch sampling, and shows that the privacy loss can converge to a constant rather than diverge with the number of iterations. Moreover, we establish convergence guarantees with gradient clipping and derive an explicit privacy-utility trade-off. Numerical results validate our theoretical findings.

[LG-14] owards Verifiable AI with Lightweight Cryptographic Proofs of Inference

链接: https://arxiv.org/abs/2603.19025
作者: Pranay Anchuri,Matteo Campanelli,Paul Cesaretti,Rosario Gennaro,Tushar M. Jois,Hasan S. Kayman,Tugce Ozdemir
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 49 pages, 14 figures. Accepted at IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) 2026

点击查看摘要

Abstract:When large AI models are deployed as cloud-based services, clients have no guarantee that responses are correct or were produced by the intended model. Rerunning inference locally is infeasible for large models, and existing cryptographic proof systems – while providing strong correctness guarantees – introduce prohibitive prover overhead (e.g., hundreds of seconds per query for billion-parameter models). We present a verification framework and protocol that replaces full cryptographic proofs with a lightweight, sampling-based approach grounded in statistical properties of neural networks. We formalize the conditions under which trace separation between functionally dissimilar models can be leveraged to argue the security of verifiable inference protocols. The prover commits to the execution trace of inference via Merkle-tree-based vector commitments and opens only a small number of entries along randomly sampled paths from output to input. This yields a protocol that trades soundness for efficiency, a tradeoff well-suited to auditing, large-scale deployment settings where repeated queries amplify detection probability, and scenarios with rationally incentivized provers who face penalties upon detection. Our approach reduces proving times by several orders of magnitude compared to state-of-the-art cryptographic proof systems, going from the order of minutes to the order of milliseconds, with moderately larger proofs. Experiments on ResNet-18 classifiers and Llama-2-7B confirm that common architectures exhibit the statistical properties our protocol requires, and that natural adversarial strategies (gradient-descent reconstruction, inverse transforms, logit swapping) fail to produce traces that evade detection. We additionally present a protocol in the refereed delegation model, where two competing servers enable correct output identification in a logarithmic number of rounds.

[LG-15] Best-of-Both-Worlds Multi-Dueling Bandits: Unified Algorithms for Stochastic and Adversarial Preferences under Condorcet and Borda Objectives

链接: https://arxiv.org/abs/2603.18972
作者: S. Akash,Pratik Gajane,Jawar Singh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-dueling bandits, where a learner selects m \geq 2 arms per round and observes only the winner, arise naturally in many applications including ranking and recommendation systems, yet a fundamental question has remained open: can a single algorithm perform optimally in both stochastic and adversarial environments, without knowing which regime it faces? We answer this affirmatively, providing the first best-of-both-worlds algorithms for multi-dueling bandits under both Condorcet and Borda objectives. For the Condorcet setting, we propose \textttMetaDueling, a black-box reduction that converts any dueling bandit algorithm into a multi-dueling bandit algorithm by transforming multi-way winner feedback into an unbiased pairwise signal. Instantiating our reduction with \textttVersatile-DB yields the first best-of-both-worlds algorithm for multi-dueling bandits: it achieves O(\sqrtKT) pseudo-regret against adversarial preferences and the instance-optimal O!\left(\sum_i \neq a^\star \frac\log T\Delta_i\right) pseudo-regret under stochastic preferences, both simultaneously and without prior knowledge of the regime. For the Borda setting, we propose \AlgBorda, a stochastic-and-adversarial algorithm that achieves O\left(K^2 \log KT + K \log^2 T + \sum_i: \Delta_i^\mathrmB 0 \fracK\log KT(\Delta_i^\mathrmB)^2\right) regret in stochastic environments and O\left(K \sqrtT \log KT + K^1/3 T^2/3 (\log K)^1/3\right) regret against adversaries, again without prior knowledge of the regime. We complement our upper bounds with matching lower bounds for the Condorcet setting. For the Borda setting, our upper bounds are near-optimal with respect to the lower bounds (within a factor of K ) and match the best-known results in the literature.

[LG-16] Maximum-Entropy Exploration with Future State-Action Visitation Measures

链接: https://arxiv.org/abs/2603.18965
作者: Adrien Bolland,Gaspard Lambrechts,Damien Ernst
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: arXiv admin note: substantial text overlap with arXiv:2412.06655

点击查看摘要

Abstract:Maximum entropy reinforcement learning motivates agents to explore states and actions to maximize the entropy of some distribution, typically by providing additional intrinsic rewards proportional to that entropy function. In this paper, we study intrinsic rewards proportional to the entropy of the discounted distribution of state-action features visited during future time steps. This approach is motivated by two results. First, we show that the expected sum of these intrinsic rewards is a lower bound on the entropy of the discounted distribution of state-action features visited in trajectories starting from the initial states, which we relate to an alternative maximum entropy objective. Second, we show that the distribution used in the intrinsic reward definition is the fixed point of a contraction operator and can therefore be estimated off-policy. Experiments highlight that the new objective leads to improved visitation of features within individual trajectories, in exchange for slightly reduced visitation of features in expectation over different trajectories, as suggested by the lower bound. It also leads to improved convergence speed for learning exploration-only agents. Control performance remains similar across most methods on the considered benchmarks.

[LG-17] BVSIMC: Bayesian Variable Selection-Guided Inductive Matrix Completion for Improved and Interpretable Drug Discovery

链接: https://arxiv.org/abs/2603.18957
作者: Sijian Fan,Liyan Xiong,Dayuan Wang,Guoshuai Cai,Ray Bai
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Recent advances in drug discovery have demonstrated that incorporating side information (e.g., chemical properties about drugs and genomic information about diseases) often greatly improves prediction performance. However, these side features can vary widely in relevance and are often noisy and high-dimensional. We propose Bayesian Variable Selection-Guided Inductive Matrix Completion (BVSIMC), a new Bayesian model that enables variable selection from side features in drug discovery. By learning sparse latent embeddings, BVSIMC improves both predictive accuracy and interpretability. We validate our method through simulation studies and two drug discovery applications: 1) prediction of drug resistance in Mycobacterium tuberculosis, and 2) prediction of new drug-disease associations in computational drug repositioning. On both synthetic and real data, BVSIMC outperforms several other state-of-the-art methods in terms of prediction. In our two real examples, BVSIMC further reveals the most clinically meaningful side features.

[LG-18] Balancing Performance and Fairness in Explainable AI for Anomaly Detection in Distributed Power Plants Monitoring

链接: https://arxiv.org/abs/2603.18954
作者: Corneille Niyonkuru,Marcellin Atemkeng,Gabin Maxime Nguegnang,Arnaud Nguembang Fadja
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliable anomaly detection in distributed power plant monitoring systems is essential for ensuring operational continuity and reducing maintenance costs, particularly in regions where telecom operators heavily rely on diesel generators. However, this task is challenged by extreme class imbalance, lack of interpretability, and potential fairness issues across regional clusters. In this work, we propose a supervised ML framework that integrates ensemble methods (LightGBM, XGBoost, Random Forest, CatBoost, GBDT, AdaBoost) and baseline models (Support Vector Machine, K-Nearrest Neighbors, Multilayer Perceptrons, and Logistic Regression) with advanced resampling techniques (SMOTE with Tomek Links and ENN) to address imbalance in a dataset of diesel generator operations in Cameroon. Interpretability is achieved through SHAP (SHapley Additive exPlanations), while fairness is quantified using the Disparate Impact Ratio (DIR) across operational clusters. We further evaluate model generalization using Maximum Mean Discrepancy (MMD) to capture domain shifts between regions. Experimental results show that ensemble models consistently outperform baselines, with LightGBM achieving an F1-score of 0.99 and minimal bias across clusters (DIR \approx 0.95 ). SHAP analysis highlights fuel consumption rate and runtime per day as dominant predictors, providing actionable insights for operators. Our findings demonstrate that it is possible to balance performance, interpretability, and fairness in anomaly detection, paving the way for more equitable and explainable AI systems in industrial power management. \colorblack Finally, beyond offline evaluation, we also discuss how the trained models can be deployed in practice for real-time monitoring. We show how containerized services can process in real-time, deliver low-latency predictions, and provide interpretable outputs for operators.

[LG-19] Context Bootstrapped Reinforcement Learning

链接: https://arxiv.org/abs/2603.18953
作者: Saaket Agashe,Jayanth Srinivasa,Gaowen Liu,Ramana Kompella,Xin Eric Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum that starts high to bootstrap early exploration, then anneals to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL’s practical applicability on Q, a domain-specific programming language that diverges significantly from mainstream language conventions.

[LG-20] An Optimised Greedy-Weighted Ensemble Framework for Financial Loan Default Prediction

链接: https://arxiv.org/abs/2603.18927
作者: Ezekiel Nii Noye Nortey,Jones Asante-Koranteng,Marcellin Atemkeng,Theophilus Ansah-Narh,David Mensah,Rebecca Davis,Ravenhill Adjetey Laryea
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of loan defaults is a central challenge in credit risk management, particularly in modern financial datasets characterised by nonlinear relationships, class imbalance, and evolving borrower behaviour. Traditional statistical models and static ensemble methods often struggle to maintain reliable performance under such conditions. This study proposes an Optimised Greedy-Weighted Ensemble framework for loan default prediction that dynamically allocates model weights based on empirical predictive performance. The framework integrates multiple machine learning classifiers, with their hyperparameters first optimised using Particle Swarm Optimisation. Model predictions are then combined via a regularised greedy weighting mechanism. At the same time, a neural-network-based meta-learner is employed within stacked-ensemble to capture higher-order relationships among model outputs. Experiments conducted on the Lending Club dataset demonstrate that the proposed framework improves predictive performance compared with individual classifiers. The BlendNet ensemble achieved the strongest results with an AUC of 0.80, a macro-average F1-score of 0.73, and a default recall of 0.81. Calibration analysis further shows that tree-based ensembles such as Extra Trees and Gradient Boosting provide the most reliable probability estimates, while the stacked ensemble offers superior ranking capability. Feature analysis using Recursive Feature Elimination identifies revolving utilisation, annual income, and debt-to-income ratio as the most influential predictors of loan default. These findings demonstrate that performance-driven ensemble weighting can improve both predictive accuracy and interpretability in credit risk modelling. The proposed framework provides a scalable data-driven approach to support institutional credit assessment, risk monitoring, and financial decision-making.

[LG-21] Neural Galerkin Normalizing Flow for Transition Probability Density Functions of Diffusion Models

链接: https://arxiv.org/abs/2603.18907
作者: Riccardo Saporiti,Fabio Nobile
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:We propose a new Neural Galerkin Normalizing Flow framework to approximate the transition probability density function of a diffusion process by solving the corresponding Fokker-Planck equation with an atomic initial distribution, parametrically with respect to the location of the initial mass. By using Normalizing Flows, we look for the solution as a transformation of the transition probability density function of a reference stochastic process, ensuring that our approximation is structure-preserving and automatically satisfies positivity and mass conservation constraints. By extending Neural Galerkin schemes to the context of Normalizing Flows, we derive a system of ODEs for the time evolution of the Normalizing Flow’s parameters. Adaptive sampling routines are used to evaluate the Fokker-Planck residual in meaningful locations, which is of vital importance to address high-dimensional PDEs. Numerical results show that this strategy captures key features of the true solution and enforces the causal relationship between the initial datum and the density function at subsequent times. After completing an offline training phase, online evaluation becomes significantly more cost-effective than solving the PDE from scratch. The proposed method serves as a promising surrogate model, which could be deployed in many-query problems associated with stochastic differential equations, like Bayesian inference, simulation, and diffusion bridge generation.

[LG-22] Uniform a priori bounds and error analysis for the Adam stochastic gradient descent optimization method

链接: https://arxiv.org/abs/2603.18899
作者: Steffen Dereich,Thang Do,Arnulf Jentzen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 34 pages

点击查看摘要

Abstract:The adaptive moment estimation (Adam) optimizer proposed by Kingma Ba (2014) is presumably the most popular stochastic gradient descent (SGD) optimization method for the training of deep neural networks (DNNs) in artificial intelligence (AI) systems. Despite its groundbreaking success in the training of AI systems, it still remains an open research problem to provide a complete error analysis of Adam, not only for optimizing DNNs but even when applied to strongly convex stochastic optimization problems (SOPs). Previous error analysis results for strongly convex SOPs in the literature provide conditional convergence analyses that rely on the assumption that Adam does not diverge to infinity but remains uniformly bounded. It is the key contribution of this work to establish uniform a priori bounds for Adam and, thereby, to provide – for the first time – an unconditional error analysis for Adam for a large class of strongly convex SOPs.

[LG-23] Authority-Level Priors: An Under-Specified Constraint in Hierarchical Predictive Processing

链接: https://arxiv.org/abs/2603.18888
作者: Marcela Palejova
类目: Machine Learning (cs.LG)
*备注: 26 pages, 1 figure

点击查看摘要

Abstract:Hierarchical predictive processing explains adaptive behaviour through precision-weighted inference. Explicit belief revision often fails to produce corresponding changes in stress reactivity or autonomic regulation. This asymmetry suggests the framework leaves under-specified a governance-level constraint concerning which identity-level hypotheses regulate autonomic and behavioural control under uncertainty. We introduce Authority-Level Priors (ALPs) as meta-structural constraints defining a regulatory-admissible subset (Hauth, a subset of H) of identity-level hypotheses. ALPs are not additional representational states nor hyperpriors over precision; they constrain which hypotheses are admissible for regulatory control. Precision determines influence conditional on admissibility; ALPs determine admissibility itself. This explains why explicit belief updating modifies representational beliefs while autonomic threat responses remain stable. A computational formalisation restricts policy optimisation to policies generated by authorised hypotheses, yielding testable predictions concerning stress-reactivity dynamics, recovery time constants, compensatory control engagement, and behavioural persistence. Neurobiologically, ALPs manifest through distributed prefrontal arbitration and control networks. The proposal is compatible with variational active inference and introduces no additional inferential operators, instead formalising a boundary condition required for determinate identity-regulation mapping. The model generates falsifiable predictions: governance shifts should produce measurable changes in stress-reactivity curves, recovery dynamics, compensatory cognitive effort, and behavioural change durability. ALPs are advanced as an architectural hypothesis to be evaluated through computational modelling and longitudinal stress-induction paradigms.

[LG-24] DriftGuard: Mitigating Asynchronous Data Drift in Federated Learning

链接: https://arxiv.org/abs/2603.18872
作者: Yizhou Han,Di Wu,Blesson Varghese
类目: Machine Learning (cs.LG)
*备注: 13 pages, 9 figures

点击查看摘要

Abstract:In real-world Federated Learning (FL) deployments, data distributions on devices that participate in training evolve over time. This leads to asynchronous data drift, where different devices shift at different times and toward different distributions. Mitigating such drift is challenging: frequent retraining incurs high computational cost on resource-constrained devices, while infrequent retraining degrades performance on drifting devices. We propose DriftGuard, a federated continual learning framework that efficiently adapts to asynchronous data drift. DriftGuard adopts a Mixture-of-Experts (MoE) inspired architecture that separates shared parameters, which capture globally transferable knowledge, from local parameters that adapt to group-specific distributions. This design enables two complementary retraining strategies: (i) global retraining, which updates the shared parameters when system-wide drift is identified, and (ii) group retraining, which selectively updates local parameters for clusters of devices identified via MoE gating patterns, without sharing raw data. Experiments across multiple datasets and models show that DriftGuard matches or exceeds state-of-the-art accuracy while reducing total retraining cost by up to 83%. As a result, it achieves the highest accuracy per unit retraining cost, improving over the strongest baseline by up to 2.3x. DriftGuard is available for download from this https URL.

[LG-25] RadioDiff-FS: Physics-Informed Manifold Alignment in Few-Shot Diffusion Models for High-Fidelity Radio Map Construction

链接: https://arxiv.org/abs/2603.18865
作者: Xiucheng Wang,Zixuan Guo,Nan Cheng
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Radio maps (RMs) provide spatially continuous propagation characterizations essential for 6G network planning, but high-fidelity RM construction remains challenging. Rigorous electromagnetic solvers incur prohibitive computational latency, while data-driven models demand massive labeled datasets and generalize poorly from simplified simulations to complex multipath environments. This paper proposes RadioDiff-FS, a few-shot diffusion framework that adapts a pre-trained main-path generator to multipath-rich target domains with only a small number of high-fidelity samples. The adaptation is grounded in a theoretical decomposition of the multipath RM into a dominant main-path component and a directionally sparse residual. This decomposition shows that the cross-domain shift corresponds to a bounded and geometrically structured feature translation rather than an arbitrary distribution change. A Direction-Consistency Loss (DCL) is then introduced to constrain diffusion score updates along physically plausible propagation directions, suppressing phase-inconsistent artifacts that arise in the low-data regime. Experiments show that RadioDiff-FS reduces NMSE by 59.5% on static RMs and by 74.0% on dynamic RMs relative to the vanilla diffusion baseline, achieving an SSIM of 0.9752 and a PSNR of 36.37 dB under severely limited supervision.

[LG-26] BeamAgent : LLM -Aided MIMO Beamforming with Decoupled Intent Parsing and Alternating Optimization for Joint Site Selection and Precoding

链接: https://arxiv.org/abs/2603.18855
作者: Xiucheng Wang,Yue Zhang,Nan Cheng
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Integrating large language models (LLMs) into wireless communication optimization is a promising yet challenging direction. Existing approaches either use LLMs as black-box solvers or code generators, tightly coupling them with numerical computation. However, LLMs lack the precision required for physical-layer optimization, and the scarcity of wireless training data makes domain-specific fine-tuning impractical. We propose BeamAgent, an LLM-aided MIMO beamforming framework that explicitly decouples semantic intent parsing from numerical optimization. The LLM serves solely as a semantic translator that converts natural language descriptions into structured spatial constraints. A dedicated gradient-based optimizer then jointly solves the discrete base station site selection and continuous precoding design through an alternating optimization algorithm. A scene-aware prompt enables grounded spatial reasoning without fine-tuning, and a multi-round interaction mechanism with dual-layer intent classification ensures robust constraint verification. A penalty-based loss function enforces dark-zone power constraints while releasing optimization degrees of freedom for bright-zone gain maximization. Experiments on a ray-tracing-based urban MIMO scenario show that BeamAgent achieves a bright-zone power of 84.0,dB, outperforming exhaustive zero-forcing by 7.1 dB under the same dark-zone constraint. The end-to-end system reaches within 3.3 dB of the expert upper bound, with the full optimization completing in under 2 s on a laptop.

[LG-27] Learn for Variation: Variationally Guided AAV Trajectory Learning in Differentiable Environments

链接: https://arxiv.org/abs/2603.18853
作者: Xiucheng Wang,Zhenye Chen,Nan Cheng
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous aerial vehicles (AAVs) empower sixth-generation (6G) Internet-of-Things (IoT) networks through mobility-driven data collection. However, conventional reward-driven reinforcement learning for AAV trajectory planning suffers from severe credit assignment issues and training instability, because sparse scalar rewards fail to capture the long-term and nonlinear effects of sequential movements. To address these challenges, this paper proposes Learn for Variation (L4V), a gradient-informed trajectory learning framework that replaces high-variance scalar reward signals with dense and analytically grounded policy gradients. Particularly, the coupled evolution of AAV kinematics, distance-dependent channel gains, and per-user data-collection progress is first unrolled into an end-to-end differentiable computational graph. Backpropagation through time then serves as a discrete adjoint solver, which propagates exact sensitivities from the cumulative mission objective to every control action and policy parameter. These structured gradients are used to train a deterministic neural policy with temporal smoothness regularization and gradient clipping. Extensive simulations demonstrate that L4V consistently outperforms representative baselines, including a genetic algorithm, DQN, A2C, and DDPG, in mission completion time, average transmission rate, and training cost

[LG-28] A Model Ensemble-Based Post-Processing Framework for Fairness-Aware Prediction

链接: https://arxiv.org/abs/2603.18838
作者: Zhouting Zhao,Tin Lok James Ng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Striking an optimal balance between predictive performance and fairness continues to be a fundamental challenge in machine learning. In this work, we propose a post-processing framework that facilitates fairness-aware prediction by leveraging model ensembling. Designed to operate independently of any specific model internals, our approach is widely applicable across various learning tasks, model architectures, and fairness definitions. Through extensive experiments spanning classification, regression, and survival analysis, we demonstrate that the framework effectively enhances fairness while maintaining, or only minimally affecting, predictive accuracy.

[LG-29] Model Order Reduction of Cerebrovascular Hemodynamics Using POD_Galerkin and Reservoir Computing_based Approach

链接: https://arxiv.org/abs/2603.18837
作者: Rahul Halder,Arash Hajisharifi,Kabir Bakhshaei,Gianluigi Rozza
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 24 pages, 15 figures

点击查看摘要

Abstract:We investigate model order reduction (MOR) strategies for simulating unsteady hemodynamics within cerebrovascular systems, contrasting a physics-based intrusive approach with a data-driven non-intrusive framework. High-fidelity 3D Computational Fluid Dynamics (CFD) snapshots of an idealised basilar artery bifurcation are first compressed into a low-dimensional latent space using Proper Orthogonal Decomposition (POD). We evaluate the performance of a POD-Galerkin (POD-G) model, which projects the Navier-Stokes equations onto the reduced basis, against a POD-Reservoir Computing (POD-RC) model that learns the temporal evolution of coefficients through a recurrent architecture. A multi-harmonic and multi-amplitude training signal is introduced to improve training efficiency. Both methodologies achieve computational speed-ups on the order of 10^2 to 10^3 compared to full-order simulations, demonstrating their potential as efficient and accurate surrogates for predicting flow quantities such as wall shear stress.

[LG-30] Seasoning Generative Models for a Generalization Aftertaste

链接: https://arxiv.org/abs/2603.18817
作者: Hisham Husain,Valentin De Bortoli,Richard Nock
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The use of discriminators to train or fine-tune generative models has proven to be a rather successful framework. A notable example is Generative Adversarial Networks (GANs) that minimize a loss incurred by training discriminators along with other paradigms that boost generative models via discriminators that satisfy weak learner constraints. More recently, even diffusion models have shown advantages with some kind of discriminator guidance. In this work, we extend a strong-duality result related to f -divergences which gives rise to a discriminator-guided recipe that allows us to \textitrefine any generative model. We then show that the refined generative models provably improve generalization, compared to its non-refined counterpart. In particular, our analysis reveals that the gap in generalization is improved based on the Rademacher complexity of the discriminator set used for refinement. Our recipe subsumes a recently introduced score-based diffusion approach (Kim et al., 2022) that has shown great empirical success, however allows us to shed light on the generalization guarantees of this method by virtue of our analysis. Thus, our work provides a theoretical validation for existing work, suggests avenues for new algorithms, and contributes to our understanding of generalization in generative models at large.

[LG-31] Enhancing the Parameterization of Reservoir Properties for Data Assimilation Using Deep VAE-GAN

链接: https://arxiv.org/abs/2603.18766
作者: Marcio Augusto Sampaio,Paulo Henrique Ranazzi,Martin Julian Blunt
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Currently, the methods called Iterative Ensemble Smoothers, especially the method called Ensemble Smoother with Multiple Data Assimilation (ESMDA) can be considered state-of-the-art for history matching in petroleum reservoir simulation. However, this approach has two important limitations: the use of an ensemble with finite size to represent the distributions and the Gaussian assumption in parameter and data uncertainties. This latter is particularly important because many reservoir properties have non-Gaussian distributions. Parameterization involves mapping non-Gaussian parameters to a Gaussian field before the update and then mapping them back to the original domain to forward the ensemble through the reservoir simulator. A promising approach to perform parameterization is through deep learning models. Recent studies have shown that Generative Adversarial Networks (GAN) performed poorly concerning data assimilation, but generated more geologically plausible realizations of the reservoir, while the Variational Autoencoder (VAE) performed better than the GAN in data assimilation, but generated less geologically realistic models. This work is innovative in combining the strengths of both to implement a deep learning model called Variational Autoencoder Generative Adversarial Network (VAE-GAN) integrated with ESMDA. The methodology was applied in two case studies, one case being categorical and the other with continuous values of permeability. Our findings demonstrate that by applying the VAE-GAN model we can obtain high quality reservoir descriptions (just like GANs) and a good history matching on the production curves (just like VAEs) simultaneously.

[LG-32] Off-Policy Learning with Limited Supply WWW2026

链接: https://arxiv.org/abs/2603.18702
作者: Koichi Tanaka,Ren Kishimoto,Bushun Kawagishi,Yusuke Narita,Yasuo Yamamoto,Nobuyuki Shimizu,Yuta Saito
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at WWW 2026

点击查看摘要

Abstract:We study off-policy learning (OPL) in contextual bandits, which plays a key role in a wide range of real-world applications such as recommendation systems and online advertising. Typical OPL in contextual bandits assumes an unconstrained environment where a policy can select the same item infinitely. However, in many practical applications, including coupon allocation and e-commerce, limited supply constrains items through budget limits on distributed coupons or inventory restrictions on products. In these settings, greedily selecting the item with the highest expected reward for the current user may lead to early depletion of that item, making it unavailable for future users who could potentially generate higher expected rewards. As a result, OPL methods that are optimal in unconstrained settings may become suboptimal in limited supply settings. To address the issue, we provide a theoretical analysis showing that conventional greedy OPL approaches may fail to maximize the policy performance, and demonstrate that policies with superior performance must exist in limited supply settings. Based on this insight, we introduce a novel method called Off-Policy learning with Limited Supply (OPLS). Rather than simply selecting the item with the highest expected reward, OPLS focuses on items with relatively higher expected rewards compared to the other users, enabling more efficient allocation of items with limited supply. Our empirical results on both synthetic and real-world datasets show that OPLS outperforms existing OPL methods in contextual bandit problems with limited supply.

[LG-33] OCP: Orthogonal Constrained Projection for Sparse Scaling in Industrial Commodity Recommendation

链接: https://arxiv.org/abs/2603.18697
作者: Chen Sun,Beilin Xu,Boheng Tan,Jiacheng Wang,Yuefeng Sun,Rite Bo,Ying He,Yaqiang Zang,Pinghua Gong
类目: Machine Learning (cs.LG)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:In industrial commodity recommendation systems, the representation quality of Item-Id vocabularies directly impacts the scalability and generalization ability of recommendation models. A key challenge is that traditional Item-Id vocabularies, when subjected to sparse scaling, suffer from low-frequency information interference, which restricts their expressive power for massive item sets and leads to representation collapse. To address this issue, we propose an Orthogonal Constrained Projection method to optimize embedding representation. By enforcing orthogonality, the projection constrains the backpropagation manifold, aligning the singular value spectrum of the learned embeddings with the orthogonal basis. This alignment ensures high singular entropy, thereby preserving isotropic generalized features while suppressing spurious correlations and overfitting to rare items. Empirical results demonstrate that OCP accelerates loss convergence and enhances the model’s scalability; notably, it enables consistent performance gains when scaling up dense layers. Large-scale industrial deployment on this http URL further confirms its efficacy, yielding a 12.97% increase in UCXR and an 8.9% uplift in GMV, highlighting its robust utility for scaling up both sparse vocabularies and dense architectures.

[LG-34] Revisiting Label Inference Attacks in Vertical Federated Learning: Why They Are Vulnerable and How to Defend

链接: https://arxiv.org/abs/2603.18680
作者: Yige Liu,Dexuan Xu,Zimai Guo,Yongzhi Cao,Hanpin Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Vertical federated learning (VFL) allows an active party with a top model, and multiple passive parties with bottom models to collaborate. In this scenario, passive parties possessing only features may attempt to infer active party’s private labels, making label inference attacks (LIAs) a significant threat. Previous LIA studies have claimed that well-trained bottom models can effectively represent labels. However, we demonstrate that this view is misleading and exposes the vulnerability of existing LIAs. By leveraging mutual information, we present the first observation of the “model compensation” phenomenon in VFL. We theoretically prove that, in VFL, the mutual information between layer outputs and labels increases with layer depth, indicating that bottom models primarily extract feature information while the top model handles label mapping. Building on this insight, we introduce task reassignment to show that the success of existing LIAs actually stems from the distribution alignment between features and labels. When this alignment is disrupted, the performance of LIAs declines sharply or even fails entirely. Furthermore, the implications of this insight for defenses are also investigated. We propose a zero-overhead defense technique based on layer adjustment. Extensive experiments across five datasets and five representative model architectures indicate that shifting cut layers forward to increase the proportion of top model layers in the entire model not only improves resistance to LIAs but also enhances other defenses.

[LG-35] Enhancing Multi-Corpus Training in SSL-Based Anti-Spoofing Models: Domain-Invariant Feature Extraction

链接: https://arxiv.org/abs/2603.18657
作者: Anh-Tuan Dao,Driss Matrouf,Mickael Rouvier,Nicholas Evans
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The performance of speech spoofing detection often varies across different training and evaluation corpora. Leveraging multiple corpora typically enhances robustness and performance in fields like speaker recognition and speech recognition. However, our spoofing detection experiments show that multi-corpus training does not consistently improve performance and may even degrade it. We hypothesize that dataset-specific biases impair generalization, leading to performance instability. To address this, we propose an Invariant Domain Feature Extraction (IDFE) framework, employing multi-task learning and a gradient reversal layer to minimize corpus-specific information in learned embeddings. The IDFE framework reduces the average equal error rate by 20% compared to the baseline, assessed across four varied datasets.

[LG-36] Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle

链接: https://arxiv.org/abs/2603.18642
作者: Kevin Song
类目: Machine Learning (cs.LG)
*备注: 23 pages, 2 figures, 3 tables, 6 supplementary figures

点击查看摘要

Abstract:Infinite-shoe casino blackjack provides a rigorous, exactly verifiable benchmark for discrete stochastic control under dynamically masked actions. Under a fixed Vegas-style ruleset (S17, 3:2 payout, dealer peek, double on any two, double after split, resplit to four), an exact dynamic programming (DP) oracle was derived over 4,600 canonical decision cells. This oracle yielded ground-truth action values, optimal policy labels, and a theoretical expected value (EV) of -0.00161 per hand. To evaluate sample-efficient policy recovery, three model-free optimizers were trained via simulated interaction: masked REINFORCE with a per-cell exponential moving average baseline, simultaneous perturbation stochastic approximation (SPSA), and the cross-entropy method (CEM). REINFORCE was the most sample-efficient, achieving a 46.37% action-match rate and an EV of -0.04688 after 10^6 hands, outperforming CEM (39.46%, 7.5x10^6 evaluations) and SPSA (38.63%, 4.8x10^6 evaluations). However, all methods exhibited substantial cell-conditional regret, indicating persistent policy-level errors despite smooth reward convergence. This gap shows that tabular environments with severe state-visitation sparsity and dynamic action masking remain challenging, while aggregate reward curves can obscure critical local failures. As a negative control, it was proven and empirically confirmed that under i.i.d. draws without counting, optimal bet sizing collapses to the table minimum. In addition, larger wagers strictly increased volatility and ruin without improving expectation. These results highlight the need for exact oracles and negative controls to avoid mistaking stochastic variability for genuine algorithmic performance.

[LG-37] Cyber-Resilient Digital Twins: Discriminating Attacks for Safe Critical Infrastructure Control

链接: https://arxiv.org/abs/2603.18613
作者: Mohammadhossein Homaei,Iman Khazrak,Rubén Molano,Andrés Caro,Mar Ávila
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 19 Pages, 2 Figures, 12 Tables

点击查看摘要

Abstract:Industrial Cyber-Physical Systems (ICPS) face growing threats from cyber-attacks that exploit sensor and control vulnerabilities. Digital Twin (DT) technology can detect anomalies via predictive modelling, but current methods cannot distinguish attack types and often rely on costly full-system shutdowns. This paper presents i-SDT (intelligent Self-Defending DT), combining hydraulically-regularized predictive modelling, multi-class attack discrimination, and adaptive resilient control. Temporal Convolutional Networks (TCNs) with differentiable conservation constraints capture nominal dynamics and improve robustness to adversarial manipulations. A recurrent residual encoder with Maximum Mean Discrepancy (MMD) separates normal operation from single- and multi-stage attacks in latent space. When attacks are confirmed, Model Predictive Control (MPC) uses uncertainty-aware DT predictions to keep operations safe without shutdown. Evaluation on SWaT and WADI datasets shows major gains in detection accuracy, 44.1% fewer false alarms, and 56.3% lower operational costs in simulation-in-the-loop evaluation. with sub-second inference latency confirming real-time feasibility on plant-level workstations, i-SDT advances autonomous cyber-physical defense while maintaining operational resilience.

[LG-38] Breaking Hard Isomorphism Benchmarks with DRESS

链接: https://arxiv.org/abs/2603.18582
作者: Eduar Castrillo Velilla
类目: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper we study the single-deletion variant \Delta -DRESS, part of the broader DRESS framework. We demonstrate empirically that \Delta -DRESS, a single level of vertex deletion applied to the DRESS graph fingerprint, achieves unique fingerprints within each tested SRG parameter family across all 51,718 non-isomorphic strongly regular graphs (SRGs) considered, spanning 16 parameter families: the complete Spence collection (12 families, 43,703 graphs on up to 64 vertices) plus four additional SRG families with up to 4,466 graphs per family. Combined with 18 additional hard graph families (102 graphs including Miyazaki, Chang, Paley, Latin square, and Steiner constructions), \Delta -DRESS achieves 100% within-family separation across 34 benchmark families covering 51,816 distinct graph instances, implicitly resolving over 576 million within-family non-isomorphic pairs. Moreover, the classical Rook L_2(4) vs. Shrikhande pair, SRG(16,6,2,2), is known to be indistinguishable by the original 3-WL algorithm, yet \Delta -DRESS separates it, proving that \Delta -DRESS escapes the theoretical boundaries of 3-WL. The method runs in polynomial time \mathcalO(n \cdot I \cdot m \cdot d_\max) per graph; a streamed implementation of the combined fingerprint uses \mathcalO(m + B + n) memory, where B is the number of histogram bins, while the experiments reported here additionally retain the full deleted-subgraph multiset matrix for post-hoc analysis.

[LG-39] WarPGNN: A Parametric Thermal Warpage Analysis Framework with Physics-aware Graph Neural Network

链接: https://arxiv.org/abs/2603.18581
作者: Haotian Lu,Jincong Lu,Sachin Sachdeva,Sheldon X.-D. Tan
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 6 Pages, ACM format

点击查看摘要

Abstract:With the advent of system-in-package (SiP) chiplet-based design and heterogeneous 2.5D/3D integration, thermal-induced warpage has become a critical reliability concern. While conventional numerical approaches can deliver highly accurate results, they often incur prohib- itively high computational costs, limiting their scalability for complex chiplet-package systems. In this paper, we present WarPGNN, an ef- ficient and accurate parametric thermal warpage analysis framework powered by Graph Neural Networks (GNNs). By operating directly on graphs constructed from the floorplans, WarPGNN enables fast warpage-aware floorplan exploration and exhibits strong transfer- ability across diverse package configurations. Our method first en- codes multi-die floorplans into reduced Transitive Closure Graphs (rTCGs), then a Graph Convolution Network (GCN)-based encoder extracts hierarchical structural features, followed by a U-Net inspired decoder that reconstructs warpage maps from graph feature embed- dings. Furthermore, to address the long-tailed pattern of warpage data distribution, we developed a physics-informed loss and revised a message-passing encoder based on Graph Isomorphic Network (GIN) that further enhance learning performance for extreme cases and expressiveness of graph embeddings. Numerical results show that WarPGNN achieves more than 205.91x speedup compared with the 2-D efficient FEM-based method and over 119766.64x acceleration with 3-D FEM method COMSOL, respectively, while maintaining comparable accuracy at only 1.26% full-scale normalized RMSE and 2.21% warpage value error. Compared with recent DeepONet-based model, our method achieved comparable prediction accuracy and in- ference speedup with 3.4x lower training time. In addition, WarPGNN demonstrates remarkable transferability on unseen datasets with up to 3.69% normalized RMSE and similar runtime.

[LG-40] Attack by Unlearning: Unlearning-Induced Adversarial Attacks on Graph Neural Networks

链接: https://arxiv.org/abs/2603.18570
作者: Jiahao Zhang,Yilong Wang,Suhang Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are widely used for learning from graph-structured data in domains such as social networks, recommender systems, and financial platforms. To comply with privacy regulations like the GDPR, CCPA, and PIPEDA, approximate graph unlearning, which aims to remove the influence of specific data points from trained models without full retraining, has become an increasingly important component of trustworthy graph learning. However, approximate unlearning often incurs subtle performance degradation, which may incur negative and unintended side effects. In this work, we show that such degradations can be amplified into adversarial attacks. We introduce the notion of \textbfunlearning corruption attacks, where an adversary injects carefully chosen nodes into the training graph and later requests their deletion. Because deletion requests are legally mandated and cannot be denied, this attack surface is both unavoidable and stealthy: the model performs normally during training, but accuracy collapses only after unlearning is applied. Technically, we formulate this attack as a bi-level optimization problem: to overcome the challenges of black-box unlearning and label scarcity, we approximate the unlearning process via gradient-based updates and employ a surrogate model to generate pseudo-labels for the optimization. Extensive experiments across benchmarks and unlearning algorithms demonstrate that small, carefully designed unlearning requests can induce significant accuracy degradation, raising urgent concerns about the robustness of GNN unlearning under real-world regulatory demands. The source code will be released upon paper acceptance.

[LG-41] SINDy-KANs: Sparse identification of non-linear dynamics through Kolmogorov-Arnold networks

链接: https://arxiv.org/abs/2603.18548
作者: Amanda A. Howard,Nicholas Zolman,Bruno Jacob,Steven L. Brunton,Panos Stinis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold networks (KANs) have arisen as a potential way to enhance the interpretability of machine learning. However, solutions learned by KANs are not necessarily interpretable, in the sense of being sparse or parsimonious. Sparse identification of nonlinear dynamics (SINDy) is a complementary approach that allows for learning sparse equations for dynamical systems from data; however, learned equations are limited by the library. In this work, we present SINDy-KANs, which simultaneously train a KAN and a SINDy-like representation to increase interpretability of KAN representations with SINDy applied at the level of each activation function, while maintaining the function compositions possible through deep KANs. We apply our method to a number of symbolic regression tasks, including dynamical systems, to show accurate equation discovery across a range of systems.

[LG-42] HEP Statistical Inference for UAV Fault Detection: CLs LRT and SBI Applied to Blade Damage

链接: https://arxiv.org/abs/2603.18546
作者: Khushiyant
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: 12 Pages, 8 Figures

点击查看摘要

Abstract:This paper transfers three statistical methods from particle physics to multirotor propeller fault detection: the likelihood ratio test (LRT) for binary detection, the CLs modified frequentist method for false alarm rate control, and sequential neural posterior estimation (SNPE) for quantitative fault characterization. Operating on spectral features tied to rotor harmonic physics, the system returns three outputs: binary detection, controlled false alarm rates, and calibrated posteriors over fault severity and motor location. On UAV-FD, a hexarotor dataset of 18 real flights with 5% and 10% blade damage, leave-one-flight-out cross-validation gives AUC 0.862 +/- 0.007 (95% CI: 0.849–0.876), outperforming CUSUM (0.708 +/- 0.010), autoencoder (0.753 +/- 0.009), and LSTM autoencoder (0.551). At 5% false alarm rate the system detects 93% of significant and 81% of subtle blade damage. On PADRE, a quadrotor platform, AUC reaches 0.986 after refitting only the generative models. SNPE gives a full posterior over fault severity (90% credible interval coverage 92–100%, MAE 0.012), so the output includes uncertainty rather than just a point estimate or fault flag. Per-flight sequential detection achieves 100% fault detection with 94% overall accuracy.

[LG-43] GAPSL: A Gradient-Aligned Parallel Split Learning on Heterogeneous Data

链接: https://arxiv.org/abs/2603.18540
作者: Zheng Lin,Ons Aouedi,Wei Ni,Symeon Chatzinotas,Xianhao Chen
类目: Machine Learning (cs.LG)
*备注: 13 pages, 21 figures

点击查看摘要

Abstract:The increasing complexity of neural networks poses significant challenges for democratizing FL on resource?constrained client devices. Parallel split learning (PSL) has emerged as a promising solution by offloading substantial computing workload to a server via model partitioning, shrinking client-side computing load, and eliminating the client-side model aggregation for reduced communication and deployment costs. Since PSL is aggregation-free, it suffers from severe training divergence stemming from gradient directional inconsistency across clients. To address this challenge, we propose GAPSL, a gradient-aligned PSL framework that comprises two key components: leader gradient identification (LGI) and gradient direction alignment (GDA). LGI dynamically selects a set of directionally consistent client gradients to construct a leader gradient that captures the global convergence trend. GDA employs a direction-aware regularization to align each client’s gradient with the leader gradient, thereby mitigating inter-device gradient directional inconsistency and enhancing model convergence. We evaluate GAPSL on a prototype computing testbed. Extensive experiments demonstrate that GAPSL consistently outperforms state-of-the-art benchmarks in training accuracy and latency.

[LG-44] SatCR: Graph-Empowered Joint Onboard Computing and Routing for LEO Data Delivery

链接: https://arxiv.org/abs/2603.18539
作者: Jiangtao Luo,Bingbing Xu,Shaohua Xia,Yongyi Ran
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:Sending massive Earth observation data produced by low Earth orbit (LEO) satellites back to the ground for processing consumes a large amount of on-orbit bandwidth and exacerbates the space-to-ground link bottleneck. Most prior work has concentrated on optimizing the routing of raw data within the constellation, yet cannot cope with the surge in data volume. Recently, advances in onboard computing have made it possible to process data in situ, thus significantly reducing the data volume to be transmitted. In this paper, we present iSatCR, a distributed graph-based approach that jointly optimizes onboard computing and routing to boost transmission efficiency. Within iSatCR, we design a novel graph embedding utilizing shifted feature aggregation and distributed message passing to capture satellite states, and then propose a distributed graph-based deep reinforcement learning algorithm that derives joint computing-routing strategies under constrained on-board storage to handle the complexity and dynamics of LEO networks. Extensive experiments show iSatCR outperforms baselines, particularly under high load.

[LG-45] Beyond Passive Aggregation: Active Auditing and Topology-Aware Defense in Decentralized Federated Learning

链接: https://arxiv.org/abs/2603.18538
作者: Sheng Pan,Niansheng Tang
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Decentralized Federated Learning (DFL) remains highly vulnerable to adaptive backdoor attacks designed to bypass traditional passive defense metrics. To address this limitation, we shift the defensive paradigm toward a novel active, interventional auditing framework. First, we establish a dynamical model to characterize the spatiotemporal diffusion of adversarial updates across complex graph topologies. Second, we introduce a suite of proactive auditing metrics, stochastic entropy anomaly, randomized smoothing Kullback-Leibler divergence, and activation kurtosis. These metrics utilize private probes to stress-test local models, effectively exposing latent backdoors that remain invisible to conventional static detection. Furthermore, we implement a topology-aware defense placement strategy to maximize global aggregation resilience. We provide theoretical property for the system’s convergence under co-evolving attack and defense dynamics. Numeric empirical evaluations across diverse architectures demonstrate that our active framework is highly competitive with state-of-the-art defenses in mitigating stealthy, adaptive backdoors while preserving primary task utility.

[LG-46] Data-efficient pre-training by scaling synthetic megadocs

链接: https://arxiv.org/abs/2603.18534
作者: Konwoo Kim,Suhas Kotha,Yejin Choi,Tatsunori Hashimoto,Nick Haber,Percy Liang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different distribution. With optimal mixing and epoching, loss and benchmark accuracy improve without overfitting as the number of synthetic generations grows, plateauing near 1.48\times data efficiency at 32 rephrases per document. We find even better loss scaling under a new perspective: synthetic generations from the same document can form a single substantially longer megadocument instead of many short documents. We show two ways to construct megadocs: stitching synthetic rephrases from the same web document or stretching a document by inserting rationales. Both methods improve i.i.d. loss, downstream benchmarks, and especially long-context loss relative to simple rephrasing, increasing data efficiency from 1.48\times to 1.80\times at 32 generations per document. Importantly, the improvement of megadocs over simple rephrasing widens as more synthetic data is generated. Our results show how to design synthetic data algorithms that benefit more from increasing compute when data-constrained.

[LG-47] AIMER: Calibration-Free Task-Agnostic MoE Pruning

链接: https://arxiv.org/abs/2603.18492
作者: Zongfang Liu,Shengkun Tang,Yifan Shen,Huan Wang,Xin Yuan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) language models increase parameter capacity without proportional per-token compute, but the deployment still requires storing all experts, making expert pruning important for reducing memory and serving overhead. Existing task-agnostic expert pruning methods are typically calibration-dependent: they estimate expert importance from routing or activation statistics on a calibration set, which makes pruning outcomes sensitive to the choice of calibration set and adds substantial preprocessing cost. We introduce AIMER (\textbfAbsolute mean over root mean square \textbfIMportance for \textbfExpert \textbfRanking), a simple calibration-free criterion that yields clear within-layer score separation and distinct expert stratification. Across 7B to 30B MoE language models at 25% and 50% pruning ratios over 16 benchmarks, AIMER consistently delivers competitive or stronger overall performance against state-of-the-art calibration-based expert pruning baselines with only 0.22–1.27 seconds for scoring the experts.

[LG-48] AcceRL: A Distributed Asynchronous Reinforcement Learning and World Model Framework for Vision-Language-Action Models

链接: https://arxiv.org/abs/2603.18464
作者: Chengxuan Lu,Shukuan Wang,Yanjie Li,Wei Liu,Shiji Jin,Fuyuan Qian,Peiming Li,Baigui Sun,Yang Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) for large-scale Vision-Language-Action (VLA) models faces significant challenges in computational efficiency and data acquisition. We propose AcceRL, a fully asynchronous and decoupled RL framework designed to eliminate synchronization barriers by physically isolating training, inference, and rollouts. Crucially, AcceRL is the first to integrate a plug-and-play, trainable world model into a distributed asynchronous RL pipeline to generate virtual experiences. Experiments on the LIBERO benchmark demonstrate that AcceRL achieves state-of-the-art (SOTA) performance. Systematically, it exhibits super-linear scaling in throughput and highly efficient hardware utilization. Algorithmically, the world-model-augmented variant delivers unprecedented sample efficiency and robust training stability in complex control tasks.

[LG-49] Seeking Universal Shot Language Understanding Solutions

链接: https://arxiv.org/abs/2603.18448
作者: Haoxin Liu,Harshavardhan Kamarthi,Zhiyuan Zhao,Hongjie Chen,B. Aditya Prakash
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Shot language understanding (SLU) is crucial for cinematic analysis but remains challenging due to its diverse cinematographic dimensions and subjective expert judgment. While vision-language models (VLMs) have shown strong ability in general visual understanding, recent studies reveal judgment discrepancies between VLMs and film experts on SLU tasks. To address this gap, we introduce SLU-SUITE, a comprehensive training and evaluation suite containing 490K human-annotated QA pairs across 33 tasks spanning six film-grounded dimensions. Using SLU-SUITE, we originally observe two insights into VLM-based SLU from: the model side, which diagnoses key bottlenecks of modules; the data side, which quantifies cross-dimensional influences among tasks. These findings motivate our universal SLU solutions from two complementary paradigms: UniShot, a balanced one-for-all generalist trained via dynamic-balanced data mixing, and AgentShots, a prompt-routed expert cluster that maximizes peak dimension performance. Extensive experiments show that our models outperform task-specific ensembles on in-domain tasks and surpass leading commercial VLMs by 22% on out-of-domain tasks.

[LG-50] MLOW: Interpretable Low-Rank Frequency Magnitude Decomposition of Multiple Effects for Time Series Forecasting

链接: https://arxiv.org/abs/2603.18432
作者: Runze Yang,Longbing Cao,Xiaoming Wu,Xin You,Kun Fang,Jianxun Li,Jie Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Separating multiple effects in time series is fundamental yet challenging for time-series forecasting (TSF). However, existing TSF models cannot effectively learn interpretable multi-effect decomposition by their smoothing-based temporal techniques. Here, a new interpretable frequency-based decomposition pipeline MLOW captures the insight: a time series can be represented as a magnitude spectrum multiplied by the corresponding phase-aware basis functions, and the magnitude spectrum distribution of a time series always exhibits observable patterns for different effects. MLOW learns a low-rank representation of the magnitude spectrum to capture dominant trending and seasonal effects. We explore low-rank methods, including PCA, NMF, and Semi-NMF, and find that none can simultaneously achieve interpretable, efficient and generalizable decomposition. Thus, we propose hyperplane-nonnegative matrix factorization (Hyperplane-NMF). Further, to address the frequency (spectral) leakage restricting high-quality low-rank decomposition, MLOW enables a flexible selection of input horizons and frequency levels via a mathematical mechanism. Visual analysis demonstrates that MLOW enables interpretable and hierarchical multiple-effect decomposition, robust to noises. It can also enable plug-and-play in existing TSF backbones with remarkable performance improvement but minimal architectural modifications.

[LG-51] owards Noise-Resilient Quantum Multi-Armed and Stochastic Linear Bandits

链接: https://arxiv.org/abs/2603.18431
作者: Zhuoyue Chen,Kechao Cai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum multi-armed bandits (MAB) and stochastic linear bandits (SLB) have recently attracted significant attention, as their quantum counterparts can achieve quadratic speedups over classical MAB and SLB. However, most existing quantum MAB algorithms assume ideal quantum Monte Carlo (QMC) procedures on noise-free circuits, overlooking the impact of noise in current noisy intermediate-scale quantum (NISQ) devices. In this paper, we study a noise-robust QMC algorithm that improves estimation accuracy when querying quantum reward oracles. Building on this estimator, we propose noise-robust QMAB and QSLB algorithms that enhance performance in noisy environments while preserving the advantage over classical methods. Experiments show that our noise-robust approach improves QMAB estimation accuracy and reduces regret under several quantum noise models.

[LG-52] FlowMS: Flow Matching for De Novo Structure Elucidation from Mass Spectra

链接: https://arxiv.org/abs/2603.18397
作者: Jianan Nie,Peng Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mass spectrometry (MS) stands as a cornerstone analytical technique for molecular identification, yet de novo structure elucidation from spectra remains challenging due to the combinatorial complexity of chemical space and the inherent ambiguity of spectral fragmentation patterns. Recent deep learning approaches, including autoregressive sequence models, scaffold-based methods, and graph diffusion models, have made progress. However, diffusion-based generation for this task remains computationally demanding. Meanwhile, discrete flow matching, which has shown strong performance for graph generation, has not yet been explored for spectrum-conditioned structure elucidation. In this work, we introduce FlowMS, the first discrete flow matching framework for spectrum-conditioned de novo molecular generation. FlowMS generates molecular graphs through iterative refinement in probability space, enforcing chemical formula constraints while conditioning on spectral embeddings from a pretrained formula transformer encoder. Notably, it achieves state-of-the-art performance on 5 out of 6 metrics on the NPLIB1 benchmark: 9.15% top-1 accuracy (9.7% relative improvement over DiffMS) and 7.96 top-10 MCES (4.2% improvement over MS-BART). We also visualize the generated molecules, which further demonstrate that FlowMS produces structurally plausible candidates closely resembling ground truth structures. These results establish discrete flow matching as a promising paradigm for mass spectrometry-based structure elucidation in metabolomics and natural product discovery.

[LG-53] RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach

链接: https://arxiv.org/abs/2603.18396
作者: Yifan Zhang,Liang Zheng
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Bus holding control is challenging due to stochastic traffic and passenger demand. While deep reinforcement learning (DRL) shows promise, standard actor-critic algorithms suffer from Q-value instability in volatile environments. A key source of this instability is the conflation of two distinct uncertainties: aleatoric uncertainty (irreducible noise) and epistemic uncertainty (data insufficiency). Treating these as a single risk leads to value underestimation in noisy states, causing catastrophic policy collapse. We propose a robust ensemble soft actor-critic (RE-SAC) framework to explicitly disentangle these uncertainties. RE-SAC applies Integral Probability Metric (IPM)-based weight regularization to the critic network to hedge against aleatoric risk, providing a smooth analytical lower bound for the robust Bellman operator without expensive inner-loop perturbations. To address epistemic risk, a diversified Q-ensemble penalizes overconfident value estimates in sparsely covered regions. This dual mechanism prevents the ensemble variance from misidentifying noise as a data gap, a failure mode identified in our ablation study. Experiments in a realistic bidirectional bus corridor simulation demonstrate that RE-SAC achieves the highest cumulative reward (approx. -0.4e6) compared to vanilla SAC (-0.55e6). Mahalanobis rareness analysis confirms that RE-SAC reduces Oracle Q-value estimation error by up to 62% in rare out-of-distribution states (MAE of 1647 vs. 4343), demonstrating superior robustness under high traffic variability.

[LG-54] Computational and Statistical Hardness of Calibration Distance

链接: https://arxiv.org/abs/2603.18391
作者: Mingda Qiao
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The distance from calibration, introduced by Błasiok, Gopalan, Hu, and Nakkiran (STOC 2023), has recently emerged as a central measure of miscalibration for probabilistic predictors. We study the fundamental problems of computing and estimating this quantity, given either an exact description of the data distribution or only sample access to it. We give an efficient algorithm that exactly computes the calibration distance when the distribution has a uniform marginal and noiseless labels, which improves the O(1/\sqrt|\mathcalX|) additive approximation of Qiao and Zheng (COLT 2024) for this special case. Perhaps surprisingly, the problem becomes \mathsfNP -hard when either of the two assumptions is removed. We extend our algorithm to a polynomial-time approximation scheme for the general case. For the estimation problem, we show that \Theta(1/\epsilon^3) samples are sufficient and necessary for the empirical calibration distance to be upper bounded by the true distance plus \epsilon . In contrast, a polynomial dependence on the domain size – incurred by the learning-based baseline – is unavoidable for two-sided estimation. Our positive results are based on simple sparsifications of both the distribution and the target predictor, which significantly reduce the search space for computation and lead to stronger concentration for the estimation problem. To prove the hardness results, we introduce new techniques for certifying lower bounds on the calibration distance – a problem that is hard in general due to its \textsfco-NP -completeness. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2603.18391 [cs.DS] (or arXiv:2603.18391v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2603.18391 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mingda Qiao [view email] [v1] Thu, 19 Mar 2026 01:21:52 UTC (40 KB)

[LG-55] Mathematical Foundations of Deep Learning

链接: https://arxiv.org/abs/2603.18387
作者: Xiaojing Ye
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Draft version. Final version is published in “Chapman Hall/CRC Mathematics and Artificial Intelligence Series” by Taylor Francis in 2026

点击查看摘要

Abstract:This draft book offers a comprehensive and rigorous treatment of the mathematical principles underlying modern deep learning. The book spans core theoretical topics, from the approximation capabilities of deep neural networks, the theory and algorithms of optimal control and reinforcement learning integrated with deep learning techniques, to contemporary generative models that drive today’s advances in artificial intelligence.

[LG-56] A Family of Adaptive Activation Functions for Mitigating Failure Modes in Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2603.18328
作者: Krishna Murari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks(PINNs) are a powerful and flexible learning framework that has gained significant attention in recent years. It has demonstrated strong performance across a wide range of scientific and engineering problems. In parallel, wavelets have been extensively used as efficient computational tools due to their strong approximation capabilities. Motivated by the common failure modes observed in standard PINNs, this work introduces a novel family of adaptive wavelet-based activation functions. The proposed activation functions significantly improve training stability and expressive power by combining trainable wavelet functions with either trainable or fixed hyperbolic tangent and softplus functions. Five distinct activation functions are developed within the PINN framework and systematically evaluated across four representative classes of partial differential equations (PDEs). Comprehensive comparisons using bar plots demonstrate improved robustness and accuracy compared to traditional activation functions. Furthermore, the proposed approach is validated through direct comparisons with baseline PINNs, transformer-based architectures such as PINNsFormer, and other deep learning models, highlighting its effectiveness and generality.

[LG-57] Escaping Offline Pessimism: Vector-Field Reward Shaping for Safe Frontier Exploration

链接: https://arxiv.org/abs/2603.18326
作者: Amirhossein Roknilamouki,Arnob Ghosh,Eylem Ekici,Ness B. Shroff
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While offline reinforcement learning provides reliable policies for real-world deployment, its inherent pessimism severely restricts an agent’s ability to explore and collect novel data online. Drawing inspiration from safe reinforcement learning, exploring near the boundary of regions well covered by the offline dataset and reliably modeled by the simulator allows an agent to take manageable risks–venturing into informative but moderate-uncertainty states while remaining close enough to familiar regions for safe recovery. However, naively rewarding this boundary-seeking behavior can lead to a degenerate parking behavior, where the agent simply stops once it reaches the frontier. To solve this, we propose a novel vector-field reward shaping paradigm designed to induce continuous, safe boundary exploration for non-adaptive deployed policies. Operating on an uncertainty oracle trained from offline data, our reward combines two complementary components: a gradient-alignment term that attracts the agent toward a target uncertainty level, and a rotational-flow term that promotes motion along the local tangent plane of the uncertainty manifold. Through theoretical analysis, we show that this reward structure naturally induces sustained exploratory behavior along the boundary while preventing degenerate solutions. Empirically, by integrating our proposed reward shaping with Soft Actor-Critic on a 2D continuous navigation task, we validate that agents successfully traverse uncertainty boundaries while balancing safe, informative data collection with primary task completion.

[LG-58] Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

链接: https://arxiv.org/abs/2603.18325
作者: Nived Rajaraman,Audrey Huang,Miro Dudik,Robert Schapire,Dylan J. Foster,Akshay Krishnamurthy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 39 pages, 4 figures

点击查看摘要

Abstract:Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where the model uses its own performance to decide which problems to focus training on, provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum decouples the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples, and requiring no assumption on the distribution or difficulty of prompts.

[LG-59] ALIGN: Adversarial Learning for Generalizable Speech Neuroprosthesis

链接: https://arxiv.org/abs/2603.18299
作者: Zhanqi Zhang,Shun Li,Bernardo L. Sabatini,Mikio Aoi,Gal Mishne
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Intracortical brain-computer interfaces (BCIs) can decode speech from neural activity with high accuracy when trained on data pooled across recording sessions. In realistic deployment, however, models must generalize to new sessions without labeled data, and performance often degrades due to cross-session nonstationarities (e.g., electrode shifts, neural turnover, and changes in user strategy). In this paper, we propose ALIGN, a session-invariant learning framework based on multi-domain adversarial neural networks for semi-supervised cross-session adaptation. ALIGN trains a feature encoder jointly with a phoneme classifier and a domain classifier operating on the latent representation. Through adversarial optimization, the encoder is encouraged to preserve task-relevant information while suppressing session-specific cues. We evaluate ALIGN on intracortical speech decoding and find that it generalizes consistently better to previously unseen sessions, improving both phoneme error rate and word error rate relative to baselines. These results indicate that adversarial domain alignment is an effective approach for mitigating session-level distribution shift and enabling robust longitudinal BCI decoding.

[LG-60] Path-Constrained Mixture-of-Experts

链接: https://arxiv.org/abs/2603.18297
作者: Zijin Gu,Tatiana Likhomanenko,Vimal Thilak,Jason Ramapuram,Navdeep Jaitly
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Sparse Mixture-of-Experts (MoE) architectures enable efficient scaling by activating only a subset of parameters for each input. However, conventional MoE routing selects each layer’s experts independently, creating N^L possible expert paths – for N experts across L layers. This far exceeds typical training set sizes, leading to statistical inefficiency as the model may not learn meaningful structure over such a vast path space. To constrain it, we propose \pathmoe, which shares router parameters across consecutive layers. Experiments on 0.9B and 16B parameter models demonstrate consistent improvements on perplexity and downstream tasks over independent routing, while eliminating the need for auxiliary load balancing losses. Analysis reveals that tokens following the same path naturally cluster by linguistic function, with \pathmoe producing more concentrated groups, better cross-layer consistency, and greater robustness to routing perturbations. These results offer a new perspective for understanding MoE architectures through the lens of expert paths.

[LG-61] On Additive Gaussian Processes for Wind Farm Power Prediction

链接: https://arxiv.org/abs/2603.18281
作者: Simon M. Brealy,Lawrence A. Bull,Daniel S. Brennan,Pauline Beltrando,Anders Sommer,Nikolaos Dervilis,Keith Worden
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Population-based Structural Health Monitoring (PBSHM) aims to share information between similar machines or structures. This paper takes a population-level perspective, exploring the use of additive Gaussian processes to reveal variations in turbine-specific and farm-level power models over a collected wind farm dataset. The predictions illustrate patterns in wind farm power generation, which follow intuition and should enable more informed control and decision-making.

[LG-62] Computation-Utility-Privacy Tradeoffs in Bayesian Estimation STOC2026

链接: https://arxiv.org/abs/2603.18254
作者: Sitan Chen,Jingqiu Ding,Mahbod Majid,Walter McKelvie
类目: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: To appear at STOC 2026

点击查看摘要

Abstract:Bayesian methods lie at the heart of modern data science and provide a powerful scaffolding for estimation in data-constrained settings and principled quantification and propagation of uncertainty. Yet in many real-world use cases where these methods are deployed, there is a natural need to preserve the privacy of the individuals whose data is being scrutinized. While a number of works have attempted to approach the problem of differentially private Bayesian estimation through either reasoning about the inherent privacy of the posterior distribution or privatizing off-the-shelf Bayesian methods, these works generally do not come with rigorous utility guarantees beyond low-dimensional settings. In fact, even for the prototypical tasks of Gaussian mean estimation and linear regression, it was unknown how close one could get to the Bayes-optimal error with a private algorithm, even in the simplest case where the unknown parameter comes from a Gaussian prior. In this work, we give the first efficient algorithms for both of these problems that achieve mean-squared error (1+o(1))\mathrmOPT and additionally show that both tasks exhibit an intriguing computational-statistical gap. For Bayesian mean estimation, we prove that the excess risk achieved by our method is optimal among all efficient algorithms within the low-degree framework, yet is provably worse than what is achievable by an exponential-time algorithm. For linear regression, we prove a qualitatively similar lower bound. Our algorithms draw upon the privacy-to-robustness framework of arXiv:2212.05015, but with the curious twist that to achieve private Bayes-optimal estimation, we need to design sum-of-squares-based robust estimators for inherently non-robust objects like the empirical mean and OLS estimator. Along the way we also add to the sum-of-squares toolkit a new kind of constraint based on short-flat decompositions.

[LG-63] AGRI-Fidelity: Evaluating the Reliability of Listenable Explanations for Poultry Disease Detection

链接: https://arxiv.org/abs/2603.18247
作者: Sindhuja Madabushi,Arda Dogan,Jonathan Liu,Dian Chen,Dong S. Ha,Sook Shin,Sam H. Noh,Jin-Hee Cho
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing XAI metrics measure faithfulness for a single model, ignoring model multiplicity where near-optimal classifiers rely on different or spurious acoustic cues. In noisy farm environments, stationary artifacts such as ventilation noise can produce explanations that are faithful yet unreliable, as masking-based metrics fail to penalize redundant shortcuts. We propose AGRI-Fidelity, a reliability-oriented evaluation framework for listenable explanations in poultry disease detection without spatial ground truth. The method combines cross-model consensus with cyclic temporal permutation to construct null distributions and compute a False Discovery Rate (FDR), suppressing stationary artifacts while preserving time-localized bioacoustic markers. Across real and controlled datasets, AGRI-Fidelity effectively provides reliability-aware discrimination for all data points versus masking-based metrics.

[LG-64] Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL

链接: https://arxiv.org/abs/2603.18174
作者: Xunzhuo Liu,Hao Wu,Huamin Chen,Bowei He,Xue Liu
类目: Machine Learning (cs.LG)
*备注: Work in progess

点击查看摘要

Abstract:Conflict detection in policy languages is a solved problem – as long as every rule condition is a crisp Boolean predicate. BDDs, SMT solvers, and NetKAT all exploit that assumption. But a growing class of routing and access-control systems base their decisions on probabilistic ML signals: embedding similarities, domain classifiers, complexity estimators. Two such signals, declared over categories the author intended to be disjoint, can both clear their thresholds on the same query and silently route it to the wrong model. Nothing in the compiler warns about this. We characterize the problem as a three-level decidability hierarchy – crisp conflicts are decidable via SAT, embedding conflicts reduce to spherical cap intersection, and classifier conflicts are undecidable without distributional knowledge – and show that for the embedding case, which dominates in practice, replacing independent thresholding with a temperature-scaled softmax partitions the embedding space into Voronoi regions where co-firing is impossible. No model retraining is needed. We implement the detection and prevention mechanisms in the Semantic Router DSL, a production routing language for LLM inference, and discuss how the same ideas apply to semantic RBAC and API gateway policy.

[LG-65] Learning-Augmented Algorithms for k-median via Online Learning NEURIPS2025

链接: https://arxiv.org/abs/2603.18157
作者: Anish Hebbar,Rong Ge,Amit Kumar,Debmalya Panigrahi
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: NeurIPS 2025

点击查看摘要

Abstract:The field of learning-augmented algorithms seeks to use ML techniques on past instances of a problem to inform an algorithm designed for a future instance. In this paper, we introduce a novel model for learning-augmented algorithms inspired by online learning. In this model, we are given a sequence of instances of a problem and the goal of the learning-augmented algorithm is to use prior instances to propose a solution to a future instance of the problem. The performance of the algorithm is measured by its average performance across all the instances, where the performance on a single instance is the ratio between the cost of the algorithm’s solution and that of an optimal solution for that instance. We apply this framework to the classic k -median clustering problem, and give an efficient learning algorithm that can approximately match the average performance of the best fixed k -median solution in hindsight across all the instances. We also experimentally evaluate our algorithm and show that its empirical performance is close to optimal, and also that it automatically adapts the solution to a dynamically changing sequence.

[LG-66] MAED: Mathematical Activation Error Detection for Mitigating Physical Fault Attacks in DNN Inference

链接: https://arxiv.org/abs/2603.18120
作者: Kasra Ahmadi,Saeed Aghapour,Mehran Mozaffari Kermani,Reza Azarderakhsh
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The inference phase of deep neural networks (DNNs) in embedded systems is increasingly vulnerable to fault attacks and failures, which can result in incorrect predictions. These vulnerabilities can potentially lead to catastrophic consequences, making the development of effective mitigation techniques essential. In this paper, we introduce MAED (Mathematical Activation Error Detection), an algorithm-level error detection framework that exploits mathematical identities to continuously validate the correctness of non-linear activation function computations at runtime. To the best of our knowledge, this work is the first to integrate algorithm-level error detection techniques to defend against both malicious fault injection attacks and naturally occurring faults in critical DNN components in embedded systems. The evaluation is conducted on three widely adopted activation functions, namely ReLu, sigmoid, and tanh which serve as fundamental building blocks for introducing non-linearity in DNNs and can lead to mispredictions when subjected to natural faults or fault attacks. We assessed the proposed error detection scheme via fault model simulation, achieving close to 100% error detection while mitigating existing fault attacks on DNN inference. Additionally, the overhead introduced by integrating the proposed scheme with the baseline implementation (i.e., without error detection) is validated through implementations on an AMD/Xilinx Artix-7 FPGA and an ATmega328P microcontroller, as well as through integration with TensorFlow. On the microcontroller, the proposed error detection incurs less than 1% clock cycle overhead, while on the FPGA it requires nearly zero additional area, at the cost of approximately a 20% increase in latency for sigmoid and tanh.

[LG-67] BoundAD: Boundary-Aware Negative Generation for Time Series Anomaly Detection

链接: https://arxiv.org/abs/2603.18111
作者: Xiancheng Wang,Lin Wang,Zhibo Zhang,Rui Wang,Minghang Zhao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Contrastive learning methods for time series anomaly detection (TSAD) heavily depend on the quality of negative sample construction. However, existing strategies based on random perturbations or pseudo-anomaly injection often struggle to simultaneously preserve temporal semantic consistency and provide effective decision-boundary supervision. Most existing methods rely on prior anomaly injection, while overlooking the potential of generating hard negatives near the data manifold boundary directly from normal samples themselves. To address this issue, we propose a reconstruction-driven boundary negative generation framework that automatically constructs hard negatives through the reconstruction process of normal samples. Specifically, the method first employs a reconstruction network to capture normal temporal patterns, and then introduces a reinforcement learning strategy to adaptively adjust the optimization update magnitude according to the current reconstruction state. In this way, boundary-shifted samples close to the normal data manifold can be induced along the reconstruction trajectory and further used for subsequent contrastive representation learning. Unlike existing methods that depend on explicit anomaly injection, the proposed framework does not require predefined anomaly patterns, but instead mines more challenging boundary negatives from the model’s own learning dynamics. Experimental results show that the proposed method effectively improves anomaly representation learning and achieves competitive detection performance on the current dataset.

[LG-68] STEP: Detecting Audio Backdoor Attacks via Stability-based Trigger Exposure Profiling

链接: https://arxiv.org/abs/2603.18103
作者: Kun Wang,Meng Chen,Junhao Wang,Yuli Wu,Li Lu,Chong Zhang,Peng Cheng,Jiaheng Zhang,Kui Ren
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:With the widespread deployment of deep-learning-based speech models in security-critical applications, backdoor attacks have emerged as a serious threat: an adversary who poisons a small fraction of training data can implant a hidden trigger that controls the model’s output while preserving normal behavior on clean inputs. Existing inference-time defenses are not well suited to the audio domain, as they either rely on trigger over-robustness assumptions that fail on transformation-based and semantic triggers, or depend on properties specific to image or text modalities. In this paper, we propose STEP (Stability-based Trigger Exposure Profiling), a black-box, retraining-free backdoor detector that operates under hard-label-only access. Its core idea is to exploit a characteristic dual anomaly of backdoor triggers: anomalous label stability under semantic-breaking perturbations, and anomalous label fragility under semantic-preserving perturbations. STEP profiles each test sample with two complementary perturbation branches that target these two properties respectively, scores the resulting stability features with one-class anomaly detectors trained on benign references, and fuses the two scores via unsupervised weighting. Extensive experiments across seven backdoor attacks show that STEP achieves an average AUROC of 97.92% and EER of 4.54%, substantially outperforming state-of-the-art baselines, and generalizes across model architectures, speech tasks, an open-set verification scenario, and over-the-air physical-world settings.

[LG-69] Variational Phasor Circuits for Phase-Native Brain-Computer Interface Classification

链接: https://arxiv.org/abs/2603.18078
作者: Dibakar Sigdel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present the \textbfVariational Phasor Circuit (VPC), a deterministic classical learning architecture operating on the continuous S^1 unit circle manifold. Inspired by variational quantum circuits, VPC replaces dense real-valued weight matrices with trainable phase shifts, local unitary mixing, and structured interference in the ambient complex space. This phase-native design provides a unified method for both binary and multi-class classification of spatially distributed signals. A single VPC block supports compact phase-based decision boundaries, while stacked VPC compositions extend the model to deeper circuits through inter-block pull-back normalization. Using synthetic brain-computer interface benchmarks, we show that VPC can decode difficult mental-state classification tasks with competitive accuracy and substantially fewer trainable parameters than standard Euclidean baselines. These results position unit-circle phase interference as a practical and mathematically principled alternative to dense neural computation, and motivate VPC as both a standalone classifier and a front-end encoding layer for future hybrid phasor-quantum systems.

[LG-70] Fundamental Limits of Neural Network Sparsification: Evidence from Catastrophic Interpretability Collapse

链接: https://arxiv.org/abs/2603.18056
作者: Dip Roy,Rajiv Misra,Sanjay Kumar Singh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Extreme neural network sparsification (90% activation reduction) presents a critical challenge for mechanistic interpretability: understanding whether interpretable features survive aggressive compression. This work investigates feature survival under severe capacity constraints in hybrid Variational Autoencoder–Sparse Autoencoder (VAE-SAE) architectures. We introduce an adaptive sparsity scheduling framework that progressively reduces active neurons from 500 to 50 over 50 training epochs, and provide empirical evidence for fundamental limits of the sparsification-interpretability relationship. Testing across two benchmark datasets – dSprites and Shapes3D – with both Top-k and L1 sparsification methods, our key finding reveals a pervasive paradox: while global representation quality (measured by Mutual Information Gap) remains stable, local feature interpretability collapses systematically. Under Top-k sparsification, dead neuron rates reach 34.4\pm0.9% on dSprites and 62.7\pm1.3% on Shapes3D at k=50. L1 regularization – a fundamentally different “soft constraint” paradigm – produces equal or worse collapse: 41.7\pm4.4% on dSprites and 90.6\pm0.5% on Shapes3D. Extended training for 100 additional epochs fails to recover dead neurons, and the collapse pattern is robust across all tested threshold definitions. Critically, the collapse scales with dataset complexity: Shapes3D (RGB, 6 factors) shows 1.8\times more dead neurons than dSprites (grayscale, 5 factors) under Top-k and 2.2\times under L1. These findings establish that interpretability collapse under sparsification is intrinsic to the compression process rather than an artifact of any particular algorithm, training duration, or threshold choice.

[LG-71] An FPGA-Based SoC Architecture with a RISC-V Controller for Energy-Efficient Temporal-Coding Spiking Neural Networks

链接: https://arxiv.org/abs/2603.18054
作者: Mohammad Javad Sekonji,Ali Mahani,Maryam Mirsadeghi,Mahdi Taheri
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) offer high energy efficiency and event-driven computation, ideal for low-power edge AI. Their hardware implementation on FPGAs, however, faces challenges due to heavy computation, large memory use, and limited flexibility. This paper proposes a compact System-on-Chip (SoC) architecture for temporal-coding SNNs, integrating a RISC-V controller with an event-driven SNN core. It replaces multipliers with bitwise operations using binarized weights, includes a spike-time sorter for active spikes, and skips noninformative events to reduce computation. The architecture runs fully on a Xilinx Artix-7 FPGA, achieving up to 16x memory reduction for weights and lowering computational overhead and latency, with 97.0% accuracy on MNIST and 88.3% on FashionMNIST. This self-contained design provides an efficient, scalable platform for real-time neuromorphic inference at the edge.

[LG-72] Quotient Geometry and Persistence-Stable Metrics for Swarm Configurations

链接: https://arxiv.org/abs/2603.18041
作者: Mark M. Bailey
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Systems and Control (eess.SY); Algebraic Topology (math.AT)
*备注: 20 pages

点击查看摘要

Abstract:Swarm and constellation reconfiguration can be viewed as motion of an unordered point configuration in an ambient space. Here, we provide persistence-stable, symmetry-invariant geometric representations for comparing and monitoring multi-agent configuration data. We introduce a quotient formation space \mathcalS_n(M,G)=M^n/(G\times S_n) and a formation matching metric d_M,G obtained by optimizing a worst-case assignment error over ambient symmetries g\in G and relabelings \sigma\in S_n . This metric is a structured, physically interpretable relaxation of Gromov–Hausdorff distance: the induced inter-agent metric spaces satisfy d_\mathrmGH(X_x,X_y)\le d_M,G([x],[y]) . Composing this bound with stability of Vietoris–Rips persistence yields d_B(\Phi_k([x]),\Phi_k([y]))\le d_M,G([x],[y]) , providing persistence-stable signatures for reconfiguration monitoring. We analyze the metric geometry of (\mathcalS_n(M,G),d_M,G) : under compactness/completeness assumptions on M and compact G it is compact/complete and the metric induces the quotient topology; if M is geodesic then the quotient is geodesic and exhibits stratified singularities along collision and symmetry strata, relating it to classical configuration spaces. We study expressivity of the signatures, identifying symmetry-mismatch and persistence-compression mechanisms for non-injectivity. Finally, in a phase-circle model we prove a conditional inverse theorem: under semicircle support and a gap-labeling margin, the H_0 signature is locally bi-Lipschitz to d_M,G up to an explicit factor, yielding two-sided control. Examples on \mathbbS^2 and \mathbbT^m illustrate satellite-constellation and formation settings.

[LG-73] Sharpness Aware Surrogate Training for Spiking Neural Networks

链接: https://arxiv.org/abs/2603.18039
作者: Maximilian Nicholson
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Surrogate gradients are a standard tool for training spiking neural networks (SNNs), but conventional hard forward or surrogate backward training couples a nonsmooth forward model with a biased gradient estimator. We study sharpness aware Surrogate Training (SAST), which applies sharpness aware Minimization (SAM) to a surrogate forward SNN trained by backpropagation. In this formulation, the optimization target is an ordinary smooth empirical risk, so the training gradient is exact for the auxiliary model being optimized. Under explicit boundedness and contraction assumptions, we derive compact state stability and input Lipschitz bounds, establish smoothness of the surrogate objective, provide a first order SAM approximation bound, and prove a nonconvex convergence guarantee for stochastic SAST with an independent second minibatch. We also isolate a local mechanism proposition, stated separately from the unconditional guarantees, that links per sample parameter gradient control to smaller input gradient norms under local Jacobian conditioning. Empirically, we evaluate clean accuracy, hard spike transfer, corruption robustness, and training overhead on N-MNIST and DVS Gesture. The clearest practical effect is transfer gap reduction: on N-MNIST, hard spike accuracy rises from 65.7% to 94.7% (best at \rho=0.30 ) while surrogate forward accuracy remains high; on DVS Gesture, hard spike accuracy improves from 31.8% to 63.3% (best at \rho=0.40 ). We additionally specify the compute matched, calibration, and theory alignment controls required for a final practical assessment.

[LG-74] Adapting Methods for Domain-Specific Japanese Small LMs: Scale Architecture and Quantization

链接: https://arxiv.org/abs/2603.18037
作者: Takato Yasuno
类目: Machine Learning (cs.LG)
*备注: 16 pages, 11 figures, 6 tables

点击查看摘要

Abstract:This paper presents a systematic methodology for building domain-specific Japanese small language models using QLoRA fine-tuning. We address three core questions: optimal training scale, base-model selection, and architecture-aware quantization. Stage 1 (Training scale): Scale-learning experiments (1k–5k samples) identify n=4,000 as optimal, where test-set NLL reaches minimum (1.127) before overfitting at 5k samples. Stage 2 (Compare finetuned SLMs): Comparing four Japanese LLMs shows that Llama-3 models with Japanese continual pre-training (Swallow-8B, ELYZA-JP-8B) outperform multilingual models (Qwen2.5-7B). Stage 3 (Quantization): Llama-3 architectures improve under Q4_K_M quantization, while GQA architectures degrade severely (Qwen2.5: -0.280 points). Production recommendation: Swallow-8B Q4_K_M achieves 2.830/3 score, 8.9 s/question, 4.9 GB size. The methodology generalizes to low-resource technical domains and provides actionable guidance for compact Japanese specialist LMs on consumer hardware.

[LG-75] MST-Direct: Matching via Sinkhorn Transport for Multivariate Geostatistical Simulation with Complex Non-Linear Dependencies

链接: https://arxiv.org/abs/2603.18036
作者: Tchalies Bachmann Schmitz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multivariate geostatistical simulation requires the faithful reproduction of complex non-linear dependencies among geological variables, including bimodal distributions, step functions, and heteroscedastic relationships. Traditional methods such as the Gaussian Copula and LU Decomposition assume linear correlation structures and often fail to preserve these complex joint distribution patterns. We propose MST-Direct (Matching via Sinkhorn Transport), a novel algorithm based on Optimal Transport theory that uses the Sinkhorn algorithm to directly match multivariate distributions while preserving spatial correlation structures. The method processes all variables simultaneously as a single multidimensional vector, enabling relational matching across the full joint space rather than relying on pairwise linear dependencies.

[LG-76] aming Epilepsy: Mean Field Control of Whole-Brain Dynamics

链接: https://arxiv.org/abs/2603.18035
作者: Ming Li,Ting Gao,Jingqiao Dua
类目: Machine Learning (cs.LG)
*备注: 22 pages, 7 figures

点击查看摘要

Abstract:Controlling the high-dimensional neural dynamics during epileptic seizures remains a significant challenge due to the nonlinear characteristics and complex connectivity of the brain. In this paper, we propose a novel framework, namely Graph-Regularized Koopman Mean-Field Game (GK-MFG), which integrates Reservoir Computing (RC) for Koopman operator approximation with Alternating Population and Agent Control Network (APAC-Net) for solving distributional control problems. By embedding Electroencephalogram (EEG) dynamics into a linear latent space and imposing graph Laplacian constraints derived from the Phase Locking Value (PLV), our method achieves robust seizure suppression while respecting the functional topological structure of the brain.

[LG-77] Co-Design of Memory-Storag e Systems for Workload Awareness with Interpretable Models

链接: https://arxiv.org/abs/2603.15571
作者: Jay Sarkar,Vamsi Pavan Rayaprolu,Abhijeet Bhalerao
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Systems and Control (eess.SY); Applied Physics (physics.app-ph)
*备注: 9 pages, 10 figures

点击查看摘要

Abstract:Solid-state storage architectures based on NAND or emerging memory devices (SSD), are fundamentally architected and optimized for both reliability and performance. Achieving these simultaneous goals requires co-design of memory components with firmware-architected Error Management (EM) algorithms for density- and performance-scaled memory technologies. We describe a Machine Learning (ML) for systems methodology and modeling for co-designing the EM subsystem together with the natural variance inherent to scaled silicon process of memory components underlying SSD technology. The modeling analyzes NAND memory components and EM algorithms interacting with comprehensive suite of synthetic (stress-focused and JEDEC) and emulation (YCSB and similar) workloads across Flash Translation abstraction layers, by leveraging a statistically interpretable and intuitively explainable ML algorithm. The generalizable co-design framework evaluates several thousand datacenter SSDs spanning multiple generations of memory and storage technology. Consequently, the modeling framework enables continuous, holistic, data-driven design towards generational architectural advancements. We additionally demonstrate that the framework enables Representation Learning of the EM-workload domain for enhancement of the architectural design-space across broad spectrum of workloads.

[LG-78] he Exponentially Weighted Signature

链接: https://arxiv.org/abs/2603.19198
作者: Alexandre Bloch,Samuel N. Cohen,Terry Lyons,Joël Mouterde,Benjamin Walker
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 43 pages, 1 figure

点击查看摘要

Abstract:The signature is a canonical representation of a multidimensional path over an interval. However, it treats all historical information uniformly, offering no intrinsic mechanism for contextualising the relevance of the past. To address this, we introduce the Exponentially Weighted Signature (EWS), generalising the Exponentially Fading Memory (EFM) signature from diagonal to general bounded linear operators. These operators enable cross-channel coupling at the level of temporal weighting together with richer memory dynamics including oscillatory, growth, and regime-dependent behaviour, while preserving the algebraic strengths of the classical signature. We show that the EWS is the unique solution to a linear controlled differential equation on the tensor algebra, and that it generalises both state-space models and the Laplace and Fourier transforms of the path. The group-like structure of the EWS enables efficient computation and makes the framework amenable to gradient-based learning, with the full semigroup action parametrised by and learned through its generator. We use this framework to empirically demonstrate the expressivity gap between the EWS and both the signature and EFM on two SDE-based regression tasks.

[LG-79] Fast and Effective Computation of Generalized Symmetric Matrix Factorization

链接: https://arxiv.org/abs/2603.19147
作者: Lei Yang,Han Wan,Min Zhang,Ling Liang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 41 pages, 2 figures, 1 table

点击查看摘要

Abstract:In this paper, we study a nonconvex, nonsmooth, and non-Lipschitz generalized symmetric matrix factorization model that unifies a broad class of matrix factorization formulations arising in machine learning, image science, engineering, and related areas. We first establish two exactness properties. On the modeling side, we prove an exact penalty property showing that, under suitable conditions, the symmetry-inducing quadratic penalty enforces symmetry whenever the penalty parameter is sufficiently large but finite, thereby exactly recovering the associated symmetric formulation. On the algorithmic side, we introduce an auxiliary-variable splitting formulation and establish an exact relaxation relationship that rigorously links stationary points of the original objective function to those of a relaxed potential function. Building on these exactness properties, we propose an average-type nonmonotone alternating updating method (A-NAUM) based on the relaxed potential function. At each iteration, A-NAUM alternately updates the two factor blocks by (approximately) minimizing the potential function, while the auxiliary block is updated in closed form. To ensure the convergence and enhance practical performance, we further incorporate an average-type nonmonotone line search and show that it is well-defined under mild conditions. Moreover, based on the Kurdyka-Łojasiewicz property and its associated exponent, we establish global convergence of the entire sequence to a stationary point and derive convergence rate results. Finally, numerical experiments on real datasets demonstrate the efficiency of A-NAUM.

[LG-80] Fast and Interpretable Autoregressive Estimation with Neural Network Backpropagation

链接: https://arxiv.org/abs/2603.19041
作者: Anaísa Lucena,Ana Martins,Armando J. Pinho,Sónia Gouveia
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autoregressive (AR) models remain widely used in time series analysis due to their interpretability, but convencional parameter estimation methods can be computationally expensive and prone to convergence issues. This paper proposes a Neural Network (NN) formulation of AR estimation by embedding the autoregressive structure directly into a feedforward NN, enabling coefficient estimation through backpropagation while preserving interpretability. Simulation experiments on 125,000 synthetic AR§ time series with short-term dependence (1 = p = 5) show that the proposed NN-based method consistently recovers model coefficients for all series, while Conditional Maximum Likelihood (CML) fails to converge in approximately 55% of cases. When both methods converge, estimation accuracy is comparable with negligible differences in relative error, R2 and, perplexity/likelihood. However, when CML fails, the NN-based approach still provides reliable estimates. In all cases, the NN estimator achieves substantial computational gains, reaching a median speedup of 12.6x and up to 34.2x for higher model orders. Overall, results demonstrate that gradient-descent NN optimization can provide a fast and efficient alternative for interpretable AR parameter estimation.

[LG-81] Revisiting OmniAnomaly for Anomaly Detection: performance metrics and comparison with PCA-based models

链接: https://arxiv.org/abs/2603.18985
作者: Bruna Alves,Ana Martins,Armando J. Pinho,Sónia Gouveia
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models have become the dominant approach for multivariate time series anomaly detection (MTSAD), often reporting substantial performance improvements over classical statistical methods. However, these gains are frequently evaluated under heterogeneous thresholding strategies and evaluation protocols, making fair comparisons difficult. This work revisits OmniAnomaly, a widely used stochastic recurrent model for MTSAD, and systematically compares it with a simple linear baseline based on Principal Component Analysis (PCA) on the Server Machine Dataset (SMD). Both methods are evaluated under identical thresholding and evaluation procedures, with experiments repeated across 100 runs for each of the 28 machines in the dataset. Performance is evaluated using Precision, Recall and F1-score at point-level, with and without point-adjustment, and under different aggregation strategies across machines and runs, with the corresponding standard deviations also reported. The results show large variability across machines and show that PCA can achieve performance comparable to OmniAnomaly, and even outperform it when point-adjustment is not applied. These findings question the added value of more complex architectures under current benchmarking practices and highlight the critical role of evaluation methodology in MTSAD research.

[LG-82] Unified Taxonomy for Multivariate Time Series Anomaly Detection using Deep Learning

链接: https://arxiv.org/abs/2603.18941
作者: Bruna Alves,Armando J. Pinho,Sónia Gouveia
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The topic of Multivariate Time Series Anomaly Detection (MTSAD) has grown rapidly over the past years, with a steady rise in publications and Deep Learning (DL) models becoming the dominant paradigm. To address the lack of systematization in the field, this study introduces a novel and unified taxonomy with eleven dimensions over three parts (Input, Output and Model) for the categorization of DL-based MTSAD methods. The dimensions were established in a two-fold approach. First, they derived from a comprehensive analysis of methodological studies. Second, insights from review papers were incorporated. Furthermore, the proposed taxonomy was validated using an additional set of recent publications, providing a clear overview of methodological trends in MTSAD. Results reveal a convergence toward Transformer-based and reconstruction and prediction models, setting the foundation for emerging adaptive and generative trends. Building on and complementing existing surveys, this unified taxonomy is designed to accommodate future developments, allowing for new categories or dimensions to be added as the field progresses. This work thus consolidates fragmented knowledge in the field and provides a reference point for future research in MTSAD.

[LG-83] Kernel Single-Index Bandits: Estimation Inference and Learning

链接: https://arxiv.org/abs/2603.18938
作者: Sakshi Arya,Satarupa Bhattacharjee,Bharath K. Sriperumbudur
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study contextual bandits with finitely many actions in which the reward of each arm follows a single-index model with an arm-specific index parameter and an unknown nonparametric link function. We consider a regime in which arms correspond to stable decision options and covariates evolve adaptively under the bandit policy. This setting creates significant statistical challenges: the sampling distribution depends on the allocation rule, observations are dependent over time, and inverse-propensity weighting induces variance inflation. We propose a kernelized \varepsilon -greedy algorithm that combines Stein-based estimation of the index parameters with inverse-propensity-weighted kernel ridge regression for the reward functions. This approach enables flexible semiparametric learning while retaining interpretability. Our analysis develops new tools for inference with adaptively collected data. We establish asymptotic normality for the single-index estimator under adaptive sampling, yielding valid confidence regions, and derive a directional functional central limit theorem for the RKHS estimator, which provides asymptotically valid pointwise confidence intervals. The analysis relies on concentration bounds for inverse-weighted Gram matrices together with martingale central limit theorems. We further obtain finite-time regret guarantees, including \tildeO(\sqrtT) rates under common-link Lipschitz conditions, showing that semiparametric structure can be exploited without sacrificing statistical efficiency. These results provide a unified framework for simultaneous learning and inference in single-index contextual bandits.

[LG-84] Data-driven construction of machine-learning-based interatomic potentials for gas-surface scattering dynamics: the case of NO on graphite

链接: https://arxiv.org/abs/2603.18864
作者: Samuel Del Fré,Gilberto A. Alou Angulo,Maurice Monnerville,Alejandro Rivero Santamaría
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 19 pages, 9 figures

点击查看摘要

Abstract:Accurate atomistic simulations of gas-surface scattering require potential energy surfaces that remain reliable over broad configurational and energetic ranges while retaining the efficiency needed for extensive trajectory sampling. Here, we develop a data-driven workflow for constructing a machine-learning interatomic potential (MLIP) tailored to gas-surface scattering dynamics, using nitric oxide (NO) scattering from highly oriented pyrolytic graphite (HOPG) as a benchmark system. Starting from an initial ab initio molecular dynamics (AIMD) dataset, local atomic environments are described by SOAP descriptors and analyzed in a reduced feature space obtained through principal component analysis. Farthest point sampling is then used to build a compact training set, and the resulting Deep Potential model is refined through a query-by-committee active-learning strategy using additional configurations extracted from molecular dynamics simulations over extended ranges of incident energies and surface temperatures. The final MLIP reproduces reference energies and forces with high fidelity and enables large-scale molecular dynamics simulations of NO scattering from graphite at a computational cost far below that of AIMD. The simulations provide detailed insight into adsorption energetics, trapping versus direct scattering probabilities, translational energy loss, angular distributions, and rotational excitation. Overall, the results reproduce the main experimental trends and demonstrate that descriptor-guided sampling combined with active learning offers an efficient and transferable strategy for constructing MLIPs for gas-surface interactions.

[LG-85] SRRM: Improving Recursive Transport Surrogates in the Small-Discrepancy Regime

链接: https://arxiv.org/abs/2603.18781
作者: Yufei Zhang,Tao Wang,Jingyi Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 29 pages,20 figures

点击查看摘要

Abstract:Recursive partitioning methods provide computationally efficient surrogates for the Wasserstein distance, yet their statistical behavior and their resolution in the small-discrepancy regime remain insufficiently understood. We study Recursive Rank Matching (RRM) as a representative instance of this class under a population-anchored reference. In this setting, we establish consistency and an explicit convergence rate for the anchored empirical RRM under the quadratic cost. We then identify a dominant mismatch mechanism responsible for the loss of resolution in the small-discrepancy regime. Based on this analysis, we introduce Selective Recursive Rank Matching (SRRM), which suppresses the resulting dominant mismatches and yields a higher-fidelity practical surrogate for the Wasserstein distance at moderate additional computational cost.

[LG-86] Holter-to-Sleep: AI-Enabled Repurposing of Single-Lead ECG for Sleep Phenotyping

链接: https://arxiv.org/abs/2603.18714
作者: Donglin Xie,Qingshuo Zhao,Jingyu Wang,Shijia Geng,Jiarui Jin,Jun Li,Rongrong Guo,Guangkun Nie,Gongzheng Tang,Yuxi Zhou,Thomas Penzel,Shenda Hong
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sleep disturbances are tightly linked to cardiovascular risk, yet polysomnography (PSG)-the clinical reference standard-remains resource-intensive and poorly suited for multi-night, home-based, and large-scale screening. Single-lead electrocardiography (ECG), already ubiquitous in Holter and patch-based devices, enables comfortable long-term acquisition and encodes sleep-relevant physiology through autonomic modulation and cardiorespiratory coupling. Here, we present a proof-of-concept Holter-to-Sleep framework that, using single-lead ECG as the sole input, jointly supports overnight sleep phenotyping and Holter-grade cardiac phenotyping within the same recording, and further provides an explicit analytic pathway for scalable cardio-sleep association studies. The framework is developed and validated on a pooled multi-center PSG sample of 10,439 studies spanning four public cohorts, with independent external evaluation to assess cross-cohort generalizability, and additional real-world feasibility assessment using overnight patch-ECG recordings via objective-subjective consistency analysis. This integrated design enables robust extraction of clinically meaningful overnight sleep phenotypes under heterogeneous populations and acquisition conditions, and facilitates systematic linkage between ECG-derived sleep metrics and arrhythmia-related Holter phenotypes. Collectively, the Holter-to-Sleep paradigm offers a practical foundation for low-burden, home-deployable, and scalable cardio-sleep monitoring and research beyond traditional PSG-centric workflows.

[LG-87] A Theoretical Comparison of No-U-Turn Sampler Variants: Necessary and Su?cient Convergence Conditions and Mixing Time Analysis under Gaussian Targets

链接: https://arxiv.org/abs/2603.18640
作者: Samuel Gruffaz,Kyurae Kim,Fares Guehtar,Hadrien Duval-decaix,Pacôme Trautmann
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:The No-U-Turn Sampler (NUTS) is the computational workhorse of modern Bayesian software libraries, yet its qualitative and quantitative convergence guarantees were established only recently. A significant gap remains in the theoretical comparison of its two main variants: NUTS-mul and NUTS-BPS, which use multinomial sampling and biased progressive sampling, respectively, for index selection. In this paper, we address this gap in three contributions. First, we derive the first necessary conditions for geometric ergodicity for both variants. Second, we establish the first sufficient conditions for geometric ergodicity and ergodicity for NUTS-mul. Third, we obtain the first mixing time result for NUTS-BPS on a standard Gaussian distribution. Our results show that NUTS-mul and NUTS-BPS exhibit nearly identical qualitative behavior, with geometric ergodicity depending on the tail properties of the target distribution. However, they differ quantitatively in their convergence rates. More precisely, when initialized in the typical set of the canonical Gaussian measure, the mixing times of both NUTS-mul and NUTS-BPS scale as O(d^1/4) up to logarithmic factors, where d denotes the dimension. Nevertheless, the associated constants are strictly smaller for NUTS-BPS.

[LG-88] Learning Decision-Sufficient Representations for Linear Optimization

链接: https://arxiv.org/abs/2603.18551
作者: Yuhan Ye,Saurabh Amin,Asuman Ozdauglar
类目: Optimization and Control (math.OC); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注: 45 pages, 2 figures, includes appendix

点击查看摘要

Abstract:We study how to construct compressed datasets that suffice to recover optimal decisions in linear programs with an unknown cost vector c lying in a prior set \mathcalC . Recent work by Bennouna et al. provides an exact geometric characterization of sufficient decision datasets (SDDs) via an intrinsic decision-relevant dimension d^\star . However, their algorithm for constructing minimum-size SDDs requires solving mixed-integer programs. In this paper, we establish hardness results showing that computing d^\star is NP-hard and deciding whether a dataset is globally sufficient is coNP-hard, thereby resolving a recent open problem posed by Bennouna et al. To address this worst-case intractability, we introduce pointwise sufficiency, a relaxation that requires sufficiency for an individual cost vector. Under nondegeneracy, we provide a polynomial-time cutting-plane algorithm for constructing pointwise-sufficient decision datasets. In a data-driven regime with i.i.d.\ costs, we further propose a cumulative algorithm that aggregates decision-relevant directions across samples, yielding a stable compression scheme of size at most d^\star . This leads to a distribution-free PAC guarantee: with high probability over the training sample, the pointwise sufficiency failure probability on a fresh draw is at most \tildeO(d^\star/n) , and this rate is tight up to logarithmic factors. Finally, we apply decision-sufficient representations to contextual linear optimization, obtaining compressed predictors with generalization bounds scaling as \tildeO(\sqrtd^\star/n) rather than \tildeO(\sqrtd/n) , where d is the ambient cost dimension.

[LG-89] On the Peril of (Even a Little) Nonstationarity in Satisficing Regret Minimization

链接: https://arxiv.org/abs/2603.18514
作者: Yixuan Zhang,Ruihao Zhu,Qiaomin Xie
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 21 pages

点击查看摘要

Abstract:Motivated by the principle of satisficing in decision-making, we study satisficing regret guarantees for nonstationary K -armed bandits. We show that in the general realizable, piecewise-stationary setting with L stationary segments, the optimal regret is \Theta(L\log T) as long as L\geq 2 . This stands in sharp contrast to the case of L=1 (i.e., the stationary setting), where a T -independent \Theta(1) satisficing regret is achievable under realizability. In other words, the optimal regret has to scale with T even if just a little nonstationarity presents. A key ingredient in our analysis is a novel Fano-based framework tailored to nonstationary bandits via a \emphpost-interaction reference construction. This framework strictly extends the classical Fano method for passive estimation as well as recent interactive Fano techniques for stationary bandits. As a complement, we also discuss a special regime in which constant satisficing regret is again possible.

[LG-90] Precise Performance of Linear Denoisers in the Proportional Regime

链接: https://arxiv.org/abs/2603.18483
作者: Reza Ghane,Danil Akhtiamov,Babak Hassibi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In the present paper we study the performance of linear denoisers for noisy data of the form \mathbfx + \mathbfz , where \mathbfx \in \mathbbR^d is the desired data with zero mean and unknown covariance \mathbf\Sigma , and \mathbfz \sim \mathcalN(0, \mathbf\Sigma_\mathbfz) is additive noise. Since the covariance \mathbf\Sigma is not known, the standard Wiener filter cannot be employed for denoising. Instead we assume we are given samples \mathbfx_1,\dots,\mathbfx_n \in \mathbbR^d from the true distribution. A standard approach would then be to estimate \mathbf\Sigma from the samples and use it to construct an empirical" Wiener filter. However, in this paper, motivated by the denoising step in diffusion models, we take a different approach whereby we train a linear denoiser \mathbfW from the data itself. In particular, we synthetically construct noisy samples \hat\mathbfx_i of the data by injecting the samples with Gaussian noise with covariance \mathbf\Sigma_1 \neq \mathbf\Sigma_\mathbfz and find the best \mathbfW that approximates \mathbfW\hat\mathbfx_i \approx \mathbfx_i in a least-squares sense. In the proportional regime \fracnd \rightarrow \kappa 1 we use the \it Convex Gaussian Min-Max Theorem (CGMT) to analytically find the closed form expression for the generalization error of the denoiser obtained from this process. Using this expression one can optimize over \mathbf\Sigma_1 to find the best possible denoiser. Our numerical simulations show that our denoiser outperforms the empirical" Wiener filter in many scenarios and approaches the optimal Wiener filter as \kappa\rightarrow\infty .

[LG-91] Statistical Testing Framework for Clustering Pipelines by Selective Inference

链接: https://arxiv.org/abs/2603.18413
作者: Yugo Miyata,Tomohiro Shiraishi,Shunichi Nishino,Ichiro Takeuchi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 59 pages, 11 figures

点击查看摘要

Abstract:A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating multiple analysis this http URL many practical applications, analytical findings are obtained only after data pass through several data-dependent procedures within such this http URL this study, we address the problem of quantifying the statistical reliability of results produced by data analysis this http URL a proof of concept, we focus on clustering pipelines that identify cluster structures from complex and heterogeneous data through procedures such as outlier detection, feature selection, and this http URL propose a novel statistical testing framework to assess the significance of clustering results obtained through these this http URL framework, based on selective inference, enables the systematic construction of valid statistical tests for clustering pipelines composed of predefined this http URL prove that the proposed test controls the type I error rate at any nominal level and demonstrate its validity and effectiveness through experiments on synthetic and real datasets.

[LG-92] Multi-Domain Causal Empirical Bayes Under Linear Mixing

链接: https://arxiv.org/abs/2603.18404
作者: Bohan Wu,Julius von Kügelgen,David M. Blei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Causal representation learning (CRL) aims to learn low-dimensional causal latent variables from high-dimensional observations. While identifiability has been extensively studied for CRL, estimation has been less explored. In this paper, we explore the use of empirical Bayes (EB) to estimate causal representations. In particular, we consider the problem of learning from data from multiple domains, where differences between domains are modeled by interventions in a shared underlying causal model. Multi-domain CRL naturally poses a simultaneous inference problem that EB is designed to tackle. Here, we propose an EB f -modeling algorithm that improves the quality of learned causal variables by exploiting invariant structure within and across domains. Specifically, we consider a linear measurement model and interventional priors arising from a shared acyclic SCM. When the graph and intervention targets are known, we develop an EM-style algorithm based on causally structured score matching. We further discuss EB \rmg -modeling in the context of existing CRL approaches. In experiments on synthetic data, our proposed method achieves more accurate estimation than other methods for CRL.

[LG-93] A Hybrid Conditional Diffusion-DeepONet Framework for High-Fidelity Stress Prediction in Hyperelastic Materials

链接: https://arxiv.org/abs/2603.18225
作者: Purna Vindhya Kota,Meer Mehran Rashid,Somdatta Goswami,Lori Graham-Brady
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting stress fields in hyperelastic materials with complex microstructures remains challenging for traditional deep learning surrogates, which struggle to capture both sharp stress concentrations and the wide dynamic range of stress magnitudes. Convolutional architectures such as UNet tend to oversmooth high-frequency gradients, while neural operators like DeepONet exhibit spectral bias and underpredict localized extremes. Diffusion models can recover fine-scale structure but often introduce low-frequency amplitude drift, degrading physical scaling. To address these limitations, we propose a hybrid surrogate framework, cDDPM-DeepONet, that decouples stress morphology from magnitude. A conditional denoising diffusion probabilistic model (cDDPM), built on a UNet backbone, generates normalized von Mises stress fields conditioned on geometry and loading. In parallel, a modified DeepONet predicts global scaling parameters (minimum and maximum stress), enabling reconstruction of full-resolution physical stress maps. This separation allows the diffusion model to focus on spatial structure while the operator network corrects global amplitude, mitigating spectral and scaling biases. We evaluate the framework on nonlinear hyperelastic datasets with single and multiple polygonal voids. The proposed model consistently outperforms UNet, DeepONet, and standalone cDDPM baselines by one to two orders of magnitude. Spectral analysis shows strong agreement with finite element solutions across all wavenumbers, preserving both global behavior and localized stress concentrations.

[LG-94] ackling the Sign Problem in the Doped Hubbard Model with Normalizing Flows

链接: https://arxiv.org/abs/2603.18205
作者: Dominic Schuh,Lena Funcke,Janik Kreit,Thomas Luu,Simran Singh
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:The Hubbard model at finite chemical potential is a cornerstone for understanding doped correlated systems, but simulations are severely limited by the sign problem. In the auxiliary-field formulation, the spin basis mitigates the sign problem, yet severe ergodicity issues have limited its use. We extend recent advances with normalizing flows at half-filling to finite chemical potential by introducing an annealing scheme enabling ergodic sampling. Compared to state-of-the-art hybrid Monte Carlo in the charge basis, our approach accurately reproduces exact diagonalization results while reducing statistical uncertainties by an order of magnitude, opening a new path for simulations of doped correlated systems.

[LG-95] Starting Off on the Wrong Foot: Pitfalls in Data Preparation

链接: https://arxiv.org/abs/2603.18190
作者: Jiayi Guo,Panyi Dong,Zhiyu Quan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 42 pages, 37 references

点击查看摘要

Abstract:When working with real-world insurance data, practitioners often encounter challenges during the data preparation stage that can undermine the statistical validity and reliability of downstream modeling. This study illustrates that conventional data preparation procedures such as random train-test partitioning, often yield unreliable and unstable results when confronted with highly imbalanced insurance loss data. To mitigate these limitations, we propose a novel data preparation framework leveraging two recent statistical advancements: support points for representative data splitting to ensure distributional consistency across partitions, and the Chatterjee correlation coefficient for initial, non-parametric feature screening to capture feature relevance and dependence structure. We further integrate these theoretical advances into a unified, efficient framework that also incorporates missing-data handling, and embed this framework within our custom InsurAutoML pipeline. The performance of the proposed approach is evaluated using both simulated datasets and datasets often cited in the academic literature. Our findings definitively demonstrate that incorporating statistically rigorous data preparation methods not only significantly enhances model robustness and interpretability but also substantially reduces computational resource requirements across diverse insurance loss modeling tasks. This work provides a crucial methodological upgrade for achieving reliable results in high stakes insurance applications.

[LG-96] ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale Limit

链接: https://arxiv.org/abs/2603.18168
作者: Louis-Pierre Chaintron,Lénaïc Chizat,Javier Maas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We establish convergence of the training dynamics of residual neural networks (ResNets) to their joint infinite depth L, hidden width M, and embedding dimension D limit. Specifically, we consider ResNets with two-layer perceptron blocks in the maximal local feature update (MLU) regime and prove that, after a bounded number of training steps, the error between the ResNet and its large-scale limit is O(1/L + sqrt(D/(L M)) + 1/sqrt(D)). This error rate is empirically tight when measured in embedding space. For a budget of P = Theta(L M D) parameters, this yields a convergence rate O(P^(-1/6)) for the scalings of (L, M, D) that minimize the bound. Our analysis exploits in an essential way the depth-two structure of residual blocks and applies formally to a broad class of state-of-the-art architectures, including Transformers with bounded key-query dimension. From a technical viewpoint, this work completes the program initiated in the companion paper [Chi25] where it is proved that for a fixed embedding dimension D, the training dynamics converges to a Mean ODE dynamics at rate O(1/L + sqrt(D)/sqrt(L M)). Here, we study the large-D limit of this Mean ODE model and establish convergence at rate O(1/sqrt(D)), yielding the above bound by a triangle inequality. To handle the rich probabilistic structure of the limit dynamics and obtain one of the first rigorous quantitative convergence for a DMFT-type limit, we combine the cavity method with propagation of chaos arguments at a functional level on so-called skeleton maps, which express the weight updates as functions of CLT-type sums from the past.

[LG-97] owards sample-optimal learning of bosonic Gaussian quantum states

链接: https://arxiv.org/abs/2603.18136
作者: Senrui Chen,Francesco Anna Mele,Marco Fanizza,Alfred Li,Zachary Mann,Hsin-Yuan Huang,Yanbei Chen,John Preskill
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注: 59 pages, 3 figures, 1 table. Comments welcome

点击查看摘要

Abstract:Continuous-variable systems enable key quantum technologies in computation, communication, and sensing. Bosonic Gaussian states emerge naturally in various such applications, including gravitational-wave and dark-matter detection. A fundamental question is how to characterize an unknown bosonic Gaussian state from as few samples as possible. Despite decades-long exploration, the ultimate efficiency limit remains unclear. In this work, we study the necessary and sufficient number of copies to learn an n -mode Gaussian state, with energy less than E , to \varepsilon trace distance with high probability. We prove a lower bound of \Omega(n^3/\varepsilon^2) for Gaussian measurements, matching the best known upper bound up to doubly-log energy dependence, and \Omega(n^2/\varepsilon^2) for arbitrary measurements. We further show an upper bound of \widetildeO(n^2/\varepsilon^2) given that the Gaussian state is promised to be either pure or passive. Interestingly, while Gaussian measurements suffice for nearly optimal learning of pure Gaussian states, non-Gaussian measurements are provably required for optimal learning of passive Gaussian states. Finally, focusing on learning single-mode Gaussian states via non-entangling Gaussian measurements, we provide a nearly tight bound of \widetilde\Theta(E/\varepsilon^2) for any non-adaptive schemes, showing adaptivity is indispensable for nearly energy-independent scaling. As a byproduct, we establish sharp bounds on the trace distance between Gaussian states in terms of the total variation distance between their Wigner distributions, and obtain a nearly tight sample complexity bound for learning the Wigner distribution of any Gaussian state to \varepsilon total variation distance. Our results greatly advance quantum learning theory in the bosonic regimes and have practical impact in quantum sensing and benchmarking applications.

[LG-98] ransfer Learning for Contextual Joint Assortment-Pricing under Cross-Market Heterogeneity

链接: https://arxiv.org/abs/2603.18114
作者: Elynn Chen,Xi Chen,Yi Zhang
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study transfer learning for contextual joint assortment-pricing under a multinomial logit choice model with bandit feedback. A seller operates across multiple related markets and observes only posted prices and realized purchases. While data from source markets can accelerate learning in a target market, cross-market differences in customer preferences may introduce systematic bias if pooled indiscriminately. We model heterogeneity through a structured utility shift, where markets share a common contextual utility structure but differ along a sparse set of latent preference coordinates. Building on this, we develop Transfer Joint Assortment-Pricing (TJAP), a bias-aware framework that combines aggregate-then-debias estimation with a UCB-style policy. TJAP constructs two-radius confidence bounds that separately capture statistical uncertainty and transfer-induced bias, uniformly over continuous prices. We establish matching minimax regret bounds of order \tildeO!\left(d\sqrt\fracT1+H + s_0\sqrtT\right), revealing a transparent variance-bias tradeoff: transfer accelerates learning along shared preference directions, while heterogeneous components impose an irreducible adaptation cost. Numerical experiments corroborate the theory, showing that TJAP outperforms both target-only learning and naive pooling while remaining robust to cross-market differences. Subjects: Methodology (stat.ME); Machine Learning (cs.LG) Cite as: arXiv:2603.18114 [stat.ME] (or arXiv:2603.18114v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2603.18114 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-99] Generative Replica-Exchange: A Flow-based Framework for Accelerating Replica Exchange Simulations

链接: https://arxiv.org/abs/2603.18076
作者: Shengjie Huang,Sijie Yang,Jianqiao Yi,Rui Zheng,Haocong Liao,Muzammal Hussain,Yaoquan Tu,Xiaoyun Lu,Yang Zhou
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Replica exchange (REX) is one of the most widely used enhanced sampling methodologies, yet its efficiency is limited by the requirement for a large number of intermediate temperature replicas. Here we present Generative Replica Exchange (GREX), which integrates deep generative models into the REX framework to eliminate this temperature ladder. Drawing inspiration from reservoir replica exchange (res-REX), GREX utilizes trained normalizing flows to generate high-temperature configurations on demand and map them directly to the target distribution using the potential energy as a constraint, without requiring target-temperature training data. This approach reduces production simulations to a single replica at the target temperature while maintaining thermodynamic rigor through Metropolis exchange acceptance. We validate GREX on three benchmark systems of increasing complexity, highlighting its superior efficiency and practical applicability for molecular simulations.

[LG-100] A Novel Framework using Intuitionistic Fuzzy Logic with U-Net and U-Net Architecture: A case Study of MRI Bain Image Segmentation

链接: https://arxiv.org/abs/2603.18042
作者: Hanuman Verma,Kiho Im,Akshansh Gupta,M. Tanveer
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:Accurate segmentation of brain images from magnetic resonance imaging (MRI) scans plays a pivotal role in brain image analysis and the diagnosis of neurological disorders. Deep learning algorithms, particularly U-Net and U-Net++, are widely used for image segmentation. However, it finds difficult to deal with uncertainty in images. To address this challenge, this work integrates intuitionistic fuzzy logic into U-Net and U-Net++, propose a novel framework, named as IFS U-Net and IFS U-Net++. These models accept input data in an intuitionistic fuzzy representation to manage uncertainty arising from vague ness and imprecise data. This approach effectively handles tissue ambiguity caused by the partial volume effect and boundary uncertainties. To evaluate the effectiveness of IFS U-Net and IFS U-Net++, experiments are conducted on two publicly available MRI brain datasets: the Internet Brain Segmentation Repository (IBSR) and the Open Access Series of Imaging Studies (OASIS). Segmentation performance is quantitatively assessed using Accuracy, Dice Coefficient, and Intersection over Union (IoU). The results demonstrate that the proposed architectures consistently improve segmentation performance by effectively addressing uncertainty

[LG-101] Physically Accurate Differentiable Inverse Rendering for Radio Frequency Digital Twin

链接: https://arxiv.org/abs/2603.18026
作者: Xingyu Chen,Xinyu Zhang,Kai Zheng,Xinmin Fang,Tzu-Mao Li,Chris Xiaoxuan Lu,Zhengxiong Li
类目: ignal Processing (eess.SP); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Digital twins, virtual simulated replicas of physical scenes, are transforming system design across industries. However, their potential in radio frequency (RF) systems has been limited by the non-differentiable nature of conventional RF simulators. The visibility of propagation paths causes severe discontinuities, and differentiable rendering techniques from computer graphics cannot easily transfer due to point-source antennas and dominant specular reflections. In this paper, we present RFDT, a physically based differentiable RF simulation framework that enables gradient-based interaction between virtual and physical worlds. RFDT resolves discontinuities with a physically grounded edge-diffraction transition function, and mitigates non-convexity from Fourier-domain processing through a signal domain transform surrogate. Our implementation demonstrates RFDT’s ability to accurately reconstruct digital twins from real RF measurements. Moreover, RFDT can augment diverse downstream applications, such as test-time adaptation of machine learning-based RF sensing and physically constrained optimization of communication systems.

附件下载

点击下载今日全部论文列表