本篇博文主要内容为 2026-03-09 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-03-09)
今日共更新548篇论文,其中:
- 自然语言处理共70篇(Computation and Language (cs.CL))
- 人工智能共148篇(Artificial Intelligence (cs.AI))
- 计算机视觉共175篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共127篇(Machine Learning (cs.LG))
- 多智能体系统共8篇(Multiagent Systems (cs.MA))
- 信息检索共11篇(Information Retrieval (cs.IR))
- 人机交互共34篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] alk Freely Execute Strictly: Schema-Gated Agent ic AI for Flexible and Reproducible Scientific Workflows
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在科学工作流中应用时面临的确定性与灵活性之间的矛盾:一方面,科学研究要求执行过程具备可复现性(determinism)、溯源性(provenance)和治理能力(governance),而LLM自主决策易导致不可控的运行结果;另一方面,用户期望通过自然语言交互获得灵活、直观的工具调用体验。解决方案的关键在于提出模式约束的编排机制(schema-gated orchestration)——即以结构化模式(schema)作为强制性的执行边界,在组合工作流层级确保所有动作(包括跨步骤依赖)必须通过机器可验证的规范校验后方可执行,从而实现“对话灵活性”与“执行确定性”的解耦。该机制通过明确区分对话控制权与执行控制权,为构建可信、可控且高效的生成式AI驱动科学计算系统提供了理论依据与实践路径。
链接: https://arxiv.org/abs/2603.06394
作者: Joel Strickland,Arjun Vijeta,Chris Moores,Oliwia Bodek,Bogdan Nenchev,Thomas Whitehead,Charles Phillips,Karl Tassenberg,Gareth Conduit,Ben Pellegrini
机构: Intellegens; University of Oxford (牛津大学)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Large language models (LLMs) can now translate a researcher’s plain-language goal into executable computation, yet scientific workflows demand determinism, provenance, and governance that are difficult to guarantee when an LLM decides what runs. Semi-structured interviews with 18 experts across 10 industrial RD stakeholders surface 2 competing requirements–deterministic, constrained execution and conversational flexibility without workflow rigidity–together with boundary properties (human-in-the-loop control and transparency) that any resolution must satisfy. We propose schema-gated orchestration as the resolving principle: the schema becomes a mandatory execution boundary at the composed-workflow level, so that nothing runs unless the complete action–including cross-step dependencies–validates against a machine-checkable specification. We operationalize the 2 requirements as execution determinism (ED) and conversational flexibility (CF), and use these axes to review 20 systems spanning 5 architectural groups along a validation-scope spectrum. Scores are assigned via a multi-model protocol–15 independent sessions across 3 LLM families–yielding substantial-to-near-perfect inter-model agreement (Krippendorff a=0.80 for ED and a=0.98 for CF), demonstrating that multi-model LLM scoring can serve as a reusable alternative to human expert panels for architectural assessment. The resulting landscape reveals an empirical Pareto front–no reviewed system achieves both high flexibility and high determinism–but a convergence zone emerges between the generative and workflow-centric extremes. We argue that a schema-gated architecture, separating conversational from execution authority, is positioned to decouple this trade-off, and distill 3 operational principles–clarification-before-execution, constrained plan-act orchestration, and tool-to-workflow-level gating–to guide adoption. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2603.06394 [cs.AI] (or arXiv:2603.06394v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.06394 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-1] Conversational Demand Response: Bidirectional Aggregator-Prosumer Coordination through Agent ic AI KR
【速读】:该论文旨在解决居民侧需求响应(Demand Response, DR)中因协调机制不足而导致的持续参与度低的问题,现有方案或为完全自动化,或仅提供单向调度信号和价格提示,缺乏支持用户知情决策的能力。解决方案的关键在于提出对话式需求响应(Conversational Demand Response, CDR),其基于代理型人工智能(agentic AI)构建双向自然语言交互机制,采用两级多智能体架构:聚合器代理(aggregator agent)发出灵活性请求,而用户侧的家用能源管理系统(Home Energy Management System, HEMS)通过调用优化工具评估可交付性和成本效益;同时支持由用户主动发起的上游通信,使偏好变更能直接传递至聚合器。该设计在保持自动化DR可扩展性的同时,增强了透明度、可解释性和用户自主权,从而促进长期可持续的产消者(prosumer)参与。
链接: https://arxiv.org/abs/2603.06217
作者: Reda El Makroum,Sebastian Zwickl-Bernhard,Lukas Kranzl,Hans Auer
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 6 pages, 2 figures. Code available at: this https URL
Abstract:Residential demand response depends on sustained prosumer participation, yet existing coordination is either fully automated, or limited to one-way dispatch signals and price alerts that offer little possibility for informed decision-making. This paper introduces Conversational Demand Response (CDR), a coordination mechanism where aggregators and prosumers interact through bidirectional natural language, enabled through agentic AI. A two-tier multi-agent architecture is developed in which an aggregator agent dispatches flexibility requests and a prosumer Home Energy Management System (HEMS) assesses deliverability and cost-benefit by calling an optimization-based tool. CDR also enables prosumer-initiated upstream communication, where changes in preferences can reach the aggregator directly. Proof-of-concept evaluation shows that interactions complete in under 12 seconds. The architecture illustrates how agentic AI can bridge the aggregator-prosumer coordination gap, providing the scalability of automated DR while preserving the transparency, explainability, and user agency necessary for sustained prosumer participation. All system components, including agent prompts, orchestration logic, and simulation interfaces, are released as open source to enable reproducibility and further development.
[MA-2] MASFactory: A Graph-centric Framework for Orchestrating LLM -Based Multi-Agent Systems with Vibe Graphing ACL2026
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS)在实现复杂工作流时存在的三大问题:人工开发成本高、组件复用性差以及难以集成异构外部上下文源。其解决方案的关键在于提出一个以图为中心的框架MASFactory,核心创新是Vibe Graphing方法——一种人机协同的自然语言意图编译机制,可将用户描述转化为可编辑的工作流规范,并进一步生成可执行的计算图;同时,该框架还提供可复用组件与插件式上下文接入能力,以及支持拓扑预览、运行时追踪和人机交互的可视化工具,从而显著提升MAS工作流的构建效率与灵活性。
链接: https://arxiv.org/abs/2603.06007
作者: Yang Liu,Jinxuan Cai,Yishen Li,Qi Meng,Zedi Liu,Xin Li,Chen Qian,Chuan Shi,Cheng Yang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Submitted to ACL 2026 Demo Track. 10 pages, 6 figures. Code and documentation are available at: this https URL
Abstract:Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents/sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph-centric framework for orchestrating LLM-based MAS. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (this https URL) and video (this https URL) are publicly available.
[MA-3] Evaluating LLM Alignment With Human Trust Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)内部如何表征和推理信任这一关键社会认知概念的问题。由于信任在人类协作与多智能体系统中具有核心作用,但其在LLMs中的内在机制尚不明确,本文通过白盒分析方法,利用对比提示(contrastive prompting)生成EleutherAI/gpt-j-6B模型激活空间中的嵌入向量,对二元信任及相关人际属性进行量化建模。解决方案的关键在于:首先从五个成熟的人类信任模型中提取信任相关概念,其次基于60个通用情感概念的成对余弦相似度确定显著概念对齐阈值,进而计算LLM内部信任表示与各信任概念之间的余弦相似度。结果表明,该模型最接近Castelfranchi的社会认知模型,其次为Marsh模型,证明LLMs能在激活空间中编码社会认知结构,从而支持可解释的比较分析、社会认知理论发展及人机协同系统设计。
链接: https://arxiv.org/abs/2603.05839
作者: Anushka Debnath,Stephen Cranefield,Bastin Tony Roy Savarimuthu,Emiliano Lorini
机构: University of Otago (奥塔哥大学); CNRS (法国国家科学研究中心); Toulouse University (图卢兹大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: This paper will appear in the post-proceedings of ICAART 2026
Abstract:Trust plays a pivotal role in enabling effective cooperation, reducing uncertainty, and guiding decision-making in both human interactions and multi-agent systems. Although it is significant, there is limited understanding of how large language models (LLMs) internally conceptualize and reason about trust. This work presents a white-box analysis of trust representation in EleutherAI/gpt-j-6B, using contrastive prompting to generate embedding vectors within the activation space of the LLM for diadic trust and related interpersonal relationship attributes. We first identified trust-related concepts from five established human trust models. We then determined a threshold for significant conceptual alignment by computing pairwise cosine similarities across 60 general emotional concepts. Then we measured the cosine similarities between the LLM’s internal representation of trust and the derived trust-related concepts. Our results show that the internal trust representation of EleutherAI/gpt-j-6B aligns most closely with the Castelfranchi socio-cognitive model, followed by the Marsh Model. These findings indicate that LLMs encode socio-cognitive constructs in their activation space in ways that support meaningful comparative analyses, inform theories of social cognition, and support the design of human-AI collaborative systems.
[MA-4] he Coordination Gap: Alternation Metrics for Temporal Dynamics in Multi-Agent Battle of the Exes
【速读】:该论文旨在解决多智能体协调中的“协调困境”问题,即个体优化目标与集体福祉之间存在的根本性张力,尤其关注传统评价指标在捕捉时间结构和集体动态方面的局限性。其核心问题是:现有基于结果的度量(如效率和公平性)无法区分有序交替、垄断或随机访问等不同协调模式,且随着智能体数量增加,公平性比率会失去判别力,从而掩盖潜在的不公平现象。解决方案的关键在于引入一个理想的参考协调范式——完美交替(Perfect Alternation, PA),并提出六种新型时序敏感的交替(Alternation, ALT)度量指标,用以量化协调质量。通过使用Q-learning代理作为最小适应诊断基准,并与随机策略基线对比,研究发现即使传统指标表现优异(如奖励公平性常高于0.9),学习到的策略在ALT度量下性能可比随机基线低达81%,揭示了高总收益与差时序协调共存的可能性,强调了时序感知观测量对多智能体博弈中协调分析的必要性。
链接: https://arxiv.org/abs/2603.05789
作者: Nikolaos Al. Papadopoulos,Konstantinos Psannis
机构: University of Macedonia (马其顿大学)
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注: 40 pages, 5 figures, 4 tables. Submitted to Mathematical Social Sciences
Abstract:Multi-agent coordination dilemmas expose a fundamental tension between individual optimization and collective welfare, yet characterizing such coordination requires metrics sensitive to temporal structure and collective dynamics. As a diagnostic testbed, we study a BoE-derived multi-agent variant of the Battle of the Exes, formalizing it as a Markov game in which turn-taking emerges as a periodic coordination regime. Conventional outcome-based metrics (e.g., efficiency and min/max fairness) are temporally blind – they cannot distinguish structured alternation from monopolistic or random access patterns – and fairness ratios lose discriminative power as n grows, obscuring inequities. To address this limitation, we introduce Perfect Alternation (PA) as a reference coordination regime and propose six novel Alternation (ALT) metrics designed as temporally sensitive observables of coordination quality. Using Q-learning agents as a minimal adaptive diagnostic baseline, and comparing against random-policy null processes, we uncover a clear measurement failure: despite exhibiting deceptively high traditional metrics (e.g., reward fairness often exceeding 0.9), learned policies perform up to 81% below random baselines under ALT-variant evaluation – a deficit already present in the two-agent case and intensifying as n grows. These results demonstrate, in this setting, that high aggregate payoffs can coexist with poor temporal coordination, and that conventional metrics may severely mischaracterize emergent dynamics. Our findings underscore the necessity of temporally aware observables for analyzing coordination in multi-agent games and highlight random-policy baselines as essential null processes for interpreting coordination outcomes relative to chance-level behavior. Comments: 40 pages, 5 figures, 4 tables. Submitted to Mathematical Social Sciences Subjects: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) MSC classes: 91A20, 68T05, 91B14 ACMclasses: I.2.11; I.2.6; J.4 Cite as: arXiv:2603.05789 [cs.MA] (or arXiv:2603.05789v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2603.05789 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-5] RACAS: Controlling Diverse Robots With a Single Agent ic System
【速读】:该论文旨在解决机器人平台从低级API接口到高级自主行为转换过程中所需复杂流水线的问题,该流水线通常涉及多个专业领域且难以迁移。解决方案的关键在于提出RACAS(Robot-Agnostic Control via Agentic Systems),一种基于大语言模型(LLM)和视觉语言模型(VLM)的协作智能体架构,由监控器(Monitors)、控制器(Controller)和记忆 curator(Memory Curator)三个模块组成,它们仅通过自然语言进行通信,实现闭环机器人控制。该方法仅需机器人自然语言描述、可用动作定义及任务说明,无需修改源代码、模型权重或奖励函数即可在不同机器人平台上部署,实验证明其在轮式地面机器人、新型多关节机械臂和水下航行器上均能稳定完成任务,显著降低了机器人原型开发的技术门槛。
链接: https://arxiv.org/abs/2603.05621
作者: Dylan R. Ashley,Jan Przepióra,Yimeng Chen,Ali Abualsaud,Nurzhan Yesmagambet,Shinkyu Park,Eric Feron,Jürgen Schmidhuber
机构: Center of Excellence in Generative AI, King Abdullah University of Science and Technology (KAUST), Saudi Arabia; Dalle Molle Institute for Artificial Intelligence Research (IDSIA), Switzerland; Università della Svizzera italiana (USI), Switzerland; Scuola universitaria professionale della Svizzera italiana (SUPSI), Switzerland; Robotics, Intelligent Systems, and Control Lab, King Abdullah University of Science and Technology (KAUST), Saudi Arabia; Department of Process Control, AGH University of Krakow, Poland
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 7 pages in main text + 1 page of appendices + 1 page of references, 5 figures in main text + 1 figure in appendices, 2 tables in main text
Abstract:Many robotic platforms expose an API through which external software can command their actuators and read their sensors. However, transitioning from these low-level interfaces to high-level autonomous behaviour requires a complicated pipeline, whose components demand distinct areas of expertise. Existing approaches to bridging this gap either require retraining for every new embodiment or have only been validated across structurally similar platforms. We introduce RACAS (Robot-Agnostic Control via Agentic Systems), a cooperative agentic architecture in which three LLM/VLM-based modules (Monitors, a Controller, and a Memory Curator) communicate exclusively through natural language to provide closed-loop robot control. RACAS requires only a natural language description of the robot, a definition of available actions, and a task specification; no source code, model weights, or reward functions need to be modified to move between platforms. We evaluate RACAS on several tasks using a wheeled ground robot, a recently published novel multi-jointed robotic limb, and an underwater vehicle. RACAS consistently solved all assigned tasks across these radically different platforms, demonstrating the potential of agentic AI to substantially reduce the barrier to prototyping robotic solutions.
[MA-6] Information-Theoretic Privacy Control for Sequential Multi-Agent LLM Systems
【速读】:该论文旨在解决顺序多智能体大语言模型(Sequential Multi-Agent LLM)系统中因代理间信息传递而导致的组合隐私泄露(Compositional Privacy Leakage)问题。尽管单个代理可能满足本地隐私约束,但敏感信息仍可通过序列执行过程中的中间表示被推断出来。解决方案的关键在于提出一种隐私正则化训练框架,通过直接约束代理输出与本地敏感变量之间的信息流来控制隐私泄露,并基于互信息建立理论边界以量化局部泄露在序列执行中的放大效应,从而将隐私保护从局部约束提升为系统级属性。
链接: https://arxiv.org/abs/2603.05520
作者: Sadia Asif,Mohammad Mohammadi Amiri
机构: 未知
类目: Multiagent Systems (cs.MA); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Sequential multi-agent large language model (LLM) systems are increasingly deployed in sensitive domains such as healthcare, finance, and enterprise decision-making, where multiple specialized agents collaboratively process a single user request. Although individual agents may satisfy local privacy constraints, sensitive information can still be inferred through sequential composition and intermediate representations. In this work, we study \emphcompositional privacy leakage in sequential LLM agent pipelines. We formalize leakage using mutual information and derive a theoretical bound that characterizes how locally introduced leakage can amplify across agents under sequential execution. Motivated by this analysis, we propose a privacy-regularized training framework that directly constrains information flow between agent outputs and agent-local sensitive variables. We evaluate our approach across sequential agent pipelines of varying depth on three benchmark datasets, demonstrating stable optimization dynamics and consistent, interpretable privacy-utility trade-offs. Our results show that privacy in agentic LLM systems cannot be guaranteed by local constraints alone and must instead be treated as a system-level property during both training and deployment.
[MA-7] Impact of arbitrag e between leverag ed ETF and futures on market liquidity during market crash
【速读】:该论文旨在解决杠杆型交易所交易基金(Leveraged ETFs, L-ETFs)与指数期货市场之间因套利交易(arbitrage trading)引发的流动性动态变化问题,特别是当某一市场出现价格暴跌时,套利行为如何影响两个市场的流动性指标(如成交量、卖盘深度、买盘深度和价差紧密度)。解决方案的关键在于通过人工市场模拟(artificial market simulations),系统性地对比有无套利交易条件下,错误订单冲击下两个市场的流动性响应机制:研究发现,当L-ETF市场发生错误订单时,套利交易促使期货市场向L-ETF市场提供卖盘深度(SellDepth)和价差紧密度(Tightness)方面的流动性支持;反之,若期货市场出现错误订单,则套利交易不仅使L-ETF市场向期货市场提供卖盘深度和价差紧密度,还进一步导致期货市场向L-ETF市场提供成交量(Volume)层面的流动性补充。这一机制揭示了套利活动在极端价格波动中对跨市场流动性传导的核心作用。
链接: https://arxiv.org/abs/2603.05862
作者: Ryuki Hayase,Takanobu Mizuta,Isao Yagi
机构: 未知
类目: Computational Finance (q-fin.CP); Multiagent Systems (cs.MA)
备注:
Abstract:Leveraged ETFs (L-ETFs) are exchange-traded funds that achieve price movements several times greater than an index by holding index-linked futures such as Nikkei Stock Average Index futures. It is known that when the price of an L-ETF falls, the L-ETF uses the liquidity of futures to limit the decline through arbitrage trading. Conversely, when the price of a futures contract falls, the futures contract uses the liquidity of the L-ETF to limit its decline. However, the impact of arbitrage trading on the liquidity of these markets has been little studied. Therefore, the present study used artificial market simulations to investigate how the liquidity (Volume, SellDepth, BuyDepth, Tightness) of both markets changes when prices plummet in either (i.e., the L-ETF or futures market), depending on the presence or absence of arbitrage trading. As a result, it was found that when erroneous orders occur in the L-ETF market, the existence of arbitrage trading causes liquidity to be supplied from the futures market to the L-ETF market in terms of SellDepth and Tightness. When erroneous orders occur in the futures market, the existence of arbitrage trading causes liquidity to be supplied from the L-ETF market to the futures market in terms of SellDepth and Tightness, and liquidity to be supplied from the futures market to the L-ETF market in terms of Volume. We also analyzed the internal market mechanisms that led to these results.
自然语言处理
[NLP-0] KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection SEMEVAL2026
【速读】: 该论文旨在解决政治话语中模糊性(ambiguity)与规避技巧(evasion)的分类问题,这是自然语言处理在政治语义分析中的关键挑战。其解决方案的关键在于提出两种建模范式:一是直接预测清晰度标签(clarity label),二是先预测规避标签并基于任务分类层次结构推导清晰度;此外,研究还探索了辅助训练变体,并在“规避优先”范式下评估了仅解码器模型在零样本(zero-shot)设置下的表现。实验表明,两种范式性能相当,其中RoBERTa-large在公开测试集上表现最优,而零样本GPT-5.2在隐藏评估集上展现出更强的泛化能力。
链接: https://arxiv.org/abs/2603.06552
作者: Archie Sage,Salvatore Greco
机构: King’s College London (国王学院)
类目: Computation and Language (cs.CL)
备注: Under review at SemEval 2026
Abstract:This paper describes the KCLarity team’s participation in CLARITY, a shared task at SemEval 2026 on classifying ambiguity and evasion techniques in political discourse. We investigate two modelling formulations: (i) directly predicting the clarity label, and (ii) predicting the evasion label and deriving clarity through the task taxonomy hierarchy. We further explore several auxiliary training variants and evaluate decoder-only models in a zero-shot setting under the evasion-first formulation. Overall, the two formulations yield comparable performance. Among encoder-based models, RoBERTa-large achieves the strongest results on the public test set, while zero-shot GPT-5.2 generalises better on the hidden evaluation set.
[NLP-1] Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning LREC2026
【速读】: 该论文旨在解决当前自动语音识别(ASR)系统在多语言支持和上下文感知能力方面的局限性,尤其是缺乏跨模态的语义对齐机制。其核心挑战在于如何在保持预训练模型模块化的同时,实现对多样化语言、口音以及对话历史等上下文信息的有效利用。解决方案的关键在于提出一种基于轻量级投影模块的框架,将冻结的语音编码器与仅解码器的语言模型相结合,并引入对比学习目标以对齐语音和上下文表示在共享嵌入空间中的分布,从而增强跨模态交互并提升多语言场景下的识别准确率。
链接: https://arxiv.org/abs/2603.06505
作者: Yuchen Zhang,Haralambos Mouratidis,Ravi Shekhar
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026
Abstract:Automatic speech recognition (ASR) has benefited from advances in pretrained speech and language models, yet most systems remain constrained to monolingual settings and short, isolated utterances. While recent efforts in context-aware ASR show promise, two key challenges persist: limited multilingual support and the absence of principled alignment between speech and contextual representations. In this paper, we introduce a context-aware multilingual ASR framework that supports diverse languages and accents while preserving the modularity of pretrained models. Our approach combines a frozen speech encoder and a decoder-only language model via a lightweight projection module, allowing structured context prompts, including dialogue history and biasing words, to guide transcription. To improve interaction between speech and context, we employ a contrastive learning objective that aligns their representations in a shared embedding space. Evaluations on over 1,500 hours of real-world conversational speech across 11 languages and 5 English dialects show that contextual input consistently improves recognition quality. Contrastive alignment provides additional gains when applied to different context types, with an overall performance gain of over 5%. These results highlight the importance of both contextual modeling and cross-modal alignment in multilingual ASR.
[NLP-2] Beyond Rows to Reasoning : Agent ic Retrieval for Multimodal Spreadsheet Understanding and Editing
【速读】: 该论文旨在解决当前多模态检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理企业级电子表格工作簿时存在的三大核心问题:单次检索导致关键上下文丢失、数据压缩造成分辨率损失,以及全量上下文注入超出大语言模型(Large Language Models, LLMs)的上下文窗口限制,从而阻碍对复杂电子表格进行可靠多步推理。其解决方案的关键在于提出了一种名为“Beyond Rows to Reasoning”(BRTR)的多模态代理框架,通过引入迭代式工具调用循环(iterative tool-calling loop)替代传统的单次检索机制,实现了从复杂分析到结构化编辑的端到端Excel工作流支持,并结合专家评估与系统性消融实验验证了规划器(planner)、检索模块和迭代推理三者均对性能提升具有显著贡献。
链接: https://arxiv.org/abs/2603.06503
作者: Anmol Gulati,Sahil Sen,Waqar Sarguroh,Kevin Paul
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in multimodal Retrieval-Augmented Generation (RAG) enable Large Language Models (LLMs) to analyze enterprise spreadsheet workbooks containing millions of cells, cross-sheet dependencies, and embedded visual artifacts. However, state-of-the-art approaches exclude critical context through single-pass retrieval, lose data resolution through compression, and exceed LLM context windows through naive full-context injection, preventing reliable multi-step reasoning over complex enterprise workbooks. We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis to structured editing. Supported by over 200 hours of expert human evaluation, BRTR achieves state-of-the-art performance across three frontier spreadsheet understanding benchmarks, surpassing prior methods by 25 percentage points on FRTR-Bench, 7 points on SpreadsheetLLM, and 32 points on FINCH. We evaluate five multimodal embedding models, identifying NVIDIA NeMo Retriever 1B as the top performer for mixed tabular and visual data, and vary nine LLMs. Ablation experiments confirm that the planner, retrieval, and iterative reasoning each contribute substantially, and cost analysis shows GPT-5.2 achieves the best efficiency-accuracy trade-off. Throughout all evaluations, BRTR maintains full auditability through explicit tool-call traces.
[NLP-3] COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics ICLR2026
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在推理阶段进行行为控制时面临的样本效率与信号提取质量之间的权衡问题:现有激活引导(activation steering)方法要么需要大量示例才能有效捕捉引导信号,要么在少量示例下表现不佳。其解决方案的关键在于提出一种无需训练的框架COLD-Steer,通过在推理时近似梯度下降对少量上下文示例所引发的表征变化来实现高效引导,核心创新包括两种互补的近似方法——单位核近似法(unit kernel approximation)直接更新激活值并归一化梯度,以及有限差分近似法(finite-difference approximation),仅需两次前向传播即可完成,且不依赖示例数量。该方法在多个任务中实现了高达95%的引导效果,同时使用比最优基线少50倍的样本,从而支持多样化视角的灵活适配。
链接: https://arxiv.org/abs/2603.06495
作者: Kartik Sharma,Rakshit S. Trivedi
机构: Georgia Institute of Technology (佐治亚理工学院); Massachusetts Institute of Technology (麻省理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICLR 2026. Code available at this https URL
Abstract:Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade-off: sample-efficient methods suboptimally capture steering signals from labeled examples, while methods that better extract these signals require hundreds to thousands of examples. We introduce COLD-Steer, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples. Our key insight is that the effect of fine-tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite-difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD-Steer facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context-aware model control that can flexibly address varying loss-driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.
[NLP-4] NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches
【速读】: 该论文旨在解决大模型预训练过程中训练效率低的问题,特别是在保持模型性能的同时减少训练步数和计算开销。其解决方案的关键在于提出一种名为NOBLE(Nonlinear lOw-rank Branch for Linear Enhancement)的架构增强方法,通过在Transformer的线性层中永久嵌入非线性低秩分支来提升训练效率。该分支结构计算为σ(xWdown)Wup,其中σ是一个可学习的非线性激活函数,实验发现CosNet(一个两层余弦非线性结构,包含可学习频率与相位,并在瓶颈空间引入线性投影)表现最优。NOBLE在不显著增加参数量(最多仅增加4%)和步长时间开销(最多7%)的前提下,实现了高达1.47倍的单步训练速度提升(减少32%训练步数),并带来最高1.22倍的净墙 clock 时间加速,且在LLM、BERT、VQGAN和ViT等多类模型上均验证了其有效性。
链接: https://arxiv.org/abs/2603.06492
作者: Ethan Smith(Canva Research)
机构: Canva Research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: 14 pages, 5 figures, 5 tables
Abstract:We introduce NOBLE (Nonlinear lOw-rank Branch for Linear Enhancement), an architectural augmentation that adds nonlinear low-rank branches to transformer linear layers. Unlike LoRA and other parameter-efficient fine-tuning (PEFT) methods, NOBLE is designed for pretraining from scratch. The branch is a permanent part of the architecture as opposed to an adapter for finetuning on top of frozen weights. The branch computes \sigma(xWdown)Wup where \sigma is a learnable nonlinearity. We evaluate several activation functions and find that CosNet, a two-layer cosine nonlinearity with learnable frequency and phase with a linear projection in between them in the bottleneck space, performs best. NOBLE achieves substantial improvements with minimal overhead: up to 1.47x step speedup to reach baseline eval loss (up to 32% fewer training steps), with as low as 4% additional parameters and 7% step time overhead, resulting in up to 1.22x net wallclock speedup. Experiments on LLMs (250M and 1.5B parameters), BERT, VQGAN, and ViT consistently show improved training efficiency. We identify one caveat: Mixup/CutMix augmentation interferes with NOBLE’s benefits in Imagenet classification along with other stochastic augmentations, but when disabled, ViT also improves. This discrepancy is possibly explained by regularization techniques that encourage smoother fits to the target function while NOBLE may specialize more in sharper aspects of the target function.
[NLP-5] PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations
【速读】: 该论文旨在解决当前可解释人工智能(Explainable Artificial Intelligence, XAI)方法普遍采用“一刀切”范式,忽视用户在专业背景、目标和认知需求上的差异性,以及大型语言模型(Large Language Models)在生成自然语言解释时存在的忠实度不足与幻觉问题。其解决方案的关键在于提出PONTE框架——一个以人为中心的闭环验证与自适应机制,通过低维偏好模型捕捉风格需求、基于结构化XAI产物的条件生成器确保内容可靠性,并结合数值忠实性、信息完整性与风格一致性验证模块,在用户反馈驱动下实现快速个性化调整,从而显著提升解释的准确性与用户适配性。
链接: https://arxiv.org/abs/2603.06485
作者: Vittoria Vineis,Matteo Silvestri,Lorenzo Antonelli,Filippo Betello,Gabriele Tolomei
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 2 figures
Abstract:Explainable Artificial Intelligence (XAI) seeks to enhance the transparency and accountability of machine learning systems, yet most methods follow a one-size-fits-all paradigm that neglects user differences in expertise, goals, and cognitive needs. Although Large Language Models can translate technical explanations into natural language, they introduce challenges related to faithfulness and hallucinations. To address these challenges, we present PONTE (Personalized Orchestration for Natural language Trustworthy Explanations), a human-in-the-loop framework for adaptive and reliable XAI narratives. PONTE models personalization as a closed-loop validation and adaptation process rather than prompt engineering. It combines: (i) a low-dimensional preference model capturing stylistic requirements; (ii) a preference-conditioned generator grounded in structured XAI artifacts; and (iii) verification modules enforcing numerical faithfulness, informational completeness, and stylistic alignment, optionally supported by retrieval-grounded argumentation. User feedback iteratively updates the preference state, enabling quick personalization. Automatic and human evaluations across healthcare and finance domains show that the verification-refinement loop substantially improves completeness and stylistic alignment over validation-free generation. Human studies further confirm strong agreement between intended preference vectors and perceived style, robustness to generation stochasticity, and consistently positive quality assessments.
[NLP-6] Abductive Reasoning with Syllogistic Forms in Large Language Models
【速读】: 该论文试图解决的问题是:当前大型语言模型(Large-Language Models, LLMs)在逻辑推理任务中表现出与人类相似的偏差,但这种偏差是否公平地归因于LLMs尚不明确,尤其在涉及非演绎推理(如溯因推理)时。论文的核心关切在于探究状态先进的LLMs在溯因推理(abductive reasoning)中的准确性,以识别其潜在偏差并推动更贴近人类认知的上下文化推理能力发展。解决方案的关键在于将经典的三段论(syllogism)数据集重构为适用于溯因推理的任务形式,从而系统评估LLMs在从有限证据中推导最可能前提的能力,并强调超越形式演绎、强化情境感知推理对提升模型认知水平的重要性。
链接: https://arxiv.org/abs/2603.06428
作者: Hirohiko Abe,Risako Ando,Takanobu Morishita Kentaro Ozeki,Koji Mineshima,Mitsuhiro Okada
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in Proceedings of the 3rd International Conference on Human and Artificial Rationalities (HAR 2024), LNCS 15504, pp. 3-17
Abstract:Research in AI using Large-Language Models (LLMs) is rapidly evolving, and the comparison of their performance with human reasoning has become a key concern. Prior studies have indicated that LLMs and humans share similar biases, such as dismissing logically valid inferences that contradict common beliefs. However, criticizing LLMs for these biases might be unfair, considering our reasoning not only involves formal deduction but also abduction, which draws tentative conclusions from limited information. Abduction can be regarded as the inverse form of syllogism in its basic structure, that is, a process of drawing a minor premise from a major premise and conclusion. This paper explores the accuracy of LLMs in abductive reasoning by converting a syllogistic dataset into one suitable for abduction. It aims to investigate whether the state-of-the-art LLMs exhibit biases in abduction and to identify potential areas for improvement, emphasizing the importance of contextualized reasoning beyond formal deduction. This investigation is vital for advancing the understanding and application of LLMs in complex reasoning tasks, offering insights into bridging the gap between machine and human cognition.
[NLP-7] From Prompting to Preference Optimization: A Comparative Study of LLM -based Automated Essay Scoring
【速读】: 该论文旨在解决当前自动作文评分(Automated Essay Scoring, AES)研究中对大型语言模型(Large Language Models, LLMs)不同应用范式缺乏统一评估的问题,尤其针对英语作为第二语言(English as a Second Language, L2)写作场景。现有研究多孤立地分析单一技术,难以揭示各方法在准确性、成本与鲁棒性之间的权衡关系。其解决方案的关键在于构建了一个统一基准——IELTS Writing Task 2,并系统比较了四种主流LLM-based AES策略:基于编码器的微调分类、零样本/少样本提示、指令微调结合检索增强生成(Retrieval-Augmented Generation, RAG),以及监督微调(Supervised Fine-Tuning, SFT)联合直接偏好优化(Direct Preference Optimization, DPO)与RAG。实验表明,最优配置为k-SFT与RAG融合方案,在F1分数上达到93%,展现出显著的性能优势,为L2写作自动评分提供了可复现且高效的范式。
链接: https://arxiv.org/abs/2603.06424
作者: Minh Hoang Nguyen,Vu Hoang Pham,Xuan Thanh Huynh,Phuc Hong Mai,Vinh The Nguyen,Quang Nhut Huynh,Huy Tien Nguyen,Tung Le
机构: Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam
类目: Computation and Language (cs.CL)
备注: 19 pages, 10 figures, 7 tables
Abstract:Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolation, limiting understanding of their relative merits for English as a Second Language (L2) writing. To bridge this gap, we presents a comprehensive comparison of major LLM-based AES paradigms on IELTS Writing Task~2. On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning combined with Direct Preference Optimization (DPO) and RAG. Our results reveal clear accuracy-cost-robustness trade-offs across methods, the best configuration, integrating k-SFT and RAG, achieves the strongest overall results with F1-Score 93%. This study offers the first unified empirical comparison of modern LLM-based AES strategies for English L2, promising potential in auto-grading writing tasks. Code is public at this https URL
[NLP-8] Evaluation of Deontic Conditional Reasoning in Large Language Models : The Case of Wasons Selection Task EACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在条件推理任务中是否存在领域特异性(domain specificity)以及其错误模式是否与人类认知偏差一致的问题。解决方案的关键在于构建了一个新的Wason选择任务数据集,该数据集明确编码了道义模态(deontic modality),从而系统地区分道义条件句(deontic conditionals)与描述性条件句(descriptive conditionals),并利用该数据集评估LLMs在道义规则下的条件推理表现。研究发现,LLMs在道义规则下推理表现优于描述性规则,并表现出类似人类的匹配偏差(matching bias)错误模式,表明其推理能力具有规则类型依赖性,且误差机制可类比于人类已知的认知偏差。
链接: https://arxiv.org/abs/2603.06416
作者: Hirohiko Abe,Kentaro Ozeki,Risako Ando,Takanobu Morishita,Koji Mineshima,Mitsuhiro Okada
机构: Keio University(庆应义塾大学); University of Tokyo(东京大学)
类目: Computation and Language (cs.CL)
备注: To appear in the Proceedings of EACL 2026
Abstract:As large language models (LLMs) advance in linguistic competence, their reasoning abilities are gaining increasing attention. In humans, reasoning often performs well in domain specific settings, particularly in normative rather than purely formal contexts. Although prior studies have compared LLM and human reasoning, the domain specificity of LLM reasoning remains underexplored. In this study, we introduce a new Wason Selection Task dataset that explicitly encodes deontic modality to systematically distinguish deontic from descriptive conditionals, and use it to examine LLMs’ conditional reasoning under deontic rules. We further analyze whether observed error patterns are better explained by confirmation bias (a tendency to seek rule-supporting evidence) or by matching bias (a tendency to ignore negation and select items that lexically match elements of the rule). Results show that, like humans, LLMs reason better with deontic rules and display matching-bias-like errors. Together, these findings suggest that the performance of LLMs varies systematically across rule types and that their error patterns can parallel well-known human biases in this paradigm.
[NLP-9] ransparent AI for Mathematics: Transformer-Based Large Language Models for Mathematical Entity Relationship Extraction with XAI
【速读】: 该论文旨在解决数学文本理解中因专业实体复杂性和关系多样性带来的挑战,特别是如何准确提取数学表达式中的实体(如操作数)与其间关系(如运算符)。其解决方案的关键在于将数学问题解析建模为数学实体关系抽取(Mathematical Entity Relation Extraction, MERE)任务,并采用基于Transformer的模型(尤其是BERT)进行自动关系识别,实现了99.39%的高准确率;同时引入可解释人工智能(Explainable Artificial Intelligence, XAI)技术,结合Shapley Additive Explanations (SHAP)方法提升模型预测的透明度,揭示文本和数学特征对关系判断的影响,从而构建一个高效且可解释的MERE框架,支持自动化解题、知识图谱构建及智能教育系统等下游应用。
链接: https://arxiv.org/abs/2603.06348
作者: Tanjim Taharat Aurpa
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Mathematical text understanding is a challenging task due to the presence of specialized entities and complex relationships between them. This study formulates mathematical problem interpretation as a Mathematical Entity Relation Extraction (MERE) task, where operands are treated as entities and operators as their relationships. Transformer-based models are applied to automatically extract these relations from mathematical text, with Bidirectional Encoder Representations from Transformers (BERT) achieving the best performance, reaching an accuracy of 99.39%. To enhance transparency and trust in the model’s predictions, Explainable Artificial Intelligence (XAI) is incorporated using Shapley Additive Explanations (SHAP). The explainability analysis reveals how specific textual and mathematical features influence relation prediction, providing insights into feature importance and model behavior. By combining transformer-based learning, a task-specific dataset, and explainable modeling, this work offers an effective and interpretable framework for MERE, supporting future applications in automated problem solving, knowledge graph construction, and intelligent educational systems.
[NLP-10] SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement ICLR2026
【速读】: 该论文旨在解决递归自我改进(Recursive Self-Improvement)过程中因迭代自修改而导致的对齐漂移(Alignment Drift)问题,即系统在持续优化自身输出时可能偏离初始设定的目标或安全约束。解决方案的关键在于提出SAHOO框架,其核心包含三项保障机制:(i) 目标漂移指数(Goal Drift Index, GDI),一种基于语义、词汇、结构和分布特征的多信号学习检测器;(ii) 约束保持检查,用于强制执行如语法正确性和非幻觉等关键安全不变量;(iii) 回归风险量化,用以识别那些抵消前期改进成果的优化循环。通过这三项机制,SAHOO实现了对递归自改进过程中的对齐状态进行可测量、可部署且可大规模验证的监控与控制。
链接: https://arxiv.org/abs/2603.06333
作者: Subramanyam Sahoo,Aman Chadha,Vinija Jain,Divya Chaudhary
机构: University of Cambridge (剑桥大学); Amazon Web Services (亚马逊云服务); Google (谷歌); Stanford University (斯坦福大学); Northeastern University (东北大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published at ICLR 2026 Workshop on AI with Recursive Self-Improvement. 20 pages, 5 figures
Abstract:Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, and evaluate their own outputs, yet iterative self-modification risks subtle alignment drift. We introduce SAHOO, a practical framework to monitor and control drift through three safeguards: (i) the Goal Drift Index (GDI), a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (ii) constraint preservation checks that enforce safety-critical invariants such as syntactic correctness and non-hallucination; and (iii) regression-risk quantification to flag improvement cycles that undo prior gains. Across 189 tasks in code generation, mathematical reasoning, and truthfulness, SAHOO produces substantial quality gains, including 18.3 percent improvement in code tasks and 16.8 percent in reasoning, while preserving constraints in two domains and maintaining low violations in truthfulness. Thresholds are calibrated on a small validation set of 18 tasks across three cycles. We further map the capability-alignment frontier, showing efficient early improvement cycles but rising alignment costs later and exposing domain-specific tensions such as fluency versus factuality. SAHOO therefore makes alignment preservation during recursive self-improvement measurable, deployable, and systematically validated at scale.
[NLP-11] he Art That Poses Back: Assessing AI Pastiches after Contemporary Artworks
【速读】: 该论文旨在解决生成式 AI 在艺术创作领域中对原作进行风格模仿(pastiching)时的准确性与艺术价值评估问题,特别是如何客观衡量 AI 生成图像与原始艺术作品在视觉、风格及概念层面的相似性。其解决方案的关键在于提出一个“风格迁移仪表盘”(style transfer dashboard),整合多种互补的量化指标,而非依赖单一风格度量标准,以更全面地捕捉颜色、纹理、构图、概念和感知等多维度差异,从而提升对 AI 艺术复刻质量的评估精度。
链接: https://arxiv.org/abs/2603.06324
作者: Anca Dinu,Andreiana Mihail,Andra-Maria Florescu,Claudiu Creanga
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study explores artificial visual creativity, focusing on ChatGPT’s ability to generate new images intentionally pastiching original artworks such as paintings, drawings, sculptures and installations. The process involved twelve artists from Romania, Bulgaria, France, Austria, and the United Kingdom, each invited to contribute with three of their artworks and to grade and comment on the AI-generated versions. The analysis combines human evaluation with computational methods aimed at detecting visual and stylistic similarities or divergences between the original works and their AI-produced renditions. The results point to a significant gap between color and texture-based similarity and compositional, conceptual, and perceptual one. Consequently, we advocate for the use of a “style transfer dashboard” of complementary metrics to evaluate the similarity between pastiches and originals, rather than using a single style metric. The artists’ comments revealed limitations of ChatGPT’s pastiches after contemporary artworks, which were perceived by the authors of the originals as lacking dimensionality, context, and intentional sense, and seeming more of a paraphrase or an approximate quotation rather than as a valuable, emotion-evoking artwork.
[NLP-12] he EpisTwin: A Knowledge Graph-Grounded Neuro-Symbolic Architecture for Personal AI
【速读】: 该论文旨在解决个人人工智能(Personal Artificial Intelligence)当前面临的用户数据碎片化问题,即用户数据分散在孤立的数据孤岛中,难以实现统一的语义理解和推理。现有方法如检索增强生成(Retrieval-Augmented Generation)虽能部分缓解此问题,但其依赖无结构的向量相似性匹配,无法捕捉潜在的语义拓扑结构和时间依赖关系,从而限制了整体认知能力。解决方案的关键在于提出EpisTwin——一个神经符号框架,它通过多模态语言模型将跨应用的异构数据转化为语义三元组,构建以用户为中心的个人知识图谱(Personal Knowledge Graph),并在推理阶段利用代理协调器结合图检索增强生成与在线深度视觉精炼机制,动态地将符号实体重新锚定到原始视觉上下文中,从而实现可验证、高保真的个性化推理。
链接: https://arxiv.org/abs/2603.06290
作者: Giovanni Servedio,Potito Aghilar,Alessio Mattiace,Gianni Carmosino,Francesco Musicco,Gabriele Conte,Vito Walter Anelli,Tommaso Di Noia,Francesco Maria Donini
机构: Politecnico di Bari (巴里理工大学); Università degli Studi della Tuscia (图斯维亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Personal Artificial Intelligence is currently hindered by the fragmentation of user data across isolated silos. While Retrieval-Augmented Generation offers a partial remedy, its reliance on unstructured vector similarity fails to capture the latent semantic topology and temporal dependencies essential for holistic sensemaking. We introduce EpisTwin, a neuro-symbolic framework that grounds generative reasoning in a verifiable, user-centric Personal Knowledge Graph. EpisTwin leverages Multimodal Language Models to lift heterogeneous, cross-application data into semantic triples. At inference, EpisTwin enables complex reasoning over the personal semantic graph via an agentic coordinator that combines Graph Retrieval-Augmented Generation with Online Deep Visual Refinement, dynamically re-grounding symbolic entities in their raw visual context. We also introduce PersonalQA-71-100, a synthetic benchmark designed to simulate a realistic user’s digital footprint and evaluate EpisTwin performance. Our framework demonstrates robust results across a suite of state-of-the-art judge models, offering a promising direction for trustworthy Personal AI.
[NLP-13] Mind the Gap: Pitfalls of LLM Alignment with Asian Public Opinion
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多语言、多文化场景中因训练数据以英语为主而导致的文化对齐问题,尤其关注宗教这一敏感领域中的偏见与代表性不足。其核心挑战在于模型输出常偏离当地公众态度,特别是对少数群体宗教观点的误表和负面刻板印象强化。解决方案的关键在于通过多层次分析——利用日志概率(log-probs/logits)比较模型内部表征与真实社会态度差异——并结合轻量级干预(如人口统计学提示和本地语言输入),识别并部分缓解文化偏差;同时强调需开展基于区域实证的系统性审计,以确保LLMs在全球部署中的公平性和文化适配性。
链接: https://arxiv.org/abs/2603.06264
作者: Hari Shankar,Vedanta S P,Sriharini Margapuri,Debjani Mazumder,Ponnurangam Kumaraguru,Abhijnan Chakraborty
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 11 pages, including references
Abstract:Large Language Models (LLMs) are increasingly being deployed in multilingual, multicultural settings, yet their reliance on predominantly English-centric training data risks misalignment with the diverse cultural values of different societies. In this paper, we present a comprehensive, multilingual audit of the cultural alignment of contemporary LLMs including GPT-4o-Mini, Gemini-2.5-Flash, Llama 3.2, Mistral and Gemma 3 across India, East Asia and Southeast Asia. Our study specifically focuses on the sensitive domain of religion as the prism for broader alignment. To facilitate this, we conduct a multi-faceted analysis of every LLM’s internal representations, using log-probs/logits, to compare the model’s opinion distributions against ground-truth public attitudes. We find that while the popular models generally align with public opinion on broad social issues, they consistently fail to accurately represent religious viewpoints, especially those of minority groups, often amplifying negative stereotypes. Lightweight interventions, such as demographic priming and native language prompting, partially mitigate but do not eliminate these cultural gaps. We further show that downstream evaluations on bias benchmarks (such as CrowS-Pairs, IndiBias, ThaiCLI, KoBBQ) reveal persistent harms and under-representation in sensitive contexts. Our findings underscore the urgent need for systematic, regionally grounded audits to ensure equitable global deployment of LLMs.
[NLP-14] SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)中显式链式思维(Explicit Chain-of-Thought, CoT)推理带来的高推理开销问题,同时保留其推理过程的可解释性。现有方法通过简化提示或剪枝步骤来降低计算成本,但往往仅截断模型输出而非内化其内部思考过程,导致语义不完整或不可解释。针对此问题,论文提出 SPOT 框架,其核心创新在于两个关键机制:一是段级语义对齐(Span-level Semantic Alignment),采用 Sinkhorn 最优传输目标,软性匹配每个潜变量 token 与整个推理段落的语义,克服了传统点对点对齐的刚性限制;二是冻结头解码约束(Frozen-Head Decoding Constraint),确保潜变量状态可直接通过预训练语言模型(LM)头解码为词分布,从而实现对隐式推理过程的可读关键词解释。实验表明,SPOT 在保持推理准确性提升的同时显著减少生成 token 数量,并提供忠实的语义解释。
链接: https://arxiv.org/abs/2603.06222
作者: Yunlong Chu,Minglai Shao,Yuhang Liu,Bing Hao,Yumeng Lin,Jialu Wang,Ruijie Wang
机构: Tianjin University (天津大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Explicit Chain-of-Thought improves the reasoning performance of large language models but often incurs high inference cost due to verbose token-level traces. While recent approaches reduce this overhead via concise prompting or step pruning, they largely truncate what the model says rather than internalize what the model thinks. Latent reasoning offers a promising alternative by performing computation in the hidden space, yet prior methods face two critical challenges. Many existing approaches rely on rigid point-to-point alignment, forcing a latent token to approximate the final representation of a reasoning step, which can be insufficient to capture the dense, variable-length semantics of an entire reasoning segment. Furthermore, these methods often suffer from a lack of interpretability: latent states are commonly produced by unconstrained optimization or embedding mixing, yielding vectors that are difficult to decode or audit under the pretrained language head. We propose SPOT, a flexible framework that compresses explicit CoT into compact latent pause tokens without enforcing a fixed response template. At the core of SPOT is Span-level Semantic Alignment, a Sinkhorn optimal-transport objective that softly matches each pause token to the semantics of an entire reasoning segment, overcoming the rigidity of step-end alignment. To further improve interpretability, SPOT introduces a Frozen-Head Decoding Constraint that keeps latent states directly decodable as token distributions under the frozen pretrained LM head, enabling readable keyword interpretations of latent thoughts. Experiments on reasoning benchmarks demonstrate that SPOT improves accuracy by 2.3 points on average while reducing generated tokens by 37.5% and provides faithful semantic interpretations of the latent reasoning process.
[NLP-15] FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本建模中因注意力机制(attention mechanism)计算复杂度为二次方而导致的预填充阶段(prefilling phase)效率瓶颈问题。现有稀疏注意力机制通常面临搜索延迟高或稀疏性不足的局限。其解决方案的关键在于提出FlashPrefill框架,通过一种快速块搜索技术同时发现动态的垂直、斜向和块状稀疏模式,并引入动态阈值机制,在无需排序或累加注意力分数的情况下有效消除长尾分布,从而显著提升稀疏性与计算效率。实验证明,该方法在256K序列长度下实现27.78倍加速,且在4K短序列上仍保持1.71倍加速,展现出跨序列尺度的鲁棒性和实用性。
链接: https://arxiv.org/abs/2603.06199
作者: Qihang Fan,Huaibo Huang,Zhiying Wu,Juqiu Wang,Bingning Wang,Ran He
机构: MAISNLPR, CASIA; UCAS; WeChat, Tencent
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling phase. While various sparse attention mechanisms have been explored, they typically suffer from either significant search latency or insufficient sparsity. In this paper, we propose FlashPrefill, a framework enabling ultra-fast prefilling via instantaneous pattern discovery and thresholding. FlashPrefill leverages a fast block-searching technique to simultaneously locate dynamic vertical, slash, and block-sparse attention patterns. Crucially, it introduces a dynamic thresholding mechanism that bypasses the prohibitive overhead of sorting or accumulating attention scores while effectively eliminating the long-tail distribution to enhance sparsity. Extensive evaluations demonstrate that FlashPrefill achieves a substantial leap in efficiency, delivering an unprecedented 27.78x speedup on 256K sequences. Notably, unlike existing methods that incur efficiency degradation on shorter contexts, FlashPrefill maintains a 1.71x speedup even at a 4K context length, demonstrating its robustness and practical utility across varying sequence scales.
[NLP-16] LIT-RAG Bench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation LREC2026
【速读】: 该论文旨在解决现有生成式 AI (Generative AI) 评估基准在检索增强生成(Retrieval-Augmented Generation, RAG)场景下覆盖能力有限的问题,即缺乏能够统一评估多类关键能力(如信息整合、逻辑推理、表格理解与拒答机制)的基准工具。其解决方案的关键在于提出 LIT-RAGBench,这是一个系统性设计的 RAG 生成器基准测试框架,涵盖五个核心维度:Integration(信息整合)、Reasoning(推理)、Logic(逻辑)、Table(表格理解)和 Abstention(拒答),每个维度进一步细分为实际应用场景下的子项,并通过虚构实体与情景确保答案可验证性,从而实现对模型在真实复杂任务中表现的全面量化评估。
链接: https://arxiv.org/abs/2603.06198
作者: Koki Itai,Shunichi Hasegawa,Yuta Yamamoto,Gouki Minegishi,Masaki Otsuki
机构: 未知
类目: Computation and Language (cs.CL)
备注: Published as a conference paper at LREC 2026
Abstract:Retrieval-Augmented Generation (RAG) is a framework in which a Generator, such as a Large Language Model (LLM), produces answers by retrieving documents from an external collection using a Retriever. In practice, Generators must integrate evidence from long contexts, perform multi-step reasoning, interpret tables, and abstain when evidence is missing. However, existing benchmarks for Generators provide limited coverage, with none enabling simultaneous evaluation of multiple capabilities under unified conditions. To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic, Table, and Abstention, each further divided into practical evaluation aspects. LIT-RAGBench systematically covers patterns combining multiple aspects across categories. By using fictional entities and scenarios, LIT-RAGBench evaluates answers grounded in the provided external documents. The dataset consists of 114 human-constructed Japanese questions and an English version generated by machine translation with human curation. We use LLM-as-a-Judge for scoring and report category-wise and overall accuracy. Across API-based and open-weight models, no model exceeds 90% overall accuracy. By making strengths and weaknesses measurable within each category, LIT-RAGBench serves as a valuable metric for model selection in practical RAG deployments and for building RAG-specialized models. We release LIT-RAGBench, including the dataset and evaluation code, at this https URL.
[NLP-17] Wisdom of the AI Crowd (AI-CROWD) for Ground Truth Approximation in Content Analysis: A Research Protocol Validation Using Eleven Large Language Models
【速读】: 该论文旨在解决大规模内容分析中因缺乏可观测的“真实标签”(ground truth)或黄金标准标签而导致的瓶颈问题,尤其是在海量数据集上通过人工标注构建基准变得不切实际时。其解决方案的关键在于提出AI-CROWD协议,该协议利用多个大语言模型(Large Language Models, LLMs)的集体输出来近似生成共识性标签,而非声称获得绝对真实的标签;通过多数投票机制聚合结果,并结合诊断性指标分析模型间的一致性与分歧模式,从而识别高置信度分类结果并揭示潜在模糊性或模型特异性偏差。
链接: https://arxiv.org/abs/2603.06197
作者: Luis de-Marcos,Manuel Goyanes,Adrián Domínguez-Díaz
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large-scale content analysis is increasingly limited by the absence of observable ground truth or gold-standard labels, as creating such benchmarks through extensive human coding becomes impractical for massive datasets due to high time, cost, and consistency challenges. To overcome this barrier, we introduce the AI-CROWD protocol, which approximates ground truth by leveraging the collective outputs of an ensemble of large language models (LLMs). Rather than asserting that the resulting labels are true ground truth, the protocol generates a consensus-based approximation derived from convergent and divergent inferences across multiple models. By aggregating outputs via majority voting and interrogating agreement/disagreement patterns with diagnostic metrics, AI-CROWD identifies high-confidence classifications while flagging potential ambiguity or model-specific biases.
[NLP-18] MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue
【速读】: 该论文旨在解决主观多轮对话任务(如情感支持)中强化学习(Reinforcement Learning, RL)面临的挑战,即缺乏可靠的进程监督信号导致信用分配(credit assignment)困难。传统仅基于结果的训练方式将多轮交互的奖励压缩为单一轨迹级回报,而简单的逐轮分组采样则在交互环境中带来高昂的回放成本。解决方案的关键在于提出一种无需价值函数估计器(critic-free)且高效的RL算法MAPO,其核心创新是利用来自裁判模型(judge model)的密集过程反馈,并通过蒙特卡洛回报传播长期影响;同时引入混合优势估计器(mixed advantage estimator),融合逐轮归一化与批次归一化,实现细粒度且可扩展的信用分配机制。实验表明,该方法在多个主观对话基准(如EMPA、EmoBench和EQ-Bench)上显著提升了训练稳定性和最终性能,尤其在7B至32B参数规模模型中表现优异,且具备良好的跨任务泛化能力。
链接: https://arxiv.org/abs/2603.06194
作者: Naifan Zhang,Ruihan Sun,Jinwei Su,Hengjie Yang,Zhengyuan Pan,Zhaohan Chen,Xiaofan Zhang
机构: NatureSelect; Tsinghua University (清华大学); South China Normal University (华南师范大学); Xiamen University (厦门大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Subjective multi-turn dialogue tasks, such as emotional support, require conversational policies that adapt to evolving user states and optimize long-horizon interaction quality. However, reinforcement learning (RL) for such settings remains challenging due to the absence of reliable process supervision. Outcome-only training collapses credit assignment across turns into a single trajectory-level reward, while naïve turn-level group sampling incurs prohibitive rollout costs in interactive environments. We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns. To stabilize optimization, we introduce a mixed advantage estimator that combines turn-level normalization with batch-level normalization, enabling fine-grained yet scalable credit assignment. Across multiple subjective dialogue benchmarks, including EMPA, EmoBench, and EQ-Bench, and model scales ranging from 7B to 32B, our method consistently improves both training stability and final performance over outcome-only GRPO and single-level normalization baselines. On EMPA, we improve rates by up to 9 points and increase dialogue scores by as much as +43.2 over the 7B base model. Despite training only on EMPA-style environments, our approach generalizes well, yielding consistent improvements on unseen emotional-intelligence benchmarks, including up to +4 points on EmoBench and +3.5 on EQ-Bench. Together, these results demonstrate that dense process supervision combined with mixed-level normalization enables effective and scalable RL for subjective, open-ended multi-turn dialogue.
[NLP-19] CRIMSON: A Clinically-Grounded LLM -Based Metric for Generative Radiology Report Evaluation
【速读】: 该论文旨在解决当前胸部X光报告生成模型评估中缺乏临床合理性与可解释性的问题,传统指标往往忽视患者个体特征(如年龄、检查指征)及临床重要性差异,导致对错误的评判不够精准。其解决方案的核心在于提出CRIMSON框架,该框架通过整合完整的临床上下文信息(包括患者年龄、检查指征和基于指南的决策规则),构建一个分层的错误分类体系(涵盖假阳性、漏诊及八类属性级错误),并依据放射科专家制定的指南对每个发现分配临床显著性等级(紧急、可行动非紧急、不可行动或预期/良性),实现基于严重程度的加权评分机制,从而优先识别和量化具有临床意义的错误,显著提升了评估结果与放射科医生判断的一致性(Kendall’s tau = 0.61–0.71;Pearson’s r = 0.71–0.84)。
链接: https://arxiv.org/abs/2603.06183
作者: Mohammed Baharoon,Thibault Heintz,Siavash Raissi,Mahmoud Alabbad,Mona Alhammad,Hassan AlOmaish,Sung Eun Kim,Oishi Banerjee,Pranav Rajpurkar
机构: Harvard Medical School (哈佛医学院); Mass General Brigham (马萨诸塞综合医院); King Fahad Hospital (法赫德国王医院); Al-Ahsa Health Cluster (阿尔阿哈卫生集群); Ras-Tanura General Hospital (拉斯塔努拉综合医院); Ministry of Health (卫生部); King Abdulaziz Medical City (阿卜杜勒阿齐兹国王医疗中心); Ministry of National Guard (国民警卫队部); Riyadh (利雅得)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline-based decision rules, and prevents normal or clinically insignificant findings from exerting disproportionate influence on the overall score. The framework categorizes errors into a comprehensive taxonomy covering false findings, missing findings, and eight attribute-level errors (e.g., location, severity, measurement, and diagnostic overinterpretation). Each finding is assigned a clinical significance level (urgent, actionable non-urgent, non-actionable, or expected/benign), based on a guideline developed in collaboration with attending cardiothoracic radiologists, enabling severity-aware weighting that prioritizes clinically consequential mistakes over benign discrepancies. CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal (Kendalls tau = 0.61-0.71; Pearsons r = 0.71-0.84), and through two additional benchmarks that we introduce. In RadJudge, a targeted suite of clinically challenging pass-fail scenarios, CRIMSON shows consistent agreement with expert judgment. In RadPref, a larger radiologist preference benchmark of over 100 pairwise cases with structured error categorization, severity modeling, and 1-5 overall quality ratings from three cardiothoracic radiologists, CRIMSON achieves the strongest alignment with radiologist preferences. We release the metric, the evaluation benchmarks, RadJudge and RadPref, and a fine-tuned MedGemma model to enable reproducible evaluation of report generation, all available at this https URL.
[NLP-20] Contrastive-to-Self-Supervised: A Two-Stage Framework for Script Similarity Learning
【速读】: 该论文旨在解决在缺乏明确历史演化关系的情况下,如何学习字符(glyph)与书写系统之间的相似性度量问题。其核心挑战在于:虽然人造字母表中的单个音素(grapheme)可以可靠标注,但不同书写系统间的历史联系仍存在不确定性与争议。解决方案的关键在于提出一个两阶段框架:第一阶段利用对比损失(contrastive loss)在已标注的人造字母表上训练编码器,构建具备强判别能力的教师模型;第二阶段通过教师-学生蒸馏(teacher-student distillation)将知识迁移至有历史依据的书写系统中,使学生模型在无监督条件下学习潜在的跨系统相似性,同时保留教师模型提供的结构化判别特征。这种非对称设置使得学生模型能够学习到形变不变的嵌入表示,并有效实现从硬边界区分不同书写系统到软相似性捕捉潜在历史影响的过渡。
链接: https://arxiv.org/abs/2603.06180
作者: Claire Roman,Philippe Meyer
机构: University of Haute Alsace, IRIMAS UR 7499, Mulhouse, France; Université Paris-Saclay, INRAE, AgroParisTech, Micalis Institute, Jouy-en-Josas, France
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Learning similarity metrics for glyphs and writing systems faces a fundamental challenge: while individual graphemes within invented alphabets can be reliably labeled, the historical relationships between different scripts remain uncertain and contested. We propose a two-stage framework that addresses this epistemological constraint. First, we train an encoder with contrastive loss on labeled invented alphabets, establishing a teacher model with robust discriminative features. Second, we extend to historically attested scripts through teacher-student distillation, where the student learns unsupervised representations guided by the teacher’s knowledge but free to discover latent cross-script similarities. The asymmetric setup enables the student to learn deformation-invariant embeddings while inheriting discriminative structure from clean examples. Our approach bridges supervised contrastive learning and unsupervised discovery, enabling both hard boundaries between distinct systems and soft similarities reflecting potential historical influences. Experiments on diverse writing systems demonstrate effective few-shot glyph recognition and meaningful script clustering without requiring ground-truth evolutionary relationships.
[NLP-21] Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR INTERSPEECH2026
【速读】: 该论文旨在解决当前音频深度伪造检测中对轻量级自监督学习(Self-supervised Learning, SSL)骨干模型研究不足的问题,尤其是跨域泛化能力受限的瓶颈。其解决方案的关键在于提出RAPTOR架构——一种基于成对门控融合机制的紧凑型SSL检测器,通过在统一框架内系统比较HuBERT与WavLM两类预训练模型,发现多语言HuBERT预训练是实现跨域鲁棒性的核心因素,并进一步引入基于扰动的随机不确定性(perturbation-based aleatoric uncertainty)测试时增强协议,揭示出WavLM变体存在过度自信的校准偏差,而迭代优化的mHuBERT则保持稳定性能,从而表明SSL预训练路径而非模型规模才是可靠音频深度伪造检测的关键驱动因素。
链接: https://arxiv.org/abs/2603.06164
作者: Ajinkya Kulkarni,Sandipana Dowerah,Atharva Kulkarni,Tanel Alumäe,Mathew Magimai Doss
机构: Idiap Research Institute (Idiap研究学院); Tallinn University of Technology (塔林理工大学); MBZUAI (穆罕默德·本·扎耶德大学人工智能研究院)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to Interspeech 2026, 4 pages, 2 figures
Abstract:Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.
[NLP-22] A Causal Graph Approach to Oppositional Narrative Analysis
【速读】: 该论文旨在解决当前文本分析方法依赖预定义本体(ontology)标注数据所导致的人类偏见问题,以及现有模型在处理对立叙事(oppositional narratives)时仅依赖线性模式识别而忽视实体间结构化交互关系的局限性。其解决方案的关键在于提出一种基于图结构的框架,将叙事表示为实体交互图(entity-interaction graphs),并通过节点层面的因果估计,从构建的句子图中提炼出最小因果子图,从而获得每个贡献项对最终分类的因果解释。在此基础上,进一步设计了一个分类流水线,显著优于现有对立思维分类方法。
链接: https://arxiv.org/abs/2603.06135
作者: Diego Revilla,Martin Fernandez-de-Retana,Lingfeng Chen,Aritz Bilbao-Jayo,Miguel Fernandez-de-Retana
机构: Faculty of Engineering, University of Deusto, Bilbao, Spain; Deusto Institute of Technology, University of Deusto, Bilbao, Spain
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Current methods for textual analysis rely on data annotated within predefined ontologies, often embedding human bias within black-box models. Despite achieving near-perfect performance, these approaches exploit unstructured, linear pattern recognition rather than modeling the structured interactions between entities that naturally emerge in discourse. In this work, we propose a graph-based framework for the detection, analysis, and classification of oppositional narratives and their underlying entities by representing narratives as entity-interaction graphs. Moreover, by incorporating causal estimation at the node level, our approach derives a causal representation of each contribution to the final classification by distilling the constructed sentence graph into a minimal causal subgraph. Building upon this representation, we introduce a classification pipeline that outperforms existing approaches to oppositional thinking classification task.
[NLP-23] Diffusion Language Models Are Natively Length-Aware
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在生成短文本时存在的计算效率低下问题。现有DLMs采用固定长度的上下文窗口和预设的去噪步数进行生成,无论实际输出长度如何,导致大量冗余计算。其解决方案的关键在于:提出一种零样本机制,利用潜在提示表示(latent prompt representation)中的信息来预测所需输出长度,并在此基础上动态裁剪上下文窗口,从而减少扩散步骤数并显著降低计算开销(如FLOPs)。实验表明,该方法在多个基准任务上实现了大幅效率提升,且对性能影响极小,甚至在两个任务中还提升了效果。
链接: https://arxiv.org/abs/2603.06123
作者: Vittorio Rossi,Giacomo Cirò,Davide Beltrame,Luca Gandolfi,Paul Röttger,Dirk Hovy
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Unlike autoregressive language models, which terminate variable-length generation upon predicting an End-of-Sequence (EoS) token, Diffusion Language Models (DLMs) operate over a fixed maximum-length context window for a predetermined number of denoising steps. However, this process is independent of the required response length, resulting in computational waste for the majority of short responses common in reasoning and chat tasks. To address this problem, we conjecture that the latent prompt representation contains sufficient information to estimate the required output length. We provide empirical evidence for this phenomenon and propose a zero-shot mechanism to dynamically crop the context window before generation begins, leading to fewer diffusion steps and substantial computational savings. We evaluate our approach on four benchmarks with diverse tasks – GSM8K (reasoning), HumanEval (code generation), IfEval (instruction following), and LongFormQA (question answering) – revealing massive efficiency gains at minimal performance impact. We report significant reductions in FLOPs across all tasks, with no statistically significant performance degradation, and significant performance improvements in 2 out of 4 tasks.
[NLP-24] Making Implicit Premises Explicit in Logical Understanding of Enthymemes
【速读】: 该论文旨在解决自然语言处理中对隐含前提(enthymeme)的逻辑解析问题,即如何将文本中的隐含推理成分转化为可验证的逻辑公式并判定其逻辑蕴含关系。当前方法存在两大局限:一是NLP方法能识别 enthymeme 但无法解码其底层逻辑;二是逻辑方法依赖预设知识库,难以自动生成所需逻辑公式。解决方案的关键在于提出一个三阶段流水线:首先利用大语言模型(LLM)从显性前提和结论中生成中间隐含前提;其次通过另一 LLM 将自然语言映射为逻辑公式;最后采用基于 SAT 求解器的神经符号推理器判断逻辑 entailment。该方法实现了从文本到逻辑结构的端到端转换与验证,显著提升了隐含前提选择的准确性。
链接: https://arxiv.org/abs/2603.06114
作者: Xuyao Feng,Anthony Hunter
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Real-world arguments in text and dialogues are normally enthymemes (i.e. some of their premises and/or claims are implicit). Natural language processing (NLP) methods for handling enthymemes can potentially identify enthymemes in text but they do not decode their underlying logic, whereas logic-based approaches for handling them assume a knowledgebase with sufficient formulae that can be used to decode them via abduction. There is therefore a lack of a systematic method for translating textual components of an enthymeme into a logical argument and generating the logical formulae required for their decoding, and thereby showing logical entailment. To address this, we propose a pipeline that integrates: (1) a large language model (LLM) to generate intermediate implicit premises based on the explicit premise and claim; (2) another LLM to translate the natural language into logical formulas; and (3) a neuro-symbolic reasoner based on a SAT solver to determine entailment. We evaluate our pipeline on two enthymeme datasets, demonstrating promising performance in selecting the correct implicit premise, as measured by precision, recall, F1-score, and accuracy.
[NLP-25] DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理深度信息时表现不佳的问题,尤其是在理解三维场景中的空间关系方面存在局限。现有方法通常依赖于RGB图像编码与文本的对齐,但忽略了深度图像(depth map)作为单通道灰度图所蕴含的直接深度线索。解决方案的关键在于:首先构建了一个全新的深度图像-文本配对数据集和深度指令数据集,其中深度图通过GLPN模型从视觉图像生成,并由GPT-4标注对应的深度理解指令;其次,改进了CLIP中的ViT编码器以融入局部物体信息,从而更有效地捕捉深度的连续变化特征;最终,设计了一个基于现有深度图像数据集的全面深度问答评估基准,验证了DeepSight在提升深度感知能力和下游任务性能上的显著优势。
链接: https://arxiv.org/abs/2603.06090
作者: Hao Yang,Hongbo Zhang,Yanyan Zhao,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately interpret depth information inherent in visual data. In this work, we introduce DeepSight, the first dedicated depth MLLM designed to enhance three-dimensional scene understanding. Unlike conventional methods that align RGB image encodings with text, our approach takes advantage of the unique characteristics of depth images: single-channel grayscale images where the pixel values directly reflect depth cues to improve spatial reasoning. To address challenges associated with limited depth data and the inadequacy of simple channel replication, we construct a novel depth image-text pair dataset and a depth instruction dataset. Depth maps are generated from visual images using the GLPN model, and GPT-4 is employed to curate corresponding depth instructions, an approach validated by LLaVA. Additionally, we modify the ViT encoder in CLIP to incorporate local object information, thereby capturing the subtle continuous variations of depth more effectively. To evaluate the performance of our model, we develop a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios. Experimental results demonstrate that DeepSight significantly enhances depth perception and downstream task performance, marking a substantial step forward in multimodal three-dimensional understanding.
[NLP-26] Experiences Build Characters: The Linguistic Origins and Functional Impact of LLM Personality
【速读】: 该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在开发过程中过度追求统一性能指标,忽视了人类问题解决中因个性差异带来的多样性优势,从而限制了模型在复杂任务中的适应性与表现潜力。解决方案的关键在于通过持续预训练(continued pre-training)使模型接触领域特定文本,在无监督条件下模拟经验积累,并借助机器人格量表(Machine Personality Inventory, MPI)基于五大性格特质框架对模型人格进行量化,进而揭示不同人格特征与语言风格及推理行为之间的因果关系,识别出“表达型通才”和“抑制型专才”的双峰性能分布,以及“抑制优势”现象——即社交特质降低有助于提升复杂推理能力,为实现可编程的“人格工程”(Personality Engineering)提供理论依据与实践路径。
链接: https://arxiv.org/abs/2603.06088
作者: Xi Wang,Mengdie Zhuang,Jiqun Liu
机构: University of Sheffield (谢菲尔德大学); University of Oklahoma (俄克拉荷马大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Human problem-solving is enriched by a diversity of styles and personality traits, yet the development of Large Language Models (LLMs) has largely prioritized uniform performance benchmarks that favour specific behavioural tendencies such as assertiveness. To investigate how diverse experiences shape machine personality and influence problem-solving, this study employs continued pre-training to expose models to domain-specific texts in an unsupervised manner, simulating the accumulation of experience. By adapting the Big Five framework via the Machine Personality Inventory (MPI), we quantify the personality traits of these model variants and analyse their relationship to linguistic style and reasoning behaviour. The findings reveal that model competence is bimodal, peaking at “Expressive Generalists” and “Suppressed Specialists,” while identifying a “Suppression Advantage” where reduced social traits enhance complex reasoning performance. This study further establishes a causal link between training data linguistics, such as imperative frequency, and lexical diversity, providing a roadmap for “Personality Engineering”.
[NLP-27] Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring
【速读】: 该论文旨在解决自动作文评分(Automated Essay Scoring, AES)系统在真实教学场景中准确性和可靠性不足的问题,特别是针对奥地利A级德语写作考试的评分需求。其解决方案的关键在于利用最新开源大语言模型(Large Language Models, LLMs)对标准化评分量表(rubric-based evaluation)进行适配与评估,通过对比多种主流LLM(如DeepSeek-R1 32b、Qwen3 30b、Mixtral 8x7b和LLama3.3 70b)在不同提示策略下的表现,验证其在子维度评分和总分匹配上的有效性。结果显示,尽管部分小规模模型能够基于量表进行初步判断,但整体准确率仍远未达到可部署于实际教学环境的标准。
链接: https://arxiv.org/abs/2603.06066
作者: Jonas Kubesch,Lena Huber,Clemens Havas
机构: Salzburg University of Applied Sciences (萨尔茨堡应用科学大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be presented at the SAC2026 and published in its symposium proceedings
Abstract:Automated Essay Scoring (AES) has been explored for decades with the goal to support teachers by reducing grading workload and mitigating subjective biases. While early systems relied on handcrafted features and statistical models, recent advances in Large Language Models (LLMs) have made it possible to evaluate student writing with unprecedented flexibility. This paper investigates the application of state-of-the-art open-weight LLMs for the grading of Austrian A-level German texts, with a particular focus on rubric-based evaluation. A dataset of 101 anonymised student exams across three text types was processed and evaluated. Four LLMs, DeepSeek-R1 32b, Qwen3 30b, Mixtral 8x7b and LLama3.3 70b, were evaluated with different contexts and prompting strategies. The LLMs were able to reach a maximum of 40.6% agreement with the human rater in the rubric-provided sub-dimensions, and only 32.8% of final grades matched the ones given by a human expert. The results indicate that even though smaller models are able to use standardised rubrics for German essay grading, they are not accurate enough to be used in a real-world grading environment.
[NLP-28] ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning
【速读】: 该论文旨在解决当前视觉语言模型在多视角空间推理任务中表现不佳的问题,尤其是当多个视角可用时,模型往往未能有效利用跨视角关系,而是依赖于单图捷径(single-image shortcuts),导致在视角变换和遮挡敏感场景下的性能脆弱。解决方案的关键在于提出一个两阶段框架 ViewFusion:第一阶段通过显式的空间预对齐(spatial pre-alignment)进行跨视角的空间推理,构建超越简单重述的中间工作空间;第二阶段基于该空间进行问题驱动的推理以生成最终答案。此设计通过合成推理监督与 GRPO 强化学习相结合的方式训练,显著提升了跨视角对齐能力与答案准确性,在 MMSI-Bench 上相较 Qwen3-VL-4B-Instruct 提升 5.3%。
链接: https://arxiv.org/abs/2603.06024
作者: Xingjian Tao,Yiwei Wang,Yujun Cai,Yifan Song,Jing Tang
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-view spatial reasoning remains difficult for current vision-language models. Even when multiple viewpoints are available, models often underutilize cross-view relations and instead rely on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases. We present ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering. In the first stage, the model performs deliberate spatial pre-thinking to infer viewpoint relations and spatial transformations across views, forming an intermediate workspace that goes beyond a simple re-description. In the second stage, the model conducts question-driven reasoning conditioned on this workspace to produce the final prediction. We train ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior. On MMSI-Bench, ViewFusion improves accuracy by 5.3% over Qwen3-VL-4B-Instruct, with the largest gains on examples that require genuine cross-view alignment.
[NLP-29] rack-SQL: Enhancing Generative Language Models with Dual-Extractive Modules for Schema and Context Tracking in Multi-turn Text-to-SQL NAACL2025
【速读】: 该论文旨在解决生成式语言模型在多轮Text-to-SQL任务中性能显著下降的问题,其核心挑战在于模型难以有效处理多轮交互中的上下文信息复杂性和动态模式(schema)关联。解决方案的关键在于提出Track-SQL框架,该框架通过引入两个双提取模块增强语言模型:一是语义增强的模式提取器(Semantic-enhanced Schema Extractor),用于精准追踪数据库结构的变化;二是模式感知的上下文提取器(Schema-aware Context Extractor),用于建模多轮对话中上下文与schema之间的动态关联。实验证明,该方法在SparC和CoSQL数据集上达到当前最优性能,且执行准确率分别提升7.1%和9.55%。
链接: https://arxiv.org/abs/2603.05996
作者: Bingfeng Chen,Shaobin Shi,Yongqi Luo,Boyan Xu,Ruichu Cai,Zhifeng Hao
机构: Guangdong University of Technology (广东工业大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济实验室(深圳)); Peng Cheng Laboratory (鹏城实验室); Shantou University (汕头大学)
类目: Computation and Language (cs.CL)
备注: Accepted at the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025), Long Paper, 19 pages
Abstract:Generative language models have shown significant potential in single-turn Text-to-SQL. However, their performance does not extend equivalently to multi-turn Text-to-SQL. This is primarily due to generative language models’ inadequacy in handling the complexities of context information and dynamic schema linking in multi-turn interactions. In this paper, we propose a framework named Track-SQL, which enhances generative language models with dual-extractive modules designed to track schema and contextual changes in multi-turn Text-to-SQL. Specifically, Track-SQL incorporates a \emphSemantic-enhanced Schema Extractor and a \emphSchema-aware Context Extractor. Experimental results demonstrate that Track-SQL achieves state-of-the-art performance on the SparC and CoSQL datasets. Furthermore, detailed ablation studies reveal that Track-SQL significantly improves execution accuracy in multi-turn interactions by 7.1% and 9.55% on these datasets, respectively. Our implementation will be open-sourced at this https URL.
[NLP-30] Imagine How To Change: Explicit Procedure Modeling for Change Captioning ICLR2026
【速读】: 该论文旨在解决传统变化描述(change captioning)方法仅基于静态图像对进行建模,从而忽略变化过程中关键时序动态信息的问题。现有方法难以捕捉“如何发生改变”这一语义层次,限制了对变化本质的理解。解决方案的关键在于提出ProCap框架,将变化建模从静态图像比较重构为动态过程建模:首先通过自动生成中间帧并采样关键帧来显式化隐式的时序变化过程,并利用掩码重建任务训练一个程序编码器(procedure encoder)以学习这些关键帧的潜在动态;其次,在解码阶段引入可学习的程序查询(learnable procedure queries),替代对显式帧的依赖,使编码器能够高效推断出时序一致且与文本描述对齐的潜在过程表征,最终由解码器生成变化描述。整个模型端到端训练,显著提升了变化描述的准确性与时序合理性。
链接: https://arxiv.org/abs/2603.05969
作者: Jiayang Sun,Zixin Guo,Min Cao,Guibo Zhu,Jorma Laaksonen
机构: Soochow University (苏州大学); Aalto University (阿尔托大学); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Wuhan AI Research (武汉人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ICLR 2026. Code and models are available at this https URL
Abstract:Change captioning generates descriptions that explicitly describe the differences between two visually similar images. Existing methods operate on static image pairs, thus ignoring the rich temporal dynamics of the change procedure, which is the key to understand not only what has changed but also how it occurs. We introduce ProCap, a novel framework that reformulates change modeling from static image comparison to dynamic procedure modeling. ProCap features a two-stage design: The first stage trains a procedure encoder to learn the change procedure from a sparse set of keyframes. These keyframes are obtained by automatically generating intermediate frames to make the implicit procedural dynamics explicit and then sampling them to mitigate redundancy. Then the encoder learns to capture the latent dynamics of these keyframes via a caption-conditioned, masked reconstruction task. The second stage integrates this trained encoder within an encoder-decoder model for captioning. Instead of relying on explicit frames from the previous stage – a process incurring computational overhead and sensitivity to visual noise – we introduce learnable procedure queries to prompt the encoder for inferring the latent procedure representation, which the decoder then translates into text. The entire model is then trained end-to-end with a captioning loss, ensuring the encoder’s output is both temporally coherent and captioning-aligned. Experiments on three datasets demonstrate the effectiveness of ProCap. Code and pre-trained models are available at this https URL
[NLP-31] Implicit Style Conditioning: A Structured Style-Rewrite Framework for Low-Resource Character Modeling
【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在角色扮演(Role-Playing, RP)任务中难以实现高保真风格化生成的问题,尤其在数据稀缺和风格解耦复杂性的挑战下,传统监督微调(Supervised Fine-Tuning, SFT)往往只能捕捉表面语义,导致“出戏”(Out-of-Character, OOC)现象。解决方案的关键在于提出一种结构化的风格重写框架(Structured Style-Rewrite Framework),将风格显式解耦为三个可解释维度:词汇特征(基于PMI计算)、句法模式(基于概率上下文无关文法PCFG规则)以及语用风格;同时引入基于思维链(Chain-of-Thought, CoT)蒸馏的隐式风格条件策略,利用训练过程中显式的推理轨迹作为强归纳偏置,引导模型潜在表示与结构化风格特征对齐,从而在无需推理标记的情况下实现高质量、一致性的风格化生成。
链接: https://arxiv.org/abs/2603.05933
作者: Chanhui Zhu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 26 pages, 4 figures. Preprint
Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing (RP); however, small Language Models (SLMs) with highly stylized personas remains a challenge due to data scarcity and the complexity of style disentanglement. Standard Supervised Fine-Tuning (SFT) often captures surface-level semantics while failing to reproduce the intricate syntactic and pragmatic nuances of a character, leading to “Out-Of-Character” (OOC) generation. To address this, we propose a Structured Style-Rewrite Framework that explicitly disentangles style into three interpretable dimensions: lexical signatures (via PMI), syntactic patterns (grounded in PCFG rules), and pragmatic style. Furthermore, we introduce an implicit style conditioning strategy via Chain-of-Thought (CoT) distillation. By leveraging explicit reasoning traces during training as a strong inductive bias, our approach aligns the model’s latent representations with structured style features, enabling high-fidelity stylized generation without requiring explicit reasoning tokens during inference. Extensive experiments on a specific high-stylization domain (anime characters) demonstrate that our method enables a Qwen-1.7B model to outperform significantly larger baselines (e.g., 4B Vanilla SFT) in style consistency and semantic fidelity. Our approach offers a data-efficient paradigm for democratizing inference and deployment on consumer hardware.
[NLP-32] InfoGatherer: Principled Information Seeking via Evidence Retrieval and Strategic Questioning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险领域(如医疗分诊和法律咨询)中因初始查询信息不足、单次检索无法满足可靠决策需求,导致输出错误且过度自信的问题。现有方法依赖LLM隐式的、非结构化的置信度信号来引导追问,难以明确识别未知信息、判断关键缺失内容以及确定停止提问的时机。解决方案的关键在于提出InfoGatherer框架,该框架通过两个互补来源主动收集缺失信息:从检索到的领域文档中提取证据,并设计针对性的用户追问策略;同时利用Dempster-Shafer信念分配理论构建结构化的证据网络,对不完整或矛盾的信息进行合理融合,在不提前得出确定性结论的前提下实现不确定性建模,从而提升决策的可靠性与可解释性。
链接: https://arxiv.org/abs/2603.05909
作者: Maksym Taranukhin,Shuyue Stella Li,Evangelos Milios,Geoff Pleiss,Yulia Tsvetkov,Vered Shwartz
机构: University of British Columbia (不列颠哥伦比亚大学); Vector Institute (向量研究所); CIFAR AI Chair (加拿大人工智能理事会AI主席); University of Washington (华盛顿大学); Dalhousie University (达尔豪西大学)
类目: Computation and Language (cs.CL)
备注: Under review
Abstract:LLMs are increasingly deployed in high-stakes domains such as medical triage and legal assistance, often as document-grounded QA systems in which a user provides a description, relevant sources are retrieved, and an LLM generates a prediction. In practice, initial user queries are often underspecified, and a single retrieval pass is insufficient for reliable decision-making, leading to incorrect and overly confident answers. While follow-up questioning can elicit missing information, existing methods typically depend on implicit, unstructured confidence signals from the LLM, making it difficult to determine what remains unknown, what information matters most, and when to stop asking questions. We propose InfoGatherer, a framework that gathers missing information from two complementary sources: retrieved domain documents and targeted follow-up questions to the user. InfoGatherer models uncertainty using Dempster-Shafer belief assignments over a structured evidential network, enabling principled fusion of incomplete and potentially contradictory evidence from both sources without prematurely collapsing to a definitive answer. Across legal and medical tasks, InfoGatherer outperforms strong baselines while requiring fewer turns. By grounding uncertainty in formal evidential theory rather than heuristic LLM signals, InfoGatherer moves towards trustworthy, interpretable decision support in domains where reliability is critical.
[NLP-33] Building an Ensemble LLM Semantic Tagger for UN Security Council Resolutions
【速读】: 该论文旨在解决联合国安全理事会决议文本在数据清洗和语义标注任务中因大语言模型(LLM)性能波动导致的准确性与一致性不足的问题。其核心解决方案在于设计了一种基于集成学习(ensemble learning)的框架,并引入两个关键评估指标:内容保留率(Content Preservation Ratio, CPR)和标签格式正确性(Tag Well-Formedness, TWF),以有效抑制幻觉并确保输出内容不偏离原始输入要求。通过多轮运行多个GPT模型并依据这两个指标筛选最优输出,该方法实现了对复杂文本的高精度语义标注,同时显著降低了计算成本——例如GPT-4.1-mini在仅20%成本下达到接近最优模型的性能。
链接: https://arxiv.org/abs/2603.05895
作者: Hussein Ghaly
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper introduces a new methodology for using LLM-based systems for accurate and efficient semantic tagging of UN Security Council resolutions. The main goal is to leverage LLM performance variability to build ensemble systems for data cleaning and semantic tagging tasks. We introduce two evaluation metrics: Content Preservation Ratio (CPR) and Tag Well-Formedness (TWF), in order to avoid hallucinations and unnecessary additions or omissions to the input text beyond the task requirement. These metrics allow the selection of the best output from multiple runs of several GPT models. GPT-4.1 achieved the highest metrics for both tasks (Cleaning: CPR 84.9% - Semantic Tagging: CPR 99.99% and TWF 99.92%). In terms of cost, smaller models, such as GPT-4.1-mini, achieved comparable performance to the best model in each task at only 20% of the cost. These metrics ultimately allowed the ensemble to select the optimal output (both cleaned and tagged content) for all the LLM models involved, across multiple runs. With this ensemble design and the use of metrics, we create a reliable LLM system for performing semantic tagging on challenging texts.
[NLP-34] Lost in Stories: Consistency Bugs in Long Story Generation by LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成长篇叙事文本时普遍存在的一致性问题,即模型在生成过程中容易违背自身设定的事实、角色特征和世界观规则。为填补现有评估基准主要关注情节质量和语言流畅性而忽视一致性错误的空白,作者提出ConStory-Bench这一基准,包含2000个提示和四种任务场景,并构建了涵盖五类错误及其19个细粒度子类别的分类体系;其关键解决方案是开发ConStory-Checker自动化检测工具,能够识别矛盾并基于显式文本证据进行判断,从而系统性地量化和分析LLMs在长文本生成中的一致性表现。
链接: https://arxiv.org/abs/2603.05890
作者: Junjie Li,Xinrui Guo,Yuhao Wu,Roy Ka-Wei Lee,Hongzhi Li,Yutao Xie
机构: Microsoft(微软); Singapore University of Technology and Design(新加坡科技设计大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine-grained subtypes. We also develop ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency errors show clear tendencies: they are most common in factual and temporal dimensions, tend to appear around the middle of narratives, occur in text segments with higher token-level entropy, and certain error types tend to co-occur. These findings can inform future efforts to improve consistency in long-form narrative generation. Our project page is available at this https URL.
[NLP-35] VerChol – Grammar-First Tokenization for Agglutinative Languages ALT
【速读】: 该论文旨在解决当前主流分词方法(如字节对编码,Byte Pair Encoding, BPE)在处理黏着语(agglutinative languages)时存在的根本性缺陷问题。BPE及其变体本质上是脚本无关且针对英语类形态学优化的,无法有效保留黏着语中单个词内丰富的词素边界信息(morpheme boundaries),导致词素被错误切分,从而显著增加token数量并破坏语言结构的完整性。论文的关键解决方案在于提出一种更符合黏着语形态特征的分词策略,以避免将一个包含多个语法范畴(如时态、人称、数、格等)的词拆解为无意义的字节对片段,从而提升模型对这类语言的理解与表示能力。
链接: https://arxiv.org/abs/2603.05883
作者: Prabhu Raja
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages. A Morphological Alternative to Statistical Subword Tokenization
Abstract:Tokenization is the foundational step in all large language model (LLM) pipelines, yet the dominant approach Byte Pair Encoding (BPE) and its variants is inherently script agnostic and optimized for English like morphology. For agglutinative languages a typological class encompassing the Dravidian family (Tamil, Kannada, Telugu, Malayalam), Turkic languages (Turkish, Azerbaijani, Uzbek), Uralic languages (Finnish, Hungarian, Estonian), Korean, Japanese, Swahili, Basque, and others, a single word may encode root, tense, aspect, person, number, gender agreement, case, and postpositions into one orthographic unit. Statistical tokenizers fragment these words into byte pair chunks that sever morpheme boundaries and inflate token counts.
[NLP-36] Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中缺乏可靠不确定性估计的问题。现有方法多采用“答案优先”范式,即仅在生成答案后才计算置信度,此类方法衡量的是特定响应的正确性,限制了其在实践中的可用性。为突破这一局限,作者提出“置信度优先”范式,要求模型在回答前输出置信分数,将其解释为当前策略下正确回答问题的概率。解决方案的核心是提出CoCA(Co-optimized Confidence and Answers)框架,这是一种基于GRPO(Group Relative Policy Optimization)的强化学习方法,通过分段信用分配机制,分别对置信度和答案片段赋予独立奖励与组相对优势,从而实现置信度校准与答案准确性的稳定联合优化,有效避免奖励劫持(reward hacking)。实验表明,CoCA在数学、代码和事实问答基准上均提升了校准精度和不确定性区分能力,同时保持答案质量,扩展了下游应用的可能性。
链接: https://arxiv.org/abs/2603.05881
作者: Changcheng Li,Jiancan Wu,Hengheng Zhang,Zhengsu Chen,Guo An,Junxiang Qiu,Xiang Wang,Qi Tian
机构: University of Science and Technology of China (中国科学技术大学); Huawei Inc. (华为公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reliable deployment of large language models (LLMs) requires accurate uncertainty estimation. Existing methods are predominantly answer-first, producing confidence only after generating an answer, which measure the correctness of a specific response and limits practical usability. We study a confidence-first paradigm, where the model outputs its confidence before answering, interpreting this score as the model’s probability of answering the question correctly under its current policy. We propose CoCA(Co-optimized Confidence and Answers), a GRPO reinforcement learning framework that jointly optimizes confidence calibration and answer accuracy via segmented credit assignment. By assigning separate rewards and group-relative advantages to confidence and answer segments, CoCA enables stable joint optimization and avoids reward hacking. Experiments across math, code, and factual QA benchmarks show improved calibration and uncertainty discrimination while preserving answer quality, thereby enabling a broader range of downstream applications. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.05881 [cs.CL] (or arXiv:2603.05881v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.05881 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-37] ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning
【速读】: 该论文旨在解决现有单次剪枝方法(如SparseGPT)中因固定从左到右的剪枝顺序导致性能不佳的问题,尤其在权重呈现列结构模式(columnar patterns)时表现欠佳。其解决方案的关键在于提出ROSE方法,通过引入两层重排序机制:首先基于预剪枝阶段估计每列和每块的剪枝损失,然后在块内按列损失降序重排、在块间按块损失重排,并利用块损失相对范围作为指标识别列结构层,实现自适应剪枝顺序优化,从而显著提升剪枝后模型的性能表现。
链接: https://arxiv.org/abs/2603.05878
作者: Mingluo Su,Huan Wang
机构: Westlake University (西湖大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: CPAL 2026 oral
Abstract:Pruning is widely recognized as an effective method for reducing the parameters of large language models (LLMs), potentially leading to more efficient deployment and inference. One classic and prominent path of LLM one-shot pruning is to leverage second-order gradients (i.e., Hessian), represented by the pioneering work SparseGPT. However, the predefined left-to-right pruning order in SparseGPT leads to suboptimal performance when the weights exhibit columnar patterns. This paper studies the effect of pruning order under the SparseGPT framework. The analyses lead us to propose ROSE, a reordered SparseGPT method that prioritizes weights with larger potential pruning errors to be pruned earlier. ROSE first performs pre-pruning to identify candidate weights for removal, and estimates both column and block pruning loss. Subsequently, two-level reordering is performed: columns within each block are reordered in descending order of column loss, while blocks are reordered based on block loss. We introduce the relative range of block loss as a metric to identify columnar layers, enabling adaptive reordering across the entire model. Substantial empirical results on prevalent LLMs (LLaMA2-7B/13B/70B, LLaMA3-8B, Mistral-7B) demonstrate that ROSE surpasses the original SparseGPT and other counterpart pruning methods. Our code is available at this https URL.
[NLP-38] ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂算法任务中因采用单次前向推理(“System 1”模式)而导致性能受限的问题。现有迭代优化方法多依赖外部反馈机制(如执行引擎或人工标注的正确答案),存在计算开销高、部署复杂等局限。其解决方案的关键在于提出 ReflexiCoder,一种基于强化学习(Reinforcement Learning, RL)的内生式自我反思与修正框架,将结构化的推理轨迹——包括初始生成、错误感知与优化反思、以及自纠正——直接嵌入模型权重中,从而实现无需外部监督或执行反馈即可在推理阶段自主完成高质量代码修正的能力。通过零样本强化学习训练与细粒度奖励函数设计,该方法显著提升了代码生成的准确率和效率,在多个基准测试中达到当前开源模型中的最先进水平,且推理时计算资源消耗降低约40%。
链接: https://arxiv.org/abs/2603.05863
作者: Juyong Jiang,Jiasi Shen,Sunghun Kim,Kang Min Yoo,Jeonghoon Kim,Sungju Kim
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology; Amazon AGI; NAVER Cloud
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:While Large Language Models (LLMs) have revolutionized code generation, standard “System 1” approaches, generating solutions in a single forward pass, often hit a performance ceiling when faced with complex algorithmic tasks. Existing iterative refinement strategies attempt to bridge this gap at inference time, yet they predominantly rely on external oracles, execution feedback, or computationally expensive prompt-response cycles. In this work, we propose ReflexiCoder, a novel reinforcement learning (RL) framework that internalizes the structured reasoning trajectory, encompassing initial generation, bug and optimization aware reflection, and self-correction, directly into the model’s weights. Unlike prior methods, ReflexiCoder shifts the paradigm from external-dependent refinement to an intrinsic, fully autonomous self-reflection and self-correction capabilities at inference time. We utilize an RL-zero training paradigm with granular reward functions to optimize the entire reflection-correction trajectory, teaching the model how to debug without reliance on ground-truth feedback or execution engines at inference time. Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder-8B establishes a new state-of-the-art (SOTA) among leading open-source models in the 1.5B-14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80% (78.57%) on MBPP (Plus), 35.00% on BigCodeBench, 52.21% on LiveCodeBench, and 37.34% on CodeForces in a single-attempt setting, rivaling or surpassing proprietary models like GPT-5.1. Notably, our framework is significantly more token-efficient than base models, reducing inference-time compute overhead by approximately 40% through disciplined, high-speed reasoning and reflection patterns. Source code is available at this https URL.
[NLP-39] st-Time Adaptation via Many-Shot Prompting: Benefits Limits and Pitfalls
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段如何通过不更新参数的方式实现行为调整的问题,即测试时适应(Test-time Adaptation, TTA)的有效性和可靠性问题。其核心解决方案是系统性地评估基于提示(prompt-based)的输入空间更新机制,特别是多示例提示(many-shot prompting)及其变体——动态和强化的上下文学习(Dynamic and Reinforced ICL),以厘清不同任务类型、更新幅度、示例顺序与选择策略对性能的影响。关键发现在于:多示例提示在结构化任务中效果显著,因其能提供高信息增益;但在开放生成任务中则高度依赖示例选择策略且收益有限,从而揭示了提示驱动的测试时适应的实际边界与适用场景。
链接: https://arxiv.org/abs/2603.05829
作者: Shubhangi Upasani,Chen Wu,Jay Rainton,Bo Li,Changran Hu,Qizheng Zhang,Urmish Thakker
机构: SambaNova Systems, Inc(萨姆巴诺瓦系统公司); 2xAI; Stanford University(斯坦福大学); Microsoft AI(微软人工智能)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Test-time adaptation enables large language models (LLMs) to modify their behavior at inference without updating model parameters. A common approach is many-shot prompting, where large numbers of in-context learning (ICL) examples are injected as an input-space test-time update. Although performance can improve as more demonstrations are added, the reliability and limits of this update mechanism remain poorly understood, particularly for open-source models. We present an empirical study of many-shot prompting across tasks and model backbones, analyzing how performance varies with update magnitude, example ordering, and selection policy. We further study Dynamic and Reinforced ICL as alternative test-time update strategies that control which information is injected and how it constrains model behavior. We find that many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and often shows limited benefits for open-ended generation tasks. Overall, we characterize the practical limits of prompt-based test-time adaptation and outline when input-space updates are beneficial versus harmful.
[NLP-40] HART: Data-Driven Hallucination Attribution and Evidence-Based Tracing for Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中产生的幻觉内容(hallucination)难以精准归因与解释的问题。现有方法多依赖语义相似性匹配或表征层面的判别,无法建立幻觉类型、错误生成机制与外部事实证据之间的细粒度结构化对应关系,从而限制了对幻觉片段的可解释性和支持/反驳证据的可追溯性。解决方案的关键在于提出HART框架,将幻觉溯源建模为一个包含四个阶段的结构化任务:片段定位(span localization)、机制归因(mechanism attribution)、证据检索(evidence retrieval)和因果追踪(causal tracing),并构建首个面向幻觉溯源的结构化标注数据集,实现因果层级的可解释性评估,实验证明其显著优于BM25和DPR等强基线方法,在幻觉分析与证据对齐方面展现出更强的有效性与泛化能力。
链接: https://arxiv.org/abs/2603.05828
作者: Shize Liang,Hongzhi Wang
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have demonstrated remarkable performance in text generation and knowledge-intensive question answering. Nevertheless, they are prone to producing hallucinated content, which severely undermines their reliability in high-stakes application domains. Existing hallucination attribution approaches, based on either external knowledge retrieval or internal model mechanisms, primarily focus on semantic similarity matching or representation-level discrimination. As a result, they have difficulty establishing structured correspondences at the span level between hallucination types, underlying error generation mechanisms, and external factual evidence, thereby limiting the interpretability of hallucinated fragments and the traceability of supporting or opposing evidence. To address these limitations, we propose HART, a fine-grained hallucination attribution and evidence retrieval framework for large language models. HART formalizes hallucination tracing as a structured modeling task comprising four stages: span localization, mechanism attribution, evidence retrieval, and causal tracing. Based upon this formulation, we develop the first structured dataset tailored for hallucination tracing, in which hallucination types, error mechanisms, and sets of counterfactual evidence are jointly annotated to enable causal-level interpretability evaluation. Experimental results on the proposed dataset demonstrate that HART substantially outperforms strong retrieval baselines, including BM25 and DPR, validating the effectiveness and generalization capability of the proposed tracing paradigm for hallucination analysis and evidence alignment.
[NLP-41] RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning
【速读】: 该论文旨在解决图结构推理(如Graph of Thoughts, GoT)在实际应用中因节点级和阶段级异质性导致的效率低下问题,即高复杂度推理路径虽能提升准确性,但常伴随显著的token消耗与延迟增加,且性能不稳定。解决方案的关键在于提出RouteGoT框架,其核心创新是基于节点重要性动态分配模型资源:对全局耦合的规划与最终合成阶段使用强模型,而对局部子任务则通过难度预测动态调度轻量模型和低成本策略;同时引入显式token预算约束到全局调度器中,实现可控制的性能-成本权衡。实验表明,该方法在多个推理、检索及多跳问答基准上显著降低token使用量(较AGoT减少79.1%),并提升准确率(平均提高8.1个百分点)。
链接: https://arxiv.org/abs/2603.05818
作者: Yuhang Liu,Ruijie Wang,Yunlong Chu,Bing Hao,Yumeng Lin,Shengzhong Liu,Minglai Shao
机构: Tianjin University (天津大学); Beihang University (北京航空航天大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) excel at multi-step reasoning, yet increasing the structural complexity of inference does not consistently improve system-level returns. Methods such as Tree of Thoughts (ToT), Graph of Thoughts (GoT), and Adaptive Graph of Thoughts (AGoT) can boost accuracy on some benchmarks, but often introduce substantial overhead in token consumption and latency, and their gains can be unstable across task distributions-sometimes underperforming simpler Chain-of-Thought (CoT) or direct input-output prompting (IO). We attribute this inefficiency to stage-wise and node-wise heterogeneity inside GoT-style reasoning pipelines: high-quality planning and final synthesis are globally coupled and typically benefit from strong models, whereas many intermediate subtasks are localized and can be solved accurately by lighter models with far fewer tokens. Motivated by these observations, we propose RouteGoT, a budget-controllable, node-adaptive routing framework for graph-structured reasoning. RouteGoT performs in-graph routing by prioritizing strong models for planning and synthesis, while dynamically allocating lightweight models and cost-effective strategies to leaf subtasks based on predicted difficulty. It further integrates explicit budget constraints into a global inference scheduler to control graph expansion under a user-specified token budget, enabling predictable performance-cost trade-offs. Experiments across reasoning, retrieval, and multi-hop QA benchmarks show that RouteGoT matching or improving accuracy while substantially reducing token usage; specifically, it achieves an average 8.1 percentage points accuracy improvement and 79.1% output token reduction compared to AGoT. Furthermore, RouteGoT outperforms existing routing baselines by maintaining a superior cost-accuracy trade-off, demonstrating improved robustness under varying budget targets and tasks.
[NLP-42] Proof-of-Guardrail in AI Agents and What (Not) to Trust from It
【速读】: 该论文旨在解决生成式 AI (Generative AI) 服务中因开发者虚假宣称安全措施而引发的信任问题,即用户无法验证实际执行的安全机制是否真实有效。解决方案的关键在于提出“proof-of-guardrail”系统,通过在可信执行环境(Trusted Execution Environment, TEE)中运行代理和开源护栏(guardrail)逻辑,并生成由TEE签名的可离线验证的证明,从而确保护栏代码确实被执行,同时保护代理模型的私密性。
链接: https://arxiv.org/abs/2603.05786
作者: Xisen Jin,Michael Duan,Qin Lin,Aaron Chan,Zhenglun Chen,Junyi Du,Xiang Ren
机构: Sahara AI; University of Southern California
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages
Abstract:As AI agents become widely deployed as online services, users often rely on an agent developer’s claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE-signed attestation of guardrail code execution verifiable by any user offline. We implement proof-of-guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof-of-guardrail ensures integrity of guardrail execution while keeping the developer’s agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video: this https URL
[NLP-43] utor Move Taxonomy: A Theory-Aligned Framework for Analyzing Instructional Moves in Tutoring
【速读】: 该论文旨在解决如何系统分析一对一辅导过程中导师教学行为以理解其有效性的问题。解决方案的关键在于构建了一个结构化的“导师行为分类体系”(tutor move taxonomy),该体系通过混合演绎与归纳的方法开发而成:首先整合认知科学、学习科学、课堂话语分析及智能辅导系统的研究成果形成初步框架,再经由专家标注者对真实辅导对话进行迭代编码不断优化;最终将导师行为分为四类支持行为——辅导支持、学习支持、社会情感与动机支持以及后勤支持,并在学习支持维度上进一步按学生参与度划分为促进学生推理的行为与直接解释或提供答案的行为。这一分类体系使得辅导对话可被拆解为离散的教学动作,从而支持大规模AI标注、计算建模辅导策略及实证分析教学行为与学习成效的关系。
链接: https://arxiv.org/abs/2603.05778
作者: Zhuqian Zhou,Kirk Vanacore,Tamisha Thompson,Jennifer St John,Rene Kizilcec
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Understanding what makes tutoring effective requires methods for systematically analyzing tutors’ instructional actions during learning interactions. This paper presents a tutor move taxonomy designed to support large-scale analysis of tutoring dialogue within the National Tutoring Observatory. The taxonomy provides a structured annotation framework for labeling tutors’ instructional moves during one-on-one tutoring sessions. We developed the taxonomy through a hybrid deductive-inductive process. First, we synthesized research from cognitive science, the learning sciences, classroom discourse analysis, and intelligent tutoring systems to construct a preliminary framework of tutoring moves. We then refined the taxonomy through iterative coding of authentic tutoring transcripts conducted by expert annotators with extensive instructional and qualitative research experience. The resulting taxonomy organizes tutoring behaviors into four categories: tutoring support, learning support, social-emotional and motivational support, and logistical support. Learning support moves are further organized along a spectrum of student engagement, distinguishing between moves that elicit student reasoning and those that provide direct explanation or answers. By defining tutoring dialogue in terms of discrete instructional actions, the taxonomy enables scalable annotation using AI, computational modeling of tutoring strategies, and empirical analysis of how tutoring behaviors relate to learning outcomes.
[NLP-44] PVminerLLM : Structured Extraction of Patient Voice from Patient-Generated Text using Large Language Models
【速读】: 该论文旨在解决患者生成文本(patient-generated text)中非临床因素信息难以结构化提取的问题,这些信息包括患者的日常生活体验、社会背景及医疗参与情况,对依从性、照护协调和健康公平具有重要影响。传统方法无法有效获取此类“患者声音信号”(patient voice signals),限制了其在以患者为中心的结果研究和临床质量改进中的应用。解决方案的关键在于提出PVminer基准和针对该任务定制的监督微调大语言模型PVminerLLM,该模型在多个数据集上显著优于基于提示的基线方法,在代码预测、子代码预测和证据片段抽取任务中分别达到最高83.82%、80.74%和87.03%的F1分数,且即使使用较小规模模型也能实现高性能,证明了无需极端模型规模即可可靠提取患者声音信号的可行性。
链接: https://arxiv.org/abs/2603.05776
作者: Samah Fodeh,Linhai Ma,Ganesh Puthiaraju,Srivani Talakokkul,Afshan Khan,Ashley Hagaman,Sarah Lowe,Aimee Roundtree
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Motivation: Patient-generated text contains critical information about patients’ lived experiences, social circumstances, and engagement in care, including factors that strongly influence adherence, care coordination, and health equity. However, these patient voice signals are rarely available in structured form, limiting their use in patient-centered outcomes research and clinical quality improvement. Reliable extraction of such information is therefore essential for understanding and addressing non-clinical drivers of health outcomes at scale. Results: We introduce PVminer, a benchmark for structured extraction of patient voice, and propose PVminerLLM, a supervised fine-tuned large language model tailored to this task. Across multiple datasets and model sizes, PVminerLLM substantially outperforms prompt-based baselines, achieving up to 83.82% F1 for Code prediction, 80.74% F1 for Sub-code prediction, and 87.03% F1 for evidence Span extraction. Notably, strong performance is achieved even with smaller models, demonstrating that reliable patient voice extraction is feasible without extreme model scale. These results enable scalable analysis of social and experiential signals embedded in patient-generated text. Availability and Implementation: Code, evaluation scripts, and trained LLMs will be released publicly. Annotated datasets will be made available upon request for research use. Keywords: Large Language Models, Supervised Fine-Tuning, Medical Annotation, Patient-Generated Text, Clinical NLP Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.05776 [cs.CL] (or arXiv:2603.05776v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.05776 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Samah Fodeh [view email] [v1] Fri, 6 Mar 2026 00:16:05 UTC (955 KB)
[NLP-45] NERdME: a Named Entity Recognition Dataset for Indexing Research Artifacts in Code Repositories WWW’26
【速读】: 该论文旨在解决现有科学信息抽取(Scientific Information Extraction, SIE)数据集主要聚焦于学术论文,而忽视代码仓库中实现层面细节的问题。其关键解决方案是构建了NERdME数据集——包含200个手动标注的README文件,涵盖超过10,000个标注实体跨度和10类实体类型,以捕捉代码实现层中的结构化信息。通过在该数据集上进行基线实验,发现论文级与实现级实体存在显著差异,验证了扩展SIE基准测试以纳入README中实体类型的必要性,并进一步通过实体链接任务证明这些实体可有效支持科研成果的发现与元数据整合。
链接: https://arxiv.org/abs/2603.05750
作者: Genet Asefa Gesese,Zongxiong Chen,Shufan Jiang,Mary Ann Tan,Zhaotai Liu,Sonja Schimmler,Harald Sack
机构: FIZ Karlsruhe(弗劳恩霍夫信息分析与知识管理研究所); Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院); Fraunhofer FOKUS(弗劳恩霍夫应用研究促进协会); Université de Toulouse Jean Jaurès(图卢兹-让·饶勒斯大学); IRIT, UT, CNRS, Toulouse INP(图卢兹国立理工学院信息与推理技术研究所)
类目: Computation and Language (cs.CL)
备注: To be published (Accepted at WWW’26)
Abstract:Existing scholarly information extraction (SIE) datasets focus on scientific papers and overlook implementation-level details in code repositories. README files describe datasets, source code, and other implementation-level artifacts, however, their free-form Markdown offers little semantic structure, making automatic information extraction difficult. To address this gap, NERdME is introduced: 200 manually annotated README files with over 10,000 labeled spans and 10 entity types. Baseline results using large language models and fine-tuned transformers show clear differences between paperlevel and implementation-level entities, indicating the value of extending SIE benchmarks with entity types available in README files. A downstream entity-linking experiment was conducted to demonstrate that entities derived from READMEs can support artifact discovery and metadata integration.
[NLP-46] CodeScout: Contextual Problem Statement Enhancement for Software Agents
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 驱动的代码辅助工具在面对表述不清、缺乏充分上下文和需求规格的任务请求时,易产生非收敛性代理轨迹(non-converging agent trajectories)的问题,从而导致修复失败或效率低下。其解决方案的关键在于提出 CodeScout —— 一种基于轻量级目标代码库预探索的上下文查询精炼方法,通过结构化分析在任务执行前生成更完整、可操作的问题陈述,包括重现步骤、预期行为及针对性探索提示,从而减少无效探索并明确用户意图,显著提升任务解决率(在 SWEBench-Verified 数据集上相较基线提升 20%)。
链接: https://arxiv.org/abs/2603.05744
作者: Manan Suri,Xiangci Li,Mehdi Shojaie,Songyang Han,Chao-Chun Hsu,Shweta Garg,Aniket Anand Deshmukh,Varun Kumar
机构: University of Maryland, College Park; Amazon Web Services (AWS); Databricks
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:Current AI-powered code assistance tools often struggle with poorly-defined problem statements that lack sufficient task context and requirements specification. Recent analysis of software engineering agents reveals that failures on such underspecified requests are highly correlated with longer trajectories involving either over-exploration or repeated attempts at applying the same fix without proper evolution or testing, leading to suboptimal outcomes across software development tasks. We introduce CodeScout, a contextual query refinement approach that systematically converts underspecified user requests into comprehensive, actionable problem statements through lightweight pre-exploration of the target codebase. Our key innovation is demonstrating that structured analysis before task execution can supplement existing agentic capabilities without requiring any modifications to their underlying scaffolds. CodeScout performs targeted context scoping, conducts multi-perspective analysis examining potential fixes and exploration opportunities, then synthesizes these insights into enhanced problem statements with reproduction steps, expected behaviors, and targeted exploration hints. This pre-exploration directly addresses the identified failure patterns by reducing non-converging agent trajectories while clarifying user intent in natural language space. We evaluate CodeScout using state-of-the-art agentic scaffolds and language models on SWEBench-Verified, demonstrating a 20% improvement in resolution rates with up to 27 additional issues resolved compared to the default baseline method. Our results suggest that systematic query refinement through contextual analysis represents a promising direction for enhancing AI code assistance capabilities.
[NLP-47] Lets Talk Not Type: An Oral-First Multi-Agent Architecture for Guaraní
【速读】: 该论文旨在解决当前人工智能(AI)与人机交互(HCI)系统普遍存在的文本中心化设计问题,即这些系统主要面向书面语言构建,忽视了以口语为主的语言群体(如巴拉圭的瓜拉尼语使用者)及其文化实践。其核心问题是:现有AI语言支持机制未能充分尊重和整合原住民社区的口头交流习惯,导致数字技术对非文本主导语言社群的边缘化。解决方案的关键在于提出一种“口述优先”(oral-first)的多智能体架构,通过将自然语言理解与对话状态管理、社区治理等模块解耦,并引入以轮流发言(turn-taking)、修复机制(repair)和共享语境(shared context)为核心的交互设计原则,从而实现对原住民数据主权和双语现象(diglossia)的尊重,使AI系统从被动适配文本转向主动服务口语互动,确保数字生态系统真正包容多元语言实践。
链接: https://arxiv.org/abs/2603.05743
作者: Samantha Adorno,Akshata Kishore Moharir,Ratna Kandala
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Although artificial intelligence (AI) and Human-Computer Interaction (HCI) systems are often presented as universal solutions, their design remains predominantly text-first, underserving primarily oral languages and indigenous communities. This position paper uses Guaraní, an official and widely spoken language of Paraguay, as a case study to argue that language support in AI remains insufficient unless it aligns with lived oral practices. We propose an alternative to the standard “text-to-speech” pipeline, proposing instead an oral-first multi-agent architecture. By decoupling Guaraní natural language understanding from dedicated agents for conversation state and community-led governance, we demonstrate a technical framework that respects indigenous data sovereignty and diglossia. Our work moves beyond mere recognition to focus on turn-taking, repair, and shared context as the primary locus of interaction. We conclude that for AI to be truly culturally grounded, it must shift from adapting oral languages to text-centric systems to treating spoken conversation as a first-class design requirement, ensuring digital ecosystems empower rather than overlook diverse linguistic practices.
[NLP-48] Structured Multidimensional Representation Learning for Large Language Models
【速读】: 该论文旨在解决Transformer架构在扩展过程中面临的参数量急剧增长和嵌入维度冗余问题。其核心解决方案是引入基于L-积(L-product)的嵌入空间结构化谱分解方法,将token表示重塑为谱张量切片,并在变换域中执行注意力与前馈操作,从而构建出Tensor Transformer架构。该方法通过将编码器分解为p个独立的谱子Transformer,在保持标准Transformer语义的同时,理论上可实现约1/p的编码器参数压缩(忽略低阶项如偏置和归一化参数)。关键创新在于利用实数离散余弦变换(DCT)实现完全可微且兼容现有训练流程的谱分解,同时引入频率相关的归纳偏置,支持切片依赖的频率缩放机制,从而提升模型泛化能力。实验表明,在IMDB和AG~News数据集上,该方法可在压缩高达75%参数的情况下维持或优于基线性能。
链接: https://arxiv.org/abs/2603.05727
作者: Alaa El Ichi,Khalide Jbilou,Mohamed El Guide,Franck Dufrenois
机构: Université du Littoral Cote d’Opale (滨海大学); FGSES, University Mohammed VI Polytechnic (穆罕默德六世 polytechnic 大学); LISIC (LISIC 研究所)
类目: Computation and Language (cs.CL); Numerical Analysis (math.NA)
备注: 25 pages, 6 figures. Preprint of a journal submission
Abstract:Transformer architectures achieve state-of-the-art performance across a wide range of pattern recognition and natural language processing tasks, but their scaling is accompanied by substantial parameter growth and redundancy in the embedding dimension. In this work, we introduce a structured spectral factorization of the embedding space based on the L-product for third-order tensors. By reshaping token representations into spectral tensor slices and performing attention and feed-forward operations in the transform domain, we obtain a Tensor Transformer architecture that decomposes the encoder into p independent spectral sub-transformers while preserving standard Transformer semantics. We prove that the proposed L-Transformer is spectrally equivalent to p parallel Transformers operating on reduceddimensional embeddings, which yields approximately 1/p reduction (up to lower-order terms such as biases and normalization parameters) in encoder parameters under fixed total embedding size. When instantiated with a real-valued Discrete Cosine Transform (DCT), the method remains fully differentiable and compatible with existing training pipelines. Beyond compression, the spectral decomposition introduces an inductive bias over embedding frequencies, enabling slice-dependent frequency scaling that improves generalization. Experiments on IMDB and AG~News show that the proposed model can substantially reduce encoder parameters (up to 75% for p=4) while maintaining competitive accuracy. On IMDB, the tensorized encoder matches or improves upon the standard baseline under compression, whereas on AG~News at moderate width we observe a small accuracy decrease in exchange for a 4 times encoder reduction; at BERT-base width (d=768), performance returns to parity.
[NLP-49] Cultural Perspectives and Expectations for Generative AI: A Global Survey Approach
【速读】: 该论文旨在解决当前关于生成式 AI (Generative AI) 如何代表不同文化缺乏实证证据的问题,特别是全球范围内对文化表征的理解与信念差异。其解决方案的关键在于通过大规模全球调查收集多元群体对“文化”概念的定义及其在 GenAI 中应如何呈现文化产物、概念或价值观的看法,并从中提炼出直接来自社区的“文化”工作定义,从而揭示文化概念的复杂性及其与 GenAI 表征之间的关联。研究进一步提出以参与式方法为核心、优先关注超越地理维度的文化特征(如宗教与传统)以及建立应对文化“红线”的敏感性框架,作为推动负责任文化表征的实践路径。
链接: https://arxiv.org/abs/2603.05723
作者: Erin van Liemt,Renee Shelby,Andrew Smart,Sinchana Kumbale,Richard Zhang,Neha Dixit,Qazi Mamunur Rashid,Jamila Smith-Loud
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures, 6 tables
Abstract:There is a lack of empirical evidence about global attitudes around whether and how GenAI should represent cultures. This paper assesses understandings and beliefs about culture as it relates to GenAI from a large-scale global survey. We gathered data about what culture means to different groups, and about how GenAI should approach the representation of cultural artifacts, concepts, or values. We distill working definitions of culture directly from these communities to build an understanding of its conceptual complexities and how they relate to representations in Generative AI. We survey from across parts of Europe, North and South America, Asia, and Africa. We conclude with a set of recommendations for Culture and GenAI development. These include participatory approaches, prioritizing specific cultural dimensions beyond geography, such as religion and tradition, and a sensitivity framework for addressing cultural ``redlines’'.
[NLP-50] owards Robust Retrieval-Augmented Generation Based on Knowledge Graph: A Comparative Analysis
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在面对不一致或噪声检索信息时响应可靠性下降的问题,尤其关注其在真实世界应用中鲁棒性不足的挑战。解决方案的关键在于引入基于知识图谱的检索机制——GraphRAG,并对其进行了三种定制化改进,以提升模型在噪声鲁棒性、信息整合能力、负例拒绝以及反事实场景下的表现,从而增强RAG系统的整体稳定性与准确性。
链接: https://arxiv.org/abs/2603.05698
作者: Hazem Amamou,Stéphane Gagnon,Alan Davoust,Anderson R. Avila
机构: Institut national de la recherche scientifique (INRS-EMT), Université du Québec en Outaouais, INRS-UQO Mixed Research Unit on Cybersecurity
类目: Computation and Language (cs.CL)
备注: The paper is 6 pages long and includes 5 figures and 3 tables illustrating the experimental framework and results. It is submitted to the IEEE International Conference on Systems, Man, and Cybernetics (SMC 2025) and studies improving the robustness of Retrieval-Augmented Generation systems using knowledge graph based GraphRAG approaches
Abstract:Retrieval-Augmented Generation (RAG) was introduced to enhance the capabilities of Large Language Models (LLMs) beyond their encoded prior knowledge. This is achieved by providing LLMs with an external source of knowledge, which helps reduce factual hallucinations and enables access to new information not available during pretraining. However, inconsistent retrieved information can negatively affect LLM responses. The Retrieval-Augmented Generation Benchmark (RGB) was introduced to evaluate the robustness of RAG systems under such conditions. In this work, we use the RGB corpus to evaluate LLMs in four scenarios: noise robustness, information integration, negative rejection, and counterfactual robustness. We perform a comparative analysis between the RGB RAG baseline and GraphRAG, a knowledge graph based retrieval system. We test three GraphRAG customizations to improve robustness. Results show improvements over the RGB baseline and provide insights for designing more reliable RAG systems for real world scenarios.
[NLP-51] Autonomous Algorithm Discovery for Ptychography via Evolutionary LLM Reasoning
【速读】: 该论文旨在解决扫描衍射成像(ptychography)中高质量重建依赖人工设计正则化函数的问题,这一过程往往耗时且缺乏通用性。解决方案的关键在于提出Ptychi-Evolve框架,该框架利用大语言模型(large language models, LLMs)自动生成并演化新型正则化算法,结合语义引导的交叉与变异机制,在无需人工干预的情况下实现正则化策略的自动优化。实验表明,该方法在多个复杂数据集上显著优于传统正则化手段,提升了结构相似性指数(SSIM)和峰值信噪比(PSNR),同时记录算法演化谱系与元数据,确保结果的可解释性和可复现性。
链接: https://arxiv.org/abs/2603.05696
作者: Xiangyu Yin,Ming Du,Junjing Deng,Zhi Yang,Yimo Han,Yi Jiang
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Numerical Analysis (math.NA)
备注:
Abstract:Ptychography is a computational imaging technique widely used for high-resolution materials characterization, but high-quality reconstructions often require the use of regularization functions that largely remain manually designed. We introduce Ptychi-Evolve, an autonomous framework that uses large language models (LLMs) to discover and evolve novel regularization algorithms. The framework combines LLM-driven code generation with evolutionary mechanisms, including semantically-guided crossover and mutation. Experiments on three challenging datasets (X-ray integrated circuits, low-dose electron microscopy of apoferritin, and multislice imaging with crosstalk artifacts) demonstrate that discovered regularizers outperform conventional reconstructions, achieving up to +0.26 SSIM and +8.3~dB PSNR improvements. Besides, Ptychi-Evolve records algorithm lineage and evolution metadata, enabling interpretable and reproducible analysis of discovered regularizers.
[NLP-52] FreeTxt-Vi: A Benchmarked Vietnamese-English Toolkit for Segmentation Sentiment and Summarisation
【速读】: 该论文旨在解决多语言文本分析中技术门槛高、资源匮乏的问题,尤其针对越南语这一广泛使用但自然语言处理(Natural Language Processing, NLP)资源相对不足的语言。其解决方案的关键在于设计并实现了一个统一的双语NLP流水线,整合了混合的VnCoreNLP与Byte Pair Encoding(BPE)分词策略、微调后的TabularisAI情感分类器以及微调后的Qwen2.5模型用于抽象式摘要生成。该系统在无需编程知识的前提下,支持用户构建、探索和解释双语文本数据,并通过三部分评估验证了其在分词、情感分析和摘要任务上优于或媲美主流基线方法的性能,从而推动越南语相关语言资源的发展与可复现研究的开展。
链接: https://arxiv.org/abs/2603.05690
作者: Hung Nguyen Huy,Mo El-Haj,Dawn Knight,Paul Rayson
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages
Abstract:FreeTxt-Vi is a free and open source web based toolkit for creating and analysing bilingual Vietnamese English text collections. Positioned at the intersection of corpus linguistics and natural language processing NLP it enables users to build explore and interpret free text data without requiring programming expertise. The system combines corpus analysis features such as concordancing keyword analysis word relation exploration and interactive visualisation with transformer based NLP components for sentiment analysis and summarisation. A key contribution of this work is the design of a unified bilingual NLP pipeline that integrates a hybrid VnCoreNLP and Byte Pair Encoding BPE segmentation strategy a fine tuned TabularisAI sentiment classifier and a fine tuned Qwen2.5 model for abstractive summarisation. Unlike existing text analysis platforms FreeTxt Vi is evaluated as a set of language processing components. We conduct a three part evaluation covering segmentation sentiment analysis and summarisation and show that our approach achieves competitive or superior performance compared to widely used baselines in both Vietnamese and English. By reducing technical barriers to multilingual text analysis FreeTxt Vi supports reproducible research and promotes the development of language resources for Vietnamese a widely spoken but underrepresented language in NLP. The toolkit is applicable to domains including education digital humanities cultural heritage and the social sciences where qualitative text data are common but often difficult to process at scale.
[NLP-53] Safer Reasoning Traces: Measuring and Mitigating Chain-of-Thought Leakage in LLM s
【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)提示技术在大语言模型(Large Language Models, LLMs)推理过程中引发的隐私泄露问题,尤其是个人身份信息(Personally Identifiable Information, PII)从输入提示中被模型推理轨迹和输出重新暴露的风险。解决方案的关键在于提出一个模型无关的框架,通过定义风险加权的token级泄露事件、追踪不同CoT预算下的泄露曲线,并对比开源与闭源模型家族在结构化PII数据集上的表现,发现CoT显著提升高风险类别PII的泄露概率且存在明显的模型家族和预算依赖性;进一步引入轻量级推理时门控机制(包括规则检测器、TF-IDF+逻辑回归分类器、GLiNER命名实体识别模型及LLM-as-judge),评估其在风险加权F1、宏平均F1和召回率等指标下的性能,结果表明无单一方法在所有模型和预算下最优,从而推动采用风格自适应的混合门控策略,在统一可复现协议下实现效用与风险的平衡。
链接: https://arxiv.org/abs/2603.05618
作者: Patrick Ahrend,Tobias Eder,Xiyang Yang,Zhiyi Pan,Georg Groh
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Chain-of-Thought (CoT) prompting improves LLM reasoning but can increase privacy risk by resurfacing personally identifiable information (PII) from the prompt into reasoning traces and outputs, even under policies that instruct the model not to restate PII. We study such direct, inference-time PII leakage using a model-agnostic framework that (i) defines leakage as risk-weighted, token-level events across 11 PII types, (ii) traces leakage curves as a function of the allowed CoT budget, and (iii) compares open- and closed-source model families on a structured PII dataset with a hierarchical risk taxonomy. We find that CoT consistently elevates leakage, especially for high-risk categories, and that leakage is strongly family- and budget-dependent. Increasing the reasoning budget can either amplify or attenuate leakage depending on the base model. We then benchmark lightweight inference-time gatekeepers: a rule-based detector, a TF-IDF + logistic regression classifier, a GLiNER-based NER model, and an LLM-as-judge, using risk-weighted F1, Macro-F1, and recall. No single method dominates across models or budgets, motivating hybrid, style-adaptive gatekeeping policies that balance utility and risk under a common, reproducible protocol.
[NLP-54] NOTAI.AI: Explainable Detection of Machine-Generated Text via Curvature and Feature Attribution
【速读】: 该论文旨在解决生成式 AI (Generative AI) 生成文本的检测问题,尤其关注检测结果的可解释性不足这一关键挑战。解决方案的关键在于构建一个融合 curvature-based 信号与神经网络及文体学特征(共17个可解释特征)的监督学习框架,并采用梯度提升树(XGBoost)作为元分类器进行判别;同时引入 Shapley Additive Explanations (SHAP) 实现局部与全局特征重要性归因,并通过大语言模型(LLM)驱动的解释层将这些归因转化为结构化的自然语言推理依据,从而实现用户友好的可解释性输出。
链接: https://arxiv.org/abs/2603.05617
作者: Oleksandr Marchenko Breneur,Adelaide Danilov,Aria Nourbakhsh,Salima Lamsiyah
机构: University of Luxembourg (卢森堡大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 7 figures
Abstract:We present this http URL, an explainable framework for machine-generated text detection that extends Fast-DetectGPT by integrating curvature-based signals with neural and stylometric features in a supervised setting. The system combines 17 interpretable features, including Conditional Probability Curvature, ModernBERT detector score, readability metrics, and stylometric cues, within a gradient-boosted tree (XGBoost) meta-classifier to determine whether a text is human- or AI-generated. Furthermore, this http URL applies Shapley Additive Explanations (SHAP) to provide both local and global feature-level attribution. These attributions are further translated into structured natural-language rationales through an LLM-based explanation layer, which enables user-facing interpretability. The system is deployed as an interactive web application that supports real-time analysis, visual feature inspection, and structured evidence presentation. A web interface allows users to input text and inspect how neural and statistical signals influence the final decision. The source code and demo video are publicly available to support reproducibility.
[NLP-55] Aligning the True Semantics: Constrained Decoupling and Distribution Sampling for Cross-Modal Alignment AAAI2026
【速读】: 该论文旨在解决跨模态对齐中因忽略嵌入中的非语义信息而导致的语义一致性不足问题。传统方法仅追求嵌入一致性,未区分语义与模态成分,易受模态差距干扰,造成语义偏差或信息丢失。其解决方案的关键在于提出一种基于约束解耦与分布采样(Constrained Decoupling and Distribution Sampling, CDDS)的新算法:首先设计双路径UNet结构,通过多约束机制自适应地将嵌入解耦为语义和模态分量;其次引入分布采样策略以弥合模态差距,确保对齐过程的合理性。实验表明,CDDS在多个基准数据集和模型架构上均显著优于当前最优方法,提升幅度达6.6%至14.2%。
链接: https://arxiv.org/abs/2603.05566
作者: Xiang Ma,Lexin Fang,Litian Xu,Caiming Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: AAAI 2026 poster
Abstract:Cross-modal alignment is a crucial task in multimodal learning aimed at achieving semantic consistency between vision and language. This requires that image-text pairs exhibit similar semantics. Traditional algorithms pursue embedding consistency to achieve semantic consistency, ignoring the non-semantic information present in the embedding. An intuitive approach is to decouple the embeddings into semantic and modality components, aligning only the semantic component. However, this introduces two main challenges: (1) There is no established standard for distinguishing semantic and modal information. (2) The modality gap can cause semantic alignment deviation or information loss. To align the true semantics, we propose a novel cross-modal alignment algorithm via \textbfConstrained \textbfDecoupling and \textbfDistribution \textbfSampling (CDDS). Specifically, (1) A dual-path UNet is introduced to adaptively decouple the embeddings, applying multiple constraints to ensure effective separation. (2) A distribution sampling method is proposed to bridge the modality gap, ensuring the rationality of the alignment process. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of CDDS, outperforming state-of-the-art methods by 6.6% to 14.2%.
[NLP-56] EigenData: A Self-Evolving Multi-Agent Platform for Function-Calling Data Synthesis Auditing and Repair
【速读】: 该论文旨在解决生成式 AI(Generative AI)在函数调用代理(function-calling agents)训练中面临的高质量、领域特定数据稀缺与不一致问题,尤其是数据生命周期管理复杂、多轮轨迹合成困难以及评估指标与真实任务成功脱节等挑战。解决方案的关键在于提出一个名为 EigenData 的集成自演化平台,其核心是通过多智能体架构实现自动化数据全生命周期管理:由顶层编排器 EigenCore 协调数据库构建(DatabaseAgent)、可执行环境生成(CodingAgent)与多轮轨迹合成(DataAgent)三个子系统,并借助跨组件反馈机制确保所有产出物的一致性;同时,该平台通过修复 Berkeley Function-Calling Leaderboard (BFCL-V3) 中的系统性错误并引入基于结果感知(outcome-aware)的评估协议——以数据库状态正确性而非对话轮次匹配度衡量任务成功率——显著提升了模型排名与人类对功能性正确性的判断之间的相关性。
链接: https://arxiv.org/abs/2603.05553
作者: Jiaao Chen,Jingyuan Qi,Mingye Gao,Wei-Chen Wang,Hanrui Wang,Di Jin
机构: Eigen AI
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Function-calling agents – large language models that invoke tools and APIs – require high-quality, domain-specific training data spanning executable environments, backing databases, and diverse multi-turn trajectories. We introduce EigenData, an integrated, self-evolving platform that automates the full data lifecycle through a multi-agent architecture. A top-level orchestrator, EigenCore, coordinates three specialized sub-systems: DatabaseAgent for realistic domain database construction, CodingAgent for verified executable environment generation with iterative test-debug loops, and DataAgent for multi-turn trajectory synthesis with self-evolving prompt optimization. Cross-component feedback ensures consistency across all artifacts. We apply EigenData to audit and repair the Berkeley Function-Calling Leaderboard (BFCL-V3), identifying systematic errors in function schemas, implementations, and reference trajectories, automatically correcting them through coordinated schema refinement, code-level bug fixes, and trajectory modification, and introducing an outcome-aware evaluation protocol that assesses task success via database-state correctness rather than turn-level trajectory matching. We demonstrate that the repaired benchmark, coupled with outcome-aware metrics, produces model rankings substantially better correlated with human judgments of functional correctness.
[NLP-57] Attention Meets Reachability: Structural Equivalence and Efficiency in Grammar-Constrained LLM Decoding
【速读】: 该论文旨在解决生成式 AI (Generative AI) 中语法约束解码(Grammar-Constrained Decoding, GCD)的效率与正确性问题,核心挑战在于如何在保持语言等价性的前提下最小化在线解码过程中的状态空间膨胀和歧义成本。解决方案的关键在于提出结构歧义成本(Structural Ambiguity Cost, SAC),用于量化每步解码中解析森林的增量增长,并证明不同等价文法在相同前缀下虽产生相同的可接受词元集合(logit mask),但其编译后的状态空间和在线歧义开销可能显著不同;进一步通过定义解码代价等价类并证明最小SAC代表的存在性,以及建立基于Doob h-变换的条件采样模型与硬掩码解码的KL散度及总变差距离边界,最终将理论结果与Transformer和专家混合(Mixture-of-Experts)架构结合,推导出延迟上界并连接SAC到基于仪器的预测性能建模与自动文法优化。
链接: https://arxiv.org/abs/2603.05540
作者: Faruk Alpay,Bilge Senturk
机构: Bahçeşehir University (巴赫切席尔大学)
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注: 20 pages
Abstract:We study grammar-constrained decoding (GCD) as a coupling between an autoregressive next-token distribution and a reachability oracle over a pushdown system compiled from a context-free grammar (CFG). We prove an oracle invariance theorem: language-equivalent grammars induce identical admissible next-token sets for every prefix, hence identical logit masks, yet can yield provably different compiled state spaces and online ambiguity costs. We give exact control-state blowup counts for the canonical a^n b^n language under redundant nonterminal delegation, and introduce a left-to-right structural ambiguity cost (SAC) measuring incremental packed-parse-forest growth per token. For two equivalent grammars over all finite strings, SAC is O(1) per token under right-recursion but \Theta(t^2) per token and \Theta(n^3) cumulatively under concatenation. We establish engine-independent lower bounds: any sound, retrieval-efficient, parse-preserving online masking engine must incur \Omega(t^2) work per token on a specific constant-size CFG family, unconditionally within this model. We define decoding-cost equivalence classes of grammars and prove existence of minimal-SAC representatives within bounded rewrite families. Finally, we characterize the true conditional sampler via a Doob h -transform and derive sharp one-step KL and total-variation distortion bounds for hard-masked decoding in terms of survival-probability spread among admissible next tokens. We integrate these results with Transformer and Mixture-of-Experts architectures, derive latency envelopes in terms of vocabulary size, active state sets, and beam width, and connect SAC to instrumentation-based predictive performance models and automated grammar optimization.
[NLP-58] Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder
【速读】: 该论文旨在解决多模态系统中因采用独立专家编码器(expert modality encoders)而导致的复杂度线性增长与计算开销问题,以及统一Omni-models中基于Mixture-of-Experts(MoE)架构带来的参数量膨胀和路由开销问题。其解决方案的关键在于提出Omni-C(Omni-Compress),一个单一的密集型Transformer编码器,通过在大规模未对齐数据上进行单模态对比预训练(unimodal contrastive pretraining),学习跨异构模态(图像、音频和文本)的共享表示;同时利用轻量级模态特定投影头(modality-specific projection heads)有效缓解模态间冲突,无需MoE结构、成对监督或路由机制,从而实现高效部署和低内存推理,显著降低推理阶段内存占用并提升可扩展性。
链接: https://arxiv.org/abs/2603.05528
作者: Kin Wai Lau,Yasar Abbas Ur Rehman,Lai-Man Po,Pedro Porto Buarque de Gusmão
机构: City University of Hong Kong (香港城市大学); TCL AI Lab (TCL人工智能实验室); University of Surrey, United Kingdom (萨里大学,英国)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Recent multimodal systems often rely on separate expert modality encoders which cause linearly scaling complexity and computational overhead with added modalities. While unified Omni-models address this via Mixture-of-Expert (MoE) architectures with specialized experts and routing, they still inflate parameter counts and introduce routing overhead. In this paper, we propose Omni-C (Omni-Compress), a single dense Transformer-based encoder that learns competitive shared representations across heterogeneous modalities–images, audio, and text–through unimodal contrastive pretraining on large-scale unaligned data. By maximizing parameter sharing in the backbone and using lightweight modality-specific projection heads, Omni-C effectively mitigates inter-modality conflicts without requiring MoE, paired supervision, or routing. This design supports efficient deployment on memory-constrained systems via sequential modality processing and low-memory inference, eliminating the need for parallel expert loading or specialized hardware. Experiments show Omni-C achieves performance comparable to expert models in unimodal and cross-model tasks, with modest zero-shot degradation on audio and text that is largely recovered through lightweight linear probing or parameter efficient fine-tuning. The unified architecture substantially reduces inference memory usage compared to multi-encoder baselines, advancing efficient and scalable multimodal learning.
[NLP-59] ransducing Language Models
【速读】: 该论文旨在解决预训练语言模型输出格式与下游任务需求不匹配的问题,例如模型生成字节对(byte-pair)字符串时无法直接得到词级(word-level)预测,或DNA序列模型无法直接输出氨基酸序列。其核心挑战在于如何在不修改模型参数的前提下,将原始模型的分布通过确定性字符串到字符串的变换映射到目标输出空间。解决方案的关键是引入有限状态转换器(finite-state transducer, FST)作为通用的可微分字符串变换工具,并提出一套算法来实现:1)对源字符串进行边缘化(marginalization),即计算所有映射到目标字符串的源串概率之和;2)在FST中传播概率以支持条件推理(conditioning on transformed outputs)。该方法实现了推理阶段对预训练模型的灵活适配,使其能无缝对接特定应用的输出格式要求。
链接: https://arxiv.org/abs/2603.05193
作者: Vésteinn Snæbjarnarson,Samuel Kiegeland,Tianyu Liu,Reda Boumasmoud,Ryan Cotterell,Tim Vieira
机构: 未知
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注:
Abstract:Modern language models define distributions over strings, but downstream tasks often require different output formats. For instance, a model that generates byte-pair strings does not directly produce word-level predictions, and a DNA model does not directly produce amino-acid sequences. In such cases, a deterministic string-to-string transformation can convert the model’s output to the desired form. This is a familiar pattern in probability theory: applying a function f to a random variable X\sim p yields a transformed random variable f(X) with an induced distribution. While such transformations are occasionally used in language modeling, prior work does not treat them as yielding new, fully functional language models. We formalize this perspective and introduce a general framework for language models derived from deterministic string-to-string transformations. We focus on transformations representable as finite-state transducers – a commonly used state-machine abstraction for efficient string-to-string mappings. We develop algorithms that compose a language model with an FST to marginalize over source strings mapping to a given target, propagating probabilities through the transducer without altering model parameters and enabling conditioning on transformed outputs. We present an exact algorithm, an efficient approximation, and a theoretical analysis. We conduct experiments in three domains: converting language models from tokens to bytes, from tokens to words, and from DNA to amino acids. These experiments demonstrate inference-time adaptation of pretrained language models to match application-specific output requirements.
[NLP-60] Continual Adaptation for Pacific Indigenous Speech Recognition INTERSPEECH
【速读】: 该论文旨在解决生成式语音基础模型在低资源太平洋原住民语言上适应困难的问题,尤其是因数据稀缺导致的性能瓶颈以及全量微调引发的灾难性遗忘问题。其关键解决方案在于系统评估不同适配策略(包括全量微调与低秩适配 LoRA)对真实世界太平洋语料库的适应效果,并提出持续学习框架以实现多语言顺序迁移。研究发现,尽管LoRA在初始适配阶段表现良好,但在连续学习过程中仍面临灾难性遗忘,揭示了模型在可塑性与稳定性之间的严峻权衡,从而强调了开发针对代表性不足语言的鲁棒适配策略的紧迫性。
链接: https://arxiv.org/abs/2603.06310
作者: Yang Xiao,Aso Mahmudi,Nick Thieberger,Eliathamby Ambikairajah,Eun-Jung Holden,Ting Dang
机构: The University of Melbourne (墨尔本大学); UNSW Sydney (新南威尔士大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Submitted to Interspeech
Abstract:Speech foundation models struggle with low-resource Pacific Indigenous languages because of severe data scarcity. Furthermore, full fine-tuning risks catastrophic forgetting. To address this gap, we present an empirical study adapting models to real-world Pacific datasets. We investigate how data volume and linguistic features affect adaptation success. Specifically, we evaluate strategies including Full Fine-Tuning and Low-Rank Adaptation (LoRA). Additionally, we analyze a continual learning framework for sequentially acquiring multiple languages. We demonstrate that adapting to these distant languages causes severe internal representational drift. Consequently, these models face a strict plasticity and stability dilemma. While LoRA adapts well initially, it suffers from catastrophic forgetting during sequential learning. Ultimately, this study highlights the urgent need for robust adaptation strategies tailored to underrepresented languages.
信息检索
[IR-0] Efficient Property-Aligned Fan-Out Retrieval via RL-Compiled Diffusion
【速读】:该论文旨在解决**集合值检索(set-valued retrieval)**中的核心挑战:传统监督学习数据集仅优化单条结果(top-1)的召回性能,无法捕捉多样性、覆盖度、互补性等高阶集合属性;而现有基于强化学习(RL)的方法虽能优化集合级目标,但推理阶段调用大语言模型(LLM)的开销过大。其解决方案的关键在于提出 R4T(Retrieve-for-Train)框架,通过三步流程实现高效且目标对齐的扇出检索(fan-out retrieval):首先利用RL训练一个扇出LLM以获得复合集合奖励,其次生成与目标一致的训练样本对,最后训练轻量级扩散模型(diffusion retriever)在嵌入空间中建模集合输出的条件分布,从而在推理时实现单次通过(single-pass)的高效扇出,显著降低延迟并提升检索质量。
链接: https://arxiv.org/abs/2603.06397
作者: Pengcheng Jiang,Judith Yue Li,Moonkyung Ryu,R. Lily Hu,Kun Su,Zhong Yi Wan,Liam Hebert,Hao Peng,Jiawei Han,Dima Kuzmin,Craig Boutilier
机构: University of Illinois Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Many modern retrieval problems are set-valued: given a broad intent, the system must return a collection of results that optimizes higher-order properties (e.g., diversity, coverage, complementarity, coherence) while remaining grounded with respect to a fixed database. Set-valued objectives are typically non-decomposable and are not captured by existing supervised (query, content) datasets which only prioritize top-1 retrieval. Consequently, fan-out retrieval is often employed to generate diverse subqueries to retrieve item sets. While reinforcement learning (RL) can optimize set-level objectives via interaction, deploying an RL-tuned LLM for fan-out retrieval is prohibitively expensive at inference time. Conversely, diffusion-based generative retrieval enables efficient single-pass fan-out in embedding space, but requires objective-aligned training targets. To address these issues, we propose R4T (Retrieve-for-Train), which uses RL once as an objective transducer in a three-step process: (i) train a fan-out LLM with composite set-level rewards, (ii) synthesize objective-consistent training pairs, and (iii) train a lightweight diffusion retriever to model the conditional distribution of set-valued outputs. Across large-scale fashion and music benchmarks consisting of curated item sets, we show that R4T improves retrieval quality relative to strong baselines while reducing query-time fan-out latency by an order of magnitude. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2603.06397 [cs.IR] (or arXiv:2603.06397v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.06397 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Judith Yue Li [view email] [v1] Fri, 6 Mar 2026 15:42:33 UTC (7,338 KB) Full-text links: Access Paper: View a PDF of the paper titled Efficient, Property-Aligned Fan-Out Retrieval via RL-Compiled Diffusion, by Pengcheng Jiang and 10 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.IR prev | next new | recent | 2026-03 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[IR-1] MLLM Rec-R1: Incentivizing Reasoning Capability in Large Language Models for Multimodal Sequential Recommendation
【速读】:该论文旨在解决将基于群体相对策略优化(Group Relative Policy Optimization, GRPO)的推理范式应用于多模态序列推荐(Multimodal Sequential Recommendation, MSR)时所面临的两大核心挑战:一是视觉内容的联合编码导致视觉token占据输入主导地位,使基于群体的rollout成本随历史交互长度和候选集规模显著增长,从而造成训练开销过高;二是现有思维链(Chain-of-Thought, CoT)监督在推荐场景中存在奖励膨胀问题,即高训练奖励无法可靠转化为排名性能提升,甚至引发捷径学习。解决方案的关键在于提出MLLMRec-R1框架:通过离线文本化视觉信号以消除昂贵的视觉token并保留多模态语义,构建高质量多模态CoT监督(结合精炼与置信度感知评估),并引入混合粒度数据增强策略,选择性注入可信CoT样本以缓解奖励膨胀、提升泛化稳定性,从而实现高效且稳定的GRPO推理管道。
链接: https://arxiv.org/abs/2603.06243
作者: Yu Wang,Yonghui Yang,Le Wu,Jiancan Wu,Hefei Xu,Hui Lin
机构: 未知
类目: Information Retrieval (cs.IR)
备注: 14 pages, 10figures
Abstract:Group relative policy optimization (GRPO) has become a standard post-training paradigm for improving reasoning and preference alignment in large language models (LLMs), and has recently shown strong effectiveness in LLM-based recommender systems. However, extending GRPO-based reasoning pipelines to multimodal sequential recommendation (MSR) with multimodal large language models (MLLMs) faces fundamental obstacles. First, MSR requires jointly encoding visual content for both historical interactions and multiple candidate items, causing visual tokens to dominate the input and making the cost of group-based rollout scale with history length and candidate set size, which renders GRPO-based training prohibitively expensive. Second, existing Chain-of-Thought (CoT) supervision suffers from reward inflation in recommendation scenarios, where higher training rewards do not reliably translate into improved ranking performance and may induce shortcut learning. To address these challenges, we propose MLLMRec-R1, an efficient and stable GRPO-based reasoning framework for multimodal sequential recommendation. MLLMRec-R1 textualizes visual signals offline to eliminate expensive visual tokens while preserving multimodal semantics, and constructs high-quality multimodal CoT supervision through refinement and confidence-aware assessment. Furthermore, a mixed-grained data augmentation strategy selectively injects reliable CoT samples while retaining standard training data, mitigating reward inflation and improving generalization stability. Extensive experiments on three benchmark datasets demonstrate that MLLMRec-R1 consistently outperforms state-of-the-art methods, establishing a practical and effective GRPO-based reasoning pipeline for multimodal sequential recommendation. The code is available at this https URL.
[IR-2] Efficient Vector Search in the Wild: One Model for Multi-K Queries
【速读】:该论文旨在解决当前基于学习的top-K搜索方法在面对多K值查询时的泛化能力不足问题,即模型在特定K值下训练后难以适应不同K值场景,导致大K值时精度下降、小K值时性能损耗。解决方案的关键在于提出OMEGA方法,其核心思想是:通过轨迹特征(trajectory-based features)对基础模型进行K=1的训练,随后利用动态精炼机制准确预测更大K值,同时通过统计学特性优化精炼过程以减少冗余模型调用,从而实现高精度、高性能且低预处理成本的多K向量查询服务。
链接: https://arxiv.org/abs/2603.06159
作者: Yifan Peng,Jiafei Fan,Xingda Wei,Sijie Shen,Rong Chen,Jianning Wang,Xiaojian Luo,Wenyuan Yu,Jingren Zhou,Haibo Chen
机构: Shanghai Jiao Tong University (上海交通大学); Boston University (波士顿大学); Alibaba Group (阿里巴巴集团)
类目: Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Learned top-K search is a promising approach for serving vector queries with both high accuracy and performance. However, current models trained for a specific K value fail to generalize to real-world multi-K queries: they suffer from accuracy degradation (for larger Ks) and performance loss (for smaller Ks). Training the model to generalize on different Ks requires orders of magnitude more preprocessing time and is not suitable for serving vector queries in the wild. We present OMEGA, a K-generalizable learned top-K search method that simultaneously achieves high accuracy, high performance, and low preprocessing cost for multi-K vector queries. The key idea is that a base model properly trained on K=1 with our trajectory-based features can be used to accurately predict larger Ks with a dynamic refinement procedure and smaller Ks with minimal performance loss. To make our refinements efficient, we further leverage the statistical properties of top-K searches to reduce excessive model invocations. Extensive evaluations on multiple public and production datasets show that, under the same preprocessing budgets, OMEGA achieves 6-33% lower average latency compared to state-of-the-art learned search methods, while all systems achieve the same recall target. With only 16-30% of the preprocessing time, OMEGA attains 1.01-1.28x of the optimal average latency of these baselines.
[IR-3] ChatShopBuddy: Towards Reliable Conversational Shopping Agents via Reinforcement Learning
【速读】:该论文旨在解决如何在真实场景中有效应用后训练强化学习(Reinforcement Learning, RL)来优化购物对话代理(Conversational Shopping Agents)的问题,此类代理需同时满足多维度且相互依赖的目标:客观指标(如商品正确性)、主观质量(如说服力)、结果奖励(最终响应质量)和过程奖励(工具使用效率)。解决方案的关键在于提出了一套完整的框架:首先构建了SmartShopBench基准,通过分层评估将复杂的质量要求分解为可度量的层级;在此基础上设计了分层奖励建模(Hierarchical Reward Modeling, HRM),利用条件门控结构刻画不同奖励类型之间的逻辑依赖关系;并进一步提出动态对比策略优化(Dynamic Contrastive Policy Optimization, DCPO),通过基于奖励与推理长度的动态轨迹选择实现响应质量与操作效率之间的平衡。实验表明,所提出的RL训练代理ChatShopBuddy在稳定性上优于依赖通用推理的大模型,验证了该方法在实际应用中的有效性。
链接: https://arxiv.org/abs/2603.06065
作者: Yiruo Cheng,Kelong Mao,Tianhao Li,Jiejun Tan,Ji-Rong Wen,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence (人工智能学院); Renmin University of China (中国人民大学); JD.com (京东)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Conversational shopping agents represent a critical consumer-facing application of Large Language Model (LLM)-powered agents, yet how to effectively apply post-training Reinforcement Learning (RL) to optimize such agents remains underexplored. This work investigates RL-based optimization for shopping agents in real-world scenarios, where agents must simultaneously satisfy multiple interdependent objectives spanning objective metrics (product correctness), subjective qualities (persuasiveness), outcome rewards (final response quality), and process rewards (tool efficiency). We present a complete methodology to address this challenge. Specifically, we first construct SmartShopBench, a benchmark that captures diverse shopping intents with a hierarchical evaluation that decomposes complex quality requirements into measurable levels. Building on this evaluation framework, we design Hierarchical Reward Modeling (HRM) to structure mixed reward types through conditional gating that reflects their logical dependencies. To enable efficient training, we further propose Dynamic Contrastive Policy Optimization (DCPO), which balances response quality with operational efficiency through dynamic trajectory selection based on reward and reasoning length. Extensive experiments demonstrate that our RL-trained agent, namely ChatShopBuddy, consistently outperforms larger models relying on generic reasoning, achieving superior stability rather than merely higher peaks. Our work provides valuable guidance for applying RL to real-world conversational agents.
[IR-4] Sensitivity-Aware Retrieval-Augmented Intent Clarification ECIR2026
【速读】:该论文旨在解决在敏感领域(如医疗、政府或法律)中,如何在保持检索增强型意图澄清(retrieval-augmented intent clarification)性能的同时,有效保护检索数据库中的敏感信息。其核心挑战在于平衡系统实用性与数据安全性之间的权衡。解决方案的关键在于提出三步策略:首先定义攻击模型以明确威胁场景;其次在检索层级设计对敏感性感知的防御机制,防止敏感信息泄露;最后开发评估方法,量化保护强度与系统效用之间的权衡关系。
链接: https://arxiv.org/abs/2603.06025
作者: Maik Larooij
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to CoSCIN@ECIR2026 (Workshop on Conversational Search for Complex Information Needs)
Abstract:In conversational search systems, a key component is to determine and clarify the intent behind complex queries. We view intent clarification in light of the exploratory search paradigm, where users, through an iterative, evolving process of selection, exploration and retrieval, transform a visceral or conscious need into a formalized one. Augmenting the clarification component with a retrieval step (retrieval-augmented intent clarification) can seriously enhance clarification performance, especially in domains where Large Language Models (LLMs) lack parametric knowledge. However, in more sensitive domains, such as healthcare, government (e.g. FOIA search) or legal contexts, the retrieval database may contain sensitive information that needs protection. In this paper, we explore the research challenge of developing a retrieval-augmented conversational agent that can act as a mediator and gatekeeper for the sensitive collection. To do that, we also need to know what we are protecting and against what. We propose to tackle this research challenge in three steps: 1) define an attack model, 2) design sensitivity-aware defenses on the retrieval level and 3) develop evaluation methods to measure the trade-off between the level of protection and the system’s utility.
[IR-5] Balancing Domestic and Global Perspectives: Evaluating Dual-Calibration and LLM -Generated Nudges for Diverse News Recommendation
【速读】:该论文旨在解决新闻推荐系统中用户阅读覆盖范围狭窄的问题,特别是针对新闻地域多样性(即国内与国际新闻)不足的现状。其解决方案的关键在于提出并验证了一种“个性化多样性助推框架”(personalized diversity nudge framework),其中包含两个核心机制:一是设计了一种基于主题-地域双重校准的算法助推策略(topic-locality dual calibration algorithmic nudge),用于优化推荐内容的多样性分布;二是引入基于大语言模型(Large Language Model, LLM)的个性化呈现助推策略(LLM-based news personalization presentation nudge),以增强用户对多样化内容的接受度。实证研究表明,算法助推能有效提升用户对不同地域新闻的曝光和消费多样性,而LLM呈现助推的效果则因用户个体差异而异,且强调文章与用户历史阅读内容的相关性比单纯基于主题或无个性化推荐更能促进点击行为。
链接: https://arxiv.org/abs/2603.05780
作者: Ruixuan Sun,Matthew Zent,Minzhu Zhao,Thanmayee Boyapati,Xinyi Li,Joseph A. Konstan
机构: Grouplens Research, University of Minnesota (明尼苏达大学群组实验室); Northwestern University (西北大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:In this study, we applied the ``personalized diversity nudge framework’’ with the goal of expanding user reading coverage in terms of news locality (i.e., domestic and world news). We designed a novel topic-locality dual calibration algorithmic nudge and a large language model-based news personalization presentation nudge, then launched a 5-week real-user study with 120 U.S. news readers on the news recommendation experiment platform POPROX. With user interaction logs and survey responses, we found that algorithmic nudges can successfully increase exposure and consumption diversity, while the impact of LLM-based presentation nudges varied. User-level topic interest is a strong predictor of user clicks, while highlighting the relevance of news articles to prior read articles outperforms generic topic-based and no personalization. We also demonstrate that longitudinal exposure to calibrated news may shift readers’ reading habits to value a balanced news digest from both domestic and world articles. Our results provide direction for future work on nudging for diverse consumption in news recommendation systems.
[IR-6] he DSAs Blind Spot: Algorithmic Audit of Advertising and Minor Profiling on TikTok
【速读】:该论文试图解决的问题是:欧盟《数字服务法》(Digital Service Act, DSA)第28(2)条虽禁止对未成年人进行基于画像(profiling)的广告推送,但其对“广告”的定义过于狭窄,未能涵盖当前广泛存在的品牌/网红营销(influencer marketing)和未标明的推广内容(undisclosed promotional content),导致平台在形式上合规却实质上仍对未成年人实施个性化商业诱导。解决方案的关键在于:通过算法审计实证揭示这一规制缺口的实际运作机制,主张应将品牌/网红营销纳入“广告”范畴,并相应扩展DSA第28(2)条的适用范围,确保所有功能性等同于广告的商业内容不得绕过对未成年人的保护措施,从而实现真正意义上的数字环境中的未成年人权益保障。
链接: https://arxiv.org/abs/2603.05653
作者: Sara Solarova,Matej Mosnar,Matus Tibensky,Jan Jakubcik,Adrian Bindas,Simon Liska,Filip Hossner,Matúš Mesarčík,Ivan Srba
机构: Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所); Comenius University Bratislava (考门纽斯大学布拉迪斯拉发分校)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注:
Abstract:Adolescents spend an increasing amount of their time in digital environments where their still-developing cognitive capacities leave them unable to recognize or resist commercial persuasion. Article 28(2) of the Digital Service Act (DSA) responds to this vulnerability by prohibiting profiling-based advertising to minors. However, the regulation’s narrow definition of “advertisement” excludes current advertising practices including influencer marketing and promotional content that serve functionally equivalent commercial purposes. We provide the first empirical evidence of how this definitional gap operates in practice through an algorithmic audit of TikTok. Our approach deploys sock-puppet accounts simulating a pair of minor and adult users with distinct interest profiles. The content recommended to these users is automatically annotated, enabling systematic statistical analysis across four video categories: containing formal, disclosed, undisclosed and none advertisement; as well as advertisement topical relevance to user’s interest. Our findings reveal a stark regulatory paradox. TikTok demonstrates formal compliance with Article 28(2) by shielding minors from profiled formal advertisements, yet both disclosed and undisclosed ads exhibit significant profiling aligned with user interests (5-8 times stronger than for adult formal advertising). The strongest profiling emerges within undisclosed commercial content, where brands/creators fail to label promotional content/paid partnership and the platform neither corrects this omission nor prevents its personalized delivery to minors. We argue that protecting minors requires expanding the regulatory definition of advertisement to encompass brand/influencer marketing and extending the Article 28(2) prohibition accordingly, ensuring that commercial content cannot circumvent protections merely by operating outside formal advertising channels.
[IR-7] CBR-to-SQL: Rethinking Retrieval-based Text-to-SQL using Case-based Reasoning in the Healthcare Domain
【速读】:该论文旨在解决从电子健康记录(Electronic Health Record, EHR)数据库中提取洞察时对SQL技能的依赖问题,这一门槛限制了医疗决策与研究的效率。传统基于检索增强生成(Retrieval-Augmented Generation, RAG)的方法在医学领域表现不佳,因其依赖静态示例池进行单步检索,难以应对医学术语和行话的变异性与噪声,易导致反模式(如扩大演示池引入噪声)。解决方案的关键在于提出CBR-to-SQL框架,受案例推理(Case-Based Reasoning, CBR)启发,将问题-SQL对表示为可复用的抽象案例模板,并采用两阶段检索机制:首先捕获逻辑结构,再解析相关实体,从而提升在数据稀缺和检索扰动下的样本效率与鲁棒性。
链接: https://arxiv.org/abs/2603.05569
作者: Hung Nguyen,Hans Moen,Pekka Marttinen
机构: Aalto University (阿尔托大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Extracting insights from Electronic Health Record (EHR) databases often requires SQL expertise, creating a barrier for healthcare decision-making and research. While a promising approach is to use Large Language Models (LLMs) to translate natural language questions to SQL via Retrieval-Augmented Generation (RAG), adapting this approach to the medical domain is non-trivial. Standard RAG relies on single-step retrieval from a static pool of examples, which struggles with the variability and noise of medical terminology and jargon. This often leads to anti-patterns such as expanding the task demonstration pool to improve coverage, which in turn introduces noise and scalability problems. To address this, we introduce CBR-to-SQL, a framework inspired by Case-Based Reasoning (CBR). It represents question-SQL pairs as reusable, abstract case templates and utilizes a two-stage retrieval process that first captures logical structure and then resolves relevant entities. Evaluated on MIMICSQL, CBR-to-SQL achieves state-of-the-art logical form accuracy and competitive execution accuracy. More importantly, it demonstrates higher sample efficiency and robustness than standard RAG approaches, particularly under data scarcity and retrieval perturbations.
[IR-8] AutothinkRAG : Complexity-Aware Control of Retrieval-Augmented Reasoning for Image-Text Interaction
【速读】:该论文旨在解决信息密集型文档问答(Information-intensive Document Question Answering, DocQA)中因长上下文和信息过载导致视觉语言模型(Vision-Language Models, VLMs)难以进行精准直接推理的问题,同时应对现有多模态GraphRAG框架在处理多样化查询复杂度时对大规模模型的依赖及端到端VLM固有的推理瓶颈。解决方案的关键在于提出AutoThinkRAG框架,其核心创新包括:一是设计Query Complexity Router,根据查询难度动态分配推理路径;二是引入功能解耦架构,由小型VLM作为高保真视觉解释器将相关视觉线索转化为文本表示,再交由大语言模型(Large Language Model, LLM)进行逻辑推理与整合,从而在降低推理成本的同时实现性能提升。
链接: https://arxiv.org/abs/2603.05551
作者: Jiashu Yang,Chi Zhang,Abudukelimu Wuerkaixi,Xuxin Cheng,Cao Liu,Ke Zeng,Xu Jia,Xunliang Cai
机构: Dalian University of Technology (大连理工大学); Tsinghua University (清华大学); Meituan LongCat Interaction Team (美团龙猫交互团队)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Information-intensive Document Question Answering (DocQA) is often constrained by long contexts and information overload, which hinders Vision-Language Models (VLMs) from performing precise direct reasoning. Although multimodal GraphRAG has achieved preliminary breakthroughs, existing frameworks still face dual challenges: (1) the necessity of large-scale models for handling queries of diverse complexities and (2) the inherent reasoning bottlenecks of end-to-end VLMs. To address these issues, we propose AutoThinkRAG, a framework that enhances the understanding of complex documents by synergizing the capabilities of multiple models. Specifically, we introduce a Query Complexity Router to allocate reasoning paths based on the analysis of query difficulty. Furthermore, to overcome the reasoning boundaries of VLM, we propose a functional decoupling architecture: a small-scale VLM serves as a high-fidelity visual interpreter to transform query-relevant visual cues into textual representations, which are subsequently processed by an LLM for logical deduction and synthesis. Extensive experiments on DocBench and MMLongBench demonstrate that AutoThinkRAG significantly reduces inference costs while achieving new state-of-the-art performance. Further ablation studies verifies the effectiveness of our proposed method.
[IR-9] VDCook:DIY video data cook your MLLM s
【速读】:该论文旨在解决当前视频训练数据构建过程中存在的静态性、低效性和缺乏可扩展性的问题,即传统数据集通常为一次性构建且难以适应领域演进与持续更新的需求。其解决方案的关键在于提出VDCook——一个自演化视频数据操作系统(self-evolving video data operating system),通过自然语言驱动的查询接口与可配置参数(如规模、检索-合成比例、质量阈值)实现自动化数据请求处理;系统并行执行真实视频检索与可控合成模块,并基于MCP(Model Context Protocol)实现自动化的数据摄入机制,从而将数据集转化为动态演化的开放生态系统。此外,平台提供多维元数据标注(如场景分割、运动评分、OCR比率、自动字幕等),支持灵活的数据“烹饪”(cooking)和索引,显著降低专业视频数据构建门槛,同时赋能社区协作与治理驱动的数据扩展范式。
链接: https://arxiv.org/abs/2603.05539
作者: Chengwei Wu
机构: BAI Academy of Artificial Intelligence (北京人工智能研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:
Abstract:We introduce VDCook: a self-evolving video data operating system, a configurable video data construction platform for researchers and vertical domain teams. Users initiate data requests via natural language queries and adjustable parameters (scale, retrieval-synthesis ratio, quality threshold). The system automatically performs query optimization, concurrently running real video retrieval and controlled synthesis modules. It ultimately generates in-domain data packages with complete provenance and metadata, along with reproducible Notebooks. Unlike traditional static, one-time-built datasets, VDCook enables continuous updates and domain expansion through its automated data ingestion mechanism based on MCP (Model Context Protocol)\citemcp2024anthropic, transforming datasets into dynamically evolving open ecosystems. The system also provides multi-dimensional metadata annotation (scene segmentation, motion scoring, OCR ratio, automatic captioning, etc.), laying the foundation for flexible subsequent data `cooking’ and indexing\citevlogger. This platform aims to significantly lower the barrier to constructing specialized video training datasets through infrastructure-level solutions, while supporting community contributions and a governance-enabled data expansion paradigm. \textbfProject demo: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM) Cite as: arXiv:2603.05539 [cs.LG] (or arXiv:2603.05539v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.05539 Focus to learn more arXiv-issued DOI via DataCite
[IR-10] Verify as You Go: An LLM -Powered Browser Extension for Fake News Detection
【速读】:该论文旨在解决数字时代虚假新闻传播对公众信任和民主制度造成的严重威胁,尤其是现有浏览器扩展工具在模型行为不透明、解释支持有限及用户参与度不足等方面的缺陷。其解决方案的关键在于提出了一款名为Aletheia的新式浏览器扩展,该工具基于检索增强生成(Retrieval-Augmented Generation, RAG)与大语言模型(Large Language Models, LLMs)实现虚假新闻检测,并提供基于证据的解释;同时集成两个交互组件:讨论中心(Discussion Hub)支持用户围绕标记内容展开对话,以及“保持知情”(Stay Informed)功能推送最新事实核查信息,从而提升检测的准确性与透明度,并增强用户的参与感与信任度。
链接: https://arxiv.org/abs/2603.05519
作者: Dorsaf Sallami,Esma Aïmeur
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:
Abstract:The rampant spread of fake news in the digital age poses serious risks to public trust and democratic institutions, underscoring the need for effective, transparent, and user-centered detection tools. Existing browser extensions often fall short due to opaque model behavior, limited explanatory support, and a lack of meaningful user engagement. This paper introduces Aletheia, a novel browser extension that leverages Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) to detect fake news and provide evidence-based explanations. Aletheia further includes two interactive components: a Discussion Hub that enables user dialogue around flagged content and a Stay Informed feature that surfaces recent fact-checks. Through extensive experiments, we show that Aletheia outperforms state-of-the-art baselines in detection performance. Complementing this empirical evaluation, a complementary user study with 250 participants confirms the system’s usability and perceived effectiveness, highlighting its potential as a transparent tool for combating online fake news.
人机交互
[HC-0] he Pen: Episodic Cognitive Assistance via an Ear-Worn Interface
【速读】:该论文旨在解决可穿戴人工智能(Wearable AI)设备因持续可用性而引发的隐私担忧、社交干扰及系统边界模糊等问题,这些问题往往导致用户在工作与社交场景中感到不适。其解决方案的关键在于提出“间歇式使用”(episodic use)范式,即设备仅在短时专注任务期间被主动激活,任务结束后即被取下,从而明确辅助的起止边界;文中设计的“笔形耳戴设备”(The Pen)通过语音与视觉上下文支持按需触发的本地化处理,有效增强了用户的自主感(agency)、认知流畅性(cognitive flow)和社交舒适度。
链接: https://arxiv.org/abs/2603.06564
作者: Yonatan Tussa,Andy Heredia
机构: University of Maryland, College Park (马里兰大学学院公园分校); University of Maryland Global Campus (马里兰大学全球校区)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Wearable AI is often designed as always-available, yet continuous availability can conflict with how people work and socialize, creating discomfort around privacy, disruption, and unclear system boundaries. This paper explores episodic use of wearable AI, where assistance is intentionally invoked for short periods of focused activity and set aside when no longer needed, with a form factor that reflects this paradigm of wearing and taking off a device between sessions. We present The Pen, an ear-worn device resembling a pen, for episodic, situated cognitive assistance. The device supports short, on-demand assistance sessions using voice and visual context, with clear start/end boundaries and local processing. We report findings from an exploratory study showing how layered activation boundaries shape users’ sense of agency, cognitive flow, and social comfort.
[HC-1] Capability at a Glance: Design Guidelines for Intuitive Avatars Communicating Augmented Actions in Virtual Reality
【速读】:该论文试图解决的问题是:在虚拟现实(Virtual Reality, VR)环境中,如何有效触发用户超越人类自然能力的交互潜力,因为这种潜力往往并不明显。解决方案的关键在于通过“可供性”(Affordance)视角优化化身(avatar)设计,使其能够清晰传达自身功能及其激活方式,从而提升用户对VR中复杂能力的理解与使用效率。研究通过专家设计实践提炼出16条通用及类别特定的设计指南,并经实证验证其在增强直观性和可用性方面的有效性。
链接: https://arxiv.org/abs/2603.06556
作者: Yang Lu,Tianyu Zhang,Jiamu Tang,Yanna Lin,Jiankun Yang,Longyu Zhang,Shijian Luo,Yukang Yan
机构: Zhejiang University (浙江大学); University of Rochester (罗切斯特大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Virtual Reality (VR) enables users to engage with capabilities beyond human limitations, but it is not always obvious how to trigger these capabilities. Taking the lens of Affordance, we believe avatar design is the key to solving this issue, which ideally should communicate its capabilities and how to activate them. To understand the current practice, we selected eight capabilities across four categories and invited twelve professional designers to design avatars that communicate the capabilities and their corresponding interactions. From the resulting designs, we formed 16 guidelines to provide general and category-specific recommendations. Then, we validated these guidelines by letting two groups of twelve participants design avatars with and without guidelines. Participants rated the guidelines’ clarity and usefulness highly. External judges confirmed that avatars designed with the guidelines were more intuitive in conveying the capabilities and interaction methods. Finally, we demonstrated the applicability of the guidelines in avatar design for four VR applications.
[HC-2] Structured Exploration vs. Generative Flexibility: A Field Study Comparing Bandit and LLM Architectures for Personalised Health Behaviour Interventions
【速读】:该论文旨在解决数字健康干预中行为改变技术(Behaviour Change Techniques, BCTs)的选择与交付效率问题,即如何在个性化、动态环境中优化BCT的使用以提升用户感知帮助性。其解决方案的关键在于结合上下文bandit算法与大语言模型(Large Language Models, LLMs)的能力:一方面利用bandit方法实现对BCT的探索-利用平衡以系统性覆盖多种技术;另一方面借助LLMs生成更具情境敏感性和自然语义的干预消息。研究发现,尽管LLM生成的消息显著优于固定模板,但引入bandit优化并未进一步提升用户感知帮助性,反而揭示出无约束LLM易聚焦单一BCT,而bandit机制则强制多技术探索。最终提出设计建议:应优先构建能基于用户输入进行情境化回应的反思型AI系统,在结构化探索与生成自主性之间取得平衡。
链接: https://arxiv.org/abs/2603.06330
作者: Dominik P. Hofer,Haochen Song,Rania Islambouli,Laura Hawkins,Ananya Bhattacharjee,Meredith Franklin,Joseph Jay Williams,Jan D. Smeddinck
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Currently under review at a conference
Abstract:Behaviour Change Techniques (BCTs) are central to digital health interventions, yet selecting and delivering effective techniques remains challenging. Contextual bandits enable statistically grounded optimisation of BCT selection, while Large Language Models (LLMs) offer flexible, context-sensitive message generation. We conducted a 4-week study on physical activity motivation (N=54; 9 post-study interviews) that compared five daily messaging approaches: random templates, contextual bandit with templates, LLM generation, hybrid bandit+LLM, and LLM with interaction history. LLM-based approaches were rated substantially more helpful than templates, but no significant differences emerged among LLM conditions. Unexpectedly, bandit optimisation for BCTs selection yielded no additional perceived helpfulness compared with LLM-only approaches. Unconstrained LLMs focused heavily on a single BCT, whereas bandit systems enforced systematic exploration-exploitation across techniques. Quantitative and qualitative findings suggest contextual acknowledgement of user input drove perceived helpfulness. We contribute design suggestions for reflective AI health behaviour change systems that address a trade-off between structured exploration and generative autonomy.
[HC-3] Exploring Socially Assistive Peer Mediation Robots for Teaching Conflict Resolution to Elementary School Students AAAI
【速读】:该论文试图解决的问题是在缺乏同伴调解(peer mediation)项目的小学环境中,如何为学生提供一种可及的、有效的冲突解决技能训练方式。解决方案的关键在于探索社交助手机器人(Socially Assistive Robots, SARs)在小学阶段用于模拟同伴调解角色扮演活动中的可行性和效果。研究发现,尽管未在统计学上显著区分机器人与平板电脑条件下的自我认知和测验表现差异,但SARs能激发学生的积极反馈,并促进其对自身情绪调节与人际帮助能力的认知提升;同时,研究识别出学生阅读水平和表达倾向对任务参与度的影响,这为未来基于SAR的干预设计提供了重要依据。
链接: https://arxiv.org/abs/2603.06255
作者: Kaleen Shrestha,Harish Dukkipati,Avni Hulyalkar,Kyla Penamante,Ankita Samanta,Maja Matarić
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Published in Proceedings of the AAAI Symposium Series (2025)
Abstract:In peer mediation–an approach to conflict resolution used in many K-12 schools in the United States–students help other students to resolve conflicts. For schools without peer mediation programs, socially assistive robots (SARs) may be able to provide an accessible option to practice peer mediation. We investigate how elementary school students react to a peer mediator role-play activity through an exploratory study with SARs. We conducted a small single-session between-subjects study with 12 participants. The study had two conditions, one with two robots acting as disputants, and the other without the robots and just the tablet. We found that a majority of students had positive feedback on the activity, with many students saying the peer mediation practice helped them feel better about themselves. Some said that the activity taught them how to help friends during conflict, indicating that the use of SARs for peer mediation practice is promising. We observed that participants had varying reading levels that impacted their ability to read and dictate the turns in the role-play script, an important consideration for future study design. Additionally, we found that some participants were more expressive while reading the script and throughout the activity. Although we did not find statistical differences in pre-/post-session self-perception and quiz performance between the robot and tablet conditions, we found strong correlations (p0.05) between certain trait-related measures and learning-related measures in the robot condition, which can inform future study design for SARs for this and related contexts.
[HC-4] Skill-Adaptive Ghost Instructors: Enhancing Retention and Reducing Over-Reliance in VR Piano Learning
【速读】:该论文旨在解决虚拟现实(VR)中运动技能学习系统因持续提供外部提示(persistent cues)而导致的学习者过度依赖提示、削弱记忆巩固与技能迁移的问题。其解决方案的关键在于提出一种动态自适应的透明度机制,即设计一个“幽灵导师”(ghost instructor),其视觉透明度在训练过程中根据学习者的实时表现动态调整,从而实现提示强度与学习者能力之间的匹配。实验表明,相较于固定透明度的静态模式,该动态模式显著提升了音高和指法准确性,并有效控制错误增长,同时保持了时间节奏的一致性,证明了该机制有助于学习者更有效地内化动作模式,降低对外部提示的依赖,增强短期技能保留效果。
链接: https://arxiv.org/abs/2603.06253
作者: Tzu-Hsin Hsieh,Cassandra Michelle Stefanie Visser,Elmar Eisemann,Ricardo Marroquim
机构: Delft University of Technology (代尔夫特理工大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to CHI 2026 (ACM Conference on Human Factors in Computing Systems)
Abstract:Motor-skill learning systems in XR rely on persistent cues. However, constant cueing can induce overreliance and erode memorization and skill transfer. We introduce a skill-adaptive, dynamically transparent ghost instructor whose opacity adapts in real time to learner performance. In a first-person perspective, users observe a ghost hand executing piano fingering with either a static or a performance-adaptive transparency in a VR piano training application. We conducted a within-subjects study (N=30), where learners practiced with traditional Static (fixed-transparency) and our proposed Dynamic (performance-adaptive) modes and were tested without guidance immediately and after a 10-minute retention interval. Relative to Static, the Dynamic mode yielded higher pitch and fingering accuracy and limited error increases, with comparable timing. These findings suggest that adaptive transparency helps learners internalize fingerings more effectively, reducing dependency on external cues and improving short-term skill retention within immersive learning environments. We discuss design implications for motor-skill learning and outline directions for extending this approach to longer-term retention and more complex tasks.
[HC-5] Beyond Scores: Explainable Intelligent Assessment Strengthens Pre-service Teachers Assessment Literacy
【速读】:该论文旨在解决职前教师评估素养(Assessment Literacy, AL)难以培养的问题,传统教师培养项目偏重理论知识传授,而数字评估工具常提供不透明的评分或参数,限制了教师的反思与迁移能力,导致AL发展不足。解决方案的关键在于提出XIA(eXplainable Intelligent Assessment)平台,其核心创新是将统计学支持扩展为可视化的认知诊断推理机制,包括对比性(contrastive)和反事实(counterfactual)解释,从而通过可解释性支架(explanatory scaffolding)促进教师从依赖分数判断向基于证据的推理转变,有效提升评估反思、自我调节与评估意识,并减少评估错误。
链接: https://arxiv.org/abs/2603.06059
作者: Yuang Wei,Fei Wang,Yifan Zhang,Brian Y. Lim,Bo Jiang
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 26 pages,8 figures. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI’ 26)
Abstract:Assessment literacy (AL) is essential for personalized education, yet difficult to cultivate in pre-service teachers. Conventional teacher preparation programs focus on theoretical knowledge, while digital assessment tools commonly provide opaque scores or parameters. These limitations hinder reflection and transfer, leaving AL underdeveloped. We propose XIA, an eXplainable Intelligent Assessment platform that extends statistics-informed support with visualized cognitive diagnostic reasoning, including contrastive and counterfactual explanations. In a pre-post controlled study with 21 pre-service teachers, we combined quantitative tasks and questionnaires with qualitative interviews. The findings offer preliminary evidence that XIA supported reflection, self-regulation, and assessment awareness, and helped reduce assessment errors. Interviews further showed a shift from score-based judgments toward evidence-based reasoning. This work contributes insights into the design of intelligent assessment tools, showing how explanatory scaffolding can bridge assessment theory and classroom practice and support the cultivation of AL in teacher education.
[HC-6] Is it Me? Toward Self-Extension to AI Avatars in Virtual Reality
【速读】:该论文试图解决的问题是:当生成式 AI(Generative AI)与具身化身(embodied avatar)结合,使用户的行为和语音被 AI 修改并作为其代理进行表达时,用户如何感知自身在这一过程中的代理权(agency)、作者身份(authorship)以及自我认同(self-identification)。传统人机交互(HCI)研究多将系统视为外部工具,忽视了用户行为与 AI 输出融合后带来的体验变化。解决方案的关键在于设计并实现一个名为 ProxyMe 的虚拟现实(VR)原型系统,该系统通过整合具身化身、语音克隆(voice cloning)与 AI 介导的语音增强(AI-mediated speech augmentation),让用户在体验中将 AI 修改后的语音和行为感知为自我的延伸——即“化身自我扩展”(avatar self-extension)。该方案的核心机制在于通过控制委托程度(delegation)和可操控性(steerability)来探索用户对自我表达的主观体验变化。
链接: https://arxiv.org/abs/2603.06030
作者: Jieying Zhang,Steeven Villa,Abdallah El Ali
机构: University of Amsterdam (阿姆斯特丹大学); Centrum Wiskunde & Informatica (荷兰数学与计算机科学研究中心); LMU Munich (慕尼黑路德维希-马克西米利安大学); Utrecht University (乌得勒支大学)
类目: Human-Computer Interaction (cs.HC)
备注: Cond. accepted to ACM CHI '26 Extended Abstracts (Poster)
Abstract:Advances in generative AI, speech synthesis, and embodied avatars enable systems that not only assist communication, but can act as proxies on users’ behalf. Prior work in HCI has largely focused on systems as external tools, with less attention paid to the experiential consequences of users’ speech and actions becoming assimilated with AI-generated output. We introduce the design and implementation of ProxyMe, a work-in-progress VR prototype that allows users to embody an avatar whose voice and spoken content are modified by an AI system. By combining avatar-based embodiment, voice cloning, and AI-mediated speech augmentation, ProxyMe invites the exploration of avatar self-extension: situations in which AI-modified communication is experienced as part of one’s own expressive behavior. We chart out research challenges and envisioned scenarios, with a focus on how varying degrees of delegation and steerability can influence perceived agency, authorship, and self-identification.
[HC-7] Hierarchical Resource Rationality Explains Human Reading Behavior
【速读】:该论文试图解决的问题是:现有理论仅能解释阅读过程中的眼动行为或理解效果,但未能阐明眼动与理解之间的关键关联机制。解决方案的关键在于提出“资源理性优化”(resource-rational optimization)作为统一原则,指导适应性阅读行为——即眼动决策被选择以在最大化预期理解的同时最小化认知和时间成本,并在嵌套的时间尺度上进行层级组织:固定决策支持词识别,句子层面的整合引导跳读和回视,文本层面的理解目标塑造记忆构建与重读行为。这一机制成功复现了人类阅读中从词汇效应到理解结果的广泛现象,表明资源理性为知识密集型人类行为中感知、记忆与行动的协调提供了普适机制。
链接: https://arxiv.org/abs/2603.06006
作者: Yunpeng Bai,Xiaofu Jin,Shengdong Zhao,Antti Oulasvirta
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Reading is a pervasive and cognitively demanding activity that underpins modern human culture. It is a prime instance of a class of tasks where eye movements are coordinated for the purpose of comprehension. Existing theories explain either eye movements or comprehension during reading, but the critical link between the two remains unclear. Here, we propose resource-rational optimization as a unifying principle governing adaptive reading behavior. Eye movements are selected to maximize expected comprehension while minimizing cognitive and temporal costs, organized hierarchically across nested time scales: fixation decisions support word recognition; sentence-level integration guides skipping and regression; and text-level comprehension goals shape memory construction and rereading. A computational implementation successfully replicates an unprecedented range of findings in human reading, from lexical effects to comprehension outcomes. Together, these results suggest that resource rationality provides a general mechanism for coordinating perception, memory, and action in knowledge-intensive human behaviors, offering a principled account of how complex cognitive skills adapt to limited resources.
[HC-8] Who We Are Where We Are: Mental Health at the Intersection of Person Situation and Large Language Models
【速读】:该论文旨在解决如何在动态情境中准确评估个体心理健康状态的问题,特别是如何从纵向社交媒体数据中识别适应性与非适应性自我状态(self-states)。其核心挑战在于将个体心理特质(如韧性、认知扭曲等)与情境特征相结合,以实现既具预测性能又具备可解释性的建模。解决方案的关键在于整合基于心理学理论的特征(如Situational 8 DIAMONDS框架推导的情境语言特征)与心理测量学驱动的语言模型嵌入,从而构建一个融合个体差异与情境变化的可解释预测模型,显著提升了对幸福感(well-being)的预测能力,并增强了结果的心理学一致性与人类可理解性。
链接: https://arxiv.org/abs/2603.05953
作者: Nikita Soni,August Håkan Nilsson,Syeda Mahwish,Vasudha Varadarajan,H. Andrew Schwartz,Ryan L. Boyd
机构: Stony Brook University (石溪大学); Oslo Metropolitan University (奥斯陆城市大学); University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Mental health is not a fixed trait but a dynamic process shaped by the interplay between individual dispositions and situational contexts. Building on interactionist and constructionist psychological theories, we develop interpretable models to predict well-being and identify adaptive and maladaptive self-states in longitudinal social media data. Our approach integrates person-level psychological traits (e.g., resilience, cognitive distortions, implicit motives) with language-inferred situational features derived from the Situational 8 DIAMONDS framework. We compare these theory-grounded features to embeddings from a psychometrically-informed language model that captures temporal and individual-specific patterns. Results show that our principled, theory-driven features provide competitive performance while offering greater interpretability. Qualitative analyses further highlight the psychological coherence of features most predictive of well-being. These findings underscore the value of integrating computational modeling with psychological theory to assess dynamic mental states in contextually sensitive and human-understandable ways.
[HC-9] Addressing the Ecological Fallacy in Larger LMs with Human Context
【速读】:该论文试图解决语言模型训练与推理中忽略的一个基本语言事实——同一作者撰写的多段文本之间存在依赖关系,即“生态谬误”(ecological fallacy)问题。传统方法将每段文本视为独立样本,未能建模作者的语言上下文,导致性能受限。解决方案的关键在于引入一种专门针对作者语言上下文建模的预训练任务(称为HuLM,Human-aware Language Modeling),并通过两种方式实现:一是使用HuLM目标进行持续预训练,二是采用人类感知微调(HuFT: Human-aware Fine-Tuning),在微调阶段引入作者的历史文本作为上下文。实验表明,即使对于8B规模的Llama模型,仅在微调阶段结合QLoRA技术引入作者上下文即可显著提升性能;而基于QLoRA的HuLM预训练则能生成通用性强的人类感知模型,在八个下游任务上仅通过线性分类器即可实现性能提升,验证了建模文本与其原始生成者(作者)语境的重要性。
链接: https://arxiv.org/abs/2603.05928
作者: Nikita Soni,Dhruv Vijay Kunjadiya,Pratham Piyush Shah,Dikshya Mohanty,H. Andrew Schwartz,Niranjan Balasubramanian
机构: Stony Brook University (石溪大学); Vanderbilt University (范德比尔特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Language model training and inference ignore a fundamental linguistic fact – there is a dependence between multiple sequences of text written by the same person. Prior work has shown that addressing this form of \textitecological fallacy can greatly improve the performance of multiple smaller (~124M) GPT-based models. In this work, we ask if addressing the ecological fallacy by modeling the author’s language context with a specific LM task (called HuLM) can provide similar benefits for a larger-scale model, an 8B Llama model. To this end, we explore variants that process an author’s language in the context of their other temporally ordered texts. We study the effect of pre-training with this author context using the HuLM objective, as well as using it during fine-tuning with author context (\textitHuFT:Human-aware Fine-Tuning). Empirical comparisons show that addressing the ecological fallacy during fine-tuning alone using QLoRA improves the performance of the larger 8B model over standard fine-tuning. Additionally, QLoRA-based continued HuLM pre-training results in a human-aware model generalizable for improved performance over eight downstream tasks with linear task classifier training alone. These results indicate the utility and importance of modeling language in the context of its original generators, the authors.
[HC-10] Learning Next Action Predictors from Human-Computer Interaction
【速读】:该论文旨在解决如何让人工智能系统具备前瞻性能力,即通过理解用户完整的交互上下文来预测其下一步操作(Next Action Prediction, NAP),而非仅依赖稀疏的文本提示。其核心挑战在于建模长期、多模态的用户行为序列,并在动作空间无限扩展的情况下实现高精度预测。解决方案的关键在于提出LongNAP模型,该模型结合参数化学习与上下文学习(in-context learning),利用策略梯度方法训练生成个性化推理轨迹;并通过检索历史轨迹库,在上下文中动态应用相关经验进行未来动作预测。实验表明,LongNAP在独立测试集上显著优于监督微调和提示基线模型,并展现出跨用户泛化能力,证明从完整用户行为中学习以主动预判需求已成为可行且具有广阔前景的任务。
链接: https://arxiv.org/abs/2603.05923
作者: Omar Shaikh,Valentin Teutschbein,Kanishk Gandhi,Yikun Chi,Nick Haber,Thomas Robinson,Nilam Ram,Byron Reeves,Sherry Yang,Michael S. Bernstein,Diyi Yang
机构: Stanford University (斯坦福大学); Hasso Plattner Institute; New York University (纽约大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 32 pages, 10 figures, see this https URL
Abstract:Truly proactive AI systems must anticipate what we will do next. This foresight demands far richer information than the sparse signals we type into our prompts – it demands reasoning over the entire context of what we see and do. We formalize this as next action prediction (NAP): given a sequence of a user’s multimodal interactions with a computer (screenshots, clicks, sensor data), predict that user’s next action. Progress on this task requires both new data and modeling approaches. To scale data, we annotate longitudinal, naturalistic computer use with vision-language models. We release an open-source pipeline for performing this labeling on private infrastructure, and label over 360K actions across one month of continuous phone usage from 20 users, amounting to 1,800 hours of screen time. We then introduce LongNAP, a user model that combines parametric and in-context learning to reason over long interaction histories. LongNAP is trained via policy gradient methods to generate user-specific reasoning traces given some context; retrieve relevant traces from a library of past traces; and then apply retrieved traces in-context to predict future actions. Using an LLM-as-judge evaluation metric (0-1 similarity to ground truth), LongNAP significantly outperforms supervised finetuning and prompted baselines on held-out data (by 79% and 39% respectively). Additionally, LongNAP generalizes to held out users when trained across individuals. The space of next actions a user might take at any moment is unbounded, spanning thousands of possible outcomes. Despite this, 17.1% of LongNAP’s predicted trajectories are well-aligned with what a user does next (LLM-judge score \geq 0.5). This rises to 26% when we filter to highly confident predictions. In sum, we argue that learning from the full context of user behavior to anticipate user needs is now a viable task with substantial opportunity.
[HC-11] Non-urgent Messages Do Not Jump into My Headset Suddenly! Adaptive Notification Design in Mixed Reality
【速读】:该论文旨在解决混合现实(Mixed Reality, MR)环境中通知系统因所有消息固定显示于中央位置而导致的干扰过多与认知过载问题。解决方案的关键在于设计一种基于紧急程度自适应调整通知空间布局的机制:非紧急通知以周边图标形式呈现,可通过头部移动访问;中等紧急程度的消息锚定在用户手部位置;极高紧急程度的通知则从周边逐步过渡到中心视野。该方法通过实证研究验证了其在降低心理负荷、时间负荷和挫败感方面的显著优势,同时保持了与传统集中式方案相当的通知感知能力,为沉浸式交互中的注意力管理提供了可操作的设计准则。
链接: https://arxiv.org/abs/2603.05893
作者: Jingyao Zheng,Xian Wang,Sven Mayer,Lik-Hang Lee
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 11 pages, 6 figures. This is the author’s version of the article that will appear at the publication TVCG Special Issue on the 2026 IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR)
Abstract:Mixed reality (MR) notification systems currently display all messages in fixed central locations regardless of urgency, leading to unnecessary interruptions and cognitive overload. Drawing from previous MR/Virtual Reality (VR) notification design work and calm technology principles, we developed an adaptive notification system that adjusts spatial placement based on urgency levels: non-urgent notifications appear as peripheral icons accessible via head movement, moderately urgent messages anchor to the user’s hand, and very urgent notifications transition progressively from peripheral to central view. Through a within-subjects study (N=18), we evaluated our adaptive system against the default centralised approach. Results demonstrate that the adaptive system significantly reduces mental workload (p=0.041), temporal workload (p=0.008), and frustration (p=0.004) while maintaining comparable notification awareness. Logistic regression analysis reveals that users prefer the adaptive system even with classification errors, provided the combined misclassification rate (disruptiveness + omission errors) remains below a determinable threshold. Our findings establish the first empirical evidence that urgency-based spatial notification distribution effectively addresses core MR usability challenges, offering practical design guidelines for immersive notification systems that balance user attention management with information accessibility.
[HC-12] Measuring Perceptions of Fairness in AI Systems: The Effects of Infra-marginality
【速读】:该论文试图解决的问题是:在机器学习模型中,不同人口群体间数据分布差异(即“次边际性”(infra-marginality))如何影响人类对公平性的判断。研究发现,人们并不将公平等同于简单的统计均等(statistical parity),而是会根据群体特定性能表现和对差异成因的认知来评估公平性。解决方案的关键在于:算法公平性指标的设计必须考虑数据分布的背景信息(distributional context),以更好地与现实场景中人类对公平的直觉期望保持一致。
链接: https://arxiv.org/abs/2603.05889
作者: Schrasing Tong,Minseok Jung,Ilaria Liccardi,Lalana Kagal
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:
Abstract:Differences in data distributions between demographic groups, known as the problem of infra-marginality, complicate how people evaluate fairness in machine learning models. We present a user study with 85 participants in a hypothetical medical decision-making scenario to examine two treatments: group-specific model performance and training data availability. Our results show that participants did not equate fairness with simple statistical parity. When group-specific performances were equal or unavailable, participants preferred models that produced equal outcomes; when performances differed, especially in ways consistent with data imbalances, they judged models that preserved those differences as more fair. These findings highlight that fairness judgments are shaped not only by outcomes, but also by beliefs about the causes of disparities. We discuss implications for popular group fairness definitions and system design, arguing that accounting for distributional context is critical to aligning algorithmic fairness metrics with human expectations in real-world applications.
[HC-13] Challenges in Synchronous Remote Collaboration Around Visualization
【速读】:该论文旨在解决远程同步协作环境中可视化应用所面临的16项核心挑战,这些问题涉及探索性数据分析、创意构思、可视化-rich演示、基于数据的联合决策以及实时数据监控等五大协同活动。其解决方案的关键在于提出一个以技术选择、社会因素、AI辅助和评估为顺序的四阶段研究与发展框架,从而系统性地组织未来在可视化与计算机支持的协同工作(Computer-Supported Cooperative Work, CSCW)交叉领域的研究方向,并应对扩展现实(Extended Reality, XR)和人工智能(Artificial Intelligence, AI)带来的动态变化。
链接: https://arxiv.org/abs/2603.05871
作者: Matthew Brehmer,Maxime Cordeil,Christophe Hurter,Takayuki Itoh,Wolfgang Büschel,Mahmood Jasim,Arnaud Prouzeau,David Saffo,Lyn Bartram,Sheelagh Carpendale,Chen Zhu-Tian,Andrew Cunningham,Tim Dwyer,Samuel Huron,Masahiko Itoh,Alark Joshi,Kiyoshi Kiyokawa,Hideaki Kuzuoka,Bongshin Lee,Gabriela Molina León,Harald Reiterer,Bektur Ryskeldiev,Jonathan Schwabish,Brian A. Smith,Yasuyuki Sumi,Ryo Suzuki,Anthony Tang,Yalong Yang,Jian Zhao
机构: University of Waterloo(滑铁卢大学); The University of Queensland(昆士兰大学); ENAC Université de Toulouse(图卢兹国立航空学院); Ochanomizu University(御茶水女子大学); University of Stuttgart(斯图加特大学); Louisiana State University(路易斯安那州立大学); Université Paris-Saclay, Inria, CNRS(巴黎萨克雷大学, 法国国家信息与自动化研究院, 法国国家科学研究中心); JPMorganChase(摩根大通); Simon Fraser University(西蒙菲莎大学); University of Minnesota-Twin Cities(明尼苏达大学双城分校); University of South Australia(南澳大学); Monash University(莫纳什大学); Inst. Polytechnique de Paris(巴黎理工学院); Hokkaido Information University(北海道信息大学); University of San Francisco(旧金山大学); Nara Inst. of Science and Technology(奈良科学技术大学院大学); The University of Tokyo(东京大学); Yonsei University(延世大学); Aarhus University(奥胡斯大学); University of Konstanz(康斯坦茨大学); Mercari R4D(mercari研究部门); Urban Institute(城市研究所); Columbia University(哥伦比亚大学); Future University Hakodate(未来大学); University of Colorado Boulder(科罗拉多大学博尔德分校); Singapore Management University(新加坡管理大学); Georgia Institute of Technology(佐治亚理工学院)
类目: Human-Computer Interaction (cs.HC)
备注: Proceedings of the 2026 ACM Conference on Human Factors in Computing Systems (CHI)
Abstract:We characterize 16 challenges faced by those investigating and developing remote and synchronous collaborative experiences around visualization. Our work reflects the perspectives and prior research efforts of an international group of 29 experts from across human-computer interaction and visualization sub-communities. The challenges are anchored around five collaborative activities that exhibit a centrality of visualization and multimodal communication. These activities include exploratory data analysis, creative ideation, visualization-rich presentations, joint decision making grounded in data, and real-time data monitoring. The challenges also reflect the changing dynamics of these activities in the face of recent advances in extended reality (XR) and artificial intelligence (AI). As an organizing scheme for future research at the intersection of visualization and computer-supported cooperative work, we align the challenges with a sequence of four sets of research and development activities: technological choices, social factors, AI assistance, and evaluation.
[HC-14] Glass Chirolytics: Reciprocal Compositing and Shared Gestural Control for Face-to-Face Collaborative Visualization at a Distance
【速读】:该论文旨在解决视频会议中因屏幕共享导致的非语言交流被忽视的问题,尤其在数据讨论、协作决策、技术面试和教学等场景下,参与者难以通过可视化界面同时感知彼此的互动意图。解决方案的关键在于采用双向合成(reciprocal compositing)技术,将可视化元素和界面控件叠加到对话伙伴的镜像视频上,形成类似“玻璃面板”的交互空间,使双方可通过双手手势同步操作合成内容,从而增强临场感(presence)和对分析意图的相互认知。
链接: https://arxiv.org/abs/2603.05864
作者: Dion Barja,Matthew Brehmer
机构: University of Manitoba (曼尼托巴大学); University of Waterloo (滑铁卢大学)
类目: Human-Computer Interaction (cs.HC)
备注: Proceedings of the 2026 ACM Conference on Human Factors in Computing Systems (CHI)
Abstract:Videoconference conversations about data often entail screen sharing visualization artifacts, in which nonverbal communication goes largely ignored. Beyond presentation use cases, conversations supported by visualization also arise in collaborative decision making, technical interviews, and tutoring: use cases that benefit from participants being able to see one another as they exchange questions about the data. In this paper, we employ a reciprocal compositing of visualization and interface widgets over the mirrored video of one’s conversation partner, suggestive of a pane of glass, in which both parties can simultaneously manipulate composited elements via bimanual gestures. We demonstrate our approach with implementations of several visualization interfaces spanning the aforementioned use cases, and we evaluate our approach in a study (N = 16) comparing it to videoconferencing while using a mouse to interact with a collaborative web application. Our findings suggest that our approach promotes feelings of presence and mutual awareness of analytical intent.
[HC-15] he Values of Value in AI Adoption: Rethinking Efficiency in UX Designers Workplaces
【速读】:该论文试图解决的问题是:在组织层面推进人工智能(AI)采纳的过程中,管理者对效率和生产力的关注与员工对AI经济和社会价值的认知之间存在显著分歧,这种张力如何影响AI在个体、团队和组织层级的落地实践。解决方案的关键在于将AI采纳视为一个价值协商过程,强调在不同层级上重新配置角色、关系与权力结构,并特别关注效率话语中隐含的社会责任、信任与自主性维度;研究呼吁未来工作应深入探讨AI系统如何在团队内部重新分配责任,同时理解这种责任重构如何增强员工的能动性(agency)。
链接: https://arxiv.org/abs/2603.05848
作者: Inha Cha,Catherine Wieczorek,Richmond Y. Wong
机构: Georgia Institute of Technology(佐治亚理工学院)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Accepted at CHI 2026
Abstract:Although organizations increasingly position AI adoption as a pathway to competitiveness and innovation, organizations’ perspectives on productivity and efficiency often clash with workers’ perspectives on AI’s economic and social value. Through design workshops with 15 UX designers, we examine how AI adoption unfolds across individual, team, and organizational scales. At the individual level, designers weighed efficiency, skill development, and professional worth. At the team level, they negotiated collaboration, responsibility, and rigor. At the organizational level, adoption was shaped by compliance requirements and organizational norms. Across these scales, discourses of efficiency carried social and ethical dimensions of responsibility, trust, and autonomy. We view adoption as a site where roles, relationships, and power are reconfigured. We argue that AI adoption should be understood as a process of negotiating values, and call for future work examining how AI systems redistribute responsibility among team members, while understanding how such shifts could strengthen worker agency.
[HC-16] Lexara: A User-Centered Toolkit for Evaluating Large Language Models for Conversational Visual Analytics
【速读】:该论文旨在解决生成式 AI (Generative AI) 在对话式可视化分析(Conversational Visual Analytics, CVA)场景中评估困难的问题,具体表现为:现有评估方法依赖编程技能、忽略真实应用场景的复杂性,且缺乏对多模态输出(可视化与文本)的可解释指标。解决方案的关键在于提出 Lexara——一个以用户为中心的评估工具包,其核心创新包括:(i) 基于真实世界使用场景设计的测试用例;(ii) 结合规则基方法与“大语言模型作为裁判”(LLM-as-a-Judge)的可解释指标体系,覆盖可视化质量(数据保真度、语义一致性、功能正确性和设计清晰度)和语言质量(事实准确性、分析推理能力、对话连贯性);(iii) 提供无需编程即可进行实验配置和多格式、多层次结果探索的交互式界面。通过六名 CVA 开发者的日记研究验证,Lexara 能有效指导模型与提示词的选择,提升评估的实用性与科学性。
链接: https://arxiv.org/abs/2603.05832
作者: Srishti Palani,Vidya Setlur
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are transforming Conversational Visual Analytics (CVA) by enabling data analysis through natural language. However, evaluating LLMs for CVA remains a challenge: requiring programming expertise, overlooking real-world complexity, and lacking interpretable metrics for multi-format (visualizations and text) outputs. Through interviews with 22 CVA developers and 16 end-users, we identified use cases, evaluation criteria and workflows. We present Lexara, a user-centered evaluation toolkit for CVA that operationalizes these insights into: (i) test cases spanning real-world scenarios; (ii) interpretable metrics covering visualization quality (data fidelity, semantic alignment, functional correctness, design clarity) and language quality (factual grounding, analytical reasoning, conversational coherence) using rule-based and LLM-as-a-Judge methods; and (iii) an interactive toolkit enabling experimental setup and multi-format and multi-level exploration of results without programming expertise. We conducted a two-week diary study with six CVA developers, drawn from our initial cohort of 22. Their feedback demonstrated Lexara’s effectiveness for guiding appropriate model and prompt selection.
[HC-17] A Closed-Loop CPR Training Glove with Integrated Tactile Sensing and Haptic Feedback ICRA2026
【速读】:该论文旨在解决心肺复苏(Cardiopulmonary Resuscitation, CPR)训练中对 instructor-led sessions 依赖性强、缺乏有效自主导练习手段的问题。为实现更高效、自主的CPR技能训练,作者提出了一种闭环式CPR训练手套,其关键在于集成高分辨率触觉传感阵列与振动触觉执行器,实时估计按压速率、力度和手部姿态,并通过即时触觉反馈引导用户调整操作,从而减少对外部音视频显示的依赖。该系统基于轻量级统计模型实现了亚毫秒级推理下的92%力值估计准确率和手姿分类精度,验证了触觉反馈在降低视觉干扰方面的优势,为未来自适应触觉CPR训练系统的设计提供了可行路径与关键设计洞见。
链接: https://arxiv.org/abs/2603.05793
作者: Jaeyoung Moon,Mingzhuo Ma,Qifeng Yang,Youjin Choi,Seokhyun Hwang,Samuel Burden,Kyung-Joong Kim,Yiyue Luo
机构: Gwangju Institute of Science and Technology (光州科学技术院); University of Washington (华盛顿大学)
类目: Human-Computer Interaction (cs.HC)
备注: 8pages, 10 figures, This paper is accepted in ICRA 2026
Abstract:Cardiopulmonary resuscitation (CPR) is a critical life-saving procedure, and effective training benefits from self-directed practice beyond instructor-led sessions. In this paper, we propose a closed-loop CPR training glove that integrates a high-resolution tactile sensing array and vibrotactile actuators for self-directed practice. The tactile sensing array measures distributed pressures across the palm and dorsum to enable real-time estimation of compression rate, force, and hand pose. Based on these estimations, the glove delivers immediate haptic feedback to guide the user for proper CPR, reducing reliance on external audio-visual displays. We quantified the tactile sensor performance by measuring wide-range sensitivity (~0.85 over 0-600 N), computing hysteresis (56.04%), testing stability (11.05% drift over 300 cycles), and estimating global signal-to-noise ratio (18.90 +/- 2.41 dB at 600 N). Our closed-loop pipeline provides continuous modeling and feedback of key performance metrics essential for high-quality CPR. Our lightweight statistical models achieves 92% accuracy for force estimation and hand pose classification within sub-millisecond inference time. Our user study (N=8) showed that haptic feedback reduced visual distraction compared to audio-visual cues, though simplified patterns were required for reliable perception under dynamic load. These results highlight the feasibility of the proposed system and offer design insights for future haptic CPR self-training system.
[HC-18] Vision-Language System using Open-Source LLM s for Gestures in Medical Interpreter Robots
【速读】:该论文旨在解决跨语言医疗场景中沟通障碍问题,尤其关注非语言线索(如手势)在医患交流中的关键作用。其核心挑战在于如何在保障隐私的前提下,使医疗翻译机器人准确识别特定言语行为(如“同意”和“指令”),并生成恰当的机器人手势以增强沟通效果。解决方案的关键在于构建一个基于本地部署开源模型的视觉-语言框架:首先利用少量示例提示(few-shot prompting)的大型语言模型(LLM)实现高精度意图识别;其次引入一个新型临床对话数据集,包含标注的言语行为与对应手势片段,用于训练和评估系统;最终通过用户研究验证了该方案在人类相似度上的显著提升,同时保持了手势适当性,并实现了更高的计算效率。
链接: https://arxiv.org/abs/2603.05751
作者: Thanh-Tung Ngo,Emma Murphy,Robert J. Ross
机构: Technological University Dublin (都柏林理工学院)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:
Abstract:Effective communication is vital in healthcare, especially across language barriers, where non-verbal cues and gestures are critical. This paper presents a privacy-preserving vision-language framework for medical interpreter robots that detects specific speech acts (consent and instruction) and generates corresponding robotic gestures. Built on locally deployed open-source models, the system utilizes a Large Language Model (LLM) with few-shot prompting for intent detection. We also introduce a novel dataset of clinical conversations annotated for speech acts and paired with gesture clips. Our identification module achieved 0.90 accuracy, 0.93 weighted precision, and a 0.91 weighted F1-Score. Our approach significantly improves computational efficiency and, in user studies, outperforms the speech-gesture generation baseline in human-likeness while maintaining comparable appropriateness.
[HC-19] he Frag ility Of Moral Judgment In Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理道德判断任务时的稳定性与可操纵性问题,即模型输出是否受输入表述形式和评估协议的影响,从而影响其道德决策的可靠性与公平性。解决方案的关键在于提出并实施一种扰动框架(perturbation framework),通过系统性地引入三类内容扰动(表面编辑、视角转换、说服线索)以及多种评估协议变化,在保持道德冲突不变的前提下测试LLM对不同输入变体的响应一致性。实验结果表明,模型道德判断高度依赖于叙述形式(如叙事语气)和任务提示结构,而非纯粹的道德内容本身,尤其在道德模糊情境中更为显著,这揭示了当前LLM在道德推理中的不稳定性及其潜在的公平性风险。
链接: https://arxiv.org/abs/2603.05651
作者: Tom van Nuenen,Pratik S. Sachdeva
机构: D-Lab, University of California, Berkeley(加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 22 pages, 7 figures, 10 tables, plus appendices
Abstract:People increasingly use large language models (LLMs) for everyday moral and interpersonal guidance, yet these systems cannot interrogate missing context and judge dilemmas as presented. We introduce a perturbation framework for testing the stability and manipulability of LLM moral judgments while holding the underlying moral conflict constant. Using 2,939 dilemmas from r/AmItheAsshole (January-March 2025), we generate three families of content perturbations: surface edits (lexical/structural noise), point-of-view shifts (voice and stance neutralization), and persuasion cues (self-positioning, social proof, pattern admissions, victim framing). We also vary the evaluation protocol (output ordering, instruction placement, and unstructured prompting). We evaluated all variants with four models (GPT-4.1, Claude 3.7 Sonnet, DeepSeek V3, Qwen2.5-72B) (N=129,156 judgments). Surface perturbations produce low flip rates (7.5%), largely within the self-consistency noise floor (4-13%), whereas point-of-view shifts induce substantially higher instability (24.3%). A large subset of dilemmas (37.9%) is robust to surface noise yet flips under perspective changes, indicating that models condition on narrative voice as a pragmatic cue. Instability concentrates in morally ambiguous cases; scenarios where no party is assigned blame are most susceptible. Persuasion perturbations yield systematic directional shifts. Protocol choices dominate all other factors: agreement between structured protocols is only 67.6% (kappa=0.55), and only 35.7% of model-scenario units match across all three protocols. These results show that LLM moral judgments are co-produced by narrative form and task scaffolding, raising reproducibility and equity concerns when outcomes depend on presentation skill rather than moral substance. Comments: 22 pages, 7 figures, 10 tables, plus appendices Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) ACMclasses: I.2.7; K.4.1; H.5.2 Cite as: arXiv:2603.05651 [cs.CL] (or arXiv:2603.05651v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.05651 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-20] From Risk Avoidance to User Empowerment: Reframing Safety in Generative AI for Mental Health Crises
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 聊天机器人在应对用户心理危机时普遍采取回避策略的问题,这种策略通常表现为拒绝深入互动或仅提供标准化转介信息,从而可能延误甚至阻断用户获得及时有效的心理支持。其解决方案的关键在于提出以赋能为导向的设计原则,借鉴社区助人者(community helper)模型,使AI聊天机器人能够在求助初期作为支持性桥梁,实现危机情绪的初步安抚与疏导,并有效引导用户连接至更可靠的线下专业资源,同时通过AI开发者与监管机构的协同合作,在风险控制与用户赋能之间达成更优平衡。
链接: https://arxiv.org/abs/2603.05647
作者: Benjamin Kaveladze,Arka Ghosh,Leah Ajmani,Denae Ford,Peter M Gutierrez,Jetta E Hanson,Eugenia Kim,Keertana Namuduri,Theresa Nguyen,Ebele Okoli,Teresa Rexin,Jessica L Schleider,Hongyi Shen,Jina Suh
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:People experiencing mental health crises frequently turn to open-ended generative AI (GenAI) chatbots such as ChatGPT for support. However, rather than providing immediate assistance, most GenAI chatbots are designed to respond to crisis situations in ways that minimize their developers’ liability, primarily through avoidance (e.g., refusing to engage beyond templated referrals to crisis hotlines). Withholding crisis support in these cases may harm users who have no viable alternatives and reduce their motivation to seek further help. At scale, this avoidant design could undermine population mental health. We propose empowerment-oriented design principles for AI crisis support, informed by community helper models. We outline how, as an initial touchpoint in help-seeking, AI chatbots can act as a supportive bridge to de-escalate crises and connect users to more reliable care. Coordination between AI developers and regulators can enable a better balance of risk mitigation and user empowerment in AI crisis support.
[HC-21] RFM-HRI : A Multimodal Dataset of Medical Robot Failure User Reaction and Recovery Preferences for Item Retrieval Tasks
【速读】:该论文旨在解决真实场景中机器人交互失败时用户行为响应机制不明确的问题,尤其是在医疗场景下,交互失败可能直接影响任务执行效率与用户信任。其核心挑战在于缺乏对用户在面对不同类型失败(如语音、时间延迟、理解错误和搜索失败)时的多模态反应(包括面部表情、头部姿态、语音内容及主观报告)的系统性研究。解决方案的关键在于构建并公开发布首个面向医疗机器人(基于急救车场景)的多模态交互失败数据集(RFM-HRI),包含41名参与者在实验室与医院环境中经历四种失败类型后的214个交互样本,并通过分析发现:失败显著降低情感正向值并削弱用户控制感,且情绪反应随重复失败呈现动态变化趋势(困惑减少、挫败感增强)。该工作为嵌入式人机交互(embodied HRI)中的故障检测与恢复策略提供了实证基础与可比较的实验范式。
链接: https://arxiv.org/abs/2603.05641
作者: Yashika Batra,Giuliano Pioldi,Promise Ekpo,Arman Sayatqyzy,Purnjay Maruur,Shalom Otieno,Kevin Ching,Angelique Taylor
机构: Cornell University, Cornell Tech (康奈尔大学康奈尔科技校区); Weill Cornell Medicine (威尔康奈尔医学院)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:
Abstract:While robots deployed in real-world environments inevitably experience interaction failures, understanding how users respond through verbal and non-verbal behaviors remains under-explored in human-robot interaction (HRI). This gap is particularly significant in healthcare-inspired settings, where interaction failures can directly affect task performance and user trust. We present the Robot Failures in Medical HRI (RFM-HRI) Dataset, a multimodal dataset capturing dyadic interactions between humans and robots embodied in crash carts, where communication failures are systematically induced during item retrieval tasks. Through Wizard-of-Oz studies with 41 participants across laboratory and hospital settings, we recorded responses to four failure types (speech, timing, comprehension, and search) derived from three years of crash-cart robot interaction data. The dataset contains 214 interaction samples including facial action units, head pose, speech transcripts, and post-interaction self-reports. Our analysis shows that failures significantly degrade affective valence and reduce perceived control compared to successful interactions. Failures are strongly associated with confusion, annoyance, and frustration, while successful interactions are characterized by surprise, relief, and confidence in task completion. Emotional responses also evolve across repeated failures, with confusion decreasing and frustration increasing over time. This work contributes (1) a publicly available multimodal dataset (RFM-HRI), (2) analysis of user responses to different failure types and preferred recovery strategies, and (3) a crash-cart retrieval scenario enabling systematic comparison of recovery strategies with implications for safety-critical failure recovery. Our findings provide a foundation for failure detection and recovery methods in embodied HRI. Subjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC) Cite as: arXiv:2603.05641 [cs.RO] (or arXiv:2603.05641v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2603.05641 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yashika Batra [view email] [v1] Thu, 5 Mar 2026 19:52:19 UTC (7,653 KB)
[HC-22] CoEditor: Instruction-based Visual Editing via Cognitive Reasoning
【速读】:该论文旨在解决当前基于大模型的图像编辑方法在高阶语义推理和视觉一致性方面存在的不足,尤其是在面对模糊或复杂指令时表现不佳的问题。解决方案的关键在于提出一种认知结构化的、无需训练的框架 CoEditor++,其通过将编辑任务分解为“编辑什么”(what to edit)与“如何编辑”(how to edit)两个认知阶段,并引入反思式自我选择机制,实现鲁棒、细粒度且可解释的图像编辑。该设计不依赖任何额外训练或微调,仅由开源组件构建,从而保障透明性和跨领域适用性,同时在通用编辑(SmartEdit)和负责任编辑(AltBear)基准上均达到最优性能,尤其在视觉一致性方面显著优于同类开源及闭源模型。
链接: https://arxiv.org/abs/2603.05518
作者: Minheng Ni,Yutao Fan,Zhengyuan Yang,Yeli Shen,Yuxiang Wei,Yaowen Zhang,Lijuan Wang,Lei Zhang,Wangmeng Zuo
机构: Hong Kong Polytechnic University (香港理工大学); Harbin Institute of Technology (哈尔滨工业大学); Microsoft (微软)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in large multimodal models (LMMs) have enabled instruction-based image editing, allowing users to modify visual content via natural language descriptions. However, existing approaches often struggle with high-level semantic reasoning and visual consistency, particularly under ambiguous or complex instructions. To address these challenges, we propose CoEditor++, a cognitively structured, training-free framework that decomposes editing into “what to edit” and “how to edit” through two cognitive stages with a reflective self-selection mechanism, enabling robust, fine-grained, and interpretable editing. Built entirely from open-sourced components, CoEditor++ requires no additional training or fine-tuning, ensuring transparency and cross-domain applicability. We evaluate CoEditor++ on SmartEdit, a widely used benchmark for general editing, and AltBear, a privacy and compliance-oriented benchmark. Experimental results show that CoEditor++ achieves state-of-the-art performance in both general editing and responsible editing tasks compared with open-sourced models that require training on specialized editing datasets maintaining significantly higher visual consistency. When compared with closed-source models such as Nano Banana Pro or GPT-4o, CoEditor++ preserves comparable instruction following while still substantially outperforming them in visual consistency. Extensive ablation studies confirm that the effectiveness of CoEditor++ benefits from its structured cognitive design rather than any specific model component. Our findings suggest the potential toward cognitive-centric instruction-based image editing.
[HC-23] Human-Centered Ambient and Wearable Sensing for Automated Monitoring in Dementia Care: A Scoping Review
【速读】:该论文旨在解决当前痴呆症照护中传感技术应用的复杂挑战,特别是如何在家庭和机构环境中有效部署可穿戴及环境感知技术(wearable and ambient sensing technologies),以提升照护自动化与患者自主性。其解决方案的关键在于确立五项核心实施原则:(1)以人为本的设计,确保所有利益相关者参与,增强而非替代照护者;(2)个性化、可适应的解决方案,支持不同场景与病情严重程度下的自主性;(3)与现有工作流程整合,并提供充分培训与支持;(4)主动考虑隐私与知情同意,尤其针对环境监控中的居民与照护者;(5)开发成本效益高、伦理合规、公平且可扩展的方案,并具备可量化的成效指标。这些原则共同为未来传感系统的设计与实施提供了方向,以应对痴呆照护中日益增长的技术需求。
链接: https://arxiv.org/abs/2603.05516
作者: Mason Kadem,Sarah Masri,Anthea Innes,Rong Zheng
机构: McMaster University (麦克马斯特大学); University of the Highlands and Islands (高地和岛屿大学)
类目: Human-Computer Interaction (cs.HC)
备注: draft
Abstract:We conducted a scoping review to map the rapidly evolving landscape of wearable and ambient sensing technologies for monitoring people with dementia across home and institutional settings. We analyzed empirical sensing studies (2015-2025) to identify and inform future technical and human-centered design requirements. Five key implementation principles emerge: (1) human-centered design involving all stakeholders to augment rather than replace caregivers; (2) personalized, adaptable solutions that support autonomy across settings and severity levels instead of standardized approaches; (3) integration with existing workflows with adequate training and support; (4) proactive privacy and consent considerations, especially for ambient monitoring of residents and caregivers; and (5) cost-effective, ethical, equitable, scalable solutions with quantifiable outcomes. This paper identifies gaps, trends and opportunities for developing sensing systems that address the complex challenges, while enhancing automation and autonomy, in dementia care.
[HC-24] Enhancing Tool Calling in LLM s with the International Tool Calling Dataset
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在工具调用(Tool Calling)任务中面临的关键局限,包括对模拟或受限API的依赖、可复现性差以及缺乏文化与地理多样性。为应对这些问题,作者提出International Tool Calling (ITC),一个大规模、多语言的基准测试集,其关键创新在于构建了包含3,571个真实API和17,540个工具调用任务的国际化数据集,覆盖20类任务和40个国家。该解决方案显著提升了非英语查询下的跨语言泛化能力、推理一致性及对域外工具的鲁棒性,尤其通过在ITC上微调模型实现了性能跃升,从而推动LLM在复杂、多工具和国际场景中的实际应用与稳健性发展。
链接: https://arxiv.org/abs/2603.05515
作者: Zuoyu Zhang,Yancheng Zhu
机构: Shenzhen University (深圳大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Tool calling allows large language models (LLMs) to interact with external systems like APIs, enabling applications in customer support, data analysis, and dynamic content generation. While recent benchmarks have advanced tool-use research, they suffer from key limitations, including reliance on simulated or restricted APIs, limited reproducibility, and a lack of cultural and geographic diversity. To address these gaps, we introduce International Tool Calling (ITC), a large-scale, multilingual benchmark designed for realistic, globally distributed tool-calling scenarios. ITC includes 3,571 real APIs and 17,540 tool calling tasks across 20 categories and 40 countries. Experiments reveal substantial performance gaps between open- and closed-source LLMs, while fine-tuning on ITC yields significant improvements, particularly for non-English queries, enhancing cross-lingual generalization, reasoning consistency, and robustness to out-of-domain tools. ITC provides a valuable benchmark for advancing LLM robustness and performance in complex, multi-tool, and international scenarios. Dataset: this https URL.
[HC-25] From Toil to Thought: Designing for Strategic Exploration and Responsible AI in Systematic Literature Reviews
【速读】:该论文旨在解决系统性文献综述(Systematic Literature Reviews, SLRs)过程中因工具生态系统碎片化所导致的认知负荷过高问题,从而抑制了科研工作的迭代性和探索性。核心挑战包括:跨多个数据库进行迭代查询时的高认知负担、现代文献出版规模与速度带来的信息过载,以及自动化与研究者自主性之间的张力。为此,作者提出ARC这一设计探针,其关键在于通过多数据库集成、透明的迭代搜索机制和可验证的AI辅助筛选功能,构建一个整合环境,使研究人员能够从繁琐的行政事务中解放出来,转向战略性探索;同时借助外部表征支持战略思考,并通过透明的AI推理增强判断的可验证性,从而在知识合成的全生命周期中增强专家贡献。
链接: https://arxiv.org/abs/2603.05514
作者: Runlong Ye,Naaz Sibia,Angela Zavaleta Bernuy,Tingting Zhu,Carolina Nobre,Viktoria Pammer-Schindler,Michael Liut
机构: University of Toronto (Toronto, Ontario, Canada); Graz University of Technology & Know-Center GmbH (Graz, Austria); University of Toronto Mississauga (Mississauga, Ontario, Canada)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted at IUI 26
Abstract:Systematic Literature Reviews (SLRs) are fundamental to scientific progress, yet the process is hindered by a fragmented tool ecosystem that imposes a high cognitive load. This friction suppresses the iterative, exploratory nature of scholarly work. To investigate these challenges, we conducted an exploratory design study with 20 experienced researchers. This study identified key friction points: 1) the high cognitive load of managing iterative query refinement across multiple databases, 2) the overwhelming scale and pace of publication of modern literature, and 3) the tension between automation and scholarly agency. Informed by these findings, we developed ARC, a design probe that operationalizes solutions for multi-database integration, transparent iterative search, and verifiable AI-assisted screening. A comparative user study with 8 researchers suggests that an integrated environment facilitates a transition in scholarly work, moving researchers from managing administrative overhead to engaging in strategic exploration. By utilizing external representations to scaffold strategic exploration and transparent AI reasoning, our system supports verifiable judgment, aiming to augment expert contributions from initial creation through long-term maintenance of knowledge synthesis. Comments: Accepted at IUI 26 Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.05514 [cs.HC] (or arXiv:2603.05514v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2603.05514 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.1145/3742413.3789079 Focus to learn more DOI(s) linking to related resources
[HC-26] he Peoples Gaze: Co-Designing and Refining Gaze Gestures with General Users and Gaze Interaction Experts
【速读】:该论文旨在解决当前眼动交互中手势设计依赖专家主导、与用户自然眼部运动习惯不匹配的问题。其关键解决方案在于提出一种两阶段方法:首先通过非专家用户的共同设计工作坊(共20人)生成初始手势概念,发现用户在简短引导后能基于熟悉隐喻自发构建组合式语法(即激活方式如注视停留 + 动作方式如眼球动作或眨眼),从而提升意图明确性并缓解经典的“金手指效应”(Midas Touch problem);随后由四位眼动交互专家对这些概念进行筛选和优化,确保最终32个手势具备良好的人体工学特性、符合自然扫视模式且可区分性强。这一用户驱动、专家验证的双轨机制为开发直观、无需双手的眼动交互界面提供了可复用的设计基础。
链接: https://arxiv.org/abs/2603.05513
作者: Yaxiong Lei,Xinya Gong,Shijing He,Yafei Wang,Mohamed Khamis,Juan Ye
机构: University of St Andrews(圣安德鲁斯大学); University of Essex(埃塞克斯大学); King’s College London(伦敦国王学院); Dalian Maritime University(大连海事大学); University of Glasgow(格拉斯哥大学)
类目: Human-Computer Interaction (cs.HC)
备注: ACM CHI2026 accepted paper
Abstract:As eye-tracking becomes increasingly common in modern mobile devices, the potential for hands-free, gaze-based interaction grows, but current gesture sets are largely expert-designed and often misaligned with how users naturally move their eyes. To address this gap, we introduce a two-phase methodology for developing intuitive gaze gestures. First, four co-design workshops with 20 non-expert participants generated 102 initial concepts. Next, four gaze interaction experts reviewed and refined these into a set of 32 gestures. We found that non-experts, after a brief introduction, intuitively anchor gestures in familiar metaphors and develop a compositional grammar; i.e., activation (dwell) + action (gaze gesture or blink), to ensure intentionality and mitigate the classic Midas Touch problem. Experts prioritized gestures that are ergonomically sound, aligned with natural saccades, and reliably distinguishable. The resulting user-grounded, expert-validated gesture set, along with actionable design principles, provides a foundation for developing intuitive, hands-free interfaces for gaze-enabled devices.
[HC-27] Biometric-enabled Personalized Augmentative and Alternative Communications
【速读】:该论文旨在解决当前辅助性通信技术(Augmentative and Alternative Communication, AAC)在满足用户个性化需求与新兴人工智能(AI)技术发展之间存在的差距问题,特别是在提升对有沟通障碍人群的支持能力方面。其核心解决方案是提出一种基于生物特征的AAC技术路线图,关键在于构建“生物特征登记册”(biometric register)和“生物特征驱动的可重构AAC通道”,通过采集和处理生理及行为特征(如手势、手语词汇),将其映射为中间通信信号(如合成语音),从而实现定制化、灵活适配的沟通渠道,并据此评估现有AI视频与语音处理技术在准确性上的不足,进而提出改进方向以缩小实际应用需求与技术进展之间的鸿沟。
链接: https://arxiv.org/abs/2603.05512
作者: S. Yanushkevich,E. Berepiki,P. Ciunkiewicz,V. Shmerko,G. Wolbring,R. Guest
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 20 pages, 14 figures, preprint for accepted paper to CVIU
Abstract:This study focuses on the roadmapping of biometric technologies onto personalized Augmentative and Alternative Communication (AAC), a branch of assistive technologies for people with communication disabilities. This technology roadmapping revolves around the proposed notions of an AAC biometric register and biometric-enabled reconfigurable AAC channels. The biometric register is referred to as a tool for acquiring and processing physiological and behavioural traits that are essential for augmentative and alternative communication. It links biometric traits, such as gestures, to intermediate traits, such as synthesized speech, for customizable communication channels. The proposed methodology is used to assess the gaps between the social and practical demands, such as assisting people with communication disabilities in the contemporary semi-automated border control, and the emerging advances in AI, such as advanced video and speech processing. We provide two case studies of the AAC that rely on hand gesture recognition and sign language word recognition, and conclude that the current accuracy of those AI technologies does not meet the practical requirements. The proposed roadmapping provides recommendations for further improvement to close these gaps.
[HC-28] An Embodied Companion for Visual Storytelling
【速读】:该论文试图解决的问题是:当前人工智能在艺术创作中的应用仍局限于作为工具性代理(tool for delegation),缺乏与人类艺术家的深度协同,难以实现真正意义上的共创。为突破这一局限,作者提出解决方案的关键在于构建一个名为Companion的艺术装置,该装置将绘图机器人与大型语言模型(Large Language Models, LLMs)集成,通过上下文学习(in-context learning)和实时工具调用能力,实现人机之间基于语音与草图的双向交互。这使得机器人从被动执行者转变为具有参与感的“玩伴式合作者”,能够共同驱动视觉叙事进入新颖的美学领域,从而验证了AI作为高能力艺术协作者的潜力。
链接: https://arxiv.org/abs/2603.05511
作者: Patrick Tresset,Markus Wulfmeier
机构: APT111Ateliers Patrick Tresset SRL; Goldsmiths, University of London; Google DeepMind
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Graphics (cs.GR); Robotics (cs.RO)
备注: 35 pages, 18 figures
Abstract:As artificial intelligence shifts from pure tool for delegation toward agentic collaboration, its use in the arts can shift beyond the exploration of machine autonomy toward synergistic co-creation. While our earlier robotic works utilized automation to distance the artist’s intent from the final mark, we present Companion: an artistic apparatus that integrates a drawing robot with Large Language Models (LLMs) to re-center human-machine presence. By leveraging in-context learning and real-time tool use, the system engages in bidirectional interaction via speech and sketching. This approach transforms the robot from a passive executor into a playful co-creative partner capable of driving shared visual storytelling into unexpected aesthetic territories. To validate this collaborative shift, we employed the Consensual Assessment Technique (CAT) with a panel of seven art-world experts. Results confirm that the system produces works with a distinct aesthetic identity and professional exhibition merit, demonstrating the potential of AI as a highly capable artistic collaborator.
[HC-29] Exploring Human-in-the-Loop Themes in AI Application Development: An Empirical Thematic Analysis
【速读】:该论文旨在解决组织在开发和部署人工智能(AI)应用过程中,因人类决策权和监督职责在系统生命周期中界定不清而导致的挑战。其解决方案的关键在于通过多源定性研究(包括对客户支持聊天机器人的回顾性日记研究及八位来自学术界与工业界的AI专家的半结构化访谈),基于1,435个编码词进行五轮主题分析,提炼出四个核心主题:AI治理与人类权威、人机协同迭代优化、AI系统生命周期与操作约束、人机团队协作与协调。这些主题为设计和验证人机协同(Human-in-the-Loop, HITL)框架提供了实证依据,从而增强AI系统中人类角色的结构化定义与有效介入机制。
链接: https://arxiv.org/abs/2603.05510
作者: Parm Suksakul,Nathan Kittichaikoonkij,Nakhin Polthai,Aung Pyae
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted for presentation at IEEE CON 2026
Abstract:Developing and deploying AI applications in organizations is challenging when human decision authority and oversight are underspecified across the system lifecycle. Although Human-in-the-Loop (HITL) and Human-Centered AI (HCAI) principles are widely acknowledged, operational guidance for structuring roles, checkpoints, and feedback mechanisms remains fragmented. We report a multi-source qualitative study: a retrospective diary study of a customer-support chatbot and semi-structured interviews with eight AI experts from academia and industry. Through five-cycle thematic analysis of 1,435 codewords, we derive four themes: AI Governance and Human Authority, Human-in-the-Loop Iterative Refinement, AI System Lifecycle and Operational Constraints, and Human-AI Team Collaboration and Coordination. These themes provide empirical inputs for subsequent HITL framework design and validation.
[HC-30] XR and Hybrid Data Visualization Spaces for Enhanced Data Analytics
【速读】:该论文旨在解决高维数据空间中复杂结构、关系与现象难以理解的问题,以及人工智能(AI)分析结果可解释性不足的挑战。随着数据维度不断升高,传统二维可视化手段在表达复杂信息时存在局限,而生成式 AI (Generative AI) 分析结果往往缺乏透明度,进一步阻碍了用户对数据洞察的有效获取。解决方案的关键在于利用扩展现实(XR)技术,实现二维与三维交互式可视化表示在虚拟空间中的无缝集成,从而增强人类对高维数据的理解能力,并提升数据分析的直观性与效率。论文通过三个案例研究验证了该方法在促进高效数据解析方面的有效性。
链接: https://arxiv.org/abs/2603.05509
作者: Santiago Lombeyda,S. G. Djorgovski,Ciro Donalek
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: An invited refereed paper, to appear in the special issue of Journal of Chemometrics,“Immersive Analytics with Virtual Reality: The Frontier is Here”, issue editor J. Kalivas, Willey publ., in press (2026)
Abstract:The growing complexity and information content of data, together with the need to understand both the complex structures, relationships, and phenomena present in these data spaces, compounded with the emerging need to understand the results produced by AI tools used to analyze the data, requires development of novel, effective data visualization tools. Much of the growing complexity is reflected in the increasing dimensionality of data spaces, where extended reality (XR) naturally emerges as a candidate to help extend our capability for higher dimensional understanding. However, humans often understand lower dimensionality representations more effectively. Still, XR offers an opportunity for a seamless integration of simulated traditional data displays within the 3-dimensional virtual data spaces, leading to more intuitive and more effective data analytics. In this paper we present an overview of the benefits of seamlessly integrated 2-dimensional and 3-dimensional interactive visual representations embedded in XR spaces, and present three case studies that leverage these approaches for more efficient data analytics.
[HC-31] he Generative AI Paradox on Evaluation: What It Can Solve It May Not Evaluate
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在生成任务中表现出色,是否同样具备可靠的评估能力,即是否存在“生成能力卓越但评估能力不足”的现象。解决方案的关键在于通过实证对比分析三类LLMs和一个开源语言模型在TriviaQA数据集上的问答(Question-Answering, QA)生成与评估任务表现,发现LLMs在评估任务中的性能显著低于生成任务,并揭示了模型在缺乏知识领域仍可能做出看似准确的评估——即“不忠实评估”(unfaithful evaluation)现象,从而强调了对LLMs作为评估者时其可信度(faithfulness)进行系统性检验的必要性。
链接: https://arxiv.org/abs/2402.06204
作者: Juhyun Oh,Eunsu Kim,Inha Cha,Alice Oh
机构: KAIST (韩国科学技术院); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators. We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA (Joshi et al., 2017) dataset. Results indicate a significant disparity, with LLMs exhibiting lower performance in evaluation tasks compared to generation tasks. Intriguingly, we discover instances of unfaithful evaluation where models accurately evaluate answers in areas where they lack competence, underscoring the need to examine the faithfulness and trustworthiness of LLMs as evaluators. This study contributes to the understanding of “the Generative AI Paradox” (West et al., 2023), highlighting a need to explore the correlation between generative excellence and evaluation proficiency, and the necessity to scrutinize the faithfulness aspect in model evaluations.
计算机视觉
[CV-0] Multimodal Large Language Models as Image Classifiers
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像分类任务中性能评估结果不一致的问题,其核心原因在于评价协议设计缺陷和标注质量(ground truth quality)噪声。解决方案的关键在于系统性识别并修正三大评估偏差:一是模型输出超出预设类别列表时被错误丢弃;二是弱干扰项(weak multiple-choice distractors)导致的分数虚高;三是开放世界设置下因输出映射不当造成的性能低估。通过在ReGT数据集(对ImageNet-1k 625类进行多标签重标注)上重新评估,发现MLLMs在使用校正后的标签后性能提升最高达10.8%,显著缩小与监督模型的差距,揭示了所谓“性能不足”实为标注噪声和评估方法缺陷所致。此外,研究还量化了批大小、图像顺序和文本编码器选择等常被忽视的设计因素对准确率的显著影响,并证明MLLMs可辅助人工标注,在困难样本中实现约50%的预测确认或融合,凸显其在大规模数据集构建中的潜力。
链接: https://arxiv.org/abs/2603.06578
作者: Nikita Kisel,Illia Volkov,Klara Janouskova,Jiri Matas
机构: Czech Technical University in Prague (布拉格捷克理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we show these conflicts stem from protocols that either inflate or underestimate performance. Across the most common evaluation protocols, we identify and fix key issues: model outputs that fall outside the provided class list and are discarded, inflated results from weak multiple-choice distractors, and an open-world setting that underperforms only due to poor output mapping. We additionally quantify the impact of commonly overlooked design choices - batch size, image ordering, and text encoder selection - showing they substantially affect accuracy. Evaluating on ReGT, our multilabel reannotation of 625 ImageNet-1k classes, reveals that MLLMs benefit most from corrected labels (up to +10.8%), substantially narrowing the perceived gap with supervised models. Much of the reported MLLMs underperformance on classification is thus an artifact of noisy ground truth and flawed evaluation protocol rather than genuine model deficiency. Models less reliant on supervised training signals prove most sensitive to annotation quality. Finally, we show that MLLMs can assist human annotators: in a controlled case study, annotators confirmed or integrated MLLMs predictions in approximately 50% of difficult cases, demonstrating their potential for large-scale dataset curation.
[CV-1] Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)普遍采用传统自回归架构所带来的局限性,即在架构设计上缺乏高效且有效的替代方案。为此,作者提出Omni-Diffusion,其核心创新在于首次构建了一个完全基于掩码驱动的离散扩散模型(mask-based discrete diffusion model)的任意到任意多模态语言模型,通过统一建模文本、语音和图像的离散多模态标记联合分布,实现跨模态的理解与生成任务一体化。该方案的关键在于利用离散扩散模型直接捕捉多模态标记的联合概率分布,从而支持双模态乃至多模态复杂场景,且在多个基准测试中表现优于或相当现有方法,验证了扩散模型作为下一代多模态基础模型骨干架构的巨大潜力。
链接: https://arxiv.org/abs/2603.06577
作者: Lijiang Li,Zuwei Long,Yunhang Shen,Heting Gao,Haoyu Cao,Xing Sun,Caifeng Shan,Ran He,Chaoyou Fu
机构: Nanjing University(南京大学); Tencent Youtu Lab(腾讯优图实验室); CASIA(中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: this https URL.
[CV-2] BEVLM: Distilling Semantic Knowledge from LLM s into Birds-Eye View Representations
【速读】:该论文旨在解决当前将大语言模型(Large Language Models, LLMs)应用于自动驾驶时,因多视角、多帧图像独立输入导致的计算冗余与空间一致性不足的问题,从而影响3D空间推理准确性和跨视角几何一致性。解决方案的关键在于提出BEVLM框架,通过构建一个空间一致且语义精炼的鸟瞰图(Bird’s-Eye View, BEV)表示,并将其作为统一输入接入LLMs,实现跨视图场景下的有效推理;同时,通过将LLMs中的语义知识蒸馏回BEV表示,显著提升封闭环路端到端驾驶在高风险场景下的安全性表现。
链接: https://arxiv.org/abs/2603.06576
作者: Thomas Monninger,Shaoyuan Xie,Qi Alfred Chen,Sihao Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 4 figures, 6 tables in the main paper, 32 pages in total
Abstract:The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with tokens from multi-view and multi-frame images independently, leading to redundant computation and limited spatial consistency. This separation in visual processing hinders accurate 3D spatial reasoning and fails to maintain geometric coherence across views. On the other hand, Bird’s-Eye View (BEV) representations learned from geometrically annotated tasks (e.g., object detection) provide spatial structure but lack the semantic richness of foundation vision encoders. To bridge this gap, we propose BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs. Through extensive experiments, we show that BEVLM enables LLMs to reason more effectively in cross-view driving scenes, improving accuracy by 46%, by leveraging BEV features as unified inputs. Furthermore, by distilling semantic knowledge from LLMs into BEV representations, BEVLM significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.
[CV-3] SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation CVPR2026
【速读】:该论文针对3D点云场景下增量少样本(Incremental Few-Shot, IFS)分割任务中存在的两个核心问题展开研究:一是现有方法在学习新类别时易发生灾难性遗忘(catastrophic forgetting),二是由于标注样本稀疏导致难以构建具有判别性的原型表示(discriminative prototypes)。解决方案的关键在于提出一种即插即用的背景引导原型增强框架SCOPE(Scene-COntextualised Prototype Enrichment),其核心思想是利用基础训练阶段未标注的背景区域中提取的高置信度伪实例,构建一个类无关的原型池;当新类别出现时,从该原型池中检索与之相关的背景原型,并将其与少量标注样本对应的原型融合,形成增强的特征表示,从而无需重新训练骨干网络或增加额外参数即可有效提升模型对新类别的识别能力。实验表明,该方法在ScanNet和S3DIS数据集上均显著优于现有方法,在保持低遗忘率的同时大幅提升了新类别IoU和平均IoU。
链接: https://arxiv.org/abs/2603.06572
作者: Vishal Thengane,Zhaochong An,Tianjin Huang,Son Lam Phung,Abdesselam Bouzerdoum,Lu Yin,Na Zhao,Xiatian Zhu
机构: University of Surrey, UK; University of Wollongong, Australia; University of Copenhagen, Denmark; University of Exeter, UK; Singapore University of Technology and Design, Singapore
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at CVPR 2026
Abstract:Incremental Few-Shot (IFS) segmentation aims to learn new categories over time from only a few annotations. Although widely studied in 2D, it remains underexplored for 3D point clouds. Existing methods suffer from catastrophic forgetting or fail to learn discriminative prototypes under sparse supervision, and often overlook a key cue: novel categories frequently appear as unlabelled background in base-training scenes. We introduce SCOPE (Scene-COntextualised Prototype Enrichment), a plug-and-play background-guided prototype enrichment framework that integrates with any prototype-based 3D segmentation method. After base training, a class-agnostic segmentation model extracts high-confidence pseudo-instances from background regions to build a prototype pool. When novel classes arrive with few labelled samples, relevant background prototypes are retrieved and fused with few-shot prototypes to form enriched representations without retraining the backbone or adding parameters. Experiments on ScanNet and S3DIS show that SCOPE achieves SOTA performance, improving novel-class IoU by up to 6.98% and 3.61%, and mean IoU by 2.25% and 1.70%, respectively, while maintaining low forgetting. Code is available this https URL.
[CV-4] SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning
【速读】:该论文旨在解决当前外科人工智能(Surgical AI)在理解手术场景中缺乏推理能力的问题,即现有模型难以回答关于手术工具选择依据、潜在风险评估及后续操作预测等深层次问题。其核心挑战在于缺乏大规模标注的显式手术推理数据。解决方案的关键在于利用外科教学视频中的自然语音讲解(通常包含意图解释、决策逻辑和未来预判),通过构建一个名为SUREON的大规模视频问答(Video QA)数据集来系统性地提取并结构化这些隐含的推理信号。SUREON涵盖12类问题类别,并采用多智能体流水线实现规模化监督信号的自动抽取与标注,最终训练出两个模型:SureonVLM(视觉-语言模型)和SureonVLM-R1(基于Group Relative Policy Optimization的推理增强模型),二者均在复杂手术问答任务上显著优于通用领域大模型,且SureonVLM-R1展现出可解释的推理行为,如从视觉上下文中推断操作意图。
链接: https://arxiv.org/abs/2603.06570
作者: Alejandra Perez,Anita Rau,Lee White,Busisiwe Mlambo,Chinedu Nwoye,Muhammad Abdullah Jamal,Omid Mohareri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Surgeons don’t just see – they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely difficult to annotate at scale. Yet surgical video lectures already contain exactly this – explanations of intent, rationale, and anticipation, narrated by experts for the purpose of teaching. Though inherently noisy and unstructured, these narrations encode the reasoning that surgical AI currently lacks. We introduce SUREON, a large-scale video QA dataset that systematically harvests this training signal from surgical academic videos. SUREON defines 12 question categories covering safety assessment, decision rationale, and forecasting, and uses a multi-agent pipeline to extract and structure supervision at scale. Across 134.7K clips and 170 procedure types, SUREON yields 206.8k QA pairs and an expert-validated benchmark of 354 examples. To evaluate the extent to which this supervision translates to surgical reasoning ability, we introduce two models: SureonVLM, a vision-language model adapted through supervised fine-tuning, and SureonVLM-R1, a reasoning model trained with Group Relative Policy Optimization. Both models can answer complex questions about surgery and substantially outperform larger general-domain models, exceeding 84% accuracy on the SUREON benchmark while outperforming general-domain models on standard surgical perception tasks. Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context.
[CV-5] Penguin-VL: Exploring the Efficiency Limits of VLM with LLM -based Vision Encoders
【速读】:该论文旨在解决当前视觉语言模型(Vision Language Model, VLM)发展过度依赖模型规模扩展的问题,这限制了其在计算资源受限的移动和边缘设备(如智能手机、机器人)上的部署。其核心挑战在于,主流VLM通常采用通过大规模对比学习(如CLIP/SigLIP)预训练的视觉编码器,而这种训练目标与细粒度视觉理解任务(如密集描述和复杂推理)存在本质不匹配:对比学习优化的是类别级别的判别能力,导致粗粒度不变性抑制了对精细视觉线索的捕捉。解决方案的关键在于提出Penguin-VL,其视觉编码器由纯文本语言模型(text-only LLM)初始化,而非依赖对比预训练。实验证明,该方法显著提升了视觉保真度与数据效率,在多项图像和视频基准测试中达到或超越先进VLM(如Qwen3-VL)的性能,尤其在文档理解、视觉知识和多视角视频理解等任务上表现更优,且仅使用轻量级架构,验证了高质量视觉表示优于模型规模扩增是性能提升的核心驱动力。
链接: https://arxiv.org/abs/2603.06569
作者: Boqiang Zhang,Lei Ke,Ruihan Yang,Qi Gao,Tianyuan Qu,Rossell Chen,Dong Yu,Leoweiliang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Penguin-VL Technical Report; Code: this https URL
Abstract:Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: this https URL
[CV-6] EgoReason er: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking
【速读】:该论文旨在解决第一人称视频理解(egocentric video understanding)中4D动态环境下的复杂推理问题,具体聚焦于一系列尚未充分探索的任务,如固定装置交互计数、视角相关的固定装置定位、物体移动路径追踪和静止物体定位,这些任务要求不同的认知操作:空间锚定、时间追踪与持续时间推理。传统通用方法(如Chain-of-Thought或统一强化学习)因缺乏任务特异性而表现不佳,尤其在空间类任务上易出现性能不稳定。解决方案的关键在于提出EgoReasoner框架,其核心是双阶段设计:第一阶段通过任务自适应思维模板(Task-Adaptive Thinking Templates)生成结构化的思维链(CoT)轨迹,使模型能基于监督微调实现跨任务的自适应推理;第二阶段则引入任务感知奖励函数,结合GRPO(Generalized Reward Policy Optimization)进行强化微调,以验证实体定位准确性、时间对齐性和逻辑一致性,从而选择性地增强各推理路径。该方法在仅用16K样本训练的情况下,显著提升了HD-EPIC基准上的平均准确率(37.5%),优于Qwen2.5-VL-7B(25.7%)。
链接: https://arxiv.org/abs/2603.06561
作者: Fangrui Zhu,Yunfeng Xi,Jianmo Ni,Mu Cai,Boqing Gong,Long Zhao,Chen Qu,Ian Miao,Yi Li,Cheng Zhong,Huaizu Jiang,Shwetak Patel
机构: Northeastern University; Google; Google DeepMind; Google Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint
Abstract:Egocentric video understanding is inherently complex due to the dynamic 4D nature of the environment, where camera motion and object displacements necessitate a continuous re-evaluation of spatial relations. In this work, we target a suite of under-explored egocentric 4D reasoning tasks, including fixture interaction counting, viewpoint-relative fixture location, object movement itinerary tracking, and stationary object localization, that require fundamentally different cognitive operations: spatial anchoring, temporal tracking, and duration reasoning. We observe that these structural differences make task-agnostic approaches insufficient: generic Chain-of-Thought methods lack task-appropriate reasoning primitives, and uniform reinforcement learning actively destabilizes performance on spatial tasks. To address this, we propose EgoReasoner, a two-stage framework that aligns both the reasoning scaffold and the reward signal to each task’s cognitive structure. In the first stage, Task-Adaptive Thinking Templates guide the synthesis of structured CoT traces that teach the model to reason adaptively across task types via supervised fine-tuning. In the second stage, task-aware reward functions verify entity grounding, temporal alignment, and task-adaptive logical consistency, selectively strengthening each reasoning pathway via reinforcement fine-tuning with GRPO. Our 3B-parameter model, trained on only 16K samples, achieves 37.5% average accuracy on the challenging HD-EPIC benchmark, surpassing Qwen2.5-VL-7B (25.7%) by over 10 points.
[CV-7] Modeling and Measuring Redundancy in Multisource Multimodal Data for Autonomous Driving
【速读】:该论文旨在解决自动驾驶(AV)感知系统中因多源、多模态(M²)数据冗余导致的数据质量(DQ)问题,这一问题在现有研究中被长期忽视。其核心挑战在于:尽管传感器数据存在空间和语义重叠,但传统方法未对冗余标签进行识别与处理,从而可能引入噪声并影响模型性能。解决方案的关键在于通过量化分析摄像头间共享视场区域及图像与激光雷达(LiDAR)数据间的冗余程度,提出选择性移除冗余标注的策略。实验表明,在nuScenes和Argoverse 2(AV2)数据集上,剔除冗余标签可显著提升YOLOv8目标检测的平均精度(mAP₅₀),尤其在高重叠区域效果明显,且不影响其他视角下的检测性能,证明冗余是可测量且可优化的数据质量因子,推动了以数据为中心的AV数据集评估与改进范式。
链接: https://arxiv.org/abs/2603.06544
作者: Yuhan Zhou,Mehri Sattari,Haihua Chen,Kewei Sha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by the Fourth IEEE International Conference on Mobility: Operations, Services, and Technologies (MOST) 2026
Abstract:Next-generation autonomous vehicles (AVs) rely on large volumes of multisource and multimodal ( M^2 ) data to support real-time decision-making. In practice, data quality (DQ) varies across sources and modalities due to environmental conditions and sensor limitations, yet AV research has largely prioritized algorithm design over DQ analysis. This work focuses on redundancy as a fundamental but underexplored DQ issue in AV datasets. Using the nuScenes and Argoverse 2 (AV2) datasets, we model and measure redundancy in multisource camera data and multimodal image-LiDAR data, and evaluate how removing redundant labels affects the YOLOv8 object detection task. Experimental results show that selectively removing redundant multisource image object labels from cameras with shared fields of view improves detection. In nuScenes, mAP 50 gains from 0.66 to 0.70 , 0.64 to 0.67 , and from 0.53 to 0.55 , on three representative overlap regions, while detection on other overlapping camera pairs remains at the baseline even under stronger pruning. In AV2, 4.1 - 8.6% of labels are removed, and mAP 50 stays near the 0.64 baseline. Multimodal analysis also reveals substantial redundancy between image and LiDAR data. These findings demonstrate that redundancy is a measurable and actionable DQ factor with direct implications for AV performance. This work highlights the role of redundancy as a data quality factor in AV perception and motivates a data-centric perspective for evaluating and improving AV datasets. Code, data, and implementation details are publicly available at: this https URL
[CV-8] SurgFormer: Scalable Learning of Organ Deformation with Resection Support and Real-Time Inference
【速读】:该论文旨在解决高保真生物力学求解器在交互式应用场景中计算成本过高、难以实时运行的问题,提出一种基于数据驱动的软组织仿真方法SurgFormer。其关键创新在于构建了一个多分辨率门控Transformer架构,通过固定网格层次结构和重复的多分支块实现局部消息传递、粗粒度全局自注意力与逐点前馈更新的融合,并利用可学习的节点-通道级门控机制自适应整合局部与长程信息,从而在保持大规模网格可扩展性的同时提升预测精度。此外,该方法引入了切口条件模拟的统一建模框架,将切割信息编码为学习得到的嵌入向量作为额外输入,使模型能够同时处理标准形变预测和拓扑变化场景,且首次在相同体积管道中实现了XFEM监督下的切口条件形变研究。
链接: https://arxiv.org/abs/2603.06543
作者: Ashkan Shahbazi,Elaheh Akbari,Kyvia Pereira,Jon S. Heiselman,Annie C. Benson,Garrison L. H. Johnston,Jie Ying Wu,Nabil Simaan,Michael I. Miga,Soheil Kolouri
机构: Vanderbilt University (范德比尔特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce SurgFormer, a multiresolution gated transformer for data driven soft tissue simulation on volumetric meshes. High fidelity biomechanical solvers are often too costly for interactive use, so we train SurgFormer on solver generated data to predict nodewise displacement fields at near real time rates. SurgFormer builds a fixed mesh hierarchy and applies repeated multibranch blocks that combine local message passing, coarse global self attention, and pointwise feedforward updates, fused by learned per node, per channel gates to adaptively integrate local and long range information while remaining scalable on large meshes. For cut conditioned simulation, resection information is encoded as a learned cut embedding and provided as an additional input, enabling a unified model for both standard deformation prediction and topology altering cases. We also introduce two surgical simulation datasets generated under a unified protocol with XFEM based supervision: a cholecystectomy resection dataset and an appendectomy manipulation and resection dataset with cut and uncut cases. To our knowledge, this is the first learned volumetric surrogate setting to study XFEM supervised cut conditioned deformation within the same volumetric pipeline as standard deformation prediction. Across diverse baselines, SurgFormer achieves strong accuracy with favorable efficiency, making it a practical backbone for both tasks. Code, data, and project page: \hrefthis https URLavailable here
[CV-9] NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion
【速读】:该论文旨在解决扩散生成模型中对语言否定(negation)建模不足的问题,即现有方法难以准确捕捉诸如对象缺失、语义非反转、多重否定组合及作用域敏感歧义等复杂否定现象。其解决方案的关键在于将否定视为扩散动力学中语义引导的结构化可行性约束,通过重新诠释无分类器引导(classifier-free guidance)为语义更新方向,并利用由语言结构导出的凸约束集对该方向进行投影以实现否定操作。此方法无需重新训练模型参数,兼容预训练扩散主干网络,且可自然扩展至视频轨迹生成,首次在扩散生成模型中建立了统一的否定形式化框架。
链接: https://arxiv.org/abs/2603.06533
作者: Taewon Kang,Ming C. Lin
机构: University of Maryland at College Park (马里兰大学学院公园分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 50 pages, 32 figures
Abstract:Negation is a fundamental linguistic operator, yet it remains inadequately modeled in diffusion-based generative systems. In this work, we present a formal treatment of linguistic negation in diffusion-based generative models by modeling it as a structured feasibility constraint on semantic guidance within diffusion dynamics. Rather than introducing heuristics or retraining model parameters, we reinterpret classifier-free guidance as defining a semantic update direction and enforce negation by projecting the update onto a convex constraint set derived from linguistic structure. This novel formulation provides a unified framework for handling diverse negation phenomena, including object absence, graded non-inversion semantics, multi-negation composition, and scope-sensitive disambiguation. Our approach is training-free, compatible with pretrained diffusion backbones, and naturally extends from image generation to temporally evolving video trajectories. In addition, we introduce a structured negation-centric benchmark suite that isolates distinct linguistic failure modes in generative systems, to further research in this area. Experiments demonstrate that our method achieves robust negation compliance while preserving visual fidelity and structural coherence, establishing the first unified formulation of linguistic negation in diffusion-based generative models beyond representation-level evaluation.
[CV-10] Spatial Calibration of Diffuse LiDARs
【速读】:该论文旨在解决扩散式直接飞行时间(Diffuse Direct Time-of-Flight, DDToF)LiDAR与RGB图像之间的空间校准问题,其核心挑战在于DDToF LiDAR通过宽瞬时视场聚合光子返回信号,违反了传统LiDAR-RGB标定中假设的单射线传播模型。解决方案的关键在于提出一种简单的空间标定流程:通过扫描一个反光贴片并结合背景减除技术,估计每个LiDAR像素在共位RGB图像平面上的足迹(footprint,即有效支持区域)及其相对空间灵敏度,从而获得显式的LiDAR到RGB的对应关系,实现跨模态对齐与融合。
链接: https://arxiv.org/abs/2603.06531
作者: Nikhil Behari,Ramesh Raskar
机构: MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Diffuse direct time-of-flight LiDARs report per-pixel depth histograms formed by aggregating photon returns over a wide instantaneous field of view, violating the single-ray assumption behind standard LiDAR-RGB calibration. We present a simple spatial calibration procedure that estimates, for each diffuse LiDAR pixel, its footprint (effective support region) and relative spatial sensitivity in a co-located RGB image plane. Using a scanned retroreflective patch with background subtraction, we recover per-pixel response maps that provide an explicit LiDAR-to-RGB correspondence for cross-modal alignment and fusion. We demonstrate the method on the ams OSRAM TMF8828.
[CV-11] AV-Unified: A Unified Framework for Audio-visual Scene Understanding
【速读】:该论文旨在解决当前音频-视觉场景理解任务(如事件定位、分割、问答等)大多独立研究导致难以全面理解复杂多模态场景及探索任务间关联性的问题。解决方案的关键在于提出AV-Unified框架,通过将不同任务的输入输出统一转换为离散token序列,构建共享表示空间以实现跨异构数据集的联合训练;同时引入多尺度时空感知网络,分别处理音频-视觉事件的时间粒度差异和空间关联性,并借助任务特定文本提示增强模型的任务适应性和感知能力,从而在多个基准数据集上验证了其在时序、空间及时空任务上的有效性。
链接: https://arxiv.org/abs/2603.06530
作者: Guangyao Li,Xin Wang,Wenwu Zhu
机构: Tsinghua University (清华大学); Beijing National Research Center for Information Science and Technology (北京信息科学与技术国家研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Multimedia (TMM)
Abstract:When humans perceive the world, they naturally integrate multiple audio-visual tasks within dynamic, real-world scenes. However, current works such as event localization, parsing, segmentation and question answering are mostly explored individually, making it challenging to comprehensively understand complex audio-visual scenes and explore inter-task relationships. Hence, we propose \textbfAV-Unified, a unified framework that enables joint learning across a wide range of audio-visual scene understanding tasks. AV-Unified standardizes the diverse input-output formats of each task and incorporates a multi-scale spatiotemporal perception network to effectively capture audio-visual associations. Specifically, we unify the inputs and outputs of all supported tasks by converting them into sequences of discrete tokens, establishing a shared representation that allows a single architecture to be trained jointly across heterogeneous varied datasets. Considering the varying temporal granularity of audio-visual events, a multi-scale temporal perception module is designed to capture key cues. Meanwhile, to overcome the lack of auditory supervision in the visual domain, we design a cross-modal guidance-based spatial perception module that models spatial audio-visual associations. Furthermore, task-specific text prompts are employed to enhance the model’s adaptability and task-awareness. Extensive experiments on benchmark datasets (e.g., AVE, LLP, MUSIC-AVQA, VGG-SS and AVS) demonstrate the effectiveness of AV-Unified across temporal, spatial, and spatiotemporal tasks.
[CV-12] SCAN: Visual Explanations with Self-Confidence and Analysis Networks
【速读】:该论文旨在解决当前可解释人工智能(Explainable AI, XAI)方法在视觉任务中面临的两大核心问题:一是架构特定方法虽具有高保真度但适用范围有限,二是通用方法虽具备广泛适用性却常导致解释抽象或碎片化,难以在不同模型家族(如卷积神经网络 CNN 和 Transformer)之间进行公平比较。解决方案的关键在于提出 Self-Confidence and Analysis Networks (SCAN),这是一个基于自编码器(AutoEncoder)的统一框架,通过中间层特征重建并结合信息瓶颈(Information Bottleneck)原理生成高分辨率自信心图(Self-Confidence Map),从而精准定位信息丰富区域,实现对 CNN 和 Transformer 架构均有效的高保真、对象聚焦的可视化解释。
链接: https://arxiv.org/abs/2603.06523
作者: Gwanghee Lee,Sungyoon Jeong,Kyoungson Jhang
机构: Chungnam National University (忠南国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures, IEEE Transactions on Artificial Intelligence
Abstract:Explainable AI (XAI) has become essential in computer vision to make the decision-making processes of deep learning models transparent. However, current visual explanation (XAI) methods face a critical trade-off between the high fidelity of architecture-specific methods and the broad applicability of universal ones. This often results in abstract or fragmented explanations and makes it difficult to compare explanatory power across diverse model families, such as CNNs and Transformers. This paper introduces the Self-Confidence and Analysis Networks (SCAN), a novel universal framework that overcomes these limitations for both convolutional neural network and transformer architectures. SCAN utilizes an AutoEncoder-based approach to reconstruct features from a model’s intermediate layers. Guided by the Information Bottleneck principle, it generates a high-resolution Self-Confidence Map that identifies information-rich regions. Extensive experiments on diverse architectures and datasets demonstrate that SCAN consistently achieves outstanding performance on various quantitative metrics such as AUC-D, Negative AUC, Drop%, and Win%. Qualitatively, it produces significantly clearer, object-focused explanations than existing methods. By providing a unified framework that is both architecturally universal and highly faithful, SCAN enhances model transparency and offers a more reliable tool for understanding the decision-making processes of complex neural networks.
[CV-13] Artificial Intelligence for Detecting Fetal Orofacial Clefts and Advancing Medical Education
【速读】:该论文旨在解决胎儿口腔面部裂隙(orofacial clefts)的产前准确诊断难题,尤其是在缺乏经验丰富的放射科医生且该疾病相对罕见的医疗环境中。解决方案的关键在于开发并验证一个基于人工智能(AI)的诊断系统,该系统在超过45,139张超声图像(来自9,215名胎儿)上进行训练,能够实现高于93%的敏感性和95%的特异性,其性能与资深放射科医生相当,并显著优于初级医生。此外,该系统作为“医疗协作者”不仅能提升初学者的诊断敏感性(提升超6%),还能加速临床专家对罕见病种的专业能力培养,从而提供一种兼具诊断准确性提升与人才培养功能的可扩展解决方案。
链接: https://arxiv.org/abs/2603.06522
作者: Yuanji Zhang,Yuhao Huang,Haoran Dou,Xiliang Zhu,Chen Ling,Zhong Yang,Lianying Liang,Jiuping Li,Siying Liang,Rui Li,Yan Cao,Yuhan Zhang,Jiewei Lai,Yongsong Zhou,Hongyu Zheng,Xinru Gao,Cheng Yu,Liling Shi,Mengqin Yuan,Honglong Li,Xiaoqiong Huang,Chaoyu Chen,Jialin Zhang,Wenxiong Pan,Alejandro F. Frangi,Guangzhi He,Xin Yang,Yi Xiong,Linliang Yin,Xuedong Deng,Dong Ni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 10 figures, 11 tables
Abstract:Orofacial clefts are among the most common congenital craniofacial abnormalities, yet accurate prenatal detection remains challenging due to the scarcity of experienced specialists and the relative rarity of the condition. Early and reliable diagnosis is essential to enable timely clinical intervention and reduce associated morbidity. Here we show that an artificial intelligence system, trained on over 45,139 ultrasound images from 9,215 fetuses across 22 hospitals, can diagnose fetal orofacial clefts with sensitivity and specificity exceeding 93% and 95% respectively, matching the performance of senior radiologists and substantially outperforming junior radiologists. When used as a medical copilot, the system raises junior radiologists’ sensitivity by more than 6%. Beyond direct diagnostic assistance, the system also accelerates the development of clinical expertise. A pilot study involving 24 radiologists and trainees demonstrated that the model can improve the expertise development for rare conditions. This dual-purpose approach offers a scalable solution for improving both diagnostic accuracy and specialist training in settings where experienced radiologists are scarce.
[CV-14] SG-DOR: Learning Scene Graphs with Direction-Conditioned Occlusion Reasoning for Pepper Plants
【速读】:该论文旨在解决密集作物冠层中机器人采摘时的遮挡识别问题,即如何准确识别阻碍目标果实的器官(如叶片)及其与果实之间的方向依赖性遮挡关系。解决方案的关键在于提出SG-DOR(Scene Graphs with Direction-Conditioned Occlusion Reasoning)框架,该框架基于实例分割的器官点云数据,构建包含物理连接关系和方向条件遮挡信息的场景图;并通过引入遮挡排序任务、方向感知图神经网络架构(含按果实分组的叶集注意力机制和联合层级聚合策略),显著提升了遮挡预测(F1=0.73,NDCG@3=0.85)与附着关系推理(边F1=0.83)性能,为下游干预规划提供结构化的关联信号。
链接: https://arxiv.org/abs/2603.06512
作者: Rohit Menon,Niklas Mueller-Goldingen,Sicong Pan,Gokul Krishna Chenchani,Maren Bennewitz
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robotic harvesting in dense crop canopies requires effective interventions that depend not only on geometry, but also on explicit, direction-conditioned relations identifying which organs obstruct a target fruit. We present SG-DOR (Scene Graphs with Direction-Conditioned Occlusion Reasoning), a relational framework that, given instance-segmented organ point clouds, infers a scene graph encoding physical attachments and direction-conditioned occlusion. We introduce an occlusion ranking task for retrieving and ranking candidate leaves for a target fruit and approach direction, and propose a direction-aware graph neural architecture with per-fruit leaf-set attention and union-level aggregation. Experiments on a multi-plant synthetic pepper dataset show improved occlusion prediction (F1=0.73, NDCG@3=0.85) and attachment inference (edge F1=0.83) over strong ablations, yielding a structured relational signal for downstream intervention planning.
[CV-15] Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis
【速读】:该论文旨在解决当前扩散模型和流匹配模型在训练过程中依赖外部预训练模型来获取语义表示的问题,这些问题包括需额外训练、目标不一致以及不可预测的扩展行为。解决方案的关键在于提出Self-Flow框架,其核心机制是Dual-Timestep Scheduling——通过在不同时间步施加异构噪声水平,制造token间的信息不对称性,迫使模型从受损输入中推断缺失信息,从而在无需外部监督的情况下,将表征学习内嵌至生成框架中,同时保持多模态兼容性和符合预期的缩放规律。
链接: https://arxiv.org/abs/2603.06507
作者: Hila Chefer,Patrick Esser,Dominik Lorenz,Dustin Podell,Vikash Raja,Vinh Tong,Antonio Torralba,Robin Rombach
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project webpage: this https URL
Abstract:Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model’s training objective, which poses a denoising task with little incentive to learn semantic representations. We introduce Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism, Dual-Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.
[CV-16] Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching
【速读】:该论文旨在解决医学影像等专业领域中视频逐帧标注成本高昂的问题,尤其针对标签传播(label propagation)在跨视频场景下的局限性。现有方法如视频追踪器或分割模型仅适用于单视频内传播且需逐视频初始化,而传统对应匹配方法依赖检测器选取的关键点,在低纹理场景表现不佳,同时密集特征匹配与一次性分割方法虽支持跨视频传播,却缺乏时空一致性且难以统一处理点和掩码标注。其解决方案的核心在于提出Match4Annotate框架:通过在测试时拟合基于SIREN的隐式神经表示(implicit neural representation)到DINOv3特征上,构建连续、高分辨率的时空特征场,并学习帧对间的平滑隐式形变场以指导对应匹配,从而实现点与掩码标注在视频内及跨视频的高效、平滑传播。
链接: https://arxiv.org/abs/2603.06471
作者: Zhuorui Zhang,Roger Pallarès-López,Praneeth Namburi,Brian W. Anthony
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Acquiring per-frame video annotations remains a primary bottleneck for deploying computer vision in specialized domains such as medical imaging, where expert labeling is slow and costly. Label propagation offers a natural solution, yet existing approaches face fundamental limitations. Video trackers and segmentation models can propagate labels within a single sequence but require per-video initialization and cannot generalize across videos. Classic correspondence pipelines operate on detector-chosen keypoints and struggle in low-texture scenes, while dense feature matching and one-shot segmentation methods enable cross-video propagation but lack spatiotemporal smoothness and unified support for both point and mask annotations. We present Match4Annotate, a lightweight framework for both intra-video and inter-video propagation of point and mask annotations. Our method fits a SIREN-based implicit neural representation to DINOv3 features at test time, producing a continuous, high-resolution spatiotemporal feature field, and learns a smooth implicit deformation field between frame pairs to guide correspondence matching. We evaluate on three challenging clinical ultrasound datasets. Match4Annotate achieves state-of-the-art inter-video propagation, outperforming feature matching and one-shot segmentation baselines, while remaining competitive with specialized trackers for intra-video propagation. Our results show that lightweight, test-time-optimized feature matching pipelines have the potential to offer an efficient and accessible solution for scalable annotation workflows.
[CV-17] GreenRFM: Toward a resource-efficient radiology foundation model
【速读】:该论文旨在解决放射学基础模型(Radiology Foundation Models, RFMs)发展中对大规模算力依赖的问题,现有方法多沿用自然图像领域的“暴力扩展”策略,导致模型在临床实践中表现出脆弱性和高成本。其解决方案的关键在于提出一种资源高效预训练框架GreenRFM,通过设计原则化的监督机制——即更精炼(More distilled)、更普遍(Ubiquitous)、语义强化(Semantic-enforcing)和任务对齐(Task-aligning)的MUST监督策略,最大化利用监督信号,而非单纯堆砌数据量。这一方法显著降低了计算需求(如单卡24GB GPU即可完成训练),同时实现了优于复杂参数密集型模型的性能,验证了监督设计在跨模态、跨人群泛化中的核心作用。
链接: https://arxiv.org/abs/2603.06467
作者: Yingtai Li,Shuai Ming,Mingyue Zhao,Haoran Lai,Rongsheng Wang,Rui Zhou,Rundong Wang,Yujia Li,Wei Wei,Shaohua Kevin Zhou
机构: University of Science and Technology of China (USTC); Suzhou Institute for Advanced Research, USTC; The First Affiliated Hospital of USTC; Institute of Computing Technology, CAS; University of Chinese Academy of Sciences; Jiangsu Provincial Key Laboratory of Multimodal Digital Twin Technology; State Key Laboratory of Precision and Intelligent Chemistry
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The development of radiology foundation models (RFMs) is hindered by a reliance on brute-force scaling. Existing approaches often directly translate methods for natural images, which prioritize scale over precision and hence lead to brittle and expensive models in clinical practice. To address this, we present a resource-efficient pre-training framework, GreenRFM, that achieves state-of-the-art performance. Our framework ensures robust generalization across diverse patient populations and imaging protocols, reducing computational requirements by orders of magnitude while surpassing complex, parameter-heavy models. These capabilities stem from principled supervision design that aims to maximally utilize supervisory signals via More distilled, Ubiquitous, Semantic-enforcing, and Task-aligning (MUST) supervision, rather than simply piling up the quantity of training data. We offer two GreenRFM configurations: (i) a performant model that establishes a new state-of-the-art using a single 24GB GPU within 24 hours, and (ii) a lightweight model that matches existing benchmarks with 6GB VRAM in 4 hours. We conduct extensive experiments using over 200,000 images from four institutions and of two modalities. GreenRFMs achieve superior performances on chest and abdominal CT datasets, regardless of public or private benchmark, surpassing a range of baseline models. In addition, the results on internal musculoskeletal MRI images show that the same supervision principles transfer between different modalities. Our performance and efficiency challenge the ``scale is all you need’’ dogma and democratize the equitable development of state-of-the-art RFMs for clinicians even on a laptop.
[CV-18] Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中几何信息表达不充分的问题,即其文本路径无法有效捕捉连续几何特征(如手部关节角度),导致生成式 AI (Generative AI) 在几何推理任务上表现受限。解决方案的关键在于发现并利用冻结模型中潜藏的几何信号:通过轻量级线性探测器(6,000参数)即可从冻结特征中提取手部关节角度,误差仅为6.1° MAE,显著优于文本输出的20.0°;进一步表明该瓶颈源于训练路径而非表征能力,并揭示了不同架构在功能层面收敛但表征差异大(CKA=0.41),且注意力机制在中间层(18–22层)承载主要几何信号,从而实现无需微调或文本生成的多任务几何感知。
链接: https://arxiv.org/abs/2603.06459
作者: Yakov Pyotr Shkolnikov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models encode continuous geometry that their text pathway fails to express: a 6,000-parameter linear probe extracts hand joint angles at 6.1 degrees MAE from frozen features, while the best text output achieves only 20.0 degrees – a 3.3x bottleneck. LoRA fine-tuning (r=16, 2,000 images) narrows this gap to 6.5 degrees, providing evidence for a pathway-training deficit rather than a representational one. Training objective determines accuracy more than architecture: five encoders spanning self-supervised, contrastive, and hybrid paradigms converge to statistically equivalent accuracy (R^2 approximately 0.55, TOST-equivalent at delta=0.03) despite sharing as little as CKA=0.41 representational similarity – functional convergence without representational convergence. Autoregressive generation damages geometric fidelity, but the damage originates in the generation process, not in language alignment: Qwen2.5-VL’s LLM layers actually improve probe accuracy over its raw vision encoder. Layer-wise analysis reveals a universal mid-network accuracy peak across all architectures, with attention heads in layers 18-22 carrying disproportionate geometric signal. These findings enable a single frozen backbone to function as a multi-task geometric sensor through lightweight probes, without fine-tuning or text generation.
[CV-19] raining Flow Matching: The Role of Weighting and Parameterization
【速读】:该论文旨在解决基于去噪(denoising-based)生成模型中训练目标设计的关键问题,特别是损失加权策略与输出参数化方式(包括噪声、干净图像和速度为基础的公式)如何影响模型性能。其解决方案的核心在于通过系统的数值实验,厘清这些训练选择如何与数据流形的内在维度、模型架构及数据集规模相互作用,从而为流形匹配(flow matching)模型的设计提供可操作的实践洞见。
链接: https://arxiv.org/abs/2603.06454
作者: Anne Gagneux,Ségolène Martin,Rémi Gribonval,Mathurin Massias
机构: ENS de Lyon (法国高等师范学校); CNRS (法国国家科学研究中心); Université Claude Bernard Lyon 1 (克莱蒙-奥弗涅大学); Inria (法国国家信息与自动化研究院); LIP (软件实验室); Technische Universität Berlin (柏林工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We study the training objectives of denoising-based generative models, with a particular focus on loss weighting and output parameterization, including noise-, clean image-, and velocity-based formulations. Through a systematic numerical study, we analyze how these training choices interact with the intrinsic dimensionality of the data manifold, model architecture, and dataset size. Our experiments span synthetic datasets with controlled geometry as well as image data, and compare training objectives using quantitative metrics for denoising accuracy (PSNR across noise levels) and generative quality (FID). Rather than proposing a new method, our goal is to disentangle the various factors that matter when training a flow matching model, in order to provide practical insights on design choices.
[CV-20] Pinterest Canvas: Large-Scale Image Generation at Pinterest
【速读】:该论文旨在解决生成式 AI(Generative AI)模型在实际产品应用中难以控制的问题,尤其是在需要严格满足特定功能需求的图像编辑与增强场景下,单一通用模型难以通过提示词或简单推理调整实现精准适配。解决方案的关键在于构建一个可扩展的大规模图像生成系统 Pinterest Canvas,其核心策略是首先训练一个具备广泛图像编辑能力的基础扩散模型(diffusion model),随后针对不同下游任务快速微调该基础模型,形成专用子模型;同时通过系统化的数据集构建、训练和推理优化实践,确保各任务变体能够高效响应具体的产品要求,从而在多个应用场景(如背景增强、长宽比外推等)中显著提升用户参与度并优于第三方模型表现。
链接: https://arxiv.org/abs/2603.06453
作者: Yu Wang,Eric Tzeng,Raymond Shiau,Jie Yang,Dmitry Kislyuk,Charles Rosenberg
机构: Pinterest Inc.(Pinterest公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While recent image generation models demonstrate a remarkable ability to handle a wide variety of image generation tasks, this flexibility makes them hard to control via prompting or simple inference adaptation alone, rendering them unsuitable for use cases with strict product requirements. In this paper, we introduce Pinterest Canvas, our large-scale image generation system built to support image editing and enhancement use cases at Pinterest. Canvas is first trained on a diverse, multimodal dataset to produce a foundational diffusion model with broad image-editing capabilities. However, rather than relying on one generic model to handle every downstream task, we instead rapidly fine-tune variants of this base model on task-specific datasets, producing specialized models for individual use cases. We describe key components of Canvas and summarize our best practices for dataset curation, training, and inference. We also showcase task-specific variants through case studies on background enhancement and aspect-ratio outpainting, highlighting how we tackle their specific product requirements. Online A/B experiments demonstrate that our enhanced images receive a significant 18.0% and 12.5% engagement lift, respectively, and comparisons with human raters further validate that our models outperform third-party models on these tasks. Finally, we showcase other Canvas variants, including multi-image scene synthesis and image-to-video generation, demonstrating that our approach can generalize to a wide variety of potential downstream tasks.
[CV-21] CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
【速读】:该论文旨在解决视觉领域中自回归(Autoregressive, AR)建模的因果性难题,即如何将语言模型中成功的因果tokenization范式有效扩展到图像处理。当前方法要么将二维图像块(patch)展平为非因果序列,要么采用启发式排序破坏“下一个token预测”的因果结构;而扩散自动编码器则因条件解码依赖全部token或引入嵌套dropout机制导致训练不平衡。解决方案的关键在于提出CaTok——一种1D因果图像分词器与MeanFlow解码器相结合的新架构:通过在时间区间内选择token并绑定至MeanFlow目标函数,学习具有因果性的1D表示,从而支持快速单步生成和高保真多步采样,同时自然捕捉跨token区间的多样视觉概念。此外,引入REPA-A正则化策略进一步提升训练稳定性与效率,使模型在ImageNet重建任务上达到0.75 FID、22.53 PSNR和0.674 SSIM的SOTA性能,且训练周期更短。
链接: https://arxiv.org/abs/2603.06449
作者: Yitong Chen,Zuxuan Wu,Xipeng Qiu,Yu-Gang Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website is available in this https URL
Abstract:Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the “next-token prediction” pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, as illustrated in Fig. 1, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 0.75 FID, 22.53 PSNR and 0.674 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.
[CV-22] What if? Emulative Simulation with World Models for Situated Reasoning
【速读】:该论文旨在解决在物理约束或安全风险限制下,智能体如何通过心理模拟(mental exploration)实现情境推理的问题,即在仅有限观测条件下,能否生成未来轨迹并回答空间类“假设性”问题。其解决方案的关键在于构建了首个大规模数据集WanderDream,包含15.8K全景视频(WanderDream-Gen)和158K问答对(WanderDream-QA),用于训练和评估模型进行无主动探索的情境推理能力;实验表明,世界模型(world models)在该数据集上表现优异,且想象力显著提升推理性能,同时数据具备良好的跨场景迁移能力。
链接: https://arxiv.org/abs/2603.06445
作者: Ruiping Liu,Yufan Chen,Yuheng Zhang,Junwei Zheng,Kunyu Peng,Chengzhi Wu,Chenguang Huang,Di Wen,Jiaming Zhang,Kailun Yang,Rainer Stiefelhagen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Situated reasoning often relies on active exploration, yet in many real-world scenarios such exploration is infeasible due to physical constraints of robots or safety concerns of visually impaired users. Given only a limited observation, can an agent mentally simulate a future trajectory toward a target situation and answer spatial what-if questions? We introduce WanderDream, the first large-scale dataset designed for the emulative simulation of mental exploration, enabling models to reason without active exploration. WanderDream-Gen comprises 15.8K panoramic videos across 1,088 real scenes from HM3D, ScanNet++, and real-world captures, depicting imagined trajectories from current viewpoints to target situations. WanderDream-QA contains 158K question-answer pairs, covering starting states, paths, and end states along each trajectory to comprehensively evaluate exploration-based reasoning. Extensive experiments with world models and MLLMs demonstrate (1) that mental exploration is essential for situated reasoning, (2) that world models achieve compelling performance on WanderDream-Gen, (3) that imagination substantially facilitates reasoning on WanderDream-QA, and (4) that WanderDream data exhibit remarkable transferability to real-world scenarios. The source code and all data will be released.
[CV-23] CLoPA: Continual Low Parameter Adaptation of Interactive Segmentation for Medical Image Annotation
【速读】:该论文旨在解决现有零样本医学图像分割模型(如nnInteractive)在多样化的医学影像任务中难以持续达到专家级性能的问题。其核心挑战在于,尽管这些模型具备良好的泛化能力,但面对不同解剖结构和成像特征的任务时,性能表现不稳定。为此,作者提出了一种名为CLoPA的持续适应策略,其关键在于:利用标注缓存中的少量任务特定数据,仅对nnInteractive模型中的一小部分参数进行微调,并通过轻量级的“训练轮次调度”机制触发更新。该方法无需引入新参数或修改推理流程,可无缝集成至现有的标注工作流中,在8个医学分割基准任务上实现了快速且显著的性能提升,尤其在原本表现不佳的任务中效果突出,且多数性能增益可在单次训练轮次后达成。
链接: https://arxiv.org/abs/2603.06426
作者: Parhom Esmaeili,Chayanin Tangwiriyasakul,Eli Gibson,Sebastien Ourselin,M. Jorge Cardoso
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 2 figures
Abstract:Interactive segmentation enables clinicians to guide annotation, but existing zero-shot models like nnInteractive fail to consistently reach expert-level performance across diverse medical imaging tasks. Because annotation campaigns produce a growing stream of task-specific labelled data, online adaptation of the segmentation model is a natural complement to zero-shot inference. We propose CLoPA, a continual adaptation strategy that tunes a small fraction of nnInteractive’s parameters on the annotation cache, triggered by lightweight episode scheduling. CLoPA requires no new parameters or changes to the inference pipeline, and operates entirely within the existing annotation workflow. Across eight Medical Segmentation Decathlon tasks spanning diverse anatomical targets and imaging characteristics, CLoPA rapidly elevates performance to expert-level, even for tasks where nnInteractive previously failed, with the majority of gains realised after a single training episode. We show that the benefits of tuning different parameter groups depends on task characteristics and data regimes. Also, that for targets with complex geometries (e.g., hepatic vessels), instance normalisation and low-level feature tuning saturates, suggesting a need for deeper feature-representation alignment in the most challenging scenarios.
[CV-24] Non-invasive Growth Monitoring of Small Freshwater Fish in Home Aquariums via Stereo Vision
【速读】:该论文旨在解决水族箱环境中鱼类体型测量的难题,尤其是由于鱼体较小及玻璃-水界面引起的强折射畸变导致的传统图像测量方法精度不足的问题。其解决方案的关键在于提出了一种非侵入式的、考虑折射效应的立体视觉方法:首先利用YOLOv11-Pose网络检测鱼体并预测关键点;随后引入一种考虑空气-玻璃-水界面的折射感知对极约束以实现鲁棒匹配,并通过学习到的质量评分剔除低质量检测结果;最终基于折射感知的三维三角测量恢复关键点三维坐标,从而准确估算鱼长。该方法在濒危苏拉威西稻鱼的立体图像数据集上验证有效,表明过滤低质量检测对精确长度估计至关重要。
链接: https://arxiv.org/abs/2603.06421
作者: Clemens Seibold,Anna Hilsmann,Peter Eisert
机构: Humboldt University of Berlin (柏林洪堡大学); Fraunhofer HHI (弗劳恩霍夫海因里希·赫兹研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at VISAPP 2026
Abstract:Monitoring fish growth behavior provides relevant information about fish health in aquaculture and home aquariums. Yet, monitoring fish sizes poses different challenges, as fish are small and subject to strong refractive distortions in aquarium environments. Image-based measurement offers a practical, non-invasive alternative that allows frequent monitoring without disturbing the fish. In this paper, we propose a non-invasive refraction-aware stereo vision method to estimate fish length in aquariums. Our approach uses a YOLOv11-Pose network to detect fish and predict anatomical keypoints on the fish in each stereo image. A refraction-aware epipolar constraint accounting for the air-glass-water interfaces enables robust matching, and unreliable detections are removed using a learned quality score. A subsequent refraction-aware 3D triangulation recovers 3D keypoints, from which fish length is measured. We validate our approach on a new stereo dataset of endangered Sulawesi ricefish captured under aquarium-like conditions and demonstrate that filtering low-quality detections is essential for accurate length estimation. The proposed system offers a simple and practical solution for non-invasive growth monitoring and can be easily applied in home aquariums.
[CV-25] Physical Simulator In-the-Loop Video Generation CVPR2026
【速读】:该论文旨在解决基于扩散模型的视频生成技术在物理合理性方面的不足,即生成的视频中物体运动常违背基本物理定律(如重力、惯性与碰撞),导致帧间运动不一致或动态行为不可信,从而限制了视频的真实感与可靠性。其解决方案的关键在于提出一种“物理模拟器内循环视频生成”(Physical Simulator In-the-loop Video Generation, PSIVG)框架,通过将物理模拟器嵌入视频扩散过程,在预训练扩散模型生成的模板视频基础上重建4D场景和前景物体网格,并在物理模拟器中初始化这些几何体以生成符合物理规律的轨迹;随后利用这些轨迹引导视频生成器实现时空上物理一致的运动。此外,为提升物体移动过程中的纹理一致性,还引入了测试时纹理一致性优化(Test-Time Texture Consistency Optimization, TTCO)技术,基于模拟器中的像素对应关系自适应调整文本和特征嵌入。
链接: https://arxiv.org/abs/2603.06408
作者: Lin Geng Foo,Mark He Huang,Alexandros Lattas,Stylianos Moschoglou,Thabo Beeler,Christian Theobalt
机构: Max Planck Institute for Informatics (马普所信息学研究所); Saarland Informatics Campus (萨尔兰信息学校区); Singapore University of Technology and Design (新加坡科技设计大学); A*STAR (新加坡科技研究局); Google (谷歌); Saarbrücken Research Center for Visual Computing, Interaction and Artificial Intelligence (萨尔布吕肯视觉计算、交互与人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Accepted to CVPR 2026
Abstract:Recent advances in diffusion-based video generation have achieved remarkable visual realism but still struggle to obey basic physical laws such as gravity, inertia, and collision. Generated objects often move inconsistently across frames, exhibit implausible dynamics, or violate physical constraints, limiting the realism and reliability of AI-generated videos. We address this gap by introducing Physical Simulator In-the-loop Video Generation (PSIVG), a novel framework that integrates a physical simulator into the video diffusion process. Starting from a template video generated by a pre-trained diffusion model, PSIVG reconstructs the 4D scene and foreground object meshes, initializes them within a physical simulator, and generates physically consistent trajectories. These simulated trajectories are then used to guide the video generator toward spatio-temporally physically coherent motion. To further improve texture consistency during object movement, we propose a Test-Time Texture Consistency Optimization (TTCO) technique that adapts text and feature embeddings based on pixel correspondences from the simulator. Comprehensive experiments demonstrate that PSIVG produces videos that better adhere to real-world physics while preserving visual quality and diversity. Project Page: this https URL
[CV-26] Locating and Editing Figure-Ground Organization in Vision Transformers
【速读】:该论文旨在解决视觉Transformer模型在处理图形-背景组织(figure-ground organization)时的感知模糊性问题,即模型在面对局部几何证据与全局格式塔先验(Gestalt prior)冲突时如何做出选择。其关键解决方案是通过设计基于合成“飞镖”形状的可控感知冲突实验,系统性地遮蔽既可支持凹形补全也可支持凸形补全的区域,从而揭示BEiT模型对凸性先验的偏好。研究进一步利用logit attribution将内部激活投影到离散视觉代码本空间,发现这种偏好由Transformer子结构中特定的功能单元调控:早期和中间层存在感知模糊,而晚期层则突然实现决策;特别地,识别出第0层第9个注意力头(L0H9)作为早期凸性偏好的“种子”,仅削弱该注意力头即可连续移动感知冲突的决策边界,使凹形证据主导补全结果。
链接: https://arxiv.org/abs/2603.06407
作者: Stefan Arnold,René Gröbner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Transformers must resolve figure-ground organization by choosing between completions driven by local geometric evidence and those favored by global organizational priors, giving rise to a characteristic perceptual ambiguity. We aim to locate where the canonical Gestalt prior convexity is realized within the internal components of BEiT. Using a controlled perceptual conflict based on synthetic shapes of darts, we systematically mask regions that equally admit either a concave completion or a convex completion. We show that BEiT reliably favors convex completion under this competition. Projecting internal activations into the model’s discrete visual codebook space via logit attribution reveals that this preference is governed by identifiable functional units within transformer substructures. Specifically, we find that figure-ground organization is ambiguous through early and intermediate layers and resolves abruptly in later layers. By decomposing the direct effect of attention heads, we identify head L0H9 acting as an early seed, introducing a weak bias toward convexity. Downscaling this single attention head shifts the distributional mass of the perceptual conflict across a continuous decision boundary, allowing concave evidence to guide completion.
[CV-27] DiffInf: Influence-Guided Diffusion for Supervision Alignment in Facial Attribute Learning
【速读】:该论文旨在解决面部属性分类中因标注不一致导致的监督误差问题,此类不一致主要源于属性本身的模糊性和连续性与离散标签之间的矛盾,以及姿态、光照、表情和人口统计学差异等视觉混杂因素。解决方案的关键在于提出DiffInf框架,通过自影响引导的扩散机制识别并修正具有高影响力(high-influence)的训练样本:首先利用一阶近似计算样本级自影响分数以定位不稳定样本;随后采用潜在扩散自动编码器对这些样本进行有针对性的生成式修正,使其视觉内容更贴合标签同时保持身份真实性和一致性;并通过轻量级高影响力成员预测器作为可微分正则化项实现引导优化。最终生成一个影响精炼后的数据集,在不改变数据规模的前提下显著提升下游面部属性分类的泛化性能。
链接: https://arxiv.org/abs/2603.06399
作者: Basudha Pal,Rama Chellappa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Facial attribute classification relies on large-scale annotated datasets in which many traits, such as age and expression, are inherently ambiguous and continuous but are discretized into categorical labels. Annotation inconsistencies arise from subjectivity and visual confounders such as pose, illumination, expression, and demographic variation, creating mismatch between images and assigned labels. These inconsistencies introduce supervision errors that impair representation learning and degrade downstream prediction. We introduce DiffInf, a self-influence–guided diffusion framework for mitigating annotation inconsistencies in facial attribute learning. We first train a baseline classifier and compute sample-wise self-influence scores using a practical first-order approximation to identify training instances that disproportionately destabilize optimization. Instead of discarding these influential samples, we apply targeted generative correction via a latent diffusion autoencoder to better align visual content with assigned labels while preserving identity and realism. To enable differentiable guidance during correction, we train a lightweight predictor of high-influence membership and use it as a surrogate influence regularizer. The edited samples replace the originals, yielding an influence-refined dataset of unchanged size. Across multi-class facial attribute classification, DiffInf consistently improves generalization compared with standard noisy-label training, robust optimization baselines, and influence-based filtering. Our results demonstrate that repairing influential annotation inconsistencies at the image level enhances downstream facial attribute classification without sacrificing distributional coverage.
[CV-28] Solving Jigsaw Puzzles in the Wild: Human-Guided Reconstruction of Cultural Heritage Frag ments
【速读】:该论文旨在解决真实世界考古文物碎片重组难题,此类问题因侵蚀、缺失区域、不规则形状及大规模模糊性而极具挑战,尤其在碎片数量达数千片时(如RePAIR基准测试),传统拼图求解器难以有效应对。解决方案的关键在于提出一种人机协同(human-in-the-loop, HIL)框架,融合自动松弛标记求解器与交互式人工引导机制,允许用户迭代锁定已验证的拼接位置、纠正错误并引导系统向语义和几何一致的组装结果收敛;同时引入两种互补的交互策略——迭代锚定(Iterative Anchoring)与连续交互优化(Continuous Interactive Refinement),从而实现对不同模糊程度和规模拼图的可扩展重建,显著优于纯自动或纯人工基线方法,在准确性和效率上均取得提升。
链接: https://arxiv.org/abs/2603.06389
作者: Omidreza Safaei,Sinem Aslan,Sebastiano Vascon,Luca Palmieri,Marina Khoroshiltseva,Marcello Pelillo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures. Presented at the 2025 IEEE 35th International Workshop on Machine Learning for Signal Processing (MLSP). This is the author-accepted version of the paper. The final version is available via IEEE Xplore: this https URL
Abstract:Reassembling real-world archaeological artifacts from fragmented pieces poses significant challenges due to erosion, missing regions, irregular shapes, and large-scale ambiguity. Traditional jigsaw puzzle solvers, often designed for clean synthetic scenarios, struggle under these conditions, especially when the number of fragments grows into the thousands, as in the RePAIR benchmark. In this paper, we propose a human-in-the-loop (HIL) puzzle solving framework designed to address the complexity and scale of real-world cultural heritage reconstruction. Our approach integrates an automatic relaxation-labeling solver with interactive human guidance, allowing users to iteratively lock verified placements, correct errors, and guide the system toward semantically and geometrically coherent assemblies. We introduce two complementary interaction strategies, Iterative Anchoring and Continuous Interactive Refinement, which support scalable reconstruction across varying levels of ambiguity and puzzle size. Experiments on several RePAIR groups demonstrate that our hybrid approach substantially outperforms both fully automatic and manual baselines in accuracy and efficiency, offering a practical solution for large-scale expert-in-the-loop artifact reassembly.
[CV-29] REACT: Efficient Cross-Attention for Real-Time Scene Graph Generation
【速读】:该论文旨在解决场景图生成(Scene Graph Generation, SGG)中性能与推理速度之间的权衡问题,现有方法通常只专注于提升关系预测准确率、物体检测准确率或降低延迟中的单一目标,难以实现三者的协同优化。解决方案的关键在于改进基于REACT架构的模型,提出REACT++,通过在原型空间内引入高效特征提取和主体到对象的交叉注意力机制,在保持物体检测性能的同时显著提升关系预测准确率并大幅降低推理延迟,最终实现高精度与实时性的平衡。
链接: https://arxiv.org/abs/2603.06386
作者: Maëlic Neau,Zoe Falomir
机构: Umeå University (于默奥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scene Graph Generation (SGG) is a task that encodes visual relationships between objects in images as graph structures. SGG shows significant promise as a foundational component for downstream tasks, such as reasoning for embodied agents. To enable real-time applications, SGG must address the trade-off between performance and inference speed. However, current methods tend to focus on one of the following: (1) improving relation prediction accuracy, (2) enhancing object detection accuracy, or (3) reducing latency, without aiming to balance all three objectives simultaneously. To address this limitation, we build on the powerful Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation (REACT) architecture and propose REACT++, a new state-of-the-art model for real-time SGG. By leveraging efficient feature extraction and subject-to-object cross-attention within the prototype space, REACT++ balances latency and representational power. REACT++ achieves the highest inference speed among existing SGG models, improving relation prediction accuracy without sacrificing object detection performance. Compared to the previous REACT version, REACT++ is 20% faster with a gain of 10% in relation prediction accuracy on average. The code is available at this https URL.
[CV-30] Prompt Group-Aware Training for Robust Text-Guided Nuclei Segmentation
【速读】:该论文旨在解决生成式 AI(Generative AI)在医学图像分割任务中对提示(prompt)敏感性过高导致的预测不一致问题,即语义等价的文本描述仍可能产生差异显著的分割掩码,从而限制其在临床和病理学工作流中的可靠性。解决方案的关键在于将提示敏感性建模为组内一致性问题:首先将语义相关的提示划分为共享同一真实标签掩码的“提示组”,并设计一种无需架构修改、推理不变的提示组感知训练框架,其核心包括两项机制——(i) 基于分割损失作为隐式排序信号的质量引导组正则化,用于提升组内预测质量;(ii) 采用停止梯度策略的logit级一致性约束,以强制同一组内不同提示下的预测输出保持一致。该方法显著降低了提示质量波动带来的性能差异,并在多个数据集上实现了零样本跨数据集迁移的Dice分数平均提升2.16点,验证了其在视觉-语言分割任务中鲁棒性和泛化能力的提升。
链接: https://arxiv.org/abs/2603.06384
作者: Yonghuang Wu,Zhenyang Liang,Wenwen Zeng,Xuan Xie,Jinhua Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Foundation models such as Segment Anything Model 3 (SAM3) enable flexible text-guided medical image segmentation, yet their predictions remain highly sensitive to prompt formulation. Even semantically equivalent descriptions can yield inconsistent masks, limiting reliability in clinical and pathology workflows. We reformulate prompt sensitivity as a group-wise consistency problem. Semantically related prompts are organized into \emphprompt groups sharing the same ground-truth mask, and a prompt group-aware training framework is introduced for robust text-guided nuclei segmentation. The approach combines (i) a quality-guided group regularization that leverages segmentation loss as an implicit ranking signal, and (ii) a logit-level consistency constraint with a stop-gradient strategy to align predictions within each group. The method requires no architectural modification and leaves inference unchanged. Extensive experiments on multi-dataset nuclei benchmarks show consistent gains under textual prompting and markedly reduced performance variance across prompt quality levels. On six zero-shot cross-dataset tasks, our method improves Dice by an average of 2.16 points. These results demonstrate improved robustness and generalization for vision-language segmentation in computational pathology.
[CV-31] CHMv2: Improvements in Global Canopy Height Mapping using DINOv3
【速读】:该论文旨在解决全球范围内高精度林冠高度(canopy height)数据获取不均衡的问题,尤其是在缺乏机载激光扫描(ALS)数据的地区。现有产品在 tall forests 中存在偏差,且难以保留林冠边缘和间隙等细尺度结构信息。解决方案的关键在于利用基于 DINOv3 构建的深度估计模型,通过大规模地理多样化的训练数据、自动化数据清洗与配准流程,以及针对林冠高度分布特性的损失函数设计和采样策略,实现了从高分辨率光学卫星影像中提取米级分辨率全球林冠高度图(CHMv2),显著提升了精度并改善了对复杂林冠结构的刻画能力。
链接: https://arxiv.org/abs/2603.06382
作者: John Brandt,Seungeun Yi,Jamie Tolan,Xinyuan Li,Peter Potapov,Jessica Ertel,Justine Spore,Huy V. Vo,Michaël Ramamonjisoa,Patrick Labatut,Piotr Bojanowski,Camille Couprie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Nature Scientific Data
Abstract:Accurate canopy height information is essential for quantifying forest carbon, monitoring restoration and degradation, and assessing habitat structure, yet high-fidelity measurements from airborne laser scanning (ALS) remain unevenly available globally. Here we present CHMv2, a global, meter-resolution canopy height map derived from high-resolution optical satellite imagery using a depth-estimation model built on DINOv3 and trained against ALS canopy height models. Compared to existing products, CHMv2 substantially improves accuracy, reduces bias in tall forests, and better preserves fine-scale structure such as canopy edges and gaps. These gains are enabled by a large expansion of geographically diverse training data, automated data curation and registration, and a loss formulation and data sampling strategy tailored to canopy height distributions. We validate CHMv2 against independent ALS test sets and against tens of millions of GEDI and ICESat-2 observations, demonstrating consistent performance across major forest biomes.
[CV-32] MoEMambaMIL: Structure-Aware Selective State Space Modeling for Whole-Slide Image Analysis
【速读】:该论文旨在解决全切片图像(Whole-slide image, WSI)分析中因图像规模庞大(吉像素级)及固有的多分辨率层次结构所带来的挑战,特别是现有基于多实例学习(Multiple Instance Learning, MIL)的方法将WSI建模为无序的图像块集合,难以捕捉全局组织结构与局部细胞模式之间的结构化依赖关系。其解决方案的关键在于提出MoEMambaMIL框架,该框架通过区域嵌套的选择性扫描(region-nested selective scanning)构建具有空间包含关系的结构化token序列,并结合混合专家(Mixture-of-Experts, MoE)模型实现分辨率感知编码与区域自适应上下文建模的解耦:静态分辨率专用专家与动态稀疏专家协同工作,借助学习到的路由机制促进诊断模式异质性下的专家专业化,从而在保持高效长序列建模的同时显著提升对WSI空间层次结构的建模能力。
链接: https://arxiv.org/abs/2603.06378
作者: Dongqing Xie,Yonghuang Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures, 6 tables
Abstract:Whole-slide image (WSI) analysis is challenging due to the gigapixel scale of slides and their inherent hierarchical multi-resolution structure. Existing multiple instance learning (MIL) approaches often model WSIs as unordered collections of patches, which limits their ability to capture structured dependencies between global tissue organization and local cellular patterns. Although recent State Space Models (SSMs) enable efficient modeling of long sequences, how to structure WSI tokens to fully exploit their spatial hierarchy remains an open this http URL propose MoEMambaMIL, a structure-aware SSM framework for WSI analysis that integrates region-nested selective scanning with mixture-of-experts (MoE) modeling. Leveraging multi-resolution preprocessing, MoEMambaMIL organizes patch tokens into region-aware sequences that preserve spatial containment across resolutions. On top of this structured sequence, we decouple resolution-aware encoding and region-adaptive contextual modeling via a combination of static, resolution-specific experts and dynamic sparse experts with learned routing. This design enables efficient long-sequence modeling while promoting expert specialization across heterogeneous diagnostic patterns. Experiments demonstrate that MoEMambaMIL achieves the best performance across 9 downstream tasks.
[CV-33] Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation
【速读】:该论文旨在解决弱监督语义分割中因稀疏标注导致性能下降的问题,即在仅使用少量像素级标签的情况下,如何提升2D图像语义分割模型的精度。其解决方案的关键在于引入3D场景重建作为辅助监督信号,利用从2D视频中恢复的3D几何结构来传播稀疏标注信息;具体而言,通过一个双学生-教师架构强制2D图像与重建的3D点云之间保持语义一致性,并借助先进的前馈式3D重建技术生成可靠的几何监督,从而显著提升弱监督分割性能。
链接: https://arxiv.org/abs/2603.06374
作者: Jonas Ernst,Wolfgang Boettcher,Lukas Hoyer,Jan Eric Lenssen,Bernt Schiele
机构: Saarland University (萨尔兰大学); Max Planck Institute for Informatics, SIC (马普所信息学研究所, SIC); ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student-teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geometric supervision. Extensive experiments demonstrate that Rewis3d achieves state-of-the-art performance in sparse supervision, outperforming existing approaches by 2-7% without requiring additional labels or inference overhead.
[CV-34] OralGPT -Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis
【速读】:该论文旨在解决现有视觉-语言模型在全景牙科X光片(panoramic dental radiograph)分析中因静态单次推理范式而导致的临床可靠性不足问题,具体表现为缺乏细粒度空间推理能力、对双侧对称性的理解缺失以及多步骤诊断验证机制的薄弱。解决方案的关键在于提出OralGPT-Plus——一个代理型(agentic)视觉-语言模型,其通过迭代式、对称性感知的诊断推理流程实现更可靠的分析;同时构建了DentalProbe数据集(含5000张图像及专家标注的诊断轨迹),用于结构化监督局部检查与对侧对比,并引入基于重检驱动的强化学习框架,以鼓励临床有意义的再检查行为并稳定长程推理过程,最终在MMOral-X基准上验证了该方法的有效性。
链接: https://arxiv.org/abs/2603.06366
作者: Yuxuan Fan,Jing Hao,Hong Chen,Jiahao Bao,Yihua Shao,Yuci Liang,Kuo Feng Hung,Hao Tang
机构: The Hong Kong University of Science and Technology (GZ); Faculty of Dentistry, The University of Hong Kong; School of Computer Science, Peking University; Shanghai Jiao Tong University; Institute of Automation, Chinese Academy of Sciences; College of Computer Science and Software Engineering, Shenzhen University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 24 figures, conference
Abstract:Panoramic dental radiographs require fine-grained spatial reasoning, bilateral symmetry understanding, and multi-step diagnostic verification, yet existing vision-language models operate under a static single-pass paradigm that limits their clinical reliability. In this paper, we introduce OralGPT-Plus, an agentic vision-language model designed to perform iterative and symmetry-aware diagnostic reasoning for panoramic dental radiograph analysis. To support this paradigm, we construct DentalProbe, a five-thousand-image dataset with expert-curated diagnostic trajectories that provide structured supervision for localized inspection and contralateral comparison. We further develop a Reinspection-driven reinforcement learning framework that encourages clinically meaningful re-examination and stabilizes long-horizon reasoning with rubric-based reward and conditioned diagnostic-driven reward. In parallel, we present MMOral-X, the first benchmark for holistic panoramic diagnosis, containing 300 open-ended questions and region-level annotations across multiple difficulty levels. OralGPT-Plus demonstrates consistent and reliable improvements over strong baselines on MMOral-X and established panoramic benchmarks, indicating the effectiveness of interactive and symmetry-informed reasoning. Our work highlights the value of agentic modeling for dental imaging and provides a foundation for future research in clinically aligned panoramic radiograph analysis.
[CV-35] Computer vision-based estimation of invertebrate biomass
【速读】:该论文旨在解决如何仅通过图像信息准确估算无脊椎动物干重(dry mass)的问题,从而提升生物多样性监测的量化效率。传统方法依赖人工称重,过程繁琐且具有破坏性,而本研究提出两种无需额外人工操作的解决方案:一是基于新型图像特征(如面积和沉降速度)构建线性模型,这些特征由双相机系统BIODISCOVER自动计算;二是训练端到端深度神经网络模型,采用单视角、多视角及元数据感知架构。关键在于利用自动化成像设备提取可量化的生物物理特征,并结合大规模图像-干重配对数据集进行模型训练与验证,实现了对形态复杂多样样本的高精度干重估计(中位百分比误差为10–20%),为群体水平生物量估算提供了可靠工具。
链接: https://arxiv.org/abs/2603.06362
作者: Mikko Impiö,Philipp M. Rehsen,Jarrett Blair,Cecilie Mielec,Arne J. Beermann,Florian Leese,Toke T. Høye,Jenni Raitoharju
机构: Finnish Environment Institute (芬兰环境研究所); University of Helsinki (赫尔辛基大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The ability to estimate invertebrate biomass using only images could help scaling up quantitative biodiversity monitoring efforts. Computer vision-based methods have the potential to omit the manual, time-consuming, and destructive process of dry weighing specimens. We present two approaches for dry mass estimation that do not require additional manual effort apart from imaging the specimens: fitting a linear model with novel predictors, automatically calculated by an imaging device, and training a family of end-to-end deep neural networks for the task, using single-view, multi-view, and metadata-aware architectures. We propose using area and sinking speed as predictors. These can be calculated with BIODISCOVER, which is a dual-camera system that captures image sequences of specimens sinking in an ethanol column. For this study, we collected a large dataset of dry mass measurement and image sequence pairs to train and evaluate models. We show that our methods can estimate specimen dry mass even with complex and visually diverse specimen morphologies. Combined with automatic taxonomic classification, our approach is an accurate method for group-level dry mass estimation, with a median percentage error of 10-20% for individuals. We highlight the importance of choosing appropriate evaluation metrics, and encourage using both percentage errors and absolute errors as metrics, because they measure different properties. We also explore different optimization losses, data augmentation methods, and model architectures for training deep-learning models.
[CV-36] LATO: 3D Mesh Flow Matching with Structured TOpology Preserving LAtents
【速读】:该论文旨在解决现有生成式3D网格模型在复杂几何结构下难以保持拓扑一致性且计算效率低的问题。传统方法如基于等值面提取或三角网格的扩散模型和自回归生成方式,常因拓扑重建依赖启发式规则或额外后处理步骤而产生不一致或低效的结果。其解决方案的关键在于提出一种拓扑保持的潜在表示方法——LATO(Latent Topology-preserving Representation),它将网格建模为锚定于表面的顶点位移场(Vertex Displacement Field, VDF),并通过稀疏体素变分自编码器(Sparse Voxel Variational Autoencoder, VAE)将其压缩为结构化、拓扑感知的潜在体素空间;随后通过解码器逐步细化并修剪潜在体素以定位顶点,并引入专用连接头直接预测顶点间的边连接关系,从而无需等值面提取或启发式网格化即可恢复精确拓扑。整个流程采用两阶段流匹配(flow matching)策略,先合成结构体素再细化拓扑特征,实现了高保真度与高效推理的统一。
链接: https://arxiv.org/abs/2603.06357
作者: Tianhao Zhao,Youjia Zhang,Hang Long,Jinshen Zhang,Wenbing Li,Yang Yang,Gongbo Zhang,Jozef Hladký,Matthias Nießner,Wei Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we introduce LATO, a novel topology-preserving latent representation that enables scalable, flow matching-based synthesis of explicit 3D meshes. LATO represents a mesh as a Vertex Displacement Field (VDF) anchored on surface, incorporating a sparse voxel Variational Autoencoder (VAE) to compress this explicit signal into a structured, topology-aware voxel latent. To decapsulate the mesh, the VAE decoder progressively subdivides and prunes latent voxels to instantiate precise vertex locations. In the end, a dedicated connection head queries the voxel latent to predict edge connectivity between vertex pairs directly, allowing mesh topology to be recovered without isosurface extraction or heuristic meshing. For generative modeling, LATO adopts a two-stage flow matching process, first synthesizing the structure voxels and subsequently refining the voxel-wise topology features. Compared to prior isosurface/triangle-based diffusion models and autoregressive generation approaches, LATO generates meshes with complex geometry, well-formed topology while being highly efficient in inference.
[CV-37] Dynamic Chunking Diffusion Transformer
【速读】:该论文旨在解决扩散模型(Diffusion Models)中计算资源分配不均的问题,即传统基于固定长度token序列的扩散Transformer(DiT)对图像中信息密度差异较大的区域(如低信息量背景与高细节区域)采用相同的计算开销,且未随扩散过程中的去噪阶段动态调整计算强度。其解决方案的关键在于提出动态分块扩散Transformer(DC-DiT),通过引入一个端到端学习的编码器-路由器-解码器框架,利用自适应分块机制将输入图像按数据依赖方式压缩为更短的token序列:在空间维度上,自动减少均匀背景区域的token数量、保留细节丰富区域的高密度表示;在时间维度上,随着扩散过程从粗略结构向精细细节演化,逐步增加token数量以匹配当前阶段的信息需求。此方法无需显式监督即可生成有意义的视觉分割,并显著提升生成质量(FID和Inception Score),同时具备良好的实用性——可从预训练DiT模型微调获得,训练步数最多减少8倍,并能与其他动态计算方法兼容进一步降低生成时的浮点运算次数(FLOPs)。
链接: https://arxiv.org/abs/2603.06351
作者: Akash Haridas,Utkarsh Saxena,Parsa Ashrafi Fashi,Mehdi Rezagholizadeh,Vikram Appia,Emad Barsoum
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Diffusion Transformers process images as fixed-length sequences of tokens produced by a static \textitpatchify operation. While effective, this design spends uniform compute on low- and high-information regions alike, ignoring that images contain regions of varying detail and that the denoising process progresses from coarse structure at early timesteps to fine detail at late timesteps. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training. The mechanism learns to compress uniform background regions into fewer tokens and detail-rich regions into more tokens, with meaningful visual segmentations emerging without explicit supervision. Furthermore, it also learns to adapt its compression across diffusion timesteps, using fewer tokens at noisy stages and more tokens as fine details emerge. On class-conditional ImageNet 256\times256 , DC-DiT consistently improves FID and Inception Score over both parameter-matched and FLOP-matched DiT baselines across 4\times and 16\times compression, showing this is a promising technique with potential further applications to pixel-space, video and 3D generation. Beyond accuracy, DC-DiT is practical: it can be upcycled from pretrained DiT checkpoints with minimal post-training compute (up to 8\times fewer training steps) and composes with other dynamic computation methods to further reduce generation FLOPs.
[CV-38] K-MaT: Knowledge-Anchored Manifold Transport for Cross-Modal Prompt Learning in Medical Imaging
【速读】:该论文旨在解决大规模生物医学视觉语言模型(Biomedical Vision-Language Models, VLMs)在从高端成像模态(如CT)迁移到低端成像模态(如X光片)时,因依赖模态特异性捷径而导致性能崩溃的问题。解决方案的关键在于提出K-MaT(Knowledge-Anchored Manifold Transport)框架,通过将提示(prompt)分解并锚定到临床文本描述上,并利用融合Gromov-Wasserstein最优传输(Fused Gromov-Wasserstein optimal transport)对齐低端提示流形与高阶视觉引导空间,从而实现无需低端训练图像的零样本跨模态迁移,显著提升模型在多个医学影像任务中的泛化能力和稳定性。
链接: https://arxiv.org/abs/2603.06340
作者: Jiajun Zeng,Shadi Albarqouni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large-scale biomedical vision-language models (VLMs) adapted on high-end imaging (e.g., CT) often fail to transfer to frontline low-end modalities (e.g., radiography), collapsing into modality-specific shortcuts. We propose K-MaT (Knowledge-Anchored Manifold Transport), a prompt-learning framework that transfers decision structures to low-end modalities without requiring low-end training images. K-MaT factorizes prompts, anchors them to clinical text descriptions, and aligns the low-end prompt manifold to the visually-grounded high-end space using Fused Gromov-Wasserstein optimal transport. We evaluate K-MaT on four cross-modal benchmarks, including dermoscopy, mammography to ultrasound, and CT to chest X-ray. K-MaT achieves state-of-the-art results, improving the average harmonic mean of accuracy to 44.1% (from BiomedCoOp’s 42.0%) and macro-F1 to 36.2%. Notably, on the challenging breast imaging task, it mitigates the catastrophic forgetting seen in standard methods like CoOp (which drops to 27.0% accuracy on the low-end), preserving robust performance across modalities. Aligning prompt manifolds via optimal transport provides a highly effective route for the zero-shot cross-modal deployment of medical VLMs.
[CV-39] WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching
【速读】:该论文旨在解决基于扩散模型的世界模型在交互式应用和长时程轨迹模拟中因迭代去噪过程计算成本过高而导致的效率瓶颈问题。其核心挑战在于:传统针对单一模态扩散模型设计的策略难以直接迁移至世界模型,主要受两个特定障碍制约——一是多模态耦合与空间变化引起的token异质性(token heterogeneity),二是非均匀时间动态特性,即少数“混沌”token驱动误差累积,导致均匀跳步策略不稳定或过于保守。解决方案的关键是提出WorldCache框架,包含两项核心技术:其一为曲率引导的异质token预测(Curvature-guided Heterogeneous Token Prediction),利用物理启发的曲率分数估计token可预测性,并对方向突变的混沌token采用Hermite引导的阻尼预测器;其二为混沌优先自适应跳步(Chaotic-prioritized Adaptive Skipping),通过累积归一化的无量纲漂移信号,在瓶颈token开始漂移时才重新计算,从而实现高效且稳定的推理加速。实验表明,该方法可在保持98%轨迹质量的前提下实现最高3.7倍的端到端速度提升,显著优于现有方案。
链接: https://arxiv.org/abs/2603.06331
作者: Weilun Feng,Guoxin Fan,Haotong Qin,Chuanguang Yang,Mingqiang Wu,Yuqi Li,Xiangqi Li,Zhulin An,Libo Huang,Dingrui Wang,Longlong Liao,Michele Magno,Yongjun Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion-based world models have shown strong potential for unified world simulation, but the iterative denoising remains too costly for interactive use and long-horizon rollouts. While feature caching can accelerate inference without training, we find that policies designed for single-modal diffusion transfer poorly to world models due to two world-model-specific obstacles: \emphtoken heterogeneity from multi-modal coupling and spatial variation, and \emphnon-uniform temporal dynamics where a small set of hard tokens drives error growth, making uniform skipping either unstable or overly conservative. We propose \textbfWorldCache, a caching framework tailored to diffusion world models. We introduce \textitCurvature-guided Heterogeneous Token Prediction, which uses a physics-grounded curvature score to estimate token predictability and applies a Hermite-guided damped predictor for chaotic tokens with abrupt direction changes. We also design \textitChaotic-prioritized Adaptive Skipping, which accumulates a curvature-normalized, dimensionless drift signal and recomputes only when bottleneck tokens begin to drift. Experiments on diffusion world models show that WorldCache delivers up to \textbf3.7 \times end-to-end speedups while maintaining \textbf98% rollout quality, demonstrating the vast advantages and practicality of WorldCache in resource-constrained scenarios. Our code is released in this https URL.
[CV-40] P-SLCR: Unsupervised Point Cloud Semantic Segmentation via Prototypes Structure Learning and Consistent Reasoning
【速读】:该论文旨在解决点云场景语义分割中对人工标注严重依赖的问题,尤其是在无监督学习场景下如何有效利用原始点云数据进行高质量语义分割。其关键解决方案是提出了一种基于原型库驱动的无监督点云语义分割策略——P-SLCR(Prototype Library-driven Unsupervised Point Cloud Semantic Segmentation with Structure Learning and Consistent Reasoning),核心在于两个机制:一是通过一致性结构学习(Consistent Structure Learning)从一致点与原型库之间建立结构特征表示,选取高质量特征;二是通过语义关系一致性推理(Semantic Relation Consistent Reasoning)构建一致与模糊原型库间的原型互相关系矩阵,从而在约束条件下保持语义一致性。该方法在S3DIS、SemanticKITTI和Scannet等多个公开数据集上取得优于现有无监督方法的性能,尤其在S3DIS Area-5上达到47.1%的mIoU,超越经典全监督PointNet方法2.5个百分点。
链接: https://arxiv.org/abs/2603.06321
作者: Lixin Zhan,Jie Jiang,Tianjian Zhou,Yukun Du,Yan Zheng,Xuehu Duan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current semantic segmentation approaches for point cloud scenes heavily rely on manual labeling, while research on unsupervised semantic segmentation methods specifically for raw point clouds is still in its early stages. Unsupervised point cloud learning poses significant challenges due to the absence of annotation information and the lack of pre-training. The development of effective strategies is crucial in this context. In this paper, we propose a novel prototype library-driven unsupervised point cloud semantic segmentation strategy that utilizes Structure Learning and Consistent Reasoning (P-SLCR). First, we propose a Consistent Structure Learning to establish structural feature learning between consistent points and the library of consistent prototypes by selecting high-quality features. Second, we propose a Semantic Relation Consistent Reasoning that constructs a prototype inter-relation matrix between consistent and ambiguous prototype libraries separately. This process ensures the preservation of semantic consistency by imposing constraints on consistent and ambiguous prototype libraries through the prototype inter-relation matrix. Finally, our method was extensively evaluated on the S3DIS, SemanticKITTI, and Scannet datasets, achieving the best performance compared to unsupervised methods. Specifically, the mIoU of 47.1% is achieved for Area-5 of the S3DIS dataset, surpassing the classical fully supervised method PointNet by 2.5%.
[CV-41] WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection
【速读】:该论文旨在解决当前零样本异常检测(Zero-Shot Anomaly Detection, ZSAD)方法中存在的两个关键问题:一是现有方法通常依赖固定文本提示(textual prompts),难以捕捉复杂语义信息;二是仅关注空间域特征,导致对细微异常的检测能力受限。解决方案的关键在于提出一种小波增强的专家混合提示学习方法(wavelet-enhanced mixture-of-experts prompt learning),其核心包括三部分:首先,利用变分自编码器(Variational Autoencoder, VAE)建模全局语义表示并融入提示以提升对多样异常模式的适应性;其次,通过小波分解提取多频域图像特征,并借助跨模态交互动态优化文本嵌入;最后,引入语义感知的专家混合模块(semantic-aware mixture-of-experts module)聚合上下文信息,从而实现更鲁棒和精准的异常检测。
链接: https://arxiv.org/abs/2603.06313
作者: Peng Chen,Chao Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models have recently shown strong generalization in zero-shot anomaly detection (ZSAD), enabling the detection of unseen anomalies without task-specific supervision. However, existing approaches typically rely on fixed textual prompts, which struggle to capture complex semantics, and focus solely on spatial-domain features, limiting their ability to detect subtle anomalies. To address these challenges, we propose a wavelet-enhanced mixture-of-experts prompt learning method for ZSAD. Specifically, a variational autoencoder is employed to model global semantic representations and integrate them into prompts to enhance adaptability to diverse anomaly patterns. Wavelet decomposition extracts multi-frequency image features that dynamically refine textual embeddings through cross-modal interactions. Furthermore, a semantic-aware mixture-of-experts module is introduced to aggregate contextual information. Extensive experiments on 14 industrial and medical datasets demonstrate the effectiveness of the proposed method.
[CV-42] Latent Transfer Attack: Adversarial Examples via Generative Latent Spaces
【速读】:该论文旨在解决当前基于像素空间的对抗攻击方法在面对常见图像预处理(如缩放、裁剪)时鲁棒性差、跨模型迁移能力弱的问题。现有方法通常在像素空间中优化扰动,导致生成的噪声具有高频纹理特征,易被标准输入流水线破坏且迁移效果不佳。其解决方案的关键在于提出一种在潜在空间(latent space)中进行对抗扰动优化的新范式——LTA(Latent Transfer Attack),即利用预训练Stable Diffusion变分自编码器(VAE)的潜在空间对输入图像进行编码,在潜在空间中最大化代理分类器损失,并通过解码后软约束像素空间的ℓ∞预算来控制扰动幅度;同时引入期望变换(Expectation Over Transformations, EOT)机制以应对分辨率不匹配问题,并采用周期性潜在空间高斯平滑抑制伪影、稳定优化过程,从而实现更强的迁移攻击性能和更高质量的低频扰动。
链接: https://arxiv.org/abs/2603.06311
作者: Eitan Shaar,Ariel Shaulov,Yalcin Tur,Gal Chechik,Ravid Shwartz-Ziv
机构: Tel-Aviv University (特拉维夫大学); Stanford University (斯坦福大学); Bar Ilan University (巴伊兰大学); NVIDIA Research (英伟达研究); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Adversarial attacks are a central tool for probing the robustness of modern vision models, yet most methods optimize perturbations directly in pixel space under \ell_\infty or \ell_2 constraints. While effective in white-box settings, pixel-space optimization often produces high-frequency, texture-like noise that is brittle to common preprocessing (e.g., resizing and cropping) and transfers poorly across architectures. We propose \textbfLTA ( \textbfL atent \textbfT ransfer \textbfA ttack), a transfer-based attack that instead optimizes perturbations in the latent space of a pretrained Stable Diffusion VAE. Given a clean image, we encode it into a latent code and optimize the latent representation to maximize a surrogate classifier loss, while softly enforcing a pixel-space \ell_\infty budget after decoding. To improve robustness to resolution mismatch and standard input pipelines, we incorporate Expectation Over Transformations (EOT) via randomized resizing, interpolation, and cropping, and apply periodic latent Gaussian smoothing to suppress emerging artifacts and stabilize optimization. Across a suite of CNN and vision-transformer targets, LTA achieves strong transfer attack success while producing spatially coherent, predominantly low-frequency perturbations that differ qualitatively from pixel-space baselines and occupy a distinct point in the transfer-quality trade-off. Our results highlight pretrained generative latent spaces as an effective and structured domain for adversarial optimization, bridging robustness evaluation with modern generative priors.
[CV-43] DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models
【速读】:该论文旨在解决当前解释性方法在应对生成式视觉语言模型(Vision-Language Models, VLMs)时的局限性问题,特别是针对自回归(autoregressive)架构下token-by-token生成过程及视觉与文本模态间复杂交互所带来的解释挑战。其解决方案的关键在于提出DEX-AR(Dynamic Explainability for AutoRegressive models),通过计算生成过程中各层对注意力图的梯度,生成逐token和序列级别的二维热力图以定位图像中影响文本输出的关键区域;同时引入两个核心创新:一是动态头过滤机制,用于识别聚焦于视觉信息的注意力头;二是序列级过滤策略,能够聚合逐token解释并区分视觉依赖型与纯语言型token,从而实现对自回归VLM决策过程的细粒度、多层次解释。
链接: https://arxiv.org/abs/2603.06302
作者: Walid Bousselham,Angie Boggust,Hendrik Strobelt,Hilde Kuehne
机构: Tuebingen AI Center (图宾根人工智能中心); University of Tuebingen (图宾根大学); MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); MIT-IBM Watson AI Lab (麻省理工学院-IBM沃森人工智能实验室); IBM Research (IBM研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:As Vision-Language Models (VLMs) become increasingly sophisticated and widely used, it becomes more and more crucial to understand their decision-making process. Traditional explainability methods, designed for classification tasks, struggle with modern autoregressive VLMs due to their complex token-by-token generation process and intricate interactions between visual and textual modalities. We present DEX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address these challenges by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model’s textual responses. The proposed method offers to interpret autoregressive VLMs-including varying importance of layers and generated tokens-by computing layer-wise gradients with respect to attention maps during the token-by-token generation process. DEX-AR introduces two key innovations: a dynamic head filtering mechanism that identifies attention heads focused on visual information, and a sequence-level filtering approach that aggregates per-token explanations while distinguishing between visually-grounded and purely linguistic tokens. Our evaluation on ImageNet, VQAv2, and PascalVOC, shows a consistent improvement in both perturbation-based metrics, using a novel normalized perplexity measure, as well as segmentation-based metrics.
[CV-44] 3D CBCT Artefact Removal Using Perpendicular Score-Based Diffusion Models MICCAI2025
【速读】:该论文旨在解决锥形束计算机断层扫描(Cone-beam Computed Tomography, CBCT)图像中因高密度物体(如牙科种植体)引起的伪影问题,此类伪影会降低图像质量和诊断准确性。现有基于扩散模型的种植体修复方法多在独立的二维投影上操作,忽略了各投影间的相关性,导致重建图像不一致。其解决方案的关键在于提出一种基于垂直方向分数驱动扩散模型(perpendicular score-based diffusion models)的三维种植体修复方法,通过在两个不同平面训练并组合两个二维分数扩散模型,在采样策略中建模投影序列的三维分布,从而实现更一致且高质量的3D CBCT图像重建。
链接: https://arxiv.org/abs/2603.06300
作者: Susanne Schaub,Florentin Bieder,Matheus L. Oliveira,Yulan Wang,Dorothea Dagassan-Berndt,Michael M. Bornstein,Philippe C. Cattin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at DGM4MICCAI 2025
Abstract:Cone-beam computed tomography (CBCT) is a widely used 3D imaging technique in dentistry, offering high-resolution images while minimising radiation exposure for patients. However, CBCT is highly susceptible to artefacts arising from high-density objects such as dental implants, which can compromise image quality and diagnostic accuracy. To reduce artefacts, implant inpainting in the sequence of projections plays a crucial role in many artefact reduction approaches. Recently, diffusion models have achieved state-of-the-art results in image generation and have widely been applied to image inpainting tasks. However, to our knowledge, existing diffusion-based methods for implant inpainting operate on independent 2D projections. This approach neglects the correlations among individual projections, resulting in inconsistencies in the reconstructed images. To address this, we propose a 3D dental implant inpainting approach based on perpendicular score-based diffusion models, each trained in two different planes and operating in the projection domain. The 3D distribution of the projection series is modelled by combining the two 2D score-based diffusion models in the sampling scheme. Our results demonstrate the method’s effectiveness in producing high-quality, artefact-reduced 3D CBCT images, making it a promising solution for improving clinical imaging.
[CV-45] FlowMotion: Training-Free Flow Guidance for Video Motion Transfer
【速读】:该论文旨在解决现有无需训练(training-free)视频运动迁移方法中存在的计算开销大和灵活性不足的问题。当前方法依赖于预训练文本到视频(Text-to-Video, T2V)模型的中间输出构建运动引导,导致效率低下且难以灵活调整。解决方案的关键在于提出FlowMotion框架,其核心创新是直接利用基于流的T2V模型预测的早期潜在表示(latent predictions),从中提取运动表征以实现源视频与生成视频之间的运动模式对齐;同时引入速度正则化策略稳定优化过程并保障运动演化平滑性。该方法完全基于模型预测操作,显著提升了时间与资源效率,并在性能上达到或超越当前最优水平。
链接: https://arxiv.org/abs/2603.06289
作者: Zhen Wang,Youcan Xu,Jun Xiao,Long Chen
机构: The Hong Kong University of Science and Technology (香港科技大学); State key lab of CADCG, Zhejiang University (浙江大学CADCG国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video motion transfer aims to generate a target video that inherits motion patterns from a source video while rendering new scenes. Existing training-free approaches focus on constructing motion guidance based on the intermediate outputs of pre-trained T2V models, which results in heavy computational overhead and limited flexibility. In this paper, we present FlowMotion, a novel training-free framework that enables efficient and flexible motion transfer by directly leveraging the predicted outputs of flow-based T2V models. Our key insight is that early latent predictions inherently encode rich temporal information. Motivated by this, we propose flow guidance, which extracts motion representations based on latent predictions to align motion patterns between source and generated videos. We further introduce a velocity regularization strategy to stabilize optimization and ensure smooth motion evolution. By operating purely on model predictions, FlowMotion achieves superior time and resource efficiency as well as competitive performance compared with state-of-the-art methods.
[CV-46] Attribute Distribution Modeling and Semantic-Visual Alignment for Generative Zero-shot Learning
【速读】:该论文旨在解决生成式零样本学习(Generative Zero-Shot Learning, Generative ZSL)中的两个固有挑战:一是类别级属性无法捕捉类内实例的视觉差异,导致类-实例差距(class-instance gap);二是语义特征与视觉特征分布之间存在显著不匹配,表现为类间相关性,形成语义-视觉域差距(semantic-visual domain gap)。解决方案的关键在于提出Attribute Distribution Modeling and Semantic-Visual Alignment (ADiVA) 方法,通过两个核心模块协同优化:一是属性分布建模(ADM)模块,学习可迁移的类别属性分布并为未见类采样实例级属性;二是视觉引导对齐(VGA)模块,显式地调整语义表示以更好地反映视觉结构,从而缩小语义与视觉特征之间的分布差异。
链接: https://arxiv.org/abs/2603.06281
作者: Haojie Pu,Zhuoming Li,Yongbiao Gao,Yuheng Jia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 13 figures
Abstract:Generative zero-shot learning (ZSL) synthesizes features for unseen classes, leveraging semantic conditions to transfer knowledge from seen classes. However, it also introduces two intrinsic challenges: (1) class-level attributes fails to capture instance-specific visual appearances due to substantial intra-class variability, thus causing the class-instance gap; (2) the substantial mismatch between semantic and visual feature distributions, manifested in inter-class correlations, gives rise to the semantic-visual domain gap. To address these challenges, we propose an Attribute Distribution Modeling and Semantic-Visual Alignment (ADiVA) approach, jointly modeling attribute distributions and performing explicit semantic-visual alignment. Specifically, our ADiVA consists of two modules: an Attribute Distribution Modeling (ADM) module that learns a transferable attribute distribution for each class and samples instance-level attributes for unseen classes, and a Visual-Guided Alignment (VGA) module that refines semantic representations to better reflect visual structures. Experiments on three widely used benchmark datasets demonstrate that ADiVA significantly outperforms state-of-the-art methods (e.g., achieving gains of 4.7% and 6.1% on AWA2 and SUN, respectively). Moreover, our approach can serve as a plugin to enhance existing generative ZSL methods.
[CV-47] Can we Trust Unreliable Voxels? Exploring 3D Semantic Occupancy Prediction under Label Noise
【速读】:该论文旨在解决3D语义占据预测(3D semantic occupancy prediction)中因真实世界体素标注受结构伪影和动态拖尾效应污染而导致的标签噪声问题,即自主系统是否能安全依赖此类不可靠的监督信号。其核心挑战在于现有基于2D标签噪声学习的方法在稀疏的3D体素空间中失效,暴露出当前范式的严重脆弱性。解决方案的关键是提出DPR-Occ框架,该框架通过双源部分标签推理构建可靠监督:利用时间模型记忆与表示层面的结构相似性,动态扩展和修剪候选标签集合,在保留真实语义的同时抑制噪声传播,从而在极端标签噪声(高达90%)下仍能有效防止几何与语义坍塌,显著提升性能(mIoU最高提升2.57%,IoU提升13.91%)。
链接: https://arxiv.org/abs/2603.06279
作者: Wenxin Li,Kunyu Peng,Di Wen,Junwei Zheng,Jiale Wei,Mengfei Duan,Yuheng Zhang,Rui Fan,Kailun Yang
机构: Hunan University (湖南大学); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Sofia University “St. Kliment Ohridski” (索非亚大学“圣克莱门特·奥霍里斯基”); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The benchmark and source code will be made publicly available at this https URL
Abstract:3D semantic occupancy prediction is a cornerstone of robotic perception, yet real-world voxel annotations are inherently corrupted by structural artifacts and dynamic trailing effects. This raises a critical but underexplored question: can autonomous systems safely rely on such unreliable occupancy supervision? To systematically investigate this issue, we establish OccNL, the first benchmark dedicated to 3D occupancy under occupancy-asymmetric and dynamic trailing noise. Our analysis reveals a fundamental domain gap: state-of-the-art 2D label noise learning strategies collapse catastrophically in sparse 3D voxel spaces, exposing a critical vulnerability in existing paradigms. To address this challenge, we propose DPR-Occ, a principled label noise-robust framework that constructs reliable supervision through dual-source partial label reasoning. By synergizing temporal model memory with representation-level structural affinity, DPR-Occ dynamically expands and prunes candidate label sets to preserve true semantics while suppressing noise propagation. Extensive experiments on SemanticKITTI demonstrate that DPR-Occ prevents geometric and semantic collapse under extreme corruption. Notably, even at 90% label noise, our method achieves significant performance gains (up to 2.57% mIoU and 13.91% IoU) over existing label noise learning baselines adapted to the 3D occupancy prediction task. By bridging label noise learning and 3D perception, OccNL and DPR-Occ provide a reliable foundation for safety-critical robotic perception in dynamic environments. The benchmark and source code will be made publicly available at this https URL.
[CV-48] Spectral and Trajectory Regularization for Diffusion Transformer Super-Resolution
【速读】:该论文旨在解决扩散Transformer(Diffusion Transformer, DiT)架构在真实图像超分辨率(Real-ISR)任务中因迭代采样计算复杂度高而导致的部署瓶颈问题,以及现有单步蒸馏方法因轨迹不匹配(trajectory mismatch)而产生严重网格状周期性伪影(grid-like periodic artifacts)的挑战。解决方案的关键在于提出一种新颖的一步对抗蒸馏框架StrSR,其核心创新包括:1)设计不对称判别式蒸馏架构以弥合DiT与学生模型之间的轨迹差距;2)引入频域分布匹配策略,通过抑制高频谱泄漏有效消除DiT特有的周期性伪影,从而在定量指标和视觉感知上均实现当前最优的Real-ISR性能。
链接: https://arxiv.org/abs/2603.06275
作者: Jingkai Wang,Yixin Tang,Jue Gong,Jiatong Li,Shu Li,Libo Liu,Jianliang Lan,Yutong Liu,Yulun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages
Abstract:Diffusion transformer (DiT) architectures show great potential for real-world image super-resolution (Real-ISR). However, their computationally expensive iterative sampling necessitates one-step distillation. Existing one-step distillation methods struggle with Real-ISR on DiT. They suffer from fundamental trajectory mismatch and generate severe grid-like periodic artifacts. To tackle these challenges, we propose StrSR, a novel one-step adversarial distillation framework featuring spectral and trajectory regularization. Specifically, we propose an asymmetric discriminative distillation architecture to bridge the trajectory gap. Additionally, we design a frequency distribution matching strategy to effectively suppress DiT-specific periodic artifacts caused by high-frequency spectral leakage. Extensive experiments demonstrate that StrSR achieves state-of-the-art performance in Real-ISR, across both quantitative metrics and visual perception. The code and models will be released at this https URL .
[CV-49] HiPP-Prune: Hierarchical Preference-Conditioned Structured Pruning for Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在结构化剪枝过程中因压缩导致的任务性能下降与视觉定位能力退化的问题,尤其是对象幻觉(object hallucinations)现象在相同稀疏度下被显著放大。其解决方案的关键在于提出一种分层偏好条件化的结构剪枝框架——HiPP-Prune,该框架将剪枝视为多目标下的资源分配问题,并通过一个统一的策略网络输出全局剪枝蓝图,其中包含总稀疏预算和逐层分配方案;同时引入基于注意力流的视觉敏感性信号来识别关键视觉融合层并避免过度剪枝,进而利用计划级群相对策略优化(Plan-level Group Relative Policy Optimization, GRPO)在任务效用、幻觉鲁棒性(POPE)、压缩率及类突触流稳定性代理指标构成的多目标回报函数下进行高效搜索,从而在匹配稀疏度条件下实现可控的鲁棒性-效用权衡。
链接: https://arxiv.org/abs/2603.06270
作者: Lincen Bai,Hedi Tabia,Raul Santos-Rodriguez
机构: Université Paris-Saclay (巴黎萨克雷大学); University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Pruning vision-language models (VLMs) for efficient deployment is challenging because compression can affect not only task utility but also visual grounding, often amplifying object hallucinations even at the same sparsity level. We present HiPP-Prune, a hierarchical preference-conditioned structured pruning framework that treats pruning as conditional resource allocation under multiple objectives. HiPP-Prune makes plan-level decisions: a single policy invocation outputs a global pruning blueprint by factorizing decisions into an overall sparsity budget and a layer-wise allocation, enabling queryable trade-offs via a user-specified preference vector. To account for VLM-specific failure modes, our policy state integrates a visual sensitivity signal derived from attention flow between vision tokens and language hidden states, discouraging over-pruning of vision-critical layers that facilitate cross-modal fusion. We optimize pruning plans with plan-level Group Relative Policy Optimization (GRPO) under a multi-objective return that combines task utility, hallucination robustness (POPE), compression, and a synaptic-flow-inspired stability proxy to reduce unproductive exploration in high-sparsity regimes. Experiments on LLaVA with POPE and ScienceQA demonstrate that HiPP-Prune discovers diverse non-dominated pruning plans and provides controllable robustness–utility trade-offs under matched sparsity budgets.
[CV-50] ODD-SEC: Onboard Drone Detection with a Spinning Event Camera
【速读】:该论文旨在解决在移动载体(如四足机器人或无人地面车辆)上实现高精度、实时无人机检测的问题,尤其针对传统帧式相机在高速目标追踪和恶劣环境下的性能瓶颈。其关键解决方案在于提出一种无需运动补偿的新型类图像事件表示方法,并结合轻量级神经网络架构以实现高效的时空特征学习;同时采用旋转式事件相机提供360°水平视场,支持被测无人机的方位估计,系统部署于Jetson Orin NX平台实现实时运行,在复杂户外环境下检测均方角误差低于2°,验证了其在实际安防场景中的可行性与优越性。
链接: https://arxiv.org/abs/2603.06265
作者: Kuan Dai,Hongxin Zhang,Sheng Zhong,Yi Zhou
机构: Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid proliferation of drones requires balancing innovation with regulation. To address security and privacy concerns, techniques for drone detection have attracted significant this http URL solutions, such as frame camera-based systems, offer versatility and energy efficiency under typical conditions but are fundamentally constrained by their operational principles in scenarios involving fast-moving targets or adverse this http URL by biological vision, event cameras asynchronously detect per-pixel brightness changes, offering high dynamic range and microsecond-level responsiveness that make them uniquely suited for drone detection in conditions beyond the reach of conventional frame-based this http URL, the design of most existing event-based solutions assumes a static camera, greatly limiting their applicability to moving carriers–such as quadrupedal robots or unmanned ground vehicles–during field this http URL this paper, we introduce a real-time drone detection system designed for deployment on moving carriers. The system utilizes a spinning event-based camera, providing a 360° horizontal field of view and enabling bearing estimation of detected drones. A key contribution is a novel image-like event representation that operates without motion compensation, coupled with a lightweight neural network architecture for efficient spatiotemporal learning. Implemented on an onboard Jetson Orin NX, the system can operate in real time. Outdoor experimental results validate reliable detection with a mean angular error below 2° under challenging conditions, underscoring its suitability for real-world surveillance applications. We will open-source our complete pipeline to support future research.
[CV-51] GazeMoE: Perception of Gaze Target with Mixture-of-Experts ICRA2026
【速读】:该论文旨在解决从可见图像中准确估计人类注视目标(gaze target)的问题,这是机器人理解人类注意力的关键任务。当前挑战在于构建具有泛化能力的神经架构和训练范式,尤其是在整合多模态线索(如眼睛、头部姿态、手势及上下文特征)时,需要高效且自适应的解码机制。解决方案的关键在于提出GazeMoE框架,该框架基于Mixture-of-Experts (MoE)机制,通过可学习的专家模块从冻结的视觉基础模型中选择性地提取与注视目标相关的特征,从而实现对多模态线索的动态适配;同时引入类别平衡辅助损失与区域特定裁剪和光度变换等数据增强策略,有效缓解注视目标分类中的类别不平衡问题并提升模型鲁棒性。
链接: https://arxiv.org/abs/2603.06256
作者: Zhuangzhuang Dai,Zhongxi Lu,Vincent G. Zakka,Luis J. Manso,Jose M Alcaraz Calero,Chen Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, ICRA 2026
Abstract:Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues – including eyes, head poses, gestures, and contextual features – demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at this https URL
[CV-52] NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving
【速读】:该论文旨在解决3D多目标跟踪(3D Multi-Object Tracking, 3D MOT)在开放世界感知场景下对未知类别的泛化能力不足的问题,现有方法受限于封闭集假设和“语义盲”的匹配策略。其解决方案的关键在于提出Next-step Open-Vocabulary Autoregression (NOVA),将3D轨迹建模为结构化的时空语义序列,利用大语言模型(Large Language Models, LLMs)的自回归能力,将跟踪任务转化为下一时刻序列补全过程,从而显式融合语言空间的层次结构以解析细粒度语义歧义,并通过高层常识推理维持长序列中的身份一致性。
链接: https://arxiv.org/abs/2603.06254
作者: Kai Luo,Xu Wang,Rui Fan,Kailun Yang
机构: Hunan University (湖南大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Code will be available at this https URL
Abstract:Generalizing across unknown targets is critical for open-world perception, yet existing 3D Multi-Object Tracking (3D MOT) pipelines remain limited by closed-set assumptions and ``semantic-blind’’ heuristics. To address this, we propose Next-step Open-Vocabulary Autoregression (NOVA), an innovative paradigm that shifts 3D tracking from traditional fragmented distance-based matching toward generative spatio-temporal semantic modeling. NOVA reformulates 3D trajectories as structured spatio-temporal semantic sequences, enabling the simultaneous encoding of physical motion continuity and deep linguistic priors. By leveraging the autoregressive capabilities of Large Language Models (LLMs), we transform the tracking task into a principled process of next-step sequence completion. This mechanism allows the model to explicitly utilize the hierarchical structure of language space to resolve fine-grained semantic ambiguities and maintain identity consistency across complex long-range sequences through high-level commonsense reasoning. Extensive experiments on nuScenes, V2X-Seq-SPD, and KITTI demonstrate the superior performance of NOVA. Notably, on the nuScenes dataset, NOVA achieves an AMOTA of 22.41% for Novel categories, yielding a significant 20.21% absolute improvement over the baseline. These gains are realized through a compact 0.5B autoregressive model. Code will be available at this https URL.
[CV-53] Hierarchical Collaborative Fusion for 3D Instance-aware Referring Expression Segmentation
【速读】:该论文旨在解决通用三维指代表达分割(Generalised 3D Referring Expression Segmentation, 3D-GRES)中因仅依赖稀疏点云而导致视觉语义信息不足、难以精准定位细粒度描述目标的问题。解决方案的关键在于提出一种多模态框架HCF-RES,其核心创新包括:(1)分层视觉语义分解(Hierarchical Visual Semantic Decomposition),利用Segment Anything Model (SAM) 的实例掩码引导CLIP编码,在像素级和实例级双粒度上提取特征,并在2D到3D投影过程中保持物体边界;(2)渐进式多层级融合(Progressive Multi-level Fusion),通过模态内协作、跨模态自适应加权(融合2D语义与3D几何特征)以及语言引导的精修机制,实现更鲁棒的多模态表示整合。
链接: https://arxiv.org/abs/2603.06250
作者: Keshen Zhou,Runnan Chen,Mingming Gong,Tongliang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generalised 3D Referring Expression Segmentation (3D-GRES) localizes objects in 3D scenes based on natural language, even when descriptions match multiple or zero targets. Existing methods rely solely on sparse point clouds, lacking rich visual semantics for fine-grained descriptions. We propose HCF-RES, a multi-modal framework with two key innovations. First, Hierarchical Visual Semantic Decomposition leverages SAM instance masks to guide CLIP encoding at dual granularities – pixel-level and instance-level features – preserving object boundaries during 2D-to-3D projection. Second, Progressive Multi-level Fusion integrates representations through intra-modal collaboration, cross-modal adaptive weighting between 2D semantic and 3D geometric features, and language-guided refinement. HCF-RES achieves state-of-the-art results on both ScanRefer and Multi3DRefer.
[CV-54] DC-Merge: Improving Model Merging with Directional Consistency CVPR2026
【速读】:该论文旨在解决多任务模型合并(model merging)过程中知识保留不足的问题,核心挑战在于如何维持合并后模型在奇异空间方向上的一致性。具体而言,现有方法常因两个问题导致性能下降:一是任务向量内部能量分布不均,即少数主导奇异值掩盖了语义重要但强度较弱的组件;二是任务向量在参数空间中几何结构不一致,直接合并会扭曲其方向关系。为此,作者提出DC-Merge方法,其关键创新在于通过平滑奇异值来均衡各任务向量的能量分布,并将这些能量平衡后的向量投影到共享正交子空间以对齐方向几何,从而实现最小重构误差下的方向一致性合并,最终在视觉与视觉-语言基准测试中显著优于现有方法。
链接: https://arxiv.org/abs/2603.06242
作者: Han-Chen Zhang,Zi-Hao Zhou,Mao-Lin Luo,Shimin Di,Min-Ling Zhang,Tong Wei
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 Main Track
Abstract:Model merging aims to integrate multiple task-adapted models into a unified model that preserves the knowledge of each task. In this paper, we identify that the key to this knowledge retention lies in maintaining the directional consistency of singular spaces between merged multi-task vector and individual task vectors. However, this consistency is frequently compromised by two issues: i) an imbalanced energy distribution within task vectors, where a small fraction of singular values dominate the total energy, leading to the neglect of semantically important but weaker components upon merging, and ii) the geometric inconsistency of task vectors in parameter space, which causes direct merging to distort their underlying directional geometry. To address these challenges, we propose DC-Merge, a method for directional-consistent model merging. It first balances the energy distribution of each task vector by smoothing its singular values, ensuring all knowledge components are adequately represented. These energy-balanced vectors are then projected onto a shared orthogonal subspace to align their directional geometries with minimal reconstruction error. Finally, the aligned vectors are aggregated in the shared orthogonal subspace and projected back to the original parameter space. Extensive experiments on vision and vision-language benchmarks show that DC-Merge consistently achieves state-of-the-art performance in both full fine-tuning and LoRA settings. The implementation code is available at this https URL.
[CV-55] aPD: Temporal-adaptive Progressive Distillation for Observation-Adaptive Trajectory Forecasting in Autonomous Driving
【速读】:该论文旨在解决自动驾驶中轨迹预测模型在实际场景下因观测历史长度变化(如遮挡或传感器范围限制导致的短序列输入)而性能显著下降的问题。现有方法通常假设固定长度的历史观测,难以适应动态变化的输入条件。解决方案的关键在于提出一种统一且即插即用的框架TaPD(Temporal-adaptive Progressive Distillation),其核心由两个协同模块构成:一是基于渐进式知识蒸馏(Progressive Knowledge Distillation, PKD)的观测自适应预测器(Observation-Adaptive Forecaster, OAF),通过层次化特征回归将长时“教师”模型中的运动模式知识迁移至短时“学生”模型,从而增强短序列下的运动上下文感知能力;二是时间回填模块(Temporal Backfilling Module, TBM),用于显式重建缺失的历史片段,以生成富含场景演化信息的轨迹,进一步强化PKD训练过程。此外,论文引入余弦退火的蒸馏权重策略以平衡预测监督与特征对齐,提升优化稳定性与跨长度一致性,并采用解耦预训练-回填-微调协议保留真实运动先验,使模型在极端短历史输入下仍能保持高性能。
链接: https://arxiv.org/abs/2603.06231
作者: Mingyu Fan,Yi Liu,Hao Zhou,Deheng Qian,Mohammad Haziq Khan,Matthias Raetsch
机构: Donghua University (东华大学); Great Bay Institute for Advanced Study/Great Bay University (大湾区研究院/大湾区大学); Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); Chongqing Chang’an Technology Co., Ltd (重庆长安科技有限公司); ViSiR, Reutlingen University (Reutlingen University (罗特林根大学))
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Trajectory prediction is essential for autonomous driving, enabling vehicles to anticipate the motion of surrounding agents to support safe planning. However, most existing predictors assume fixed-length histories and suffer substantial performance degradation when observations are variable or extremely short in real-world settings (e.g., due to occlusion or a limited sensing range). We propose TaPD (Temporal-adaptive Progressive Distillation), a unified plug-and-play framework for observation-adaptive trajectory forecasting under variable history lengths. TaPD comprises two cooperative modules: an Observation-Adaptive Forecaster (OAF) for future prediction and a Temporal Backfilling Module (TBM) for explicit reconstruction of the past. OAF is built on progressive knowledge distillation (PKD), which transfers motion pattern knowledge from long-horizon “teachers” to short-horizon “students” via hierarchical feature regression, enabling short observations to recover richer motion context. We further introduce a cosine-annealed distillation weighting scheme to balance forecasting supervision and feature alignment, improving optimization stability and cross-length consistency. For extremely short histories where implicit alignment is insufficient, TBM backfills missing historical segments conditioned on scene evolution, producing context-rich trajectories that strengthen PKD and thereby improve OAF. We employ a decoupled pretrain-reconstruct-finetune protocol to preserve real-motion priors while adapting to backfilled inputs. Extensive experiments on Argoverse 1 and Argoverse 2 show that TaPD consistently outperforms strong baselines across all observation lengths, delivers especially large gains under very short inputs, and improves other predictors (e.g., HiVT) in a plug-and-play manner. Code will be available at this https URL.
[CV-56] Low-latency Event-based Object Detection with Spatially-Sparse Linear Attention
【速读】:该论文旨在解决事件相机(event camera)在低延迟目标检测任务中,现有异步事件神经网络因递归结构难以高效训练长序列、且提升精度常导致单事件计算量和延迟增加的问题。其核心挑战在于如何在利用事件空间稀疏性的同时,实现高效的并行训练与推理。解决方案的关键是提出空间稀疏线性注意力(Spatially-Sparse Linear Attention, SSLA),通过引入“空间混合状态分解”和“散射-计算-聚集”训练流程,在保持全局并行训练能力的基础上,实现了状态级别的稀疏激活,从而显著降低每事件计算开销并提升检测精度。基于SSLA构建的端到端异步模型SSLA-Det在Gen1和N-Caltech101数据集上达到当前最优性能(mAP分别为0.375和0.515),同时比最强的异步基线方法减少超过20倍的每事件计算量。
链接: https://arxiv.org/abs/2603.06228
作者: Haiqing Hao,Zhipeng Sui,Rong Zou,Zijia Dai,Nikola Zubić,Davide Scaramuzza,Wenhui Wang
机构: Tsinghua University (清华大学); Robot and Perception Group, University of Zurich (苏黎世大学机器人与感知组); ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event cameras provide sequential visual data with spatial sparsity and high temporal resolution, making them attractive for low-latency object detection. Existing asynchronous event-based neural networks realize this low-latency advantage by updating predictions event-by-event, but still suffer from two bottlenecks: recurrent architectures are difficult to train efficiently on long sequences, and improving accuracy often increases per-event computation and latency. Linear attention is appealing in this setting because it supports parallel training and recurrent inference. However, standard linear attention updates a global state for every event, yielding a poor accuracy-efficiency trade-off, which is problematic for object detection, where fine-grained representations and thus states are preferred. The key challenge is therefore to introduce sparse state activation that exploits event sparsity while preserving efficient parallel training. We propose Spatially-Sparse Linear Attention (SSLA), which introduces a mixture-of-spaces state decomposition and a scatter-compute-gather training procedure, enabling state-level sparsity as well as training parallelism. Built on SSLA, we develop an end-to-end asynchronous linear attention model, SSLA-Det, for event-based object detection. On Gen1 and N-Caltech101, SSLA-Det achieves state-of-the-art accuracy among asynchronous methods, reaching 0.375 mAP and 0.515 mAP, respectively, while reducing per-event computation by more than 20 times compared to the strongest prior asynchronous baseline, demonstrating the potential of linear attention for low-latency event-based vision.
[CV-57] Word-Anchored Temporal Forgery Localization
【速读】:该论文旨在解决当前时序伪造定位(Temporal Forgery Localization, TFL)方法中存在的特征粒度不匹配和计算成本高昂的问题。现有方法多依赖于时间边界回归或连续帧级异常检测,导致在特征提取与伪造区域对齐上存在偏差,且模型复杂度高、效率低。其解决方案的关键在于提出一种全新的离散词元级二分类范式——词锚定的时序伪造定位(Word-Anchored Temporal Forgery Localization, WAFL),将TFL任务从连续的时间回归转变为基于自然语言边界划分的词级别二分类。该方法通过引入伪造特征重对齐(Forensic Feature Realignment, FFR)模块,将预训练模型的语义空间表示映射到判别性伪造流形,使轻量级线性分类器能够高效完成定位;同时设计以伪影为中心的非对称损失(Artifact-Centric Asymmetric, ACA loss),有效缓解伪造检测中的极端类别不平衡问题,动态抑制真实样本梯度并优先关注细微伪造特征,从而实现更精准、高效的伪造定位。
链接: https://arxiv.org/abs/2603.06220
作者: Tianyi Wang,Xi Shao,Harry Cheng,Yinglong Wang,Mohan Kankanhalli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted for review
Abstract:Current temporal forgery localization (TFL) approaches typically rely on temporal boundary regression or continuous frame-level anomaly detection paradigms to derive candidate forgery proposals. However, they suffer not only from feature granularity misalignment but also from costly computation. To address these issues, we propose word-anchored temporal forgery localization (WAFL), a novel paradigm that shifts the TFL task from temporal regression and continuous localization to discrete word-level binary classification. Specifically, we first analyze the essence of temporal forgeries and identify the minimum meaningful forgery units, word tokens, and then align data preprocessing with the natural linguistic boundaries of speech. To adapt powerful pre-trained foundation backbones for feature extraction, we introduce the forensic feature realignment (FFR) module, mapping representations from the pre-trained semantic space to a discriminative forensic manifold. This allows subsequent lightweight linear classifiers to efficiently perform binary classification and accomplish the TFL task. Furthermore, to overcome the extreme class imbalance inherent to forgery detection, we design the artifact-centric asymmetric (ACA) loss, which breaks the standard precision-recall trade-off by dynamically suppressing overwhelming authentic gradients while asymmetrically prioritizing subtle forensic artifacts. Extensive experiments demonstrate that WAFL significantly outperforms state-of-the-art approaches in localization performance under both in- and cross-dataset settings, while requiring substantially fewer learnable parameters and operating at high computational efficiency.
[CV-58] EntON: Eigenentropy-Optimized Neighborhood Densification in 3D Gaussian Splatting
【速读】:该论文旨在解决传统3D高斯泼溅(3D Gaussian Splatting, 3DGS)方法中高斯分布中心与物体表面几何对齐不佳,以及基于表面的重建方法常牺牲光度准确性的核心问题。其解决方案的关键在于提出一种基于特征值熵(Eigenentropy)优化的邻域稠密化策略——EntON,通过计算每个高斯中心在k近邻范围内协方差矩阵特征值的熵值来量化局部结构有序性,并将其融入交替优化框架:在标准梯度驱动稠密化的基础上,引入特征值熵感知的稠密化机制,优先对低熵(有序、平坦)区域进行分裂以精细捕捉表面细节,同时剪枝高熵(无序、球状)区域的高斯,从而在提升几何精度(最高达33%)和渲染质量(最高达7%)的同时,减少50%的高斯数量和23%的训练时间,实现几何准确性、渲染质量和效率的协同优化。
链接: https://arxiv.org/abs/2603.06216
作者: Miriam Jäger,Boris Jutzi
机构: Karlsruhe Institute of Technology (KIT)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ISPRS Journal of Photogrammetry and Remote Sensing on 20 February 2026
Abstract:We present a novel Eigenentropy-optimized neighboorhood densification strategy EntON in 3D Gaussian Splatting (3DGS) for geometrically accurate and high-quality rendered 3D reconstruction. While standard 3DGS produces Gaussians whose centers and surfaces are poorly aligned with the underlying object geometry, surface-focused reconstruction methods frequently sacrifice photometric accuracy. In contrast to the conventional densification strategy, which relies on the magnitude of the view-space position gradient, our approach introduces a geometry-aware strategy to guide adaptive splitting and pruning. Specifically, we compute the 3D shape feature Eigenentropy from the eigenvalues of the covariance matrix in the k-nearest neighborhood of each Gaussian center, which quantifies the local structural order. These Eigenentropy values are integrated into an alternating optimization framework: During the optimization process, the algorithm alternates between (i) standard gradient-based densification, which refines regions via view-space gradients, and (ii) Eigenentropy-aware densification, which preferentially densifies Gaussians in low-Eigenentropy (ordered, flat) neighborhoods to better capture fine geometric details on the object surface, and prunes those in high-Eigenentropy (disordered, spherical) regions. We provide quantitative and qualitative evaluations on two benchmark datasets: small-scale DTU dataset and large-scale TUM2TWIN dataset, covering man-made objects and urban scenes. Experiments demonstrate that our Eigenentropy-aware alternating densification strategy improves geometric accuracy by up to 33% and rendering quality by up to 7%, while reducing the number of Gaussians by up to 50% and training time by up to 23%. Overall, EnTON achieves a favorable balance between geometric accuracy, rendering quality and efficiency by avoiding unnecessary scene expansion.
[CV-59] Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events CVPR2026
【速读】:该论文旨在解决多模态摘要(Multimodal Summarization, MMS)中存在的三大挑战:依赖领域特定监督、跨模态对齐弱且融合隐式,以及缺乏事件过渡的扁平时间建模。解决方案的关键在于提出一种无需训练的框架 CoE,其核心是通过一个由分层事件图(Hierarchical Event Graph, HEG)引导的事件链(Chain-of-Events, CoE)实现结构化推理。HEG 显式编码文本语义为事件层次结构,从而强化跨模态对齐与时间推理能力;在此结构指导下,CoE 能精准定位关键视觉线索、建模事件演化与因果转换,并通过轻量级风格适配实现域对齐,显著提升摘要质量与泛化性能。
链接: https://arxiv.org/abs/2603.06213
作者: Xiaoxing You,Qiang Huang,Lingyu Li,Xiaojun Chang,Jun Yu
机构: Hangzhou Dianzi University (杭州电子科技大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026
Abstract:Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce CoE, a training-free MMS framework that performs structured reasoning through a Chain-of-Events guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, CoE localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive experiments on eight diverse datasets demonstrate that CoE consistently outperforms state-of-the-art video CoT baselines, achieving average gains of +3.04 ROUGE, +9.51 CIDEr, and +1.88 BERTScore, highlighting its robustness, interpretability, and cross-domain generalization. Our code is available at this https URL.
[CV-60] VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction
【速读】:该论文旨在解决纯视觉范式下3D语义占据预测(3D semantic occupancy prediction)中几何信息不足的问题,尤其是在依赖高质量3D高斯分布(3D Gaussian splatting)时,由于缺乏精确的几何线索导致性能受限。解决方案的关键在于引入视觉基础模型(Vision Foundation Models, VFMs)所具备的强大几何定位能力,并提出一种名为Visual Geometry Grounded Gaussian Splatting (VG3S) 的新框架,其核心创新是设计了一个即插即用的分层几何特征适配器(hierarchical geometric feature adapter),通过特征聚合、任务特定对齐和多尺度重构,有效将冻结的VFM中的通用token转化为适用于占据预测的3D几何先验,从而显著提升预测精度与泛化能力。
链接: https://arxiv.org/abs/2603.06210
作者: Xiaoyang Yan,Muleilan Pei,Shaojie Shen
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:3D semantic occupancy prediction has become a crucial perception task for comprehensive scene understanding in autonomous driving. While recent advances have explored 3D Gaussian splatting for occupancy modeling to substantially reduce computational overhead, the generation of high-quality 3D Gaussians relies heavily on accurate geometric cues, which are often insufficient in purely vision-centric paradigms. To bridge this gap, we advocate for injecting the strong geometric grounding capability from Vision Foundation Models (VFMs) into occupancy prediction. In this regard, we introduce Visual Geometry Grounded Gaussian Splatting (VG3S), a novel framework that empowers Gaussian-based occupancy prediction with cross-view 3D geometric grounding. Specifically, to fully exploit the rich 3D geometric priors from a frozen VFM, we propose a plug-and-play hierarchical geometric feature adapter, which can effectively transform generic VFM tokens via feature aggregation, task-specific alignment, and multi-scale restructuring. Extensive experiments on the nuScenes occupancy benchmark demonstrate that VG3S achieves remarkable improvements of 12.6% in IoU and 7.5% in mIoU over the baseline. Furthermore, we show that VG3S generalizes seamlessly across diverse VFMs, consistently enhancing occupancy prediction accuracy and firmly underscoring the immense value of integrating priors derived from powerful, pre-trained geometry-grounded VFMs.
[CV-61] Point-Supervised Skeleton-Based Human Action Segmentation
【速读】:该论文旨在解决骨架动作分割(Skeleton-based Action Segmentation)任务中对昂贵帧级标注的依赖以及模糊动作边界带来的敏感性问题。其关键解决方案在于提出了一种基于点监督(Point-Supervised)的框架,仅需每段动作的一个标注帧即可训练模型;同时利用多模态骨架数据(关节、骨骼和运动信息)通过预训练统一模型提取丰富特征表示,并结合原型相似性方法与现有伪标签生成策略(能量函数与约束K-Medoids聚类),进一步通过多模态伪标签融合提升伪标签可靠性并指导模型训练,从而在显著降低标注成本的同时实现媲美甚至超越全监督方法的性能。
链接: https://arxiv.org/abs/2603.06201
作者: Hongsong Wang,Yiqin Shen,Pengbo Yan,Jie Gui
机构: Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Skeleton-based temporal action segmentation is a fundamental yet challenging task, playing a crucial role in enabling intelligent systems to perceive and respond to human activities. While fully-supervised methods achieve satisfactory performance, they require costly frame-level annotations and are sensitive to ambiguous action boundaries. To address these issues, we introduce a point-supervised framework for skeleton-based action segmentation, where only a single frame per action segment is labeled. We leverage multimodal skeleton data, including joint, bone, and motion information, encoded via a pretrained unified model to extract rich feature representations. To generate reliable pseudo-labels, we propose a novel prototype similarity method and integrate it with two existing methods: energy function and constrained K-Medoids clustering. Multimodal pseudo-label integration is proposed to enhance the reliability of the pseudo-label and guide the model training. We establish new benchmarks on PKU-MMD (X-Sub and X-View), MCFS-22, and MCFS-130, and implement baselines for point-supervised skeleton-based human action segmentation. Extensive experiments show that our method achieves competitive performance, even surpassing some fully-supervised methods while significantly reducing annotation effort.
[CV-62] Adaptive Language-Aware Image Reflection Removal Network IJCAI2025
【速读】:该论文旨在解决现有图像去反射方法在处理复杂反射时性能受限的问题,尤其针对由模糊和失真干扰导致机器生成的语言描述不准确,进而影响语言引导型去反射效果的挑战。解决方案的关键在于提出自适应语言感知网络(Adaptive Language-Aware Network, ALANet),其核心创新包括:1)融合滤波与优化策略,其中滤波策略降低错误语言输入的负面影响并保留其有益信息,优化策略增强语言与视觉特征之间的对齐;2)利用语言提示从特征图中解耦特定层级内容,从而提升模型对复杂反射的建模能力。
链接: https://arxiv.org/abs/2603.06200
作者: Siyan Fang,Yuntao Wang,Jinpu Zhang,Ziwen Li,Yuehuan Wang
机构: Huazhong University of Science and Technology (华中科技大学); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IJCAI 2025
Abstract:Existing image reflection removal methods struggle to handle complex reflections. Accurate language descriptions can help the model understand the image content to remove complex reflections. However, due to blurred and distorted interferences in reflected images, machine-generated language descriptions of the image content are often inaccurate, which harms the performance of language-guided reflection removal. To address this, we propose the Adaptive Language-Aware Network (ALANet) to remove reflections even with inaccurate language inputs. Specifically, ALANet integrates both filtering and optimization strategies. The filtering strategy reduces the negative effects of language while preserving its benefits, whereas the optimization strategy enhances the alignment between language and visual features. ALANet also utilizes language cues to decouple specific layer content from feature maps, improving its ability to handle complex reflections. To evaluate the model’s performance under complex reflections and varying levels of language accuracy, we introduce the Complex Reflection and Language Accuracy Variance (CRLAV) dataset. Experimental results demonstrate that ALANet surpasses state-of-the-art methods for image reflection removal. The code and dataset are available at this https URL.
[CV-63] SpaCRD: Multimodal Deep Fusion of Histology and Spatial Transcriptomics for Cancer Region Detection AAAI-2026
【速读】:该论文旨在解决癌症组织区域(Cancer Tissue Region, CTR)检测中因组织学图像形态相似性导致假阳性率高,且现有方法难以有效整合多源空间转录组(Spatial Transcriptomics, ST)数据的问题,尤其在跨样本、跨平台/批次场景下表现不佳。解决方案的关键在于提出一种基于迁移学习的深度融合方法 SpaCRD,其核心是一个类别正则化的变分重建引导双向交叉注意力融合网络,能够自适应地捕捉组织学特征与基因表达之间的潜在共表达模式,从而实现跨不同样本、平台和批次的可靠CTR检测。
链接: https://arxiv.org/abs/2603.06186
作者: Shuailin Xue,Jun Wan,Lihua Zhang,Wenwen Min
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI-2026-Oral
Abstract:Accurate detection of cancer tissue regions (CTR) enables deeper analysis of the tumor microenvironment and offers crucial insights into treatment response. Traditional CTR detection methods, which typically rely on the rich cellular morphology in histology images, are susceptible to a high rate of false positives due to morphological similarities across different tissue regions. The groundbreaking advances in spatial transcriptomics (ST) provide detailed cellular phenotypes and spatial localization information, offering new opportunities for more accurate cancer region detection. However, current methods are unable to effectively integrate histology images with ST data, especially in the context of cross-sample and cross-platform/batch settings for accomplishing the CTR detection. To address this challenge, we propose SpaCRD, a transfer learning-based method that deeply integrates histology images and ST data to enable reliable CTR detection across diverse samples, platforms, and batches. Once trained on source data, SpaCRD can be readily generalized to accurately detect cancerous regions across samples from different platforms and batches. The core of SpaCRD is a category-regularized variational reconstruction-guided bidirectional cross-attention fusion network, which enables the model to adaptively capture latent co-expression patterns between histological features and gene expression from multiple perspectives. Extensive benchmark analysis on 23 matched histology-ST datasets spanning various disease types, platforms, and batches demonstrates that SpaCRD consistently outperforms existing eight state-of-the-art methods in CTR detection.
[CV-64] owards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots
【速读】:该论文旨在解决如何客观评估人形机器人运动是否具有人类相似性的问题,即在仅依赖运动学信息(kinematic information)的情况下,判断人类观察者能否区分机器人与人类的动作。其核心挑战在于缺乏统一、可量化的评价标准来衡量机器人动作的“类人程度”。解决方案的关键在于提出一种新的评估框架——Motion Turing Test,并构建了HHMotion数据集,该数据集包含1,000个来自11种人形机器人和10名人类受试者的动作序列,均以SMPL-X参数化表示以消除视觉外观影响;同时通过30名标注者对每帧姿态进行0–5分的人类相似度评分,形成超过500小时的高质量标注数据。基于此,研究进一步定义了一个自动预测人类相似度分数的任务,并提出一个简单但有效的基线模型,证明其优于当前主流多模态大语言模型(multimodal large language models, LLMs),从而为后续相关研究提供了基准和工具支持。
链接: https://arxiv.org/abs/2603.06181
作者: Mingzhe Li,Mengyin Liu,Zekai Wu,Xincheng Lin,Junsheng Zhang,Ming Yan,Zengye Xie,Changwang Zhang,Chenglu Wen,Lan Xu,Siqi Shen,Cheng Wang
机构: Xiamen University (厦门大学); ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures, conference
Abstract:Humanoid robots have achieved significant progress in motion generation and control, exhibiting movements that appear increasingly natural and human-like. Inspired by the Turing Test, we propose the Motion Turing Test, a framework that evaluates whether human observers can discriminate between humanoid robot and human poses using only kinematic information. To facilitate this evaluation, we present the Human-Humanoid Motion (HHMotion) dataset, which consists of 1,000 motion sequences spanning 15 action categories, performed by 11 humanoid models and 10 human subjects. All motion sequences are converted into SMPL-X representations to eliminate the influence of visual appearance. We recruited 30 annotators to rate the human-likeness of each pose on a 0-5 scale, resulting in over 500 hours of annotation. Analysis of the collected data reveals that humanoid motions still exhibit noticeable deviations from human movements, particularly in dynamic actions such as jumping, boxing, and running. Building on HHMotion, we formulate a human-likeness evaluation task that aims to automatically predict human-likeness scores from motion data. Despite recent progress in multimodal large language models, we find that they remain inadequate for assessing motion human-likeness. To address this, we propose a simple baseline model and demonstrate that it outperforms several contemporary LLM-based methods. The dataset, code, and benchmark will be publicly released to support future research in the community.
[CV-65] Making Training-Free Diffusion Segmentors Scale with the Generative Power CVPR2026
【速读】:该论文旨在解决训练-free扩散分割方法在利用更强生成能力的扩散模型时性能提升有限的问题。现有方法依赖于模型注意力层中的交叉注意力图(cross-attention maps)来捕捉图像像素与文本标记之间的语义关系,但存在两个关键瓶颈:一是多头多层注意力图之间缺乏统一的全局表示;二是即使获得全局注意力图,不同文本标记间的得分不平衡导致语义关联不准确。为此,作者提出两种关键技术:自动聚合(auto aggregation)用于整合多头多层注意力信息以构建一致的全局表示,以及逐像素重缩放(per-pixel rescaling)以校正文本标记间得分差异,从而更有效地利用生成式扩散模型的能力实现高精度分割。
链接: https://arxiv.org/abs/2603.06178
作者: Benyuan Meng,Qianqian Xu,Zitai Wang,Xiaochun Cao,Longtao Huang,Qingming Huang
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院); State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所人工智能安全重点实验室); School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区网络科学与技术学院); Alibaba Group(阿里巴巴集团); School of Computer Science and Technology, University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院); Key Laboratory of Big Data Mining and Knowledge Management (BDKM), University of Chinese Academy of Sciences(中国科学院大学大数据挖掘与知识管理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model’s attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at this https URL.
[CV-66] Optimizing 3D Diffusion Models for Medical Imaging via Multi-Scale Reward Learning
【速读】:该论文旨在解决3D扩散模型(Diffusion Models)在医学图像生成中训练目标与临床实用性之间存在的鸿沟问题。其核心挑战在于如何使生成的图像不仅在局部纹理细节上逼真,同时在全局结构一致性上符合医学诊断需求。解决方案的关键在于引入基于强化学习(Reinforcement Learning, RL)的多尺度反馈机制:首先在MRI数据上预训练3D扩散模型以建立强大的生成先验,随后采用近端策略优化(Proximal Policy Optimization, PPO)进行微调,奖励函数融合了2D切片级评估和3D体积分析,从而协同优化局部细节与整体结构一致性。实验表明,该方法显著提升了Fréchet Inception Distance (FID) 并增强了合成数据在肿瘤和疾病分类等下游任务中的实用价值。
链接: https://arxiv.org/abs/2603.06173
作者: Yueying Tian,Xudong Han,Meng Zhou,Rodrigo Aviles-Espinosa,Rupert Young,Philip Birch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:Diffusion models have emerged as powerful tools for 3D medical image generation, yet bridging the gap between standard training objectives and clinical relevance remains a challenge. This paper presents a method to enhance 3D diffusion models using Reinforcement Learning (RL) with multi-scale feedback. We first pretrain a 3D diffusion model on MRI volumes to establish a robust generative prior. Subsequently, we fine-tune the model using Proximal Policy Optimization (PPO), guided by a novel reward system that integrates both 2D slice-wise assessments and 3D volumetric analysis. This combination allows the model to simultaneously optimize for local texture details and global structural coherence. We validate our framework on the BraTS 2019 and OASIS-1 datasets. Our results indicate that incorporating RL feedback effectively steers the generation process toward higher quality distributions. Quantitative analysis reveals significant improvements in Fréchet Inception Distance (FID) and, crucially, the synthetic data demonstrates enhanced utility in downstream tumor and disease classification tasks compared to non-optimized baselines.
[CV-67] JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas
【速读】:该论文旨在解决跨视觉模态(如3D点云和全景图像)语义分割中因标注数据稀缺及固定标签模型适应性不足而导致的挑战。其解决方案的关键在于提出JOPP-3D框架,该框架通过联合利用全景图像与点云数据,实现语言驱动的场景理解:将RGB-D全景图像转换为切向透视图像和3D点云,并提取与对齐基础视觉-语言特征,从而支持自然语言查询生成两种模态上的语义掩码。此方法显著提升了开放词汇和封闭词汇条件下2D与3D语义分割的性能。
链接: https://arxiv.org/abs/2603.06168
作者: Sandeep Inuganti,Hideaki Kanayama,Kanta Shimizu,Mahdi Chamseddine,Soichiro Yokota,Didier Stricker,Jason Rambach
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed-label models. In this paper, we present JOPP-3D, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding. We convert RGB-D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision-language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford-2D-3D-s and ToF-360 datasets demonstrates the capability of JOPP-3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.
[CV-68] A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement
【速读】:该论文旨在解决半监督学习(Semi-supervised Learning, SSL)在乳腺超声(Breast Ultrasound, BUS)图像分割中因标注数据极度有限而导致伪标签不稳定、监督信号 inaccurate 从而性能下降的问题。其关键解决方案在于提出一种无需训练的伪标签生成与精炼框架:首先利用基于外观描述(如“dark oval”)的简单提示,实现自然图像与医学图像之间的跨域结构迁移,使视觉-语言模型(Vision-Language Models, VLMs)生成结构一致的伪标签;随后用这些伪标签预热一个静态教师网络以捕捉乳腺病灶的全局结构先验,并结合指数移动平均教师网络,引入不确定性熵加权融合与自适应不确定性引导的反向对比学习策略,增强边界判别能力。该方法在仅使用2.5%标注数据的情况下即可达到全监督模型性能,且具有良好的可扩展性,适用于其他成像模态或疾病场景。
链接: https://arxiv.org/abs/2603.06167
作者: Ruili Li,Jiayi Ding,Ruiyu Li,Yilun Jin,Shiwen Ge,Yuwen Zeng,Xiaoyong Zhang,Eichi Takaya,Jan Vrba,Noriyasu Homma
机构: Tohoku University Graduate School of Medicine (东北大学研究生院医学部); Advanced Institute of Convergence Knowledge Informatics, Research Institute of Electrical Communication, Tohoku University (东北大学电气通信研究所先进融合知识研究院); National Institute of Technology, Sendai College (仙台高等技术学院); State Key Laboratory of Oncology in South China, Sun Yat-sen University Cancer Center (华南肿瘤学国家重点实验室,中山大学肿瘤中心); University of Chemistry and Technology, Prague (布拉格化工大学); School of Software Technology, Zhejiang University, Ningbo (浙江大学软件技术学院,宁波); Southeast University, School of Cyber Science and Engineering (东南大学网络科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semi-supervised learning (SSL) has emerged as a promising paradigm for breast ultrasound (BUS) image segmentation, but it often suffers from unstable pseudo labels under extremely limited annotations, leading to inaccurate supervision and degraded performance. Recent vision-language models (VLMs) provide a new opportunity for pseudo-label generation, yet their effectiveness on BUS images remains limited because domain-specific prompts are difficult to transfer. To address this issue, we propose a semi-supervised framework with training-free pseudo-label generation and label refinement. By leveraging simple appearance-based descriptions (e.g., dark oval), our method enables cross-domain structural transfer between natural and medical images, allowing VLMs to generate structurally consistent pseudo labels. These pseudo labels are used to warm up a static teacher that captures global structural priors of breast lesions. Combined with an exponential moving average teacher, we further introduce uncertainty entropy weighted fusion and adaptive uncertainty-guided reverse contrastive learning to improve boundary discrimination. Experiments on four BUS datasets demonstrate that our method achieves performance comparable to fully supervised models even with only 2.5% labeled data, significantly outperforming existing SSL approaches. Moreover, the proposed paradigm is readily extensible: for other imaging modalities or diseases, only a global appearance description is required to obtain reliable pseudo supervision, enabling scalable semi-supervised medical image segmentation under limited annotations. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.06167 [cs.CV] (or arXiv:2603.06167v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.06167 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-69] FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation Models
【速读】:该论文旨在解决道路场景中语义与全景占据(panoptic occupancy)预测的训练依赖问题,即现有纯摄像头方法通常需要昂贵的密集3D监督或在目标域数据上进行模型训练,限制了其在未见环境中的部署。解决方案的关键在于提出FreeOcc——一个无需训练的端到端管线,其核心是利用预训练基础模型(foundation models)从多视角图像中恢复语义与几何信息:首先通过可提示的基础分割模型提取每视角的全景先验,并结合提示到类别规则生成标签;随后使用重建基础模型恢复度量3D点云,并通过深度与置信度感知过滤提升可靠标签;最后通过时间融合与确定性精修堆栈实现体素化输出。此外,通过拟合和合并当前视图的鲁棒3D边界框候选对象,实现无任何3D学习模型的实例级占据恢复,从而在不依赖训练的情况下达到与最先进弱监督方法相当甚至更优的性能。
链接: https://arxiv.org/abs/2603.06166
作者: Andrew Caunes,Thierry Chateau,Vincent Fremont
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages
Abstract:Semantic and panoptic occupancy prediction for road scene analysis provides a dense 3D representation of the ego vehicle’s surroundings. Current camera-only approaches typically rely on costly dense 3D supervision or require training models on data from the target domain, limiting deployment in unseen environments. We propose FreeOcc, a training-free pipeline that leverages pretrained foundation models to recover both semantics and geometry from multi-view images. FreeOcc extracts per-view panoptic priors with a promptable foundation segmentation model and prompt-to-taxonomy rules, and reconstructs metric 3D points with a reconstruction foundation model. Depth- and confidence- aware filtering lifts reliable labels into 3D, which are fused over time and voxelized with a deterministic refinement stack. For panoptic occupancy, instances are recovered by fitting and merging robust current-view 3D box candidates, enabling instance-aware occupancy without any learned 3D model. On Occ3D-nuScenes, FreeOcc achieves 16.9 mIoU and 16.5 RayIoU train-free, on par with state-of-the-art weakly supervised methods. When employed as a pseudo-label generation pipeline for training downstream models, it achieves 21.1 RayIoU, surpassing the previous state-of-the-art weakly supervised baseline. Furthermore, FreeOcc sets new baselines for both train-free and weakly supervised panoptic occupancy prediction, achieving 3.1 RayPQ and 3.9 RayPQ, respectively. These results highlight foundation-model-driven perception as a practical route to training-free 3D scene understanding.
[CV-70] Reflective Flow Sampling Enhancement
【速读】:该论文旨在解决当前推理时增强策略在流模型(flow models)尤其是CFG蒸馏变体(如FLUX)上效果不佳的问题。现有方法主要针对传统扩散模型设计,难以有效提升流模型的生成质量和文本提示一致性。解决方案的关键在于提出一种理论严谨且无需训练的推理增强框架——反射流采样(Reflective Flow Sampling, RF-Sampling),其核心机制是通过形式化推导证明该方法隐式地沿文本-图像对齐得分进行梯度上升;具体实现上,利用文本表示的线性组合与流反演(flow inversion)相结合,引导模型探索更符合输入提示的噪声空间,从而显著提升生成质量与提示对齐度,并首次在FLUX模型上展现出一定的测试时扩展能力(test-time scaling)。
链接: https://arxiv.org/abs/2603.06165
作者: Zikai Zhou,Muyao Wang,Shitong Shao,Lichen Bai,Haoyi Xiong,Bo Han,Zeke Xie
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The University of Tokyo (东京大学); Microsoft (微软); Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The growing demand for text-to-image generation has led to rapid advances in generative modeling. Recently, text-to-image diffusion models trained with flow matching algorithms, such as FLUX, have achieved remarkable progress and emerged as strong alternatives to conventional diffusion models. At the same time, inference-time enhancement strategies have been shown to improve the generation quality and text-prompt alignment of text-to-image diffusion models. However, these techniques are mainly applicable to conventional diffusion models and usually fail to perform well on flow models. To bridge this gap, we propose Reflective Flow Sampling (RF-Sampling), a theoretically-grounded and training-free inference enhancement framework explicitly designed for flow models, especially for the CFG-distilled variants (i.e., models distilled from CFG guidance techniques), like FLUX. Departing from heuristic interpretations, we provide a formal derivation proving that RF-Sampling implicitly performs gradient ascent on the text-image alignment score. By leveraging a linear combination of textual representations and integrating them with flow inversion, RF-Sampling allows the model to explore noise spaces that are more consistent with the input prompt. Extensive experiments across multiple benchmarks demonstrate that RF-Sampling consistently improves both generation quality and prompt alignment. Moreover, RF-Sampling is also the first inference enhancement method that can exhibit test-time scaling ability to some extent on FLUX.
[CV-71] VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在面对真实世界图像失真时的鲁棒性认知不足问题,尤其是其在多种类型扰动下的性能变化机制尚不清晰。解决方案的关键在于构建了一个系统性的基准测试平台——VLM-RobustBench,涵盖49种增强类型(包括噪声、模糊、天气、数字和几何扰动),并在低/中/高严重程度及二值变换下评估133种损坏场景,并对四个家族的VLM(Qwen、InternVL、Molmo、Gemma)在两个互补任务(MMBench用于视觉定位,MMMU-Pro用于推理)上进行评测。实验发现,视觉严重度并非预测难度的可靠指标,低严重度的空间扰动(如glass_blur)比高严重度的光度畸变更能显著降低性能(平均下降约8个百分点),而重采样和几何失真(如upsample、elastic_transform)导致的最大性能下降可达34个百分点,揭示出当前VLMs虽语义能力强但空间结构脆弱,从而推动建立新的鲁棒性评估协议与训练策略,强调对重采样和几何不变性的建模。
链接: https://arxiv.org/abs/2603.06148
作者: Rohit Saxena,Alessandro Suglia,Pasquale Minervini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models (VLMs) achieve strong performance on standard, high-quality datasets, but we still do not fully understand how they perform under real-world image distortions. We present VLM-RobustBench, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities (low/mid/high) and binary transforms, yielding 133 corrupted settings. We evaluate VLMs from four families (Qwen, InternVL, Molmo, Gemma) on two complementary benchmarks: MMBench (visually grounded) and MMMU-Pro (reasoning-oriented). Our results reveal that visual severity is a weak predictor of difficulty: low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions. In particular, low-severity glass_blur reduces MMBench accuracy by about 8 pp on average across models, while the largest drops arise from resampling and geometric distortions (e.g., upsample, elastic_transform), reaching up to 34 pp. Overall, our findings suggest current VLMs are semantically strong but spatially fragile, motivating the definition of novel robustness evaluation protocols and training regimes that emphasize resampling and geometric invariances.
[CV-72] Longitudinal NSCLC Treatment Progression via Multimodal Generative Models
【速读】:该论文旨在解决放射治疗过程中非小细胞肺癌(NSCLC)肿瘤演化的预测难题,尤其是在解剖结构和治疗因素共同驱动下的纵向变化建模问题。其解决方案的关键在于提出一种名为“虚拟治疗”(Virtual Treatment, VT)的框架,将肿瘤演化建模为一个剂量感知的多模态条件图像到图像翻译问题:给定基线CT影像、临床变量及指定的放疗剂量增量,VT通过生成模型合成反映治疗诱导解剖变化的随访CT图像。该方法的核心创新在于引入剂量信息作为条件输入,并结合人口统计学与肿瘤相关临床变量,显著提升了生成结果的稳定性和解剖合理性,其中基于扩散模型(diffusion-based models)的实现相较生成对抗网络(GAN-based models)展现出更一致且更符合生物学逻辑的肿瘤演化轨迹,验证了VT在数字治疗监测与自适应放疗研究中的潜力。
链接: https://arxiv.org/abs/2603.06147
作者: Massimiliano Mantegna,Elena Mulero Ayllón,Alice Natalina Caragliano,Francesco Di Feola,Claudia Tacconi,Michele Fiore,Edy Ippolito,Carlo Greco,Sara Ramella,Philippe C. Cattin,Paolo Soda,Matteo Tortora,Valerio Guarrasi
机构: University of Campania “Luigi Vanvitelli”(坎帕尼亚大学路易吉·范维特利); University of Bologna(博洛尼亚大学); University of Milan(米兰大学); University of Turin(都灵大学); University of Palermo(巴勒莫大学); University of Catania(卡塔尼亚大学); University of Rome Tor Vergata(罗马托尔韦加塔大学); University of Calabria(卡拉布里亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Predicting tumor evolution during radiotherapy is a clinically critical challenge, particularly when longitudinal changes are driven by both anatomy and treatment. In this work, we introduce a Virtual Treatment (VT) framework that formulates non-small cell lung cancer (NSCLC) progression as a dose-aware multimodal conditional image-to-image translation problem. Given a CT scan, baseline clinical variables, and a specified radiation dose increment, VT aims to synthesize plausible follow-up CT images reflecting treatment-induced anatomical changes. We evaluate the proposed framework on a longitudinal dataset of 222 stage III NSCLC patients, comprising 895 CT scans acquired during radiotherapy under irregular clinical schedules. The generative process is conditioned on delivered dose increments together with demographic and tumor-related clinical variables. Representative GAN-based and diffusion-based models are benchmarked across 2D and 2.5D configurations. Quantitative and qualitative results indicate that diffusion-based models benefit more consistently from multimodal, dose-aware conditioning and produce more stable and anatomically plausible tumor evolution trajectories than GAN-based baselines, supporting the potential of VT as a tool for in-silico treatment monitoring and adaptive radiotherapy research in NSCLC.
[CV-73] Spatial Colour Mixing Illusions as a Perception Stress Test for Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对结构化像素扰动时表现出的系统性感知脆弱性问题,即当图像中引入特定颜色混叠(Spatial Colour Mixing)干扰后,VLMs 会产生高置信度但语义错误的预测,而人类观察者仍能准确识别场景。解决方案的关键在于提出一种基于空间色彩混合的程序化干扰框架,包含八种变体,并通过实验证明:尽管增大语言模型规模无法稳定提升鲁棒性,但引入一个简单的人类启发式预处理步骤可显著恢复部分性能,从而揭示感知感知-aware 预处理和工具使用(tool-use)是提升 VLM 在复杂干扰下鲁棒性的有效策略。
链接: https://arxiv.org/abs/2603.06141
作者: Nicoleta-Nina Basoc,Adrian Cosma,Emilian Radoi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) achieve strong benchmark results, yet can exhibit systematic perceptual weaknesses: structured, large changes to pixel values can cause confident yet nonsensical predictions, even when the underlying scene remains easily recognizable to humans. We study this gap using Spatial Colour Mixing, a programmatic family of colour distortions that overlays structured patterns (in both RGB and Ostwald colour systems) onto natural images. We introduce a framework of eight spatial colour mixing variants and evaluate nine VLMs across three model families on four datasets. Across models and datasets, accuracy degrades sharply with increasing distortion, and scaling the language model does not reliably mitigate the failure. In a human study with 61 participants on an animal recognition dataset, humans substantially outperform VLMs under the same distortions. Finally, we show that a simple human-inspired preprocessing step recovers a meaningful portion of performance for several distortion types, motivating perception-aware preprocessing and tool-use as practical strategies for improving VLM robustness.
[CV-74] Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion
【速读】:该论文旨在解决当前视频对象插入技术中视觉保真度高但物理因果一致性不足的问题,即插入对象与环境之间缺乏物理合理性。其解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, MLLMs)的思维链(Chain-of-Thought, CoT)推理能力,构建一个“先思考再放置”(Think-then-Place)的端到端框架 Place-it-R1。核心创新包括:1)MLLM进行物理场景理解与交互推理,生成环境感知的思维链 token 并推断有效插入区域以引导扩散模型实现物理合理插入;2)引入 MLLM 指导的空间直接偏好优化(Spatial Direct Preference Optimization, DPO),通过反馈评分提升视觉自然性;3)在推理阶段形成闭环迭代机制,使 MLLM 可持续触发精修循环并动态调整扩散模型输出,从而逐步提升编辑质量。此外,系统提供两种用户可选模式——注重物理合理性(允许修改环境结构)的灵活模式和注重场景完整性的标准模式,实现了对保真度与物理合理性之间权衡的显式控制。
链接: https://arxiv.org/abs/2603.06140
作者: Bohai Gu,Taiyi Wu,Dazhao Du,Jian Liu,Shuai Yang,Xiaotong Zhao,Alan Zhao,Song Guo
机构: HKUST; Tencent Video; Peking University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:Modern video editing techniques have achieved high visual fidelity when inserting video objects. However, they focus on optimizing visual fidelity rather than physical causality, leading to edits that are physically inconsistent with their environment. In this work, we present Place-it-R 1 , an end-to-end framework for video object insertion that unlocks the environment-aware reasoning potential of Multimodal Large Language Models (MLLMs). Our framework leverages the Chain-of-Thought (CoT) reasoning of MLLMs to orchestrate video diffusion, following a Think-then-Place paradigm. To bridge cognitive reasoning and generative execution, we introduce three key innovations: First, MLLM performs physical scene understanding and interaction reasoning, generating environment-aware chain-of-thought tokens and inferring valid insertion regions to explicitly guide the diffusion toward physically plausible insertion. Then, we introduce MLLM-guided Spatial Direct Preference Optimization (DPO), where diffusion outputs are fed back to the MLLM for scoring, enabling visual naturalness. During inference, the MLLM iteratively triggers refinement cycles and elicits adaptive adjustments from the diffusion model, forming a closed-loop that progressively enhances editing quality. Furthermore, we provide two user-selectable modes: a plausibility-oriented flexible mode that permits environment modifications (\eg, generating support structures) to enhance physical plausibility, and a fidelity-oriented standard mode that preserves scene integrity for maximum fidelity, offering users explicit control over the plausibility-fidelity trade-off. Extensive experiments demonstrate Place-it-R1 achieves physically-coherent video object insertion compared with state-of-the-art solutions and commercial models.
[CV-75] Cross-Resolution Distribution Matching for Diffusion Distillation
【速读】:该论文旨在解决扩散模型(Diffusion Model)在图像和视频生成中因去噪过程步骤过多而导致的推理速度瓶颈问题,以及部分低分辨率生成方法因跨分辨率分布差异(cross-resolution distribution gaps)引发的质量下降问题。解决方案的关键在于提出了一种新颖的跨分辨率分布匹配蒸馏框架(Cross-Resolution Distribution Matching Distillation, RMD),其核心创新包括:利用对数信噪比(logSNR)曲线划分各分辨率的时步区间,并引入基于logSNR的映射机制补偿分辨率引起的分布偏移;沿分辨率轨迹进行分布匹配以缩小低分辨率生成器分布与教师模型高分辨率分布之间的差距;同时在上采样过程中引入预测噪声重注入机制(predicted-noise re-injection),提升训练稳定性与合成质量。该方法实现了高质量、少步数的多分辨率级联推理,在多个骨干网络上显著加速生成过程(如SDXL达33.4倍、Wan2.1-14B达25.6倍),且保持高视觉保真度。
链接: https://arxiv.org/abs/2603.06136
作者: Feiyang Chen,Hongpeng Pan,Haonan Xu,Xinyu Duan,Yang Yang,Zhefeng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion distillation is central to accelerating image and video generation, yet existing methods are fundamentally limited by the denoising process, where step reduction has largely saturated. Partial timestep low-resolution generation can further accelerate inference, but it suffers noticeable quality degradation due to cross-resolution distribution gaps. We propose Cross-Resolution Distribution Matching Distillation (RMD), a novel distillation framework that bridges cross-resolution distribution gaps for high-fidelity, few-step multi-resolution cascaded inference. Specifically, RMD divides the timestep intervals for each resolution using logarithmic signal-to-noise ratio (logSNR) curves, and introduces logSNR-based mapping to compensate for resolution-induced shifts. Distribution matching is conducted along resolution trajectories to reduce the gap between low-resolution generator distributions and the teacher’s high-resolution distribution. In addition, a predicted-noise re-injection mechanism is incorporated during upsampling to stabilize training and improve synthesis quality. Quantitative and qualitative results show that RMD preserves high-fidelity generation while accelerating inference across various backbones. Notably, RMD achieves up to 33.4X speedup on SDXL and 25.6X on Wan2.1-14B, while preserving high visual fidelity.
[CV-76] FedARKS: Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration for Person Re-identification
【速读】:该论文旨在解决联邦域泛化行人重识别(FedDG-ReID)中两个关键问题:一是仅使用全局特征表示难以捕捉细微且域不变的局部信息(如配饰或纹理),二是统一参数平均策略忽略了客户端在鲁棒特征提取能力上的差异,导致高质量客户端的贡献被稀释。解决方案的核心在于提出一种新颖的联邦学习框架FedARKS,其包含两个机制:RK(Robust Knowledge)用于增强模型对噪声和分布偏移的鲁棒性,KS(Knowledge Selection)则基于客户端特征提取质量动态选择并加权聚合知识,从而提升整体模型的域泛化性能与隐私保护能力。
链接: https://arxiv.org/abs/2603.06122
作者: Xin Xu,Binchang Ma,Zhixi Yu,Wei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The application of federated domain generalization in person re-identification (FedDG-ReID) aims to enhance the model’s generalization ability in unseen domains while protecting client data privacy. However, existing mainstream methods typically rely on global feature representations and simple averaging operations for model aggregation, leading to two limitations in domain generalization: (1) Using only global features makes it difficult to capture subtle, domain-invariant local details (such as accessories or textures); (2) Uniform parameter averaging treats all clients as equivalent, ignoring their differences in robust feature extraction capabilities, thereby diluting the contributions of high quality clients. To address these issues, we propose a novel federated learning framework, Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration (FedARKS), comprising two mechanisms: RK (Robust Knowledge) and KS (Knowledge Selection).
[CV-77] Lyapunov Probes for Hallucination Detection in Large Foundation Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)和多模态大型语言模型(Multimodal Large Language Models, MLLMs)中的幻觉检测问题。传统方法将幻觉视为简单的分类任务,而本文提出从动力系统稳定性理论的角度重新建模:将事实知识视为表示空间中的稳定平衡点,幻觉则倾向于出现在知识过渡区域的边界上——即稳定区与不稳定区之间的分界处。解决方案的关键在于引入“李雅普诺夫探针”(Lyapunov Probes),这是一种轻量级网络结构,通过基于导数的稳定性约束训练,强制模型在输入扰动下置信度单调衰减,从而有效区分稳定的真实区域与易产生幻觉的不稳定性区域。该方法结合系统性扰动分析与两阶段训练策略,在多个数据集和模型上均显著优于现有基线。
链接: https://arxiv.org/abs/2603.06081
作者: Bozhi Luan,Gen Li,Yalan Qin,Jifeng Guo,Yun Zhou,Faguo Wu,Hongwei Zheng,Wenjun Wu,Zhaoxin Fan
机构: Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing; School of Artificial Intelligence, Beihang University; School of Electronic and Information Engineering, State Key Laboratory of CNS/ATM, Beihang University; National Key Laboratory of Information Systems Engineering, National University of Defense Technology; Beijing Academy of Blockchain and Edge Computing; Shanghai University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We address hallucination detection in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) by framing the problem through the lens of dynamical systems stability theory. Rather than treating hallucination as a straightforward classification task, we conceptualize (M)LLMs as dynamical systems, where factual knowledge is represented by stable equilibrium points within the representation space. Our main insight is that hallucinations tend to arise at the boundaries of knowledge-transition regions separating stable and unstable zones. To capture this phenomenon, we propose Lyapunov Probes: lightweight networks trained with derivative-based stability constraints that enforce a monotonic decay in confidence under input perturbations. By performing systematic perturbation analysis and applying a two-stage training process, these probes reliably distinguish between stable factual regions and unstable, hallucination-prone regions. Experiments on diverse datasets and models demonstrate consistent improvements over existing baselines.
[CV-78] xt-Driven Emotionally Continuous Talking Face Generation
【速读】:该论文旨在解决现有说话人脸生成(Talking Face Generation, TFG)方法在情感表达上的局限性问题,即当前方法通常只能生成固定情绪的面部动作,缺乏人类在自然交流中表现出的连续、动态的情绪变化能力。为实现更逼真的情感连续性表达,作者提出了一种新任务——情感连续说话人脸生成(Emotionally Continuous Talking Face Generation, EC-TFG),其输入包括文本片段和带有情绪变化描述的驱动信号。解决方案的关键在于设计了一个定制化模型 Temporal-Intensive Emotion Modulated Talking Face Generation (TIE-TFG),该模型通过引入时序密集的情感波动建模(Temporal-Intensive Emotion Fluctuation Modeling)机制,能够从输入情绪描述中提取对应的时间序列情感变化,并据此驱动合成视频中面部表情的平滑过渡,从而实现高保真度的视觉质量和运动真实性下的持续情感表达。
链接: https://arxiv.org/abs/2603.06071
作者: Hao Yang,Yanyan Zhao,Tian Zheng,Hongbo Zhang,Bichen Wang,Di Wu,Xing Fu,Xuda Zhi,Yongbo Huang,Hao He
机构: Harbin Institute of Technology (哈尔滨工业大学); SERES
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Talking Face Generation (TFG) strives to create realistic and emotionally expressive digital faces. While previous TFG works have mastered the creation of naturalistic facial movements, they typically express a fixed target emotion in synthetic videos and lack the ability to exhibit continuously changing and natural expressions like humans do when conveying information. To synthesize realistic videos, we propose a novel task called Emotionally Continuous Talking Face Generation (EC-TFG), which takes a text segment and an emotion description with varying emotions as driving data, aiming to generate a video where the person speaks the text while reflecting the emotional changes within the description. Alongside this, we introduce a customized model, i.e., Temporal-Intensive Emotion Modulated Talking Face Generation (TIE-TFG), which innovatively manages dynamic emotional variations by employing Temporal-Intensive Emotion Fluctuation Modeling, allowing it to provide emotion variation sequences corresponding to the input text to drive continuous facial expression changes in synthesized videos. Extensive evaluations demonstrate our method’s exceptional ability to produce smooth emotion transitions and uphold high-quality visuals and motion authenticity across diverse emotional states.
[CV-79] ransforming Omnidirectional RGB-LiDAR data into 3D Gaussian Splatting IROS
【速读】:该论文旨在解决大规模数字孪生(Digital Twin)环境构建中因依赖昂贵定制化数据采集而导致成本高昂的问题,同时提出了一种高效利用部署平台日常积累的全向RGB与LiDAR原始日志数据的方法。其核心挑战在于:直接使用未经处理的多模态传感器数据会导致结构从运动(Structure-from-Motion, SfM)跟踪不稳定(源于非线性畸变),以及点云密集且无组织带来的3D高斯泼溅(3D Gaussian Splatting, 3DGS)优化计算开销大。解决方案的关键在于设计了一个端到端的多模态重用流水线,包含两个核心技术模块:一是将等距投影(ERP)图像转换为立方体贴图(cubemap)以实现确定性的空间锚定;二是引入PRISM——一种基于颜色分层的下采样策略,有效降低LiDAR点云冗余并提升匹配效率。通过FPFH全局配准与迭代最近点(ICP)融合多模态数据,该方法成功将大量原本被丢弃的数据转化为可用于3DGS初始化的高质量几何信息,并在结构复杂场景中显著优于纯视觉基线,实现了从标准传感器日志到仿真级数字孪生的可复现、确定性工作流。
链接: https://arxiv.org/abs/2603.06061
作者: Semin Bae,Hansol Lim,Jongseong Brad Choi
机构: State University of New York, Stony Brook (纽约州立大学石溪分校); National Research Foundation of Korea (韩国国家研究基金会)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This work has been submitted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) for possible publication
Abstract:The demand for large-scale digital twins is rapidly growing in robotics and autonomous driving. However, constructing these environments with 3D Gaussian Splatting (3DGS) usually requires expensive, purpose-built data collection. Meanwhile, deployed platforms routinely collect extensive omnidirectional RGB and LiDAR logs, but a significant portion of these sensor data is directly discarded or strictly underutilized due to transmission constraints and the lack of scalable reuse pipeline. In this paper, we present an omnidirectional RGB-LiDAR reuse pipeline that transforms these archived logs into robust initialization assets for 3DGS. Direct conversion of such raw logs introduces practical bottlenecks: inherent non-linear distortion leads to unreliable Structure-from-Motion (SfM) tracking, and dense, unorganized LiDAR clouds cause computational overhead during 3DGS optimization. To overcome these challenges, our pipeline strategically integrates an ERP-to-cubemap conversion module for deterministic spatial anchoring, alongside PRISM-a color stratified downsampling strategy. By bridging these multi-modal inputs via Fast Point Feature Histograms (FPFH) based global registration and Iterative Closest Point (ICP), our pipeline successfully repurposes a considerable fraction of discarded data into usable SfM geometry. Furthermore, our LiDAR-reinforced initialization consistently enhances the final 3DGS rendering fidelity in structurally complex scenes compared to vision-only baselines. Ultimately, this work provides a deterministic workflow for creating simulation-grade digital twins from standard archived sensor logs.
[CV-80] mpoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation
【速读】:该论文旨在解决生成式AI(Generative AI)在音频驱动的虚拟人头像(Talking-Head Generation, THG)应用中面临的三大挑战:高推理延迟、时序不稳定性(如闪烁和身份漂移)以及在复杂语音条件下音视频对齐不佳的问题。解决方案的关键在于提出一种参考条件化的潜在扩散框架TempoSyncDiff,其核心创新包括:采用教师-学生蒸馏机制,使轻量级学生去噪器在显著减少推理步数下仍能保持与强教师模型相近的重建性能;引入身份锚定和时序正则化策略以抑制身份漂移和帧间闪烁;并通过基于viseme的音频条件控制实现粗粒度唇部运动引导。该方法在LRS3数据集上验证了其在重建质量、推理效率和稳定性方面的改进,为受限计算环境下实用化扩散模型驱动的THG提供了可行路径。
链接: https://arxiv.org/abs/2603.06057
作者: Soumya Mazumdar,Vineet Kumar Rakesh
机构: Gargi Memorial Institute of Technology (加尔吉纪念理工学院); Variable Energy Cyclotron Centre (变能回旋加速器中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注:
Abstract:Diffusion models have recently advanced photorealistic human synthesis, although practical talking-head generation (THG) remains constrained by high inference latency, temporal instability such as flicker and identity drift, and imperfect audio-visual alignment under challenging speech conditions. This paper introduces TempoSyncDiff, a reference-conditioned latent diffusion framework that explores few-step inference for efficient audio-driven talking-head generation. The approach adopts a teacher-student distillation formulation in which a diffusion teacher trained with a standard noise prediction objective guides a lightweight student denoiser capable of operating with significantly fewer inference steps to improve generation stability. The framework incorporates identity anchoring and temporal regularization designed to mitigate identity drift and frame-to-frame flicker during synthesis, while viseme-based audio conditioning provides coarse lip motion control. Experiments on the LRS3 dataset report denoising-stage component-level metrics relative to VAE reconstructions and preliminary latency characterization, including CPU-only and edge computing measurements and feasibility estimates for edge deployment. The results suggest that distilled diffusion models can retain much of the reconstruction behaviour of a stronger teacher while enabling substantially lower latency inference. The study is positioned as an initial step toward practical diffusion-based talking-head generation under constrained computational settings. GitHub: this https URL
[CV-81] Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving
【速读】:该论文试图解决视觉语言模型(Vision-Language Models, VLMs)在自动驾驶场景中对简单但关键的视觉问题表现不佳的问题,尤其是这些模型在处理长尾场景时仍存在不可靠性。其解决方案的关键在于通过分析VLMs中间激活状态,评估特定视觉概念是否以线性方式编码于模型内部表示中,从而识别视觉信息流动中的瓶颈。具体而言,研究者构建了仅在目标视觉概念上不同的反事实图像集,并使用线性探测器(linear probes)测试四个先进VLMs的激活特征,发现物体存在等概念被显式线性编码,而空间属性如方向则仅隐式保留;进一步识别出两类失败模式:感知失败(perceptual failure),即所需视觉信息未被线性编码;认知失败(cognitive failure),即视觉信息虽存在但未能与语言语义正确对齐。此外,还发现目标物体距离增加会迅速降低对应视觉概念的线性可分性,这为改进VLMs在自动驾驶中的鲁棒性提供了关键洞见。
链接: https://arxiv.org/abs/2603.06054
作者: Nikos Theodoridis,Reenu Mohandas,Ganesh Sistu,Anthony Scanlan,Ciarán Eising,Tim Brophy
机构: University of Limerick (利默里克大学); Valeo Vision Systems, Ireland (法雷奥视觉系统,爱尔兰)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The use of Vision-Language Models (VLMs) in automated driving applications is becoming increasingly common, with the aim of leveraging their reasoning and generalisation capabilities to handle long tail scenarios. However, these models often fail on simple visual questions that are highly relevant to automated driving, and the reasons behind these failures remain poorly understood. In this work, we examine the intermediate activations of VLMs and assess the extent to which specific visual concepts are linearly encoded, with the goal of identifying bottlenecks in the flow of visual information. Specifically, we create counterfactual image sets that differ only in a targeted visual concept and then train linear probes to distinguish between them using the activations of four state-of-the-art (SOTA) VLMs. Our results show that concepts such as the presence of an object or agent in a scene are explicitly and linearly encoded, whereas other spatial visual concepts, such as the orientation of an object or agent, are only implicitly encoded by the spatial structure retained by the vision encoder. In parallel, we observe that in certain cases, even when a concept is linearly encoded in the model’s activations, the model still fails to answer correctly. This leads us to identify two failure modes. The first is perceptual failure, where the visual information required to answer a question is not linearly encoded in the model’s activations. The second is cognitive failure, where the visual information is present but the model fails to align it correctly with language semantics. Finally, we show that increasing the distance of the object in question quickly degrades the linear separability of the corresponding visual concept. Overall, our findings improve our understanding of failure cases in VLMs on simple visual tasks that are highly relevant to automated driving.
[CV-82] Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models CVPR2026
【速读】:该论文旨在解决自动驾驶视觉-语言-动作(VLA)模型在训练过程中因模仿学习(Imitation Learning, IL)阶段探索不足而导致后续强化学习(Reinforcement Learning, RL)阶段性能饱和的问题,即“窄策略限制”(Narrow Policy limitation)。其核心解决方案是提出Curious-VLA框架,关键在于两阶段设计:第一阶段引入可行轨迹扩展(Feasible Trajectory Expansion, FTE)策略生成多样化的物理可行轨迹,并采用逐步归一化的轨迹表示以适配多样化数据;第二阶段通过自适应多样性感知采样(Adaptive Diversity-Aware Sampling, ADAS)优先选择高多样性样本,并设计具有焦点加权机制的跨度驾驶奖励(Spanning Driving Reward, SDR),以扩大奖励值范围并提升对驾驶质量的敏感性。该方法有效缓解了利用与探索之间的权衡问题,显著提升了VLA模型的探索潜力和最终性能。
链接: https://arxiv.org/abs/2603.06049
作者: Canyu Chen,Yuguang Yang,Zhewen Tan,Yizhi Wang,Ruiyi Zhan,Haiyan Liu,Xuanyao Mao,Jason Bao,Xinyue Tang,Linlin Yang,Bingchuan Sun,Yan Wang,Baochang Zhang
机构: Beihang University (北京航空航天大学); Tsinghua University (清华大学); Lenovo Group Limited (联想集团); Communication University of China (中国传媒大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by CVPR2026 findings
Abstract:We identify a fundamental Narrow Policy limitation undermining the performance of autonomous VLA models, where driving Imitation Learning (IL) tends to collapse exploration and limit the potential of subsequent Reinforcement Learning (RL) stages, which often saturate prematurely due to insufficient feedback diversity. Thereby, we propose Curious-VLA, a framework that alleviates the exploit-explore dilemma through a two-stage design. During IL, we introduce a Feasible Trajectory Expansion (FTE) strategy to generate multiple physically valid trajectories and a step-wise normalized trajectory representation to adapt this diverse data. In the RL stage, we present Adaptive Diversity-Aware Sampling (ADAS) that prioritizes high-diversity samples and introduce Spanning Driving Reward (SDR) with a focal style weighting to amplify reward’s value span for improving sensitivity to driving quality. On the Navsim benchmark, Curious-VLA achieves SoTA results (PDMS 90.3, EPDMS 85.4) and a Best-of-N PDMS of 94.8, demonstrating its effectiveness in unlocking the exploratory potential of VLA models. Code: this https URL.
[CV-83] GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection
【速读】:该论文旨在解决手-物体交互(Hand-Object Interaction, HOI)在数字人视频合成中的两个核心挑战:一是生成物理上合理的接触关系,二是跨帧保持物体身份一致性,尤其是在未见过的、真实复杂场景(in-the-wild)中。现有方法通常局限于域内训练与评估,泛化能力弱;而通用视频编辑模型虽具鲁棒性,却难以处理HOI特有的对象外观不一致问题。解决方案的关键在于提出GenHOI——一种轻量级增强模块,通过时空协同机制注入参考对象信息:在时间维度上,引入Head-Sliding RoPE以分配头特定的时间偏移,均衡参考令牌的影响并缓解3D RoPE的时序衰减,提升长程对象一致性;在空间维度上,设计两级空间注意力门控机制,聚焦于HOI区域并自适应调节强度,在保留背景真实感的同时增强交互精度。
链接: https://arxiv.org/abs/2603.06048
作者: Xuan Huang,Mochu Xiang,Zhelun Shen,Jinbo Wu,Chenming Wu,Chen Zhao,Kaisiyuan Wang,Hang Zhou,Shanshan Liu,Haocheng Feng,Wei He,Jingdong Wang
机构: Baidu Inc.(百度公司); Shenzhen Campus of Sun Yat-Sen University (中山大学深圳校区); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hand-Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated in-domain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing methods. Project page: this https URL
[CV-84] Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models CVPR2026
【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)在视觉理解与生成能力之间存在的显著差距问题,即UMMs虽具备强大的视觉理解能力,但在复杂文本到图像(Text-to-Image, T2I)任务中生成图像的语义一致性较弱。其解决方案的关键在于提出一种基于token级内在图文对齐的奖励机制GvU,使UMM能够同时作为“教师”和“学生”:利用自身的理解分支评估生成结果,并据此引导生成过程。在此基础上构建自监督强化学习框架,通过内在奖励信号实现无外部监督下的迭代优化,从而提升生成质量并反向增强细粒度视觉理解能力,缩小UMMs在理解与生成之间的能力鸿沟。
链接: https://arxiv.org/abs/2603.06043
作者: Jiadong Pan,Liang Li,Yuxin Peng,Yu-Ming Tang,Shuohuan Wang,Yu Sun,Hua Wu,Qingming Huang,Haifeng Wang
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); Peking University (北京大学); Sun Yat-sen University (中山大学); Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Recently, unified multimodal models (UMMs) have made remarkable progress in integrating visual understanding and generation, demonstrating strong potential for complex text-to-image (T2I) tasks. Despite their theoretical promise, a persistent capability gap exists: UMMs typically exhibit superior visual understanding but comparatively weaker generative capabilities. This discrepancy arises largely from the intrinsic decoupling between the understanding and generation processes. While a UMM can accurately interpret fine-grained visual details, it often struggles to produce semantically coherent images from complex textual prompts. To address this challenge, we explore UMMs’ internal understanding capability to enhance generation quality. We propose a token-level intrinsic text-image alignment reward mechanism, GvU, enabling the UMM to act simultaneously as teacher and student: it evaluates its own outputs using the understanding branch to guide the generations accordingly. Building upon this, we design a self-supervised reinforcement learning framework, allowing UMMs to iteratively improve their generation quality through understanding-based intrinsic reward signals–without reliance on external supervision. Experimental results show that our method substantially boosts UMMs’ generation, which in turn strengthens their fine-grained visual understanding, narrowing the capability gap between UMMs’ visual understanding and generation.
[CV-85] FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography
【速读】:该论文旨在解决当前文本到图像生成模型在控制字体排版(typography)方面存在的挑战,即用户指定的字体风格和使用场景常被忽略或仅弱相关地遵循。其解决方案的关键在于提出一种以数据为中心的方法,构建了一个大规模、结构化的字体专用数据集 FontUse(约70K张图像),通过自动化标注流程(结合分割模型与多模态大语言模型 MLLM)获取文本区域位置、OCR识别字符串及用户友好的提示词(prompt),这些提示词明确融合字体风格(如衬线体、手写体、优雅风格)和使用场景(如婚礼请柬、咖啡店菜单),从而实现对生成结果的精确控制。在此基础上微调现有图像生成器无需修改架构即可稳定理解风格与场景条件,显著提升生成文本与提示的一致性。
链接: https://arxiv.org/abs/2603.06038
作者: Xia Xin,Yuki Endo,Yoshihiro Kanamori
机构: University of Tsukuba (筑波大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Recent text-to-image models can generate high-quality images from natural-language prompts, yet controlling typography remains challenging: requested typographic appearance is often ignored or only weakly followed. We address this limitation with a data-centric approach that trains image generation models using targeted supervision derived from a structured annotation pipeline specialized for typography. Our pipeline constructs a large-scale typography-focused dataset, FontUse, consisting of about 70K images annotated with user-friendly prompts, text-region locations, and OCR-recognized strings. The annotations are automatically produced using segmentation models and multimodal large language models (MLLMs). The prompts explicitly combine font styles (e.g., serif, script, elegant) and use cases (e.g., wedding invitations, coffee-shop menus), enabling intuitive specification even for novice users. Fine-tuning existing generators with these annotations allows them to consistently interpret style and use-case conditions as textual prompts without architectural modification. For evaluation, we introduce a Long-CLIP-based metric that measures alignment between generated typography and requested attributes. Experiments across diverse prompts and layouts show that models trained with our pipeline produce text renderings more consistent with prompts than competitive baselines. The source code for our annotation pipeline is available at this https URL.
[CV-86] Ensemble Learning with Sparse Hypercolumns
【速读】:该论文旨在解决高维超列(hypercolumns)在图像分割任务中因计算复杂度随训练集规模 $ N $ 线性增长而导致的实际应用受限问题。其关键解决方案是采用分层抽样(stratified subsampling)策略对基于VGG16的超列进行稀疏化处理,并结合集成学习方法(如堆叠和投票集成)提升性能。实验表明,在极端低样本场景下($ N \leq 20 ),简单逻辑回归分类器表现最优;而在10 p $-值 = $ 3.07 \times 10^{-11} $),验证了该方案在缓解过拟合与保持精度之间的有效性。
链接: https://arxiv.org/abs/2603.06036
作者: Julia Dietlmeier,Vayangi Ganepola,Oluwabukola G. Adegboro,Mayug Maniparambil,Claudia Mazo,Noel E. O’Connor
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: presented at 33rd International Conference on Artificial Intelligence and Cognitive Science (AICS 2025)
Abstract:Directly inspired by findings in biological vision, high-dimensional hypercolumns are feature vectors built by concatenating multi-scale activations of convolutional neural networks for a single image pixel location. Together with powerful classifiers, they can be used for image segmentation i.e. pixel classification. However, in practice, there are only very few works dedicated to the use of hypercolumns. One reason is the computational complexity of processing concatenated dense hypercolumns that grows linearly with the size N of the training set. In this work, we address this challenge by applying stratified subsampling to the VGG16 based hypercolumns. Furthermore, we investigate the performance of ensemble learning on sparse hypercolumns. Our experiments on a brain tumor dataset show that stacking and voting ensembles deliver competitive performance, but in the extreme low-shot case of N \leq 20 , a simple Logistic Regression classifier is the most effective method. For 10% stratified subsampling rate, our best average Dice score is 0.66 for N=20 . This is a statistically significant improvement of 24.53% over the standard multi-scale UNet baseline ( p -value = [3.07e-11] , Wilcoxon signed-rank test), which is less effective due to overfitting.
[CV-87] Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking CVPR2026
【速读】:该论文旨在解决2D多目标跟踪(Multi-object Tracking, MOT)中因部分遮挡(partial occlusion)导致的位置代价混淆(positional cost confusion)问题,从而提升轨迹关联的准确性与稳定性。解决方案的关键在于提出了一种即插即用且无需训练的框架——Occlusion-Aware SORT(OA-SORT),其核心组件包括:1)遮挡感知模块(Occlusion-Aware Module, OAM),通过引入高斯图(Gaussian Map, GM)降低背景干扰并准确识别遮挡状态;2)遮挡感知偏移量(Occlusion-Aware Offset, OAO)和偏差感知动量机制(Bias-Aware Momentum, BAM),利用OAM输出的遮挡信息来缓解代价混淆并抑制估计不稳定性。实验证明,该框架在DanceTrack、SportsMOT和MOT17等多个数据集上显著提升了HOTA和IDF1指标,且可集成到其他主流追踪器中实现性能增益。
链接: https://arxiv.org/abs/2603.06034
作者: Chunjiang Li,Jianbo Ma,Li Shen,Yanru Chen,Liangyin Chen
机构: Sichuan University (四川大学); CAS (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR2026)
Abstract:Multi-object tracking (MOT) involves analyzing object trajectories and counting the number of objects in video sequences. However, 2D MOT faces challenges due to positional cost confusion arising from partial occlusion. To address this issue, we present the novel Occlusion-Aware SORT (OA-SORT) framework, a plug-and-play and training-free framework that includes the Occlusion-Aware Module (OAM), the Occlusion-Aware Offset (OAO), and the Bias-Aware Momentum (BAM). Specifically, OAM analyzes the occlusion status of objects, where a Gaussian Map (GM) is introduced to reduce background influence. In contrast, OAO and BAM leverage the OAM-described occlusion status to mitigate cost confusion and suppress estimation instability. Comprehensive evaluations on the DanceTrack, SportsMOT, and MOT17 datasets demonstrate the importance of occlusion handling in MOT. On the DanceTrack test set, OA-SORT achieves 63.1% and 64.2% in HOTA and IDF1, respectively. Furthermore, integrating the Occlusion-Aware framework into the four additional trackers improves HOTA and IDF1 by an average of 2.08% and 3.05%, demonstrating the reusability of the occlusion awareness.
[CV-88] StruVis: Enhancing Reasoning -based Text-to-Image Generation via Thinking with Structured Vision
【速读】:该论文旨在解决基于推理的文本到图像(Text-to-Image, T2I)生成中,现有方法在视觉语义理解与计算效率之间的权衡问题。具体而言,传统文本-only 推理框架缺乏视觉上下文,易忽略关键空间和视觉元素;而文本-图像交错推理虽增强视觉定位能力,却因依赖中间图像生成导致计算开销大且受限于生成器的表征能力。其解决方案的关键在于提出 StruVis 框架,通过引入**结构化视觉表示(Structured Vision Representation)**作为纯文本推理过程中的中间状态,使多模态大语言模型(MLLM)能够在不生成中间图像的前提下“感知”视觉结构,从而实现高效且强健的视觉引导推理。此方法不仅解耦了推理与生成过程,还具备生成器无关性,可无缝适配多种 T2I 生成器并显著提升其在复杂提示理解任务上的性能表现。
链接: https://arxiv.org/abs/2603.06032
作者: Yuanhuiyi Lyu,Kaiyu Lei,Ziqiao Weng,Xu Zheng,Lutao Jiang,Teng Li,Yangfu Li,Ziyuan Huang,Linfeng Zhang,Xuming Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reasoning-based text-to-image (T2I) generation requires models to interpret complex prompts accurately. Existing reasoning frameworks can be broadly categorized into two types: (1) Text-Only Reasoning, which is computationally efficient but lacks access to visual context, often resulting in the omission of critical spatial and visual elements; and (2) Text-Image Interleaved Reasoning, which leverages a T2I generator to provide visual references during the reasoning process. While this approach enhances visual grounding, it incurs substantial computational costs and constrains the reasoning capacity of MLLMs to the representational limitations of the generator. To this end, we propose StruVis, a novel framework that enhances T2I generation through Thinking with Structured Vision. Instead of relying on intermediate image generation, StruVis employs text-based structured visual representations as intermediate reasoning states, thereby enabling the MLLM to effectively “perceive” visual structure within a purely text-based reasoning process. Powered by this, the reasoning potential for T2I generation of the MLLM is unlocked through structured-vision-guided reasoning. Additionally, as a generator-agnostic reasoning framework, our proposed StruVis can be seamlessly integrated with diverse T2I generators and efficiently enhance their performance in reasoning-based T2I generation. Extensive experiments demonstrate that StruVis achieves significant performance improvements on reasoning-based T2I benchmarks, e.g., a 4.61% gain on T2I-ReasonBench and a 4% gain on WISE.
[CV-89] MOSIV: Multi-Object System Identification from Videos ICLR2026
【速读】:该论文旨在解决从视频中进行多物体系统识别(multi-object system identification)的挑战性问题,现有方法因局限于单物体场景或固定材料原型的离散分类而难以适用。其解决方案的核心在于提出MOSIV框架,该框架通过可微分仿真器直接优化连续的、每个物体独立的材料参数,并利用从视频中提取的几何目标函数进行引导,从而实现更精确的物理属性估计与长期模拟保真度提升。关键创新点包括引入基于几何对齐的目标函数以及对象级细粒度监督,显著增强了复杂多物体场景下的优化稳定性。
链接: https://arxiv.org/abs/2603.06022
作者: Chunjiang Liu,Xiaoyuan Wang,Qingran Lin,Albert Xiao,Haoyu Chen,Shizheng Wen,Hao Zhang,Lu Qi,Ming-Hsuan Yang,Laszlo A. Jeni,Min Xu,Yizhou Zhao
机构: CMU (卡内基梅隆大学); Georgia Tech (佐治亚理工学院); Harvard (哈佛大学); ETH Zurich (苏黎世联邦理工学院); UIUC (伊利诺伊大学厄巴纳-香槟分校); Insta360 (Insta360); UC Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026
Abstract:We introduce the challenging problem of multi-object system identification from videos, for which prior methods are ill-suited due to their focus on single-object scenes or discrete material classification with a fixed set of material prototypes. To address this, we propose MOSIV, a new framework that directly optimizes for continuous, per-object material parameters using a differentiable simulator guided by geometric objectives derived from video. We also present a new synthetic benchmark with contact-rich, multi-object interactions to facilitate evaluation. On this benchmark, MOSIV substantially improves grounding accuracy and long-horizon simulation fidelity over adapted baselines, establishing it as a strong baseline for this new task. Our analysis shows that object-level fine-grained supervision and geometry-aligned objectives are critical for stable optimization in these complex, multi-object settings. The source code and dataset will be released.
[CV-90] EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation
【速读】:该论文旨在解决生成式 AI (Generative AI) 在视觉特效(VFX)定制化生成中的两大核心挑战:一是由于特效数据稀缺且难以建模超自然或风格化效果,导致现有 AIGC 系统泛化能力弱;二是多数方法需针对每种特效进行微调(per-effect fine-tuning),严重限制了系统的可扩展性和对新特效的适应能力。解决方案的关键在于提出 EffectMaker,一个统一的推理-生成框架,其核心创新是构建了语义-视觉双路径引导机制:利用多模态大语言模型(multimodal large language model)解析高阶特效语义并推理其如何适配目标主体,同时通过扩散 Transformer(diffusion transformer)基于上下文学习从参考视频中捕捉细粒度视觉线索,从而实现无需微调即可准确、可控且一致地合成高质量 VFX。此外,研究还构建了包含 130k 视频、覆盖 3000 类 VFX 的 EffectData 数据集,显著提升了模型的泛化与扩展能力。
链接: https://arxiv.org/abs/2603.06014
作者: Shiyuan Yang,Ruihuang Li,Jiale Tao,Shuai Shao,Qinglin Lu,Jing Liao
机构: Tencent Hunyuan (腾讯混元); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Visual effects (VFX) are essential for enhancing the expressiveness and creativity of video content, yet producing high-quality effects typically requires expert knowledge and costly production pipelines. Existing AIGC systems face significant challenges in VFX generation due to the scarcity of effect-specific data and the inherent difficulty of modeling supernatural or stylized effects. Moreover, these approaches often require per-effect fine-tuning, which severely limits their scalability and generalization to novel VFX. In this work, we present EffectMaker, a unified reasoning-generation framework that enables reference-based VFX customization. EffectMaker employs a multimodal large language model to interpret high-level effect semantics and reason about how they should adapt to a target subject, while a diffusion transformer leverages in-context learning to capture fine-grained visual cues from reference videos. These two components form a semantic-visual dual-path guidance mechanism that enables accurate, controllable, and effect-consistent synthesis without per-effect fine-tuning. Furthermore, we construct EffectData, the largest high-quality synthetic dataset containing 130k videos across 3k VFX categories, to improve generalization and scalability. Experiments show that EffectMaker achieves superior visual quality and effect consistency over state-of-the-art baselines, offering a scalable and flexible paradigm for customized VFX generation. Project page: this https URL
[CV-91] Demystifying KAN for Vision Tasks: The RepKAN Approach
【速读】:该论文旨在解决遥感图像分类中主流卷积神经网络(CNN)和Transformer模型普遍存在的黑箱特性问题,即缺乏可解释性。其解决方案的关键在于提出RepKAN架构,该架构通过双路径设计——空间线性路径与光谱非线性路径——融合了CNN的结构效率与基于核方法的神经网络(KAN)的非线性表征能力,从而实现对类别特异性光谱指纹和物理交互流形的自主发现,使模型具备明确的物理可解释推理能力。
链接: https://arxiv.org/abs/2603.06002
作者: Minjong Cheon
机构: Sejong University (世宗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Remote sensing image classification is essential for Earth observation, yet standard CNNs and Transformers often function as uninterpretable black-boxes. We propose RepKAN, a novel architecture that integrates the structural efficiency of CNNs with the non-linear representational power of KANs. By utilizing a dual-path design – Spatial Linear and Spectral Non-linear – RepKAN enables the autonomous discovery of class-specific spectral fingerprints and physical interaction manifolds. Experimental results on the EuroSAT and NWPU-RESISC45 datasets demonstrate that RepKAN provides explicit physically interpretable reasoning while outperforming state-of-the-art models. These findings indicate that RepKAN holds significant potential to serve as the backbone for future interpretable visual foundation models.
[CV-92] Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在面对分布外(Out-of-Distribution, OOD)语言指令时存在的可靠性问题,特别是当语言指令与场景逻辑矛盾时,模型仍可能基于视觉先验生成看似合理但错误的动作,这种现象被称为“语言盲视”(linguistic blindness)。其核心问题是VLA模型在决策过程中对视觉信息的过度依赖,导致语言指令语义被弱化甚至忽略。解决方案的关键在于提出一种无需重新训练或修改架构的推理阶段机制——指令引导注意力重校准(Instruction-Guided Attention Recalibration, IGAR),通过动态调整注意力分布来恢复语言指令对动作生成的影响,从而有效抑制因指令不一致引发的错误执行行为,同时保持原模型在正常任务上的性能表现。
链接: https://arxiv.org/abs/2603.06001
作者: Ninghao Zhang,Bin Zhu,Shijie Zhou,Jingjing Chen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language-Action (VLA) models enable robots to perform manipulation tasks directly from natural language instructions and are increasingly viewed as a foundation for generalist robotic policies. However, their reliability under Out-of-Distribution (OOD) instructions remains underexplored. In this paper, we reveal a critical failure mode in which VLA policies continue executing visually plausible actions even when the language instruction contradicts the scene. We refer to this phenomenon as linguistic blindness, where VLA policies prioritize visual priors over instruction semantics during action generation. To systematically analyze this issue, we introduce ICBench, a diagnostic benchmark constructed from the LIBERO dataset that probes language-action coupling by injecting controlled OOD instruction contradictions while keeping the visual environment unchanged. Evaluations on three representative VLA architectures, including Pi0, Pi0.5 and OpenVLA OFT, show that these models frequently succeed at tasks despite logically impossible instructions, revealing a strong visual bias in action generation. To mitigate this issue, we propose Instruction-Guided Attention Recalibration (IGAR), a train-free inference-time mechanism that rebalances attention distributions to restore the influence of language instructions. IGAR operates without retraining or architectural modification and can be directly applied to existing VLA models. Experiments across 30 LIBERO tasks demonstrate that IGAR substantially reduces erroneous execution under OOD contradictory instructions while preserving baseline task performance. We additionally validate the approach on a real Franka robotic arm, where IGAR effectively prevents manipulation triggered by inconsistent instructions.
[CV-93] RePer-360: Releasing Perspective Priors for 360circ Depth Estimation via Self-Modulation
【速读】:该论文旨在解决深度基础模型(depth foundation models)在从透视图像(perspective imagery)迁移到360°全景图像(360° panoramic images)时泛化性能差的问题,以及全量微调(fully fine-tuning)所需大量全景数据的限制。其核心解决方案是提出RePer-360框架,关键在于通过轻量级几何对齐引导模块(geometry-aligned guidance module)利用两种互补投影(等距圆柱投影ERP与立方体贴图CP)生成调制信号,指导模型向全景域适应,同时保留预训练的透视先验知识;进一步引入自条件AdaLN-Zero机制以生成像素级缩放因子,缩小透视与全景特征分布差异,并结合立方体贴图域一致性损失提升训练稳定性和跨投影对齐效果,从而实现仅用1%训练数据即超越标准微调方法的性能表现。
链接: https://arxiv.org/abs/2603.05999
作者: Cheng Guan,Chunyu Lin,Zhijie Shen,Junsong Zhang,Jiyuan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent depth foundation models trained on perspective imagery achieve strong performance, yet generalize poorly to 360 ^\circ images due to the substantial geometric discrepancy between perspective and panoramic domains. Moreover, fully fine-tuning these models typically requires large amounts of panoramic data. To address this issue, we propose RePer-360, a distortion-aware self-modulation framework for monocular panoramic depth estimation that adapts depth foundation models while preserving powerful pretrained perspective priors. Specifically, we design a lightweight geometry-aligned guidance module to derive a modulation signal from two complementary projections (i.e., ERP and CP) and use it to guide the model toward the panoramic domain without overwriting its pretrained perspective knowledge. We further introduce a Self-Conditioned AdaLN-Zero mechanism that produces pixel-wise scaling factors to reduce the feature distribution gap between the perspective and panoramic domains. In addition, a cubemap-domain consistency loss further improves training stability and cross-projection alignment. By shifting the focus from complementary-projection fusion to panoramic domain adaptation under preserved pretrained perspective priors, RePer-360 surpasses standard fine-tuning methods while using only 1% of the training data. Under the same in-domain training setting, it further achieves an approximately 20% improvement in RMSE. Code will be released upon acceptance.
[CV-94] MM-ISTS: Cooperating Irregularly Sampled Time Series Forecasting with Multimodal Vision-Text LLM s
【速读】:该论文旨在解决不规则采样时间序列(Irregularly Sampled Time Series, ISTS)预测中普遍存在的问题,即现有方法仅依赖历史观测进行未来预测,难以捕捉上下文语义和细粒度的时间模式。其核心解决方案是提出MM-ISTS框架,通过引入视觉-文本大语言模型(Multimodal Large Language Models, MLLMs)构建跨模态融合机制:首先设计两阶段编码结构,其中交叉模态视觉-文本编码模块自动生成具有信息量的图像和文本数据,以增强对复杂时间模式和上下文的理解;同时,不规则时间序列编码模块通过多视角嵌入融合与时间-变量编码器提取互补且丰富的时序特征;进一步地,采用自适应查询机制压缩MLLMs学习到的token表示,过滤小规模有用知识以降低计算开销,并引入模态感知门控的多模态对齐模块缓解不同模态间的差异。
链接: https://arxiv.org/abs/2603.05997
作者: Zhi Lei,Chenxi Liu,Hao Miao,Wanghui Qiu,Bin Yang,Chenjuan Guo
机构: East China Normal University (华东师范大学); Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science Innovation, Chinese Academy of Sciences (香港科学创新研究院人工智能与机器人中心,中国科学院); Department of Computing, The Hong Kong Polytechnic University (香港理工大学计算机系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Irregularly sampled time series (ISTS) are widespread in real-world scenarios, exhibiting asynchronous observations on uneven time intervals across variables. Existing ISTS forecasting methods often solely utilize historical observations to predict future ones while falling short in learning contextual semantics and fine-grained temporal patterns. To address these problems, we achieve MM-ISTS, a multimodal framework augmented by vision-text large language models, that bridges temporal, visual, and textual modalities, facilitating ISTS forecasting. MM-ISTS encompasses a novel two-stage encoding mechanism. In particular, a cross-modal vision-text encoding module is proposed to automatically generate informative visual images and textual data, enabling the capture of intricate temporal patterns and comprehensive contextual understanding, in collaboration with multimodal LLMs (MLLMs). In parallel, ISTS encoding extracts complementary yet enriched temporal features from historical ISTS observations, including multi-view embedding fusion and a temporal-variable encoder. Further, we propose an adaptive query-based feature extractor to compress the learned tokens of MLLMs, filtering out small-scale useful knowledge, which in turn reduces computational costs. In addition, a multimodal alignment module with modality-aware gating is designed to alleviate the modality gap across ISTS, images, and text. Extensive experiments on real data offer insight into the effectiveness of the proposed solutions.
[CV-95] chnical Report: Automated Optical Inspection of Surgical Instruments
【速读】:该论文旨在解决巴基斯坦制造的外科器械因生产缺陷(如裂纹、锈蚀和结构异常)导致手术安全风险及重大经济损失的问题。其解决方案的关键在于构建一个包含4,414张高分辨率图像的新数据集,并引入基于深度学习的自动化光学检测(Automated Optical Inspection, AOI)方法,具体采用YOLOv8、ResNet-152与EfficientNet-b4等先进模型进行缺陷识别与分类,从而实现对医疗器械质量的精准控制与提升,保障患者安全并推动行业标准升级。
链接: https://arxiv.org/abs/2603.05987
作者: Zunaira Shafqat,Atif Aftab Ahmed Jilani,Qurrat Ul Ain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 20 pages, 33 figures, 6 tables. Technical Report
Abstract:In the dynamic landscape of modern healthcare, maintaining the highest standards in surgical instruments is critical for clinical success. This report explores the diverse realm of surgical instruments and their associated manufacturing defects, emphasizing their pivotal role in ensuring the safety of surgical procedures. With potentially fatal consequences arising from even minor defects, precision in manufacturing is this http URL report addresses the identification and rectification of critical defects such as cracks, rust, and structural irregularities. Such scrutiny prevents substantial financial losses for manufacturers and, more crucially, safeguards patient lives. The collaboration with industry leaders Daddy D Pro and Dr. Frigz International, renowned trailblazers in the Sialkot surgical cluster, provides invaluable insights into the analysis of defects in Pakistani-made instruments. This partnership signifies a commitment to advancing automated defect detection methodologies, specifically through the integration of deep learning architectures including YOLOv8, ResNet-152, and EfficientNet-b4, thereby elevating quality standards in the manufacturing process. The scope of this report is to identify various surgical instruments manufactured in Pakistan and analyze their associated defects using a newly developed dataset of 4,414 high-resolution images. By focusing on quality assurance through Automated Optical Inspection (AOI) tools, this document serves as a resource for manufacturers, healthcare professionals, and regulatory bodies. The insights gained contribute to the enhancement of instrument standards, ensuring a more reliable healthcare environment through industry expertise and cutting-edge technology.
[CV-96] HarvestFlex: Strawberry Harvesting via Vision-Language-Action Policy Adaptation in the Wild
【速读】:该论文旨在解决将视觉-语言-动作(Vision-Language-Action, VLA)策略迁移到真实温室中桌面上草莓采摘任务的问题,该任务具有长时程、非结构化特性,并受遮挡和镜面反射等挑战。解决方案的关键在于构建一个端到端闭环系统,基于HarvestFlex平台,采用三视角RGB感知(两个固定场景视角加一个腕部安装视角),并有意避免使用深度点云和显式几何标定;通过收集3.71小时虚拟现实(VR)远程操作示范数据(227个episode),对pi_0、pi_0.5及WALL-OSS模型进行全量微调与LoRA微调,最终在统一的50次真实温室测试协议下,pi_0.5模型以全量微调方式实现74.0%的成功率、32.6秒/次采摘时间和4.1%的损伤率,同时异步推理-控制解耦进一步提升了性能表现。
链接: https://arxiv.org/abs/2603.05982
作者: Ziyang Zhao,Shuheng Wang,Zhonghua Miao,Ya Xiong
机构: Beijing Academy of Agriculture and Forestry Sciences (北京市农林科学院); Shanghai University (上海大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work presents the first study on transferring vision-language-action (VLA) policies to real greenhouse tabletop strawberry harvesting, a long-horizon, unstructured task challenged by occlusion and specular reflections. We built an end-to-end closed-loop system on the HarvestFlex platform using three-view RGB sensing (two fixed scene views plus a wrist-mounted view) and intentionally avoided depth clouds and explicit geometric calibration. We collected 3.71 h of VR teleoperated demonstrations (227 episodes) and fine-tuned pi_0, pi_0.5, and WALL-OSS with full fine-tuning and LoRA. Under a unified 50 trials real-greenhouse protocol and metrics spanning completion, pi_0.5 with full fine-tuning achieved success rate of 74.0% with 32.6 s/pick and damage rate of 4.1%. Asynchronous inference-control decoupling further improved performance over synchronous deployment. Results showed non-trivial closed-loop picking with fewer than four hours of real data, while remaining limited by close-range observability loss and contact-dynamics mismatch. A demonstration video is available at: this https URL.
[CV-97] owards High-resolution and Disentangled Reference-based Sketch Colorization
【速读】:该论文旨在解决草图着色(Sketch colorization)任务中因训练数据与测试数据之间分布偏移(distribution shift)所导致的图像质量下降、分辨率受限及控制能力不足的问题。传统方法仅试图缓解分布偏移带来的伪影,未能从根源上解决该问题。其解决方案的关键在于提出一个双分支框架,分别建模训练过程中的语义对齐数据分布(semantic-aligned branch)和推理过程中的语义错位分布(semantic-misaligned branch),并通过跨分支的Gram正则化损失(Gram Regularization Loss)强制特征空间中的跨域分布一致性与稳定性,从而显著降低分布偏移的影响。此外,引入动漫专用标签网络(anime-specific Tagger Network)提取细粒度属性并调节SDXL条件编码器,结合插件模块增强纹理迁移能力,实现了高质量、高分辨率且可控性强的着色效果。
链接: https://arxiv.org/abs/2603.05971
作者: Dingkun Yan,Xinrui Wang,Ru Wang,Zhuoru Li,Jinze Yu,Yusuke Iwasawa,Yutaka Matsuo,Jiaxian Guo
机构: The University of Tokyo (东京大学); Project HAT; Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sketch colorization is a critical task for automating and assisting in the creation of animations and digital illustrations. Previous research identified the primary difficulty as the distribution shift between semantically aligned training data and highly diverse test data, and focused on mitigating the artifacts caused by the distribution shift instead of fundamentally resolving the problem. In this paper, we present a framework that directly minimizes the distribution shift, thereby achieving superior quality, resolution, and controllability of colorization. We propose a dual-branch framework to explicitly model the data distributions of the training process and inference process with a semantic-aligned branch and a semantic-misaligned branch, respectively. A Gram Regularization Loss is applied across the feature maps of both branches, effectively enforcing cross-domain distribution coherence and stability. Furthermore, we adopt an anime-specific Tagger Network to extract fine-grained attributions from reference images and modulate SDXL’s conditional encoders to ensure precise control, and a plugin module to enhance texture transfer. Quantitative and qualitative comparisons, alongside user studies, confirm that our method effectively overcomes the distribution shift challenge, establishing State-of-the-Art performance across both quality and controllability metrics. Ablation study reveals the influence of each component.
[CV-98] Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions
【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)视角下多目标跟踪(Multi-Object Tracking, MOT)中存在的观测挑战问题,特别是由无人机剧烈自身运动(ego-motion)引起的复杂表观轨迹、尺度突变、视角变化及运动模糊等难题。现有基准数据集通常缺乏这些动态特性,难以真实反映实际应用场景。解决方案的关键在于构建一个名为DynUAV的新基准,其包含42段视频序列和超过170万条边界框标注,涵盖车辆、行人及工业设备(如挖掘机、推土机和起重机)等多种目标类别,并突出强调强烈的无人机自身运动带来的复杂性。通过在该基准上对主流跟踪算法的全面评估,揭示了当前方法在检测与关联任务中应对动态条件时的局限性,从而确立了DynUAV作为高挑战性测试平台的价值,推动真实场景下UAV-MOT技术的发展。
链接: https://arxiv.org/abs/2603.05970
作者: Jingtao Ye,Kexin Zhang,Xunchi Ma,Yuehan Li,Guangming Zhu,Peiyi Shen,Linhua Jiang,Xiangdong Zhang,Liang Zhang
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid movements and agile maneuvers of unmanned aerial vehicles (UAVs) induce significant observational challenges for multi-object tracking (MOT). However, existing UAV-perspective MOT benchmarks often lack these complexities, featuring predominantly predictable camera dynamics and linear motion patterns. To address this gap, we introduce DynUAV, a new benchmark for dynamic UAV-perspective MOT, characterized by intense ego-motion and the resulting complex apparent trajectories. The benchmark comprises 42 video sequences with over 1.7 million bounding box annotations, covering vehicles, pedestrians, and specialized industrial categories such as excavators, bulldozers and cranes. Compared to existing benchmarks, DynUAV introduces substantial challenges arising from ego-motion, including drastic scale changes and viewpoint changes, as well as motion blur. Comprehensive evaluations of state-of-the-art trackers on DynUAV reveal their limitations, particularly in managing the intertwined challenges of detection and association under such dynamic conditions, thereby establishing DynUAV as a rigorous benchmark. We anticipate that DynUAV will serve as a demanding testbed to spur progress in real-world UAV-perspective MOT, and we will make all resources available at link.
[CV-99] PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition
【速读】:该论文旨在解决LiDAR位姿识别(place recognition)中手写特征描述子在跨会话场景下性能受限的问题,尤其关注如何实现传感器无关的泛化能力与高精度匹配。其解决方案的关键在于提出一种无需训练的BEV(鸟瞰图)占用编码方法PROBE,它将每个BEV单元的占据状态建模为伯努利随机变量(Bernoulli random variable),并通过极坐标雅可比矩阵(polar Jacobian)对连续笛卡尔平移进行解析边缘化,从而获得距离自适应的角向不确定性 σθ=σt/r,其中σt为传感器无关的物理量——预期平移不确定性(expected translational uncertainty in meters)。此设计使得模型可在不依赖特定数据集微调的情况下实现跨传感器泛化,同时结合伯努利KL-Jaccard相似度、指数不确定性门控和基于FFT的高度余弦相似度,有效提升旋转对齐与匹配精度。
链接: https://arxiv.org/abs/2603.05965
作者: Jinseop Lee,Byoungho Lee,Gichul Yoo
机构: AX Tech; SK Intellix
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures
Abstract:We present PROBE (PRobabilistic Occupancy BEV Encoding), a learning-free LiDAR place recognition descriptor that models each BEV cell’s occupancy as a Bernoulli random variable. Rather than relying on discrete point-cloud perturbations, PROBE analytically marginalizes over continuous Cartesian translations via the polar Jacobian, yielding a distance-adaptive angular uncertainty \sigma_\theta = \sigma_t / r in \mathcalO(R \times S) time. The primary parameter \sigma_t represents the expected translational uncertainty in meters, a sensor-independent physical quantity allowing cross-sensor generalization without per-dataset tuning. Pairwise similarity combines a Bernoulli-KL Jaccard with exponential uncertainty gating and FFT-based height cosine similarity for rotation alignment. Evaluated on four datasets spanning four diverse LiDAR types, PROBE achieves the highest accuracy among handcrafted descriptors in multi-session evaluation and competitive single-session performance against both handcrafted and supervised baselines. The source code and supplementary materials are available at this https URL.
[CV-100] CR-QAT: Curriculum Relational Quantization-Aware Training for Open-Vocabulary Object Detection
【速读】:该论文旨在解决开放词汇目标检测(Open-vocabulary object detection, OVOD)模型在资源受限设备上部署时因模型规模庞大而面临的挑战,尤其针对极端低比特(如4-bit)量化导致的细粒度视觉-语言对齐能力退化及区域间关系结构失真问题。解决方案的关键在于提出一种集成式框架——课程关系量化感知训练(Curriculum Relational Quantization-Aware Training, CR-QAT),其核心包括两个创新:一是课程量化感知训练(Curriculum QAT, CQAT),通过分阶段优化和模型分区实现渐进式量化,以隔离误差并稳定训练;二是以文本为中心的关系知识蒸馏(Text-centric Relational Knowledge Distillation, TRKD),在任务相关模块中构建文本锚定的成对相似性矩阵,从而全面迁移教师模型的多维关系知识。该方法在LVIS和COCO零样本基准测试中显著优于现有量化感知训练基线,在极低比特设置下分别实现了最高达38.9%和40.9%的平均精度(AP)相对提升。
链接: https://arxiv.org/abs/2603.05964
作者: Jinyeong Park,Donghwa Kim,Brent ByungHoon Kang,Hyeongboo Baek,Jibum Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary object detection (OVOD) enables novel category detection via vision-language alignment, but massive model sizes hinder deployment on resource-constrained devices. While quantization offers practical compression, we reveal that naive extreme low-bit (e.g., 4-bit) quantization severely degrades fine-grained vision-language alignment and distorts inter-region relational structures. To address this, we propose curriculum relational quantization-aware training (CR-QAT), an integrated framework combining stage-by-stage optimization with relational knowledge distillation. Within CR-QAT, curriculum QAT (CQAT) mitigates error accumulation by partitioning the model for progressive quantization, ensuring stable optimization via error isolation. Concurrently, text-centric relational KD (TRKD) is applied to task-relevant modules. By constructing text-anchored pairwise similarity matrices, TRKD comprehensively transfers the teacher’s multi-dimensional relational knowledge. Experiments on LVIS and COCO zero-shot benchmarks demonstrate that CR-QAT consistently outperforms existing QAT baselines under aggressive low-bit settings, achieving relative AP improvements of up to 38.9% and 40.9%, respectively.
[CV-101] Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models
【速读】:该论文旨在解决3D人体骨骼数据在应用大规模预训练视觉模型时面临的挑战,包括骨骼数据与图像数据在格式上的根本差异、大规模骨骼数据集的稀缺性,以及如何在不引入额外模型分支的前提下将骨骼数据融入多模态动作识别任务。其解决方案的关键在于提出Skeleton-to-Image Encoding (S2I),一种将骨骼序列通过基于身体部位语义划分和重排并缩放到标准图像尺寸的编码方法,从而生成图像-like 表示,使现有的强大视觉预训练模型可用于自监督骨骼表征学习,有效迁移视觉领域的知识至骨骼分析,并统一处理来自不同来源的异构骨骼数据结构。
链接: https://arxiv.org/abs/2603.05963
作者: Siyuan Yang,Jun Liu,Hao Cheng,Chong Wang,Shijian Lu,Hedvig Kjellstrom,Weisi Lin,Alex C. Kot
机构: KTH Royal Institute of Technology (皇家理工学院); Hebei University of Technology (河北工业大学); Nanyang Technological University (南洋理工大学); Lancaster University (兰卡斯特大学); Shenzhen MSU-BIT University (深圳北理莫斯科大学); VinUniversity (VinUniversity)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE TPAMI, under review
Abstract:Recent advances in large-scale pretrained vision models have demonstrated impressive capabilities across a wide range of downstream tasks, including cross-modal and multi-modal scenarios. However, their direct application to 3D human skeleton data remains challenging due to fundamental differences in data format. Moreover, the scarcity of large-scale skeleton datasets and the need to incorporate skeleton data into multi-modal action recognition without introducing additional model branches present significant research opportunities. To address these challenges, we introduce Skeleton-to-Image Encoding (S2I), a novel representation that transforms skeleton sequences into image-like data by partitioning and arranging joints based on body-part semantics and resizing to standardized image dimensions. This encoding enables, for the first time, the use of powerful vision-pretrained models for self-supervised skeleton representation learning, effectively transferring rich visual-domain knowledge to skeleton analysis. While existing skeleton methods often design models tailored to specific, homogeneous skeleton formats, they overlook the structural heterogeneity that naturally arises from diverse data sources. In contrast, our S2I representation offers a unified image-like format that naturally accommodates heterogeneous skeleton data. Extensive experiments on NTU-60, NTU-120, and PKU-MMD demonstrate the effectiveness and generalizability of our method for self-supervised skeleton representation learning, including under challenging cross-format evaluation settings.
[CV-102] Exploring Open-Vocabulary Object Recognition in Images using CLIP
【速读】:该论文旨在解决现有开放词汇目标识别(Open-Vocabulary Object Recognition, OVOR)方法中存在的系统复杂度高、训练成本大以及泛化能力有限的问题。其解决方案的关键在于提出一种基于简化两阶段策略的OVOR框架:首先进行目标分割,随后执行识别。该框架摒弃了复杂的再训练和繁琐的人工标注过程;通过CLIP模型生成图像级嵌入与类别级文本嵌入,实现任意词汇的识别能力;为进一步降低对CLIP的依赖并提升编码灵活性,引入基于卷积神经网络(CNN)/多层感知机(MLP)的图像编码方法,将视觉特征与文本嵌入对齐后,利用奇异值分解(SVD)构建共享表示空间,并最终通过嵌入相似性匹配完成识别。实验表明,无需训练的CLIP编码结合SVD方法在COCO、Pascal VOC和ADE20K数据集上均达到最优平均精度(AP),同时验证了CNN/MLP编码方法在OVOR中的潜力。
链接: https://arxiv.org/abs/2603.05962
作者: Wei Yu Chen,Ying Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:To address the limitations of existing open-vocabulary object recognition methods, specifically high system complexity, substantial training costs, and limited generalization, this paper proposes a novel Open-Vocabulary Object Recognition (OVOR) framework based on a streamlined two-stage strategy: object segmentation followed by recognition. The framework eliminates the need for complex retraining and labor-intensive annotation. After cropping object regions, we generate object-level image embeddings alongside category-level text embeddings using CLIP, which facilitates arbitrary vocabularies. To reduce reliance on CLIP and enhance encoding flexibility, we further introduce a CNN/MLP-based method that extracts convolutional neural network (CNN) feature maps and utilizes a multilayer perceptron (MLP) to align visual features with text embeddings. These embeddings are concatenated and processed via Singular Value Decomposition (SVD) to construct a shared representation space. Finally, recognition is performed through embedding similarity matching. Experiments on COCO, Pascal VOC, and ADE20K demonstrate that training-free, CLIP-based encoding without SVD achieves the highest average AP, outperforming current state-of-the-art methods. Simultaneously, the results highlight the potential of CNN/MLP-based image encoding for OVOR.
[CV-103] OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer
【速读】:该论文旨在解决从流式视频中重建三维几何结构时面临的资源受限问题,尤其是现有基于全连接注意力(all-to-all attention)的几何基础模型因计算复杂度为二次方而无法处理长序列,以及采用因果注意力(causal attention)的方案如StreamVGGT虽支持单次遍历但会随帧数增长导致键值(KV)缓存占用显存激增,从而限制了长时间视频的部署。其解决方案的关键在于提出OVGGT框架,通过两项核心技术实现固定内存与计算预算:一是自选择缓存(Self-Selective Caching),利用前馈网络(FFN)残差幅值对KV缓存进行压缩,同时保持与FlashAttention的兼容性;二是动态锚点保护(Dynamic Anchor Protection),防止关键坐标信息相关的token被逐出缓存,有效抑制长时间轨迹中的几何漂移。实验证明,OVGGT可在恒定显存开销下处理任意长度视频,并达到当前最优的3D几何精度。
链接: https://arxiv.org/abs/2603.05959
作者: Si-Yu Lu,Po-Ting Chen,Hui-Che Hsu,Sin-Ye Jhong,Wen-Huang Cheng,Yung-Yao Chen
机构: National Taiwan University (国立台湾大学); National Taiwan University of Science and Technology (国立台湾科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.
[CV-104] Unify the Views: View-Consistent Prototype Learning for Few-Shot Segmentation CVPR
【速读】:该论文旨在解决少样本分割(Few-shot Segmentation, FSS)中因大外观或视角变化导致的结构错位(structural misalignment)和跨视图不一致性(cross-view inconsistency)问题。解决方案的关键在于提出一个统一框架VINE(View-Informed NEtwork),其核心创新包括:1)在骨干网络特征上构建空间-视角图(spatial-view graph),其中空间图捕捉局部几何拓扑,视角图连接不同视角特征以传播视角不变的结构语义;2)利用支持集与查询集特征差异推导判别先验(discriminative prior),通过强调显著区域重加权SAM(Segment Anything Model)特征,并校准骨干网络激活以增强结构聚焦;3)通过掩码交叉注意力机制逐步融合增强的SAM特征与结构丰富ResNet特征,生成类一致原型作为自适应提示输入SAM解码器,从而提升分割精度与鲁棒性。
链接: https://arxiv.org/abs/2603.05952
作者: Hongli Liu,Yu Wang,Shengjie Zhao
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR Findings 2026
Abstract:Few-shot segmentation (FSS) has gained significant attention for its ability to generalize to novel classes with limited supervision, yet remains challenged by structural misalignment and cross-view inconsistency under large appearance or viewpoint variations. This paper tackles these challenges by introducing VINE (View-Informed NEtwork), a unified framework that jointly models structural consistency and foreground discrimination to refine class-specific prototypes. Specifically, VINE introduces a spatial-view graph on backbone features, where the spatial graph captures local geometric topology and the view graph connects features from different perspectives to propagate view-invariant structural semantics. To further alleviate foreground ambiguity, we derive a discriminative prior from the support-query feature discrepancy to capture category-specific contrast, which reweights SAM features by emphasizing salient regions and recalibrates backbone activations for improved structural focus. The foreground-enhanced SAM features and structurally enriched ResNet features are progressively integrated through masked cross-attention, yielding class-consistent prototypes used as adaptive prompts for the SAM decoder to generate accurate masks. Extensive experiments on multiple FSS benchmarks validate the effectiveness and robustness of VINE, particularly under challenging scenarios with viewpoint shifts and complex structures. The code is available at this https URL.
[CV-105] Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)中因固定token预算导致的效率与性能失衡问题,即现有方法通常对所有输入图像采用相同的token数量限制,忽略了图像信息密度的差异性。其解决方案的关键在于提出E-AdaPrune框架,该框架基于视觉特征空间的奇异值谱(singular value spectrum)动态分配token预算:通过保留一定比例的谱能量(spectral energy),在信息密集场景中保留更多token,在冗余区域则进行更激进的压缩,从而实现自适应的视觉token剪枝。此方法无需引入额外可学习参数,且在多个VLM骨干网络和基准测试中均表现出显著性能提升,例如在MMVet推理任务上相对提升达5.1%。
链接: https://arxiv.org/abs/2603.05950
作者: Jialuo He,Huangxun Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6%, including a significant +5.1% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.
[CV-106] LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Generative Real-World Super-Resolution
【速读】:该论文针对生成式真实世界图像超分辨率(Real-ISR)中因随机采样导致的语义与结构幻觉问题展开研究,即输出图像虽视觉锐利但缺乏对低分辨率(LR)输入的语义和结构忠实性,且在无高分辨率(HR)真值的情况下难以评估这种忠实性。解决方案的关键在于提出LucidNFT框架,其核心创新包括:(1)引入LucidConsistency——一种对退化鲁棒的语义一致性评估器,使LR锚定的忠实性可测量并可优化;(2)设计解耦的优势归一化策略,在每组LR条件下的候选样本中保留各目标间的对比差异,避免优势塌陷;(3)构建大规模真实退化图像数据集LucidLR,提升rollout多样性与偏好信号质量,从而实现稳定且高效的扩散模型前向微调。
链接: https://arxiv.org/abs/2603.05947
作者: Song Fei,Tian Ye,Sixiang Chen,Zhaohu Xing,Jianyu Lai,Lei Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative real-world image super-resolution (Real-ISR) can synthesize visually convincing details from severely degraded low-resolution (LR) inputs, yet its stochastic sampling makes a critical failure mode hard to avoid: outputs may look sharp but be unfaithful to the LR evidence (semantic and structural hallucination), while such LR-anchored faithfulness is difficult to assess without HR ground truth. Preference-based reinforcement learning (RL) is a natural fit because each LR input yields a rollout group of candidates to compare. However, effective alignment in Real-ISR is hindered by (i) the lack of a degradation-robust LR-referenced faithfulness signal, and (ii) a rollout-group optimization bottleneck where naive multi-reward scalarization followed by normalization compresses objective-wise contrasts, causing advantage collapse and weakening the reward-weighted updates in DiffusionNFT-style forward fine-tuning. Moreover, (iii) limited coverage of real degradations restricts rollout diversity and preference signal quality. We propose LucidNFT, a multi-reward RL framework for flow-matching Real-ISR. LucidNFT introduces LucidConsistency, a degradation-robust semantic evaluator that makes LR-anchored faithfulness measurable and optimizable; a decoupled advantage normalization strategy that preserves objective-wise contrasts within each LR-conditioned rollout group before fusion, preventing advantage collapse; and LucidLR, a large-scale collection of real-world degraded images to support robust RL fine-tuning. Experiments show that LucidNFT consistently improves strong flow-based Real-ISR baselines, achieving better perceptual-faithfulness trade-offs with stable optimization dynamics across diverse real-world scenarios.
[CV-107] Adaptive Radial Projection on Fourier Magnitude Spectrum for Document Image Skew Estimation ICIP2022
【速读】:该论文旨在解决文档图像处理系统中倾斜校正(skew correction)的准确性问题,尤其是在扫描文档图像中,倾斜角度估计的精度直接影响后续处理步骤的效果。其解决方案的关键在于提出一种新颖的倾斜估计方法:通过在二维离散傅里叶变换(2D Discrete Fourier Transform, DFT)幅度谱上应用自适应径向投影(Adaptive Radial Projection),提取图像中的主导倾斜角。该方法显著提升了鲁棒性和可靠性,并在自建的高质量倾斜估计数据集 DISE-2021 上验证了其优于现有各类基于傅里叶的方法。
链接: https://arxiv.org/abs/2603.05942
作者: Luan Pham,Phu Hao Hoang,Xuan Toan Mai,Tuan Anh Tran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted to ICIP 2022
Abstract:Skew estimation is one of the vital tasks in document processing systems, especially for scanned document images, because its performance impacts subsequent steps directly. Over the years, an enormous number of researches focus on this challenging problem in the rise of digitization age. In this research, we first propose a novel skew estimation method that extracts the dominant skew angle of the given document image by applying an Adaptive Radial Projection on the 2D Discrete Fourier Magnitude spectrum. Second, we introduce a high quality skew estimation dataset DISE-2021 to assess the performance of different estimators. Finally, we provide comprehensive analyses that focus on multiple improvement aspects of Fourier-based methods. Our results show that the proposed method is robust, reliable, and outperforms all compared methods. The source code is available at this https URL.
[CV-108] SLER-IR: Spherical Layer-wise Expert Routing for All-in-One Image Restoration
【速读】:该论文旨在解决统一型图像恢复(Image Restoration)框架在应对多种退化类型时面临的特征干扰和专家专业化不足的问题。其核心解决方案是提出了一种球面层间专家路由机制(Spherical Layer-wise Expert Routing, SLER-IR),通过动态激活不同网络层中的专用专家来提升模型对复杂退化的适应能力;关键创新在于引入基于对比学习的球面均匀退化嵌入(Spherical Uniform Degradation Embedding),将退化表示映射到超球面上以消除线性嵌入空间中的几何偏差,并结合全局-局部粒度融合(Global-Local Granularity Fusion, GLGF)模块,有效处理空间非均匀退化及训练与测试阶段的粒度差异问题。
链接: https://arxiv.org/abs/2603.05940
作者: Peng Shurui,Xin Lin,Shi Luo,Jincen Ou,Dizhe Zhang,Lu Qi,Truong Nguyen,Chao Ren
机构: Sichuan University (四川大学); University of California San Diego (加州大学圣地亚哥分校); Insta360 Research (Insta360 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image restoration under diverse degradations remains challenging for unified all-in-one frameworks due to feature interference and insufficient expert specialization. We propose SLER-IR, a spherical layer-wise expert routing framework that dynamically activates specialized experts across network layers. To ensure reliable routing, we introduce a Spherical Uniform Degradation Embedding with contrastive learning, which maps degradation representations onto a hypersphere to eliminate geometry bias in linear embedding spaces. In addition, a Global-Local Granularity Fusion (GLGF) module integrates global semantics and local degradation cues to address spatially non-uniform degradations and the train-test granularity gap. Experiments on three-task and five-task benchmarks demonstrate that SLER-IR achieves consistent improvements over state-of-the-art methods in both PSNR and SSIM. Code and models will be publicly released.
[CV-109] Facial Expression Recognition Using Residual Masking Network
【速读】:该论文旨在解决自动面部表情识别(Automatic Facial Expression Recognition, FER)任务中模型性能提升的问题,特别是在复杂背景下如何增强卷积神经网络(CNN)对关键面部特征的关注能力。其解决方案的关键在于提出了一种新颖的“掩码”(Masking)机制,该机制利用分割网络对特征图进行精细化处理,从而引导网络聚焦于与表情相关的区域,抑制无关信息干扰,显著提升分类准确性。实验表明,将该掩码机制嵌入深度残差网络(Deep Residual Network)与类似Unet的架构中所构建的Residual Masking Network,在FER2013和私有VEMO数据集上均达到当前最优(State-of-the-art, SOTA)性能。
链接: https://arxiv.org/abs/2603.05937
作者: Luan Pham, TheHuynh Vu,Tuan Anh Tran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Automatic facial expression recognition (FER) has gained much attention due to its applications in human-computer interaction. Among the approaches to improve FER tasks, this paper focuses on deep architecture with the attention mechanism. We propose a novel Masking idea to boost the performance of CNN in facial expression task. It uses a segmentation network to refine feature maps, enabling the network to focus on relevant information to make correct decisions. In experiments, we combine the ubiquitous Deep Residual Network and Unet-like architecture to produce a Residual Masking Network. The proposed method holds state-of-the-art (SOTA) accuracy on the well-known FER2013 and private VEMO datasets. The source code is available at this https URL.
[CV-110] OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving ICCV2025
【速读】:该论文旨在解决自动驾驶系统在处理罕见场景或复杂道路结构时的安全性不足问题,尤其针对当前道路基础设施设计以人类驾驶员为中心、安全改进多为事故后被动响应的局限性。其解决方案的关键在于提出OD-RASE框架,通过构建基于交通系统领域知识的本体(ontology)来形式化道路结构与事故关联关系,并结合大规模视觉语言模型(LVLM)生成改进建议,再利用本体驱动的数据过滤机制提升建议可靠性,从而实现对潜在风险道路结构的主动识别与改善方案生成,最终形成包含预事故图像标注的新数据集及基于扩散模型的可视化改进方案生成能力。
链接: https://arxiv.org/abs/2603.05936
作者: Kota Shimomura,Masaki Nambata,Atsuya Ishikawa,Ryota Mimura,Takayuki Kawabuchi,Takayoshi Yamashita,Koki Inoue
机构: Chubu University(中部大学); Elith Inc.(Elith公司); Honda RD Co., Ltd.(本田研发有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted ICCV2025
Abstract:Although autonomous driving systems demonstrate high perception performance, they still face limitations when handling rare situations or complex road structures. Such road infrastructures are designed for human drivers, safety improvements are typically introduced only after accidents occur. This reactive approach poses a significant challenge for autonomous systems, which require proactive risk mitigation. To address this issue, we propose OD-RASE, a framework for enhancing the safety of autonomous driving systems by detecting road structures that cause traffic accidents and connecting these findings to infrastructure development. First, we formalize an ontology based on specialized domain knowledge of road traffic systems. In parallel, we generate infrastructure improvement proposals using a large-scale visual language model (LVLM) and use ontology-driven data filtering to enhance their reliability. This process automatically annotates improvement proposals on pre-accident road images, leading to the construction of a new dataset. Furthermore, we introduce the Baseline approach (OD-RASE model), which leverages LVLM and a diffusion model to produce both infrastructure improvement proposals and generated images of the improved road environment. Our experiments demonstrate that ontology-driven data filtering enables highly accurate prediction of accident-causing road structures and the corresponding improvement plans. We believe that this work contributes to the overall safety of traffic environments and marks an important step toward the broader adoption of autonomous driving systems.
[CV-111] FTSplat: Feed-forward Triangle Splatting Network
【速读】:该论文旨在解决当前基于神经辐射场(Neural Radiance Fields, NeRF)和三维高斯溅射(3D Gaussian Splatting, 3DGS)的三维重建方法在实时部署中受限于耗时的场景级优化,以及新兴前馈高斯溅射方法缺乏显式流形几何结构(manifold geometry)以直接用于仿真问题。解决方案的关键在于提出一种前馈式三角面片生成框架,通过单次前向传播直接从校准的多视角图像中预测连续三角面表面,从而无需场景级优化或后处理即可生成可直接用于标准图形和机器人仿真器的模型;其核心创新包括像素对齐的三角面片生成模块与相对3D点云监督机制,有效提升了几何学习的稳定性和一致性。
链接: https://arxiv.org/abs/2603.05932
作者: Xiong Jinlin,Li Can,Shen Jiawei,Qi Zhigang,Sun Lei,Zhao Dongyang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:High-fidelity three-dimensional (3D) reconstruction is essential for robotics and simulation. While Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) achieve impressive rendering quality, their reliance on time-consuming per-scene optimization limits real-time deployment. Emerging feed-forward Gaussian splatting methods improve efficiency but often lack explicit, manifold geometry required for direct simulation. To address these limitations, we propose a feed-forward framework for triangle primitive generation that directly predicts continuous triangle surfaces from calibrated multi-view images. Our method produces simulation-ready models in a single forward pass, obviating the need for per-scene optimization or post-processing. We introduce a pixel-aligned triangle generation module and incorporate relative 3D point cloud supervision to enhance geometric learning stability and consistency. Experiments demonstrate that our method achieves efficient reconstruction while maintaining seamless compatibility with standard graphics and robotic simulators.
[CV-112] Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation
【速读】:该论文旨在解决基于视觉Transformer(Vision Transformer, ViT)的2D人体姿态估计方法在视频序列中因忽略时间一致性而导致预测不稳定的问题,尤其在运动模糊、遮挡或失焦等挑战性场景下表现不佳。其核心解决方案是提出一种名为TAR-ViTPose的时序聚合与恢复视觉Transformer架构,关键创新在于引入了两种机制:一是关节中心的时间聚合(Joint-centric Temporal Aggregation, JTA),通过为每个关节约束可学习查询令牌(query token),实现对邻近帧中对应区域的精准时序注意力聚焦;二是全局恢复注意力(Global Restoring Attention, GRA),将聚合后的时序特征重新映射回当前帧的token序列,从而增强姿态表示的同时保留全局上下文信息,实现更鲁棒且准确的视频级人体姿态估计。
链接: https://arxiv.org/abs/2603.05929
作者: Hongwei Fang,Jiahang Cai,Xun Wang,Wenwu Yang
机构: Zhejiang Gongshang University (浙江工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Transformers (ViTs) have recently achieved state-of-the-art performance in 2D human pose estimation due to their strong global modeling capability. However, existing ViT-based pose estimators are designed for static images and process each frame independently, thereby ignoring the temporal coherence that exists in video sequences. This limitation often results in unstable predictions, especially in challenging scenes involving motion blur, occlusion, or defocus. In this paper, we propose TAR-ViTPose, a novel Temporal Aggregate-and-Restore Vision Transformer tailored for video-based 2D human pose estimation. TAR-ViTPose enhances static ViT representations by aggregating temporal cues across frames in a plug-and-play manner, leading to more robust and accurate pose estimation. To effectively aggregate joint-specific features that are temporally aligned across frames, we introduce a joint-centric temporal aggregation (JTA) that assigns each joint a learnable query token to selectively attend to its corresponding regions from neighboring frames. Furthermore, we develop a global restoring attention (GRA) to restore the aggregated temporal features back into the token sequence of the current frame, enriching its pose representation while fully preserving global context for precise keypoint localization. Extensive experiments demonstrate that TAR-ViTPose substantially improves upon the single-frame baseline ViTPose, achieving a +2.3 mAP gain on the PoseTrack2017 benchmark. Moreover, our approach outperforms existing state-of-the-art video-based methods, while also achieving a noticeably higher runtime frame rate in real-world applications. Project page: this https URL.
[CV-113] owards Driver Behavior Understanding: Weakly-Supervised Risk Perception in Driving Scenes
【速读】:该论文旨在解决智能车辆系统中实现零碰撞移动(zero-collision mobility)的关键挑战,核心在于提升对驾驶员风险感知(risk perception)的理解。风险感知是一个复杂的认知过程,涉及驾驶员对外部刺激的自愿响应以及周围道路使用者对本车的关注程度。为支持该方向的研究,作者构建了RAID(Risk Assessment In Driving scenes)数据集,包含4,691个标注视频片段,涵盖多样交通场景,并提供驾驶员意图、道路拓扑、风险情境(如横穿行人)、驾驶员反应及行人注意力等多维标签。解决方案的关键在于提出一种弱监督的风险目标识别框架,通过建模驾驶员意图与响应之间的关系来识别潜在风险源,并进一步分析行人注意力在风险评估中的作用。实验表明,该方法在RAID和HDDS数据集上分别相较现有最优方法提升20.6%和23.1%的性能。
链接: https://arxiv.org/abs/2603.05926
作者: Nakul Agarwal,Yi-Ting Chen,Behzad Dariush
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IV 2026
Abstract:Achieving zero-collision mobility remains a key objective for intelligent vehicle systems, which requires understanding driver risk perception-a complex cognitive process shaped by voluntary response of the driver to external stimuli and the attentiveness of surrounding road users towards the ego-vehicle. To support progress in this area, we introduce RAID (Risk Assessment In Driving scenes)-a large-scale dataset specifically curated for research on driver risk perception and contextual risk assessment. RAID comprises 4,691 annotated video clips, covering diverse traffic scenarios with labels for driver’s intended maneuver, road topology, risk situations (e.g., crossing pedestrians), driver responses, and pedestrian attentiveness. Leveraging RAID, we propose a weakly supervised risk object identification framework that models the relationship between driver’s intended maneuver and responses to identify potential risk sources. Additionally, we analyze the role of pedestrian attention in estimating risk and demonstrate the value of the proposed dataset. Experimental evaluations demonstrate that our method achieves 20.6% and 23.1% performance gains over prior state-of-the-art approaches on the RAID and HDDS datasets, respectively.
[CV-114] RAC: Rectified Flow Auto Coder
【速读】:该论文旨在解决传统变分自编码器(Variational Autoencoder, VAE)在生成质量与重建性能之间存在差距的问题,以及其单向推理机制导致的计算冗余和灵活性不足。解决方案的关键在于提出一种受修正流(Rectified Flow)启发的直通流自动编码器(Rectified Flow Auto Coder, RAC),其核心创新包括:通过在时间步上应用解码器实现多步解码,使重构路径更直接且可修正,从而支持逐步优化;利用时间反转使解码器兼具编码功能,实现双向推理,显著减少约41%的参数量;同时,该生成式解码机制允许沿路径修正潜在变量,缓解了重建与生成之间的鸿沟,实验表明RAC在重建和生成质量上均优于当前最优VAE方法,且计算成本降低约70%。
链接: https://arxiv.org/abs/2603.05925
作者: Sen Fang,Yalin Feng,Yanxin Zhang,Dimitris N. Metaxas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 Figures, 4 Tables. Project Page at this https URL
Abstract:In this paper, we propose a Rectified Flow Auto Coder (RAC) inspired by Rectified Flow to replace the traditional VAE: 1. It achieves multi-step decoding by applying the decoder to flow timesteps. Its decoding path is straight and correctable, enabling step-by-step refinement. 2. The model inherently supports bidirectional inference, where the decoder serves as the encoder through time reversal (hence Coder rather than encoder or decoder), reducing parameter count by nearly 41%. 3. This generative decoding method improves generation quality since the model can correct latent variables along the path, partially addressing the reconstruction–generation gap. Experiments show that RAC surpasses SOTA VAEs in both reconstruction and generation with approximately 70% lower computational cost.
[CV-115] BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation CVPR2026
【速读】:该论文旨在解决在黑盒环境下检测文本到图像生成模型(text-to-image models)中后门攻击(backdoor attacks)的难题。现有方法通常依赖于图像级别的相似性分析,假设后门触发生成的图像具有高度一致性,但难以应对近年来出现的新型后门攻击——这些攻击生成的图像在视觉上呈现多样性。解决方案的关键在于提出一个名为BlackMirror的通用、无需训练的检测框架,其核心创新是观察到:尽管整体图像内容多样,但后门攻击仅稳定地操纵生成图像中的部分语义模式。为此,BlackMirror包含两个组件:MirrorMatch用于对齐视觉模式与文本指令以检测语义偏差;MirrorVerify则通过评估这些偏差在不同提示下的稳定性,区分真正的后门行为与良性响应。该框架可作为插件模块部署于Model-as-a-Service(MaaS)应用中,实验证明其在多种攻击场景下均能实现高精度检测。
链接: https://arxiv.org/abs/2603.05921
作者: Feiran Li,Qianqian Xu,Shilong Bao,Zhiyong Yang,Xilin Zhao,Xiaochun Cao,Qingming Huang
机构: Institute of Information Engineering, CAS; School of Cyber Security, University of Chinese Academy of Sciences; State Key Laboratory of AI Safety, Institute of Computing Technology, CAS; School of Computer Science and Technology, University of Chinese Academy of Sciences; BDKM, University of Chinese Academy of Sciences; School of Computer Science and Technology, Beijing Institute of Technology; School of Cyber Science and Tech., Shenzhen Campus, Sun Yat-sen University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper is accepted by CVPR 2026
Abstract:This paper investigates the challenging task of detecting backdoored text-to-image models under black-box settings and introduces a novel detection framework BlackMirror. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit strong consistency across samples. However, they struggle to generalize to recently emerging backdoor attacks, where backdoored generations can appear visually diverse. BlackMirror is motivated by an observation: across backdoor attacks, only partial semantic patterns within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two components: MirrorMatch, which aligns visual patterns with the corresponding instructions to detect semantic deviations; and MirrorVerify, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. BlackMirror is a general, training-free framework that can be deployed as a plug-and-play module in Model-as-a-Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of attacks. Code is available at this https URL.
[CV-116] CORE-Seg: Reasoning -Driven Segmentation for Complex Lesions via Reinforcement Learning
【速读】:该论文旨在解决医学图像分割中从传统视觉模式匹配向认知推理分析转变过程中存在的关键瓶颈问题:现有通用多模态大语言模型(Multimodal Large Language Models, MLLs)虽具备广泛常识,但缺乏对复杂病灶的专用视觉推理能力;而传统分割模型虽擅长像素级分割,却缺乏逻辑可解释性。解决方案的关键在于提出一种端到端框架CORE-Seg,通过语义引导的提示适配器(Semantic-Guided Prompt Adapter)将推理过程与分割任务深度融合,并构建基于渐进式训练(从监督微调SFT到GRPO)和自适应双粒度奖励机制的优化策略,有效缓解奖励稀疏问题,从而在首个针对复杂病灶推理驱动分割的Chain-of-Thought(CoT)基准ComLesion-14K上实现最优性能(平均Dice达37.06%,较次优基线提升14.89%),同时将失败率降至18.42%。
链接: https://arxiv.org/abs/2603.05911
作者: Yuxin Xie,Yuming Chen,Yishan Yang,Yi Zhou,Tao Zhou,Zhen Zhao,Jiacheng Liu,Huazhu Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Review with Computational Visual Media
Abstract:Medical image segmentation is undergoing a paradigm shift from conventional visual pattern matching to cognitive reasoning analysis. Although Multimodal Large Language Models (MLLMs) have shown promise in integrating linguistic and visual knowledge, significant gaps remain: existing general MLLMs possess broad common sense but lack the specialized visual reasoning required for complex lesions, whereas traditional segmentation models excel at pixel-level segmentation but lack logical interpretability. In this paper, we introduce ComLesion-14K, the first diverse Chain-of-Thought (CoT) benchmark for reasoning-driven complex lesion segmentation. To accomplish this task, we propose CORE-Seg, an end-to-end framework integrating reasoning with segmentation through a Semantic-Guided Prompt Adapter. We design a progressive training strategy from SFT to GRPO, equipped with an adaptive dual-granularity reward mechanism to mitigate reward sparsity. Our Method achieves state-of-the-art results with a mean Dice of 37.06% (14.89% higher than the second-best baseline), while reducing the failure rate to 18.42%. Project Page: this https URL
[CV-117] Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image CVPR2026
【速读】:该论文旨在解决当前组合式图像到3D场景生成方法中存在的两大问题:一是依赖耗时的迭代布局优化或僵化的联合物体-布局生成方式,二是多数方法仅使用有限视场角的透视图像,难以构建完整的360度全景环境。解决方案的关键在于提出Pano3DComposer框架,其核心创新是设计了一个即插即用的物体-世界坐标变换预测器(Object-World Transformation Predictor),该模块可将现成图像到3D模型生成的局部坐标系下的3D物体转换为全局世界坐标系下的位置;通过引入Alignment-VGGT架构,利用目标物体裁剪、多视角渲染和相机参数来预测变换关系,并采用伪几何监督缓解生成物体与真实物体之间的形状差异;此外,针对未见域输入,进一步提出粗到精(Coarse-to-Fine, C2F)对齐机制,通过场景渲染反馈迭代优化几何一致性,从而实现高效且高保真的全景3D场景生成,可在RTX 4090 GPU上约20秒内完成高质量3D场景重建。
链接: https://arxiv.org/abs/2603.05908
作者: Zidian Qiu,Ancong Wu
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. Project page: this https URL
Abstract:Current compositional image-to-3D scene generation approaches construct 3D scenes by time-consuming iterative layout optimization or inflexible joint object-layout generation. Moreover, most methods rely on limited field-of-view perspective images, hindering the creation of complete 360-degree environments. To address these limitations, we design Pano3DComposer, an efficient feed-forward framework for panoramic images. To decouple object generation from layout estimation, we propose a plug-and-play Object-World Transformation Predictor. This module converts the 3D objects generated by off-the-shelf image-to-3D models from local to world coordinates. To achieve this, we adapt the VGGT architecture to Alignment-VGGT by using target object crop, multi-view object renderings and camera parameters to predict the transformation. The predictor is trained using pseudo-geometric supervision to address the shape discrepancy between generated and ground-truth objects. For input images from unseen domains, we further introduce a Coarse-to-Fine (C2F) alignment mechanism for Pano3DComposer that iteratively refines geometric consistency with feedback of scene rendering. Our method achieves superior geometric accuracy for image/text-to-3D tasks on synthetic and real-world datasets. It can generate a high-fidelity 3D scene in approximately 20 seconds on an RTX 4090 GPU. Project page: this https URL.
[CV-118] Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D CVPR2026
【速读】:该论文旨在解决当前2D-to-3D转换方法在几何准确性上表现良好但缺乏艺术性的问题,即现有方法无法再现专业3D电影中沉浸感与情感共鸣的视觉体验。其根本原因在于,传统几何重建范式将艺术家有意设计的视觉意图(如零平面偏移以实现突出效果或局部深度塑造)误判为数据噪声或歧义。解决方案的关键在于提出“艺术差异合成”(Artistic Disparity Synthesis)的新范式,将目标从物理准确的视差估计转变为艺术一致的视差合成;为此,作者提出了Art3D框架,采用双路径结构解耦全局深度参数(宏观意图)与局部艺术效果(视觉笔触),并通过间接监督学习专业3D电影数据,从而实现对关键局部出屏效应和整体深度风格的精准复现。
链接: https://arxiv.org/abs/2603.05906
作者: Ping Chen,Zezhou Chen,Xingpeng Zhang,Yanlin Qian,Huan Hu,Xiang Liu,Zipeng Wang,Xin Wang,Zhaoxiang Liu,Kai Wang,Shiguo Lian
机构: China Unicom(中国联通); DJI Technology Co.,Ltd.(大疆创新有限公司); Southwest Petroleum University(西南石油大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepet by CVPR 2026 (10 pages, 4 figures)
Abstract:Current 2D-to-3D conversion methods achieve geometric accuracy but are artistically deficient, failing to replicate the immersive and emotionally resonant experience of professional 3D cinema. This is because geometric reconstruction paradigms mistake deliberate artistic intent, such as strategic zero-plane shifts for pop-out effects and local depth sculpting, for data noise or ambiguity. This paper argues for a new paradigm: Artistic Disparity Synthesis, shifting the goal from physically accurate disparity estimation to artistically coherent disparity synthesis. We propose Art3D, a preliminary framework exploring this paradigm. Art3D uses a dual-path architecture to decouple global depth parameters (macro-intent) from local artistic effects (visual brushstrokes) and learns from professional 3D film data via indirect supervision. We also introduce a preliminary evaluation method to quantify cinematic alignment. Experiments show our approach demonstrates potential in replicating key local out-of-screen effects and aligning with the global depth styles of cinematic 3D content, laying the groundwork for a new class of artistically-driven conversion tools.
[CV-119] CollabOD: Collaborative Multi-Backbone with Cross-scale Vision for UAV Small Object Detection
【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)影像中小目标检测的难题,主要挑战包括尺度变化剧烈、结构细节退化以及计算资源受限等问题。在高空场景下,传统模型在层级下采样和跨尺度融合过程中会进一步削弱细粒度特征,导致定位不稳定和鲁棒性下降。解决方案的关键在于提出一种轻量级协同检测框架 CollabOD,其核心创新包括:显式保留结构细节的结构细节保持机制、跨路径特征对齐策略以实现异构特征流的精准融合,以及面向定位任务的轻量化设计。该框架从图像处理、通道结构和轻量化设计三个维度优化了传统UAV感知模型架构,在提升特征表示稳定性的同时保障高效推理,并通过统一的细节感知检测头增强回归鲁棒性,且不增加额外部署开销。
链接: https://arxiv.org/abs/2603.05905
作者: Xuecheng Bai,Yuxiang Wang,Chuanzhi Xu,Boyu Hu,Kang Han,Ruijie Pan,Xiaowei Niu,Xiaotian Guan,Liqiang Fu,Pengfei Ye
机构: Aviation Traffic Control Technology (SHENZHEN) Co., Ltd. (航空交通控制技术(深圳)有限公司); The University of Sydney (悉尼大学); The University of International Business and Economics (对外经济贸易大学); Research Institute of Traffic Control Technology Co., Ltd. (交通控制技术研究院有限公司); Guoneng Shuohuang Railway Development Co., Ltd. (国能神黄铁路发展有限公司); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Small object detection in unmanned aerial vehicle (UAV) imagery is challenging, mainly due to scale variation, structural detail degradation, and limited computational resources. In high-altitude scenarios, fine-grained features are further weakened during hierarchical downsampling and cross-scale fusion, resulting in unstable localization and reduced robustness. To address this issue, we propose CollabOD, a lightweight collaborative detection framework that explicitly preserves structural details and aligns heterogeneous feature streams before multi-scale fusion. The framework integrates Structural Detail Preservation, Cross-Path Feature Alignment, and Localization-Aware Lightweight Design strategies. From the perspectives of image processing, channel structure, and lightweight design, it optimizes the architecture of conventional UAV perception models. The proposed design enhances representation stability while maintaining efficient inference. A unified detail-aware detection head further improves regression robustness without introducing additional deployment overhead. The code is available at: this https URL.
[CV-120] Mitigating Bias in Concept Bottleneck Models for Fair and Interpretable Image Classification
【速读】:该论文旨在解决图像分类模型中存在的偏见问题,尤其是性别偏见等敏感属性代理信息在模型决策中的隐性传播,从而确保模型的公平性。其解决方案的关键在于对概念瓶颈模型(Concept Bottleneck Models, CBMs)进行改进:首先通过top-k概念过滤减少无关信息泄露;其次移除具有偏见的概念;最后引入对抗去偏机制,以增强模型在保持高分类性能的同时显著提升公平性表现。实验表明,所提出的三阶段去偏策略优于现有方法,在公平性与性能之间实现了更优的权衡。
链接: https://arxiv.org/abs/2603.05899
作者: Schrasing Tong,Antoine Salaun,Vincent Yuan,Annabel Adeyeri,Lalana Kagal
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Ensuring fairness in image classification prevents models from perpetuating and amplifying bias. Concept bottleneck models (CBMs) map images to high-level, human-interpretable concepts before making predictions via a sparse, one-layer classifier. This structure enhances interpretability and, in theory, supports fairness by masking sensitive attribute proxies such as facial features. However, CBM concepts have been known to leak information unrelated to concept semantics and early results reveal only marginal reductions in gender bias on datasets like ImSitu. We propose three bias mitigation techniques to improve fairness in CBMs: 1. Decreasing information leakage using a top-k concept filter, 2. Removing biased concepts, and 3. Adversarial debiasing. Our results outperform prior work in terms of fairness-performance tradeoffs, indicating that our debiased CBM provides a significant step towards fair and interpretable image classification.
[CV-121] InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation CVPR2026
【速读】:该论文旨在解决电商产品海报生成中多条件控制(主体、文本和风格)的协同优化问题,尤其针对现有基于扩散模型的多阶段流水线方法存在的三个核心缺陷:主体保真度差、文本准确性低以及风格不一致。其解决方案的关键在于提出一个单阶段框架 InnoAds-Composer,通过引入三重条件控制令牌(subject, glyph, style)实现高效联合调控;进一步地,为缓解传统三条件令牌拼接带来的二次计算开销,作者基于层与时间步的重要性分析,仅将每类条件路由至最响应的位置,从而缩短有效令牌序列;同时设计文本特征增强模块(Text Feature Enhancement Module, TFEM),融合字形图像与字形裁剪区域特征以提升中文文本渲染精度。该方案在保持推理延迟可控的前提下显著优于现有方法。
链接: https://arxiv.org/abs/2603.05898
作者: Yuxin Qin,Ke Cao,Haowei Liu,Ao Ma,Fengheng Li,Honghe Zhu,Zheng Zhang,Run Ling,Wei Feng,Xuanhua He,Zhanjie Zhang,Zhen Guo,Haoyi Bian,Jingjing Lv,Junjie Shen,Ching Law
机构: JD.com, Inc. (京东); Chongqing University of Posts and Telecommunications (重庆邮电大学); The Hong Kong University of Science and Technology (香港科技大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:E-commerce product poster generation aims to automatically synthesize a single image that effectively conveys product information by presenting a subject, text, and a designed style. Recent diffusion models with fine-grained and efficient controllability have advanced product poster synthesis, yet they typically rely on multi-stage pipelines, and simultaneous control over subject, text, and style remains underexplored. Such naive multi-stage pipelines also show three issues: poor subject fidelity, inaccurate text, and inconsistent style. To address these issues, we propose InnoAds-Composer, a single-stage framework that enables efficient tri-conditional control tokens over subject, glyph, and style. To alleviate the quadratic overhead introduced by naive tri-conditional token concatenation, we perform importance analysis over layers and timesteps and route each condition only to the most responsive positions, thereby shortening the active token sequence. Besides, to improve the accuracy of Chinese text rendering, we design a Text Feature Enhancement Module (TFEM) that integrates features from both glyph images and glyph crops. To support training and evaluation, we also construct a high-quality e-commerce product poster dataset and benchmark, which is the first dataset that jointly contains subject, text, and style conditions. Extensive experiments demonstrate that InnoAds-Composer significantly outperforms existing product poster methods without obviously increasing inference latency.
[CV-122] PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction CVPR2026
【速读】:该论文旨在解决从单张RGB图像中直接重建完整室内场景网格(mesh)的问题,传统方法通常依赖隐式符号距离场(signed distance fields)并结合后处理布局优化,存在重建质量不高或生成流程复杂等问题。其解决方案的关键在于提出PixARMesh,一种基于自回归机制的统一模型,能够联合预测物体布局与几何结构,在单一前向传播中生成连贯且可直接用于下游任务的高质量网格;通过在点云编码器中引入像素对齐的图像特征和全局场景上下文信息(利用交叉注意力机制),实现了仅凭单图即可进行精准空间推理的能力,从而显著提升重建精度与效率。
链接: https://arxiv.org/abs/2603.05888
作者: Xiang Zhang,Sohyun Yoo,Hongrui Wu,Chuan Li,Jianwen Xie,Zhuowen Tu
机构: UC San Diego (加州大学圣地亚哥分校); Lambda, Inc. (Lambda公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: CVPR 2026. Project Page: this https URL
Abstract:We introduce PixARMesh, a method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post-hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative models, we augment a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream containing context, pose, and mesh, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing lightweight, high-quality meshes ready for downstream applications.
[CV-123] CylinderSplat: 3D Gaussian Splatting with Cylindrical Triplanes for Panoramic Novel View Synthesis
【速读】:该论文旨在解决全景图像(panoramic imagery)在实时新视角合成中的两大挑战:一是现有方法依赖多视角代价体(multi-view cost volumes)进行几何精修,在稀疏视图场景下难以有效处理遮挡问题;二是标准体素表示(如笛卡尔三平面,Cartesian Triplanes)无法准确捕捉360°场景的固有几何结构,导致失真和混叠现象。解决方案的关键在于提出一种新的圆柱形三平面表示(cylindrical Triplane),其更贴合全景数据分布及符合曼哈顿世界假设(Manhattan-world assumption)的实际场景结构,并设计了一个双分支架构:像素分支用于重建观测良好的区域,体积分支则利用圆柱三平面补全遮挡或稀疏观测区域,从而实现从单视图到多视图全景新视角合成的灵活适配与高质量重建。
链接: https://arxiv.org/abs/2603.05882
作者: Qiwei Wang,Xianghui Ze,Jingyi Yu,Yujiao Shi
机构: Shanghaitech University (上海科技大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Feed-forward 3D Gaussian Splatting (3DGS) has shown great promise for real-time novel view synthesis, but its application to panoramic imagery remains challenging. Existing methods often rely on multi-view cost volumes for geometric refinement, which struggle to resolve occlusions in sparse-view scenarios. Furthermore, standard volumetric representations like Cartesian Triplanes are poor in capturing the inherent geometry of 360^\circ scenes, leading to distortion and aliasing. In this work, we introduce CylinderSplat, a feed-forward framework for panoramic 3DGS that addresses these limitations. The core of our method is a new cylindrical Triplane representation, which is better aligned with panoramic data and real-world structures adhering to the Manhattan-world assumption. We use a dual-branch architecture: a pixel-based branch reconstructs well-observed regions, while a volume-based branch leverages the cylindrical Triplane to complete occluded or sparsely-viewed areas. Our framework is designed to flexibly handle a variable number of input views, from single to multiple panoramas. Extensive experiments demonstrate that CylinderSplat achieves state-of-the-art results in both single-view and multi-view panoramic novel view synthesis, outperforming previous methods in both reconstruction quality and geometric accuracy. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.05882 [cs.CV] (or arXiv:2603.05882v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.05882 Focus to learn more arXiv-issued DOI via DataCite
[CV-124] Systematic Evaluation of Novel View Synthesis for Video Place Recognition IROS2026
【速读】:该论文旨在解决视频位置识别(Video Place Recognition, VPR)中因视角差异导致识别性能下降的问题,特别是通过生成合成的新视角图像来增强跨视角的匹配能力。其解决方案的关键在于系统性地评估合成新视角在多个公开VPR数据集上的有效性,并发现:当合成视图数量较小时,其能显著提升VPR的识别指标;而当合成视图数量较大时,视角变化幅度的影响减弱,识别效果更依赖于添加的视图数量和数据集中图像的类型。这一发现为基于图像的机器人导航与多模态感知融合提供了关键实证依据。
链接: https://arxiv.org/abs/2603.05876
作者: Muhammad Zawad Mahmud,Samiha Islam,Damian Lyons
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Submitted to IEEE IROS 2026
Abstract:The generation of synthetic novel views has the potential to positively impact robot navigation in several ways. In image-based navigation, a novel overhead view generated from a scene taken by a ground robot could be used to guide an aerial robot to that location. In Video Place Recognition (VPR), novel views of ground locations from the air can be added that enable a UAV to identify places seen by the ground robot, and similarly, overhead views can be used to generate novel ground views. This paper presents a systematic evaluation of synthetic novel views in VPR using five public VPR image databases and seven typical image similarity methods. We show that for small synthetic additions, novel views improve VPR recognition statistics. We find that for larger additions, the magnitude of viewpoint change is less important than the number of views added and the type of imagery in the dataset. Comments: Submitted to IEEE IROS 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2603.05876 [cs.CV] (or arXiv:2603.05876v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.05876 Focus to learn more arXiv-issued DOI via DataCite
[CV-125] Shifting Adaptation from Weight Space to Memory Space: A Memory-Augmented Agent for Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割模型在跨机构、跨设备或跨人群时泛化能力差的问题,尤其是在联邦学习场景下因任务特定微调导致的通信开销大和部署后知识无法持续演进的挑战。解决方案的关键在于提出一种记忆增强型分割智能体(MemSeg-Agent),其将适应机制从模型权重空间转移到记忆空间,通过轻量级静态记忆、少样本记忆与测试时工作记忆的动态组合,由代理控制器统一调度,在不更新模型参数的前提下实现少样本学习、联邦监督学习及测试时自适应。该设计显著降低了联邦学习中的通信成本,并支持部署后的持续知识演化,实验表明其在多个公开数据集上均展现出优异的性能与域偏移鲁棒性。
链接: https://arxiv.org/abs/2603.05873
作者: Bowen Chen,Qiaohui Gao,Shaowen Wan,Shanhui Sun,Wei Liu,Xiang Li,Tianming Liu,Lin Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image segmentation is fundamental to clinical workflows, yet models trained on a single dataset often fail to generalize across institutions, scanners, or patient populations. While vision foundation models have shown great promise in addressing this challenge, their deployment typically requires task-specific fine-tuning, which introduces substantial communication overhead in federated learning and prevents continuous knowledge evolution during deployment. In this work, we propose a memory-augmented segmentation agent (MemSeg-Agent) that shifts adaptation from weight space to memory space, enabling few-shot learning, federated supervised learning, and test-time adaptation within a unified architecture. MemSeg-Agent conditions a fixed backbone with lightweight static, few-shot, and test-time working memories, which are dynamically composed by an agentic controller. In federated settings, we update compact memory units instead of model parameters, substantially reducing communication overhead. Experiments on four public datasets demonstrate strong performance and robustness to domain shift: Static memory alone matches or surpasses strong supervised baselines with high parameter efficiency, and test-time working memory further improves in-domain and cross-domain performance without fine-tuning. Overall, MemSeg-Agent introduces a new paradigm for scalable and adaptive medical image segmentation in the era of agentic AI.
[CV-126] PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues
【速读】:该论文旨在解决现有视觉语言模型(Vision-Language Models, VLMs)在多模态推理任务中对视觉信息利用不足的问题,尤其是传统链式思维(Chain-of-Thought, CoT)方法仅依赖文本信息、未能充分挖掘图像中关键视觉线索的局限性。其解决方案的关键在于提出一种基于图像块(patch)级别的视觉提示机制——PatchCue,通过将图像划分为固定大小的块并以块级表示视觉提示,不仅更贴合人类感知习惯,还能与现代VLMs的patch-tokenized输入结构自然契合。该方法采用两阶段训练策略:首先进行冷启动监督微调以生成patch-level提示,再通过过程监督的强化学习优化中间视觉推理步骤,从而显著提升模型在复杂视觉理解与推理任务中的表现。
链接: https://arxiv.org/abs/2603.05869
作者: Yukun Qi,Pei Fu,Hang Li,Yuhan Liu,Chao Jiang,Bin Qin,Zhenbo Luo,Jian Luan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) have achieved remarkable progress on a wide range of challenging multimodal understanding and reasoning tasks. However, existing reasoning paradigms, such as the classical Chain-of-Thought (CoT), rely solely on textual information and often underutilize important visual cues. While prior work has incorporated pixel-level visual cues, these representations require precise spatial localization, introducing additional learning complexity. To address this, we propose PatchCue, a novel patch-based visual cue paradigm designed to significantly enhance the visual reasoning capabilities of VLMs. By partitioning images into patches and representing cues at the patch level, PatchCue aligns better with human perceptual habits and leverages the patch-tokenized input of modern VLMs. We train VLMs using a two-stage approach: cold-start supervised fine-tuning to output patch-level cues, followed by reinforcement learning with a process-supervised cue reward that guides intermediate visual reasoning steps. Extensive experiments on multiple VLMs and diverse benchmarks, including general visual question answering, complex reasoning, and document understanding, demonstrate that PatchCue consistently improves overall model performance. Our results show that patch-level cues outperform both pixel-level bounding boxes and point-based cues, providing a more effective and cognitively aligned visual reasoning paradigm.
[CV-127] umorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis ICLR2026
【速读】:该论文旨在解决临床肿瘤分析中因缺乏可解释性与多模态推理能力而导致的诊断误差问题,特别是在影像学发现(findings)、临床印象(impressions)与病理级风险评估(pathology-level risk assessment)之间的断层现象。其核心解决方案是构建一个大规模的多模态链式推理基准 TumorCoT(包含150万条带链式思维标注的视觉问答指令与3D CT扫描数据),并提出 TumorChain 框架,该框架通过紧密耦合三维影像编码器、临床文本理解模块以及器官级视觉-语言对齐机制,实现跨模态对齐与迭代式因果推理,从而在多轮自优化过程中增强视觉证据的可追溯性并降低幻觉风险,最终提升病灶检测、印象生成和病理分类的准确性与一致性。
链接: https://arxiv.org/abs/2603.05867
作者: Sijing Li,Zhongwei Qiu,Jiang Liu,Wenqiao Zhang,Tianwei Lin,Yihan Xie,Jianxiang An,Boxiang Yun,Chenglin Yang,Jun Xiao,Guangyu Guo,Jiawen Yao,Wei Liu,Yuan Gao,Ke Yan,Weiwei Cao,Zhilin Zheng,Tony C. W. Mok,Kai Cao,Yu Shi,Jiuyu Zhang,Jian Zhou,Beng Chin Ooi,Yingda Xia,Ling Zhang
机构: Zhejiang University (浙江大学); DAMO Academy, Alibaba Group (达摩院,阿里巴巴集团); Hupan Lab (湖畔实验室); Shanghai Institution of Pancreatic Disease (上海胰腺疾病研究所); Shengjing Hospital of China Medical University (中国医科大学盛京医院); Sun Yat-sen University Cancer Center (中山大学肿瘤中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2026. 10 pages + appendix
Abstract:Accurate tumor analysis is central to clinical radiology and precision oncology, where early detection, reliable lesion characterization, and pathology-level risk assessment guide diagnosis and treatment planning. Chain-of-Thought (CoT) reasoning is particularly important in this setting because it enables step-by-step interpretation from imaging findings to clinical impressions and pathology conclusions, improving traceability and reducing diagnostic errors. Here, we target the clinical tumor analysis task and build a large-scale benchmark that operationalizes a multimodal reasoning pipeline, spanning findings, impressions, and pathology predictions. We curate TumorCoT, a large-scale dataset of 1.5M CoT-labeled VQA instructions paired with 3D CT scans, with step-aligned rationales and cross-modal alignments along the trajectory from findings to impression to pathology, enabling evaluation of both answer accuracy and reasoning consistency. We further propose TumorChain, a multimodal interleaved reasoning framework that tightly couples 3D imaging encoders, clinical text understanding, and organ-level vision-language alignment. Through cross-modal alignment and iterative interleaved causal reasoning, TumorChain grounds visual evidence, aggregates conclusions, and issues pathology predictions after multiple rounds of self-refinement, improving traceability and reducing hallucination risk. Experiments show consistent improvements over strong baselines in lesion detection, impression generation, and pathology classification, and demonstrate strong generalization on the DeepTumorVQA benchmark. These results highlight the potential of multimodal reasoning for reliable and interpretable tumor analysis in clinical practice. Detailed information about our project can be found on our project homepage at this https URL.
[CV-128] Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的医疗辅助代理在多步骤临床影像解读任务中因工具组合静态化而导致的适应性差问题。现有系统在部署后无法动态调整工具调用策略,面对真实世界中的领域漂移、任务变化和诊断需求演进时,其预定义的工具链易失效且需高成本人工重构。解决方案的关键在于提出MACRO——一个自演化、经验增强的医学代理,它通过从已验证的执行轨迹中自动识别高频有效的多步工具序列,将其合成并注册为可复用的复合工具(composite tools),从而实现工具组合的动态扩展;同时借助轻量级图像特征记忆模块锚定视觉-临床上下文,并采用类似GRPO的训练机制强化对新发现复合工具的可靠调用,形成闭环自我优化能力,显著提升多步骤任务的准确性和跨域泛化性能。
链接: https://arxiv.org/abs/2603.05860
作者: Lin Fan,Pengyu Dai,Zhipeng Deng,Haolin Wang,Xun Gong,Yefeng Zheng,Yafei Ou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 4 figures, 3 tables
Abstract:Clinical image interpretation is inherently multi-step and tool-centric: clinicians iteratively combine visual evidence with patient context, quantify findings, and refine their decisions through a sequence of specialized procedures. While LLM-based agents promise to orchestrate such heterogeneous medical tools, existing systems treat tool sets and invocation strategies as static after deployment. This design is brittle under real-world domain shifts, across tasks, and evolving diagnostic requirements, where predefined tool chains frequently degrade and demand costly manual re-design. We propose MACRO, a self-evolving, experience-augmented medical agent that shifts from static tool composition to experience-driven tool discovery. From verified execution trajectories, the agent autonomously identifies recurring effective multi-step tool sequences, synthesizes them into reusable composite tools, and registers these as new high-level primitives that continuously expand its behavioral repertoire. A lightweight image-feature memory grounds tool selection in a visual-clinical context, while a GRPO-like training loop reinforces reliable invocation of discovered composites, enabling closed-loop self-improvement with minimal supervision. Extensive experiments across diverse medical imaging datasets and tasks demonstrate that autonomous composite tool discovery consistently improves multi-step orchestration accuracy and cross-domain generalization over strong baselines and recent state-of-the-art agentic methods, bridging the gap between brittle static tool use and adaptive, context-aware clinical AI assistance. Code will be available upon acceptance.
[CV-129] VS3R: Robust Full-frame Video Stabilization via Deep 3D Reconstruction
【速读】:该论文旨在解决视频稳定(video stabilization)中几何鲁棒性与全帧一致性之间的根本权衡问题:传统2D方法因过度裁剪导致信息丢失,而3D方法则常因优化流程脆弱在极端运动下失效。其解决方案的关键在于提出VS3R框架,该框架通过前馈式3D重建与生成式视频扩散模型的协同作用,联合估计相机参数、深度图和掩码以保障全场景可靠性,并引入混合稳定渲染模块融合语义与几何线索实现动态一致性;最终利用双流视频扩散模型结合结构引导与语义锚点修复遮挡区域并校正伪影,从而在多种相机模型下实现高保真、全帧稳定的视频输出,显著优于现有最先进方法的鲁棒性和视觉质量。
链接: https://arxiv.org/abs/2603.05851
作者: Muhua Zhu,Xinhao Jin,Yu Zhang,Yifei Xue,Tie Ji,Yizhen Lao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video stabilization aims to mitigate camera shake but faces a fundamental trade-off between geometric robustness and full-frame consistency. While 2D methods suffer from aggressive cropping, 3D techniques are often undermined by fragile optimization pipelines that fail under extreme motions. To bridge this gap, we propose VS3R, a framework that synergizes feed-forward 3D reconstruction with generative video diffusion. Our pipeline jointly estimates camera parameters, depth, and masks to ensure all-scenario reliability, and introduces a Hybrid Stabilized Rendering module that fuses semantic and geometric cues for dynamic consistency. Finally, a Dual-Stream Video Diffusion Model restores disoccluded regions and rectifies artifacts by synergizing structural guidance with semantic anchors. Collectively, VS3R achieves high-fidelity, full-frame stabilization across diverse camera models and significantly outperforms state-of-the-art methods in robustness and visual quality.
[CV-130] Cog2Gen3D: Sculpturing 3D Semantic-Geometric Cognition for 3D Generation
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在三维(3D)生成任务中因缺乏空间几何约束而导致的结构不合理与尺度不一致问题。现有方法虽利用几何特征作为条件以增强空间感知,但仅能建模相对关系,难以保证绝对几何的一致性。为此,作者提出 Cog2Gen3D,其核心创新在于构建一个由语义-几何双流图结构组成的3D认知图(3D Latent Cognition Graph),通过基于共同注意力机制的融合方式整合多模态特征(包括语义信息和绝对几何信息),进而将该认知图作为条件引导潜在扩散过程,实现物理合理且结构合理的3D高斯点云生成。此方案有效提升了3D生成结果在语义保真度与几何合理性上的表现。
链接: https://arxiv.org/abs/2603.05845
作者: Haonan Wang,Hanyu Zhou,Haoyue Liu,Tao Gu,Luxin Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative models have achieved success in producing semantically plausible 2D images, but it remains challenging in 3D generation due to the absence of spatial geometry constraints. Typically, existing methods utilize geometric features as conditions to enhance spatial awareness. However, these methods can only model relative relationships and are prone to scale inconsistency of absolute geometry. Thus, we argue that semantic information and absolute geometry empower 3D cognition, thereby enabling controllable 3D generation for the physical world. In this work, we propose Cog2Gen3D, a 3D cognition-guided diffusion framework for 3D generation. Our model is guided by three key designs: 1) Cognitive Feature Embeddings. We encode different modalities into semantic and geometric representations and further extract logical representations. 2) 3D Latent Cognition Graph. We structure different representations into dual-stream semantic-geometric graphs and fuse them via common-based cross-attention to obtain a 3D cognition graph. 3) Cognition-Guided Latent Diffusion. We leverage the fused 3D cognition graph as the condition to guide the latent diffusion process for 3D Gaussian generation. Under this unified framework, the 3D cognition graph ensures the physical plausibility and structural rationality of 3D generation. Moreover, we construct a validation subset based on the Marble World Labs. Extensive experiments demonstrate that our Cog2Gen3D significantly outperforms existing methods in both semantic fidelity and geometric plausibility.
[CV-131] Remote Sensing Image Classification Using Deep Ensemble Learning
【速读】:该论文旨在解决遥感图像分类中因卷积神经网络(Convolutional Neural Networks, CNNs)难以捕捉全局上下文信息而导致的性能瓶颈问题,同时克服纯视觉Transformer(Vision Transformers, ViTs)在特征冗余和计算资源利用效率方面的局限性。其解决方案的关键在于提出一种融合CNN与ViT优势的集成模型:通过训练四个独立的融合模型(每个模型均包含CNN和ViT骨干网络),并在最终预测阶段采用集成策略聚合输出结果,从而有效避免冗余特征表示带来的性能瓶颈,并显著提升分类准确率,同时实现对计算资源的高效利用。
链接: https://arxiv.org/abs/2603.05844
作者: Niful Islam,Md. Rayhan Ahmed,Nur Mohammad Fahad,Salekul Islam,A.K.M. Muzahidul Islam,Saddam Mukta,Swakkhar Shatabda
机构: Oakland University (奥克兰大学); University of British Columbia (不列颠哥伦比亚大学); United International University (联合国际大学); North South University (南亚大学); LUT University (拉彭兰塔理工大学); BRAC University (BRAC大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Remote sensing imagery plays a crucial role in many applications and requires accurate computerized classification techniques. Reliable classification is essential for transforming raw imagery into structured and usable information. While Convolutional Neural Networks (CNNs) are mostly used for image classification, they excel at local feature extraction, but struggle to capture global contextual information. Vision Transformers (ViTs) address this limitation through self attention mechanisms that model long-range dependencies. Integrating CNNs and ViTs, therefore, leads to better performance than standalone architectures. However, the use of additional CNN and ViT components does not lead to further performance improvement and instead introduces a bottleneck caused by redundant feature representations. In this research, we propose a fusion model that combines the strengths of CNNs and ViTs for remote sensing image classification. To overcome the performance bottleneck, the proposed approach trains four independent fusion models that integrate CNN and ViT backbones and combine their outputs at the final prediction stage through ensembling. The proposed method achieves accuracy rates of 98.10 percent, 94.46 percent, and 95.45 percent on the UC Merced, RSSCN7, and MSRSI datasets, respectively. These results outperform competing architectures and highlight the effectiveness of the proposed solution, particularly due to its efficient use of computational resources during training.
[CV-132] Margin and Consistency Supervision for Calibrated and Robust Vision Models
【速读】:该论文旨在解决深度视觉分类器在面对微小分布偏移时校准性能差且鲁棒性弱的问题。其核心解决方案是提出一种名为Margin and Consistency Supervision (MaCS) 的正则化框架,关键在于联合优化两个目标:一是通过平方铰链损失(hinge-squared margin penalty)强制正确类别与最强竞争类之间的logit空间间隔;二是引入一致性正则项,最小化干净输入与轻微扰动视图预测结果间的KL散度,从而提升局部预测稳定性。理论分析表明,增大分类边界同时降低局部敏感性(以Lipschitz型稳定性代理量化)可获得更优的泛化保证和可证明的鲁棒性半径,且该方法无需额外数据或架构改动,推理开销可忽略,可作为标准训练目标的即插即用替代方案。
链接: https://arxiv.org/abs/2603.05812
作者: Salim Khazem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Deep vision classifiers often achieve high accuracy while remaining poorly calibrated and fragile under small distribution shifts. We present Margin and Consistency Supervision (MaCS), a simple, architecture-agnostic regularization framework that jointly enforces logit-space separation and local prediction stability. MaCS augments cross-entropy with (i) a hinge-squared margin penalty that enforces a target logit gap between the correct class and the strongest competitor, and (ii) a consistency regularizer that minimizes the KL divergence between predictions on clean inputs and mildly perturbed views. We provide a unifying theoretical analysis showing that increasing classification margin while reducing local sensitivity formalized via a Lipschitz-type stability proxy yields improved generalization guarantees and a provable robustness radius bound scaling with the margin-to-sensitivity ratio. Across several image classification benchmarks and several backbones spanning CNNs and Vision Transformers, MaCS consistently improves calibration (lower ECE and NLL) and robustness to common corruptions while preserving or improving top-1 accuracy. Our approach requires no additional data, no architectural changes, and negligible inference overhead, making it an effective drop-in replacement for standard training objectives.
[CV-133] raining-free Latent Inter-Frame Pruning with Attention Recovery
【速读】:该论文旨在解决当前视频生成模型中存在的高计算延迟问题,这使得实时应用成本过高。其解决方案的关键在于利用视频潜在特征图(latent patches)中的时间冗余性,提出了一种名为“潜在帧间剪枝与注意力恢复”(Latent Inter-frame Pruning with Attention Recovery, LIPAR)的框架:该框架通过检测并跳过重复的潜在特征图的重新计算来提升效率,并引入一种新颖的注意力恢复机制,以近似被剪枝token的注意力值,从而消除因简单剪枝导致的视觉伪影。该方法在不牺牲生成质量的前提下,无需额外训练即可无缝集成至现有模型中,显著提升了视频编辑吞吐量(平均达12.2 FPS,相较基线提升1.45倍)。
链接: https://arxiv.org/abs/2603.05811
作者: Dennis Menn,Yuedong Yang,Bokun Wang,Xiwen Wei,Mustafa Munir,Feng Liang,Radu Marculescu,Chenfeng Xu,Diana Marculescu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current video generation models suffer from high computational latency, making real-time applications prohibitively costly. In this paper, we address this limitation by exploiting the temporal redundancy inherent in video latent patches. To this end, we propose the Latent Inter-frame Pruning with Attention Recovery (LIPAR) framework, which detects and skips recomputing duplicated latent patches. Additionally, we introduce a novel Attention Recovery mechanism that approximates the attention values of pruned tokens, thereby removing visual artifacts arising from naively applying the pruning method. Empirically, our method increases video editing throughput by 1.45\times , on average achieving 12.2 FPS on an NVIDIA A6000 compared to the baseline 8.4 FPS. The proposed method does not compromise generation quality and can be seamlessly integrated with the model without additional training. Our approach effectively bridges the gap between traditional compression algorithms and modern generative pipelines.
[CV-134] EventGeM: Global-to-Local Feature Matching for Event-Based Visual Place Recognition
【速读】:该论文旨在解决事件相机(event camera)在视觉位置识别(Visual Place Recognition, VPR)任务中如何实现高精度、实时定位的问题。现有方法在复杂光照条件或高速运动场景下性能受限,且难以兼顾全局与局部特征的融合效率。其解决方案的关键在于提出EventGeM——一种端到端的全局到局部特征融合流水线:首先利用预训练的视觉Transformer(ViT-S/16)从事件直方图图像中提取全局特征用于初始匹配预测;随后通过预训练的MaxViT骨干网络检测局部关键点,并基于2D单应性变换结合RANSAC进行重排序优化;进一步引入预训练视觉基础模型进行深度估计,以结构相似性对候选结果做精细化重排序。该方案在多个基准数据集和光照条件下均达到当前最优性能,且可在多种计算架构上实现实时运行,已在机器人平台上成功部署,直接处理事件流完成在线定位。
链接: https://arxiv.org/abs/2603.05807
作者: Adam D. Hines,Gokul B. Nair,Nicolás Marticorena,Michael Milford,Tobias Fischer
机构: Queensland University of Technology (昆士兰科技大学); QUT Centre for Robotics (QUT机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, 5 tables, under review
Abstract:Dynamic vision sensors, also known as event cameras, are rapidly rising in popularity for robotic and computer vision tasks due to their sparse activation and high-temporal resolution. Event cameras have been used in robotic navigation and localization tasks where accurate positioning needs to occur on small and frequent time scales, or when energy concerns are paramount. In this work, we present EventGeM, a state-of-the-art global to local feature fusion pipeline for event-based Visual Place Recognition. We use a pre-trained vision transformer (ViT-S/16) backbone to obtain global feature patch for initial match predictions embeddings from event histogram images. Local feature keypoints were then detected using a pre-trained MaxViT backbone for 2D-homography based re-ranking with RANSAC. For additional re-ranking refinement, we subsequently used a pre-trained vision foundation model for depth estimation to compare structural similarity between references and queries. Our work performs state-of-the-art localization when compared to the best currently available event-based place recognition method across several benchmark datasets and lighting conditions all whilst being fully capable of running in real-time when deployed across a variety of compute architectures. We demonstrate the capability of EventGeM in a real-world deployment on a robotic platform for online localization using event streams directly from an event camera. Project page: this https URL
[CV-135] Spectral Probing of Feature Upsamplers in 2D-to-3D Scene Reconstruction
【速读】:该论文旨在解决2D-to-3D重建中特征上采样模块对三维感知能力不足的问题,即现有可学习上采样方法虽能提升空间细节(如几何锐度或纹理丰富度),但其对三维一致性的影响尚未被充分理解。解决方案的关键在于提出一个包含六个互补指标的频谱诊断框架(spectral diagnostic framework),用于量化分析上采样过程中幅度重分布、结构频谱对齐和方向稳定性等特性,从而揭示频谱结构保持性(spectral consistency)与三维重建质量之间的强关联性,指出重建性能更依赖于频谱结构的一致性而非单纯的空间细节增强。
链接: https://arxiv.org/abs/2603.05787
作者: Ling Xiao,Yuliang Xiu,Yue Chen,Guoming Wang,Toshihiko Yamasaki
机构: Hokkaido University, Sapporo, Japan; Westlake University, Hangzhou, China; Zhejiang University, Hangzhou, China; The University of Tokyo, Tokyo, Japan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A typical 2D-to-3D pipeline takes multi-view images as input, where a Vision Foundation Model (VFM) extracts features that are spatially upsampled to dense representations for 3D reconstruction. If dense features across views preserve geometric consistency, differentiable rendering can recover an accurate 3D representation, making the feature upsampler a critical component. Recent learnable upsampling methods mainly aim to enhance spatial details, such as sharper geometry or richer textures, yet their impact on 3D awareness remains underexplored. To address this gap, we introduce a spectral diagnostic framework with six complementary metrics that characterize amplitude redistribution, structural spectral alignment, and directional stability. Across classical interpolation and learnable upsampling methods on CLIP and DINO backbones, we observe three key findings. First, structural spectral consistency (SSC/CSC) is the strongest predictor of NVS quality, whereas High-Frequency Spectral Slope Drift (HFSS) often correlates negatively with reconstruction performance, indicating that emphasizing high-frequency details alone does not necessarily improve 3D reconstruction. Second, geometry and texture respond to different spectral properties: Angular Energy Consistency (ADC) correlates more strongly with geometry-related metrics, while SSC/CSC influence texture fidelity slightly more than geometric accuracy. Third, although learnable upsamplers often produce sharper spatial features, they rarely outperform classical interpolation in reconstruction quality, and their effectiveness depends on the reconstruction model. Overall, our results indicate that reconstruction quality is more closely related to preserving spectral structure than to enhancing spatial detail, highlighting spectral consistency as an important principle for designing upsampling strategies in 2D-to-3D pipelines.
[CV-136] Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval
【速读】:该论文旨在解决密集图像检索(Dense Image Retrieval)在大规模场景下计算成本高、可解释性差以及难以提供明确归因的问题。其解决方案的关键在于提出BM25-V方法,即利用Okapi BM25评分机制对视觉词(visual-word)激活进行稀疏编码,这些视觉词来自在Vision Transformer补丁特征上训练的稀疏自动编码器(Sparse Auto-Encoder, SAE)。由于视觉词文档频率呈现类似Zipf分布的极不均衡特性,BM25中的逆文档频率(Inverse Document Frequency, IDF)权重能有效抑制高频低信息量词汇并突出罕见且具有判别性的词汇,从而实现高效且可解释的首次检索阶段。该方法通过稀疏倒排索引操作快速获取高召回率候选集(Recall@200 ≥ 0.993),再结合后续密集重排序,仅需重排序K=200个候选对象即可恢复接近密集检索的精度(平均误差≤0.2%),且SAE模型在ImageNet-1K上预训练后可零样本迁移至多个细粒度任务,同时支持基于具体视觉词及其IDF贡献的可解释性归因。
链接: https://arxiv.org/abs/2603.05781
作者: Donghoon Han,Eunhwan Park,Seunghyeon Seo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Dense image retrieval is accurate but offers limited interpretability and attribution, and it can be compute-intensive at scale. We present \textbfBM25-V, which applies Okapi BM25 scoring to sparse visual-word activations from a Sparse Auto-Encoder (SAE) on Vision Transformer patch features. Across a large gallery, visual-word document frequencies are highly imbalanced and follow a Zipfian-like distribution, making BM25’s inverse document frequency (IDF) weighting well suited for suppressing ubiquitous, low-information words and emphasizing rare, discriminative ones. BM25-V retrieves high-recall candidates via sparse inverted-index operations and serves as an efficient first-stage retriever for dense reranking. Across seven benchmarks, BM25-V achieves Recall@200 \geq 0.993, enabling a two-stage pipeline that reranks only K=200 candidates per query and recovers near-dense accuracy within 0.2 % on average. An SAE trained once on ImageNet-1K transfers zero-shot to seven fine-grained benchmarks without fine-tuning, and BM25-V retrieval decisions are attributable to specific visual words with quantified IDF contributions.
[CV-137] Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers CVPR26
【速读】:该论文旨在解决文本到图像生成中区域指令布局控制的问题,现有方法存在两大局限:一是基于训练的方法继承数据偏差且常导致图像质量下降;二是当前技术难以处理对象间的遮挡顺序(occlusion order),限制了实际应用。解决方案的关键在于提出一种无需训练、即插即用的框架 LayerBind,其核心思想是将区域生成建模为独立图层(layer),并在生成过程中进行绑定(binding)。具体而言,该方法在早期去噪阶段重构潜在结构以建立预设的空间布局和遮挡关系,并通过两个阶段实现精确控制:首先利用多模态联合注意力中的上下文共享机制进行逐层实例初始化,形成各实例分支并锚定于共享背景,在指定早期步骤按层序融合生成具有预设布局的统一潜在表示;其次通过逐层语义增强(layer-wise semantic nursing)维持区域细节与遮挡顺序,采用并行的分层注意力路径与标准全局路径协同更新,结合层透明度调度器进行复合优化。此设计使模型具备强大的区域与遮挡可控性,且支持可编辑工作流,适用于扩散Transformer架构下的多种创意应用场景。
链接: https://arxiv.org/abs/2603.05769
作者: Ruidong Chen,Yancheng Bai,Xuanpu Zhang,Jianhao Zeng,Lanjun Wang,Dan Song,Lei Sun,Xiangxiang Chu,Anan Liu
机构: Tianjin University (天津大学); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR26
Abstract:Region-instructed layout control in text-to-image generation is highly practical, yet existing methods suffer from limitations: (i) training-based approaches inherit data bias and often degrade image quality, and (ii) current techniques struggle with occlusion order, limiting real-world usability. To address these issues, we propose LayerBind. By modeling regional generation as distinct layers and binding them during the generation, our method enables precise regional and occlusion controllability. Our motivation stems from the observation that spatial layout and occlusion are established at a very early denoising stage, suggesting that rearranging the early latent structure is sufficient to modify the final output. Building on this, we structure the scheme into two phases: instance initialization and subsequent semantic nursing. (1) First, leveraging the contextual sharing mechanism in multimodal joint attention, Layer-wise Instance Initialization creates per-instance branches that attend to their own regions while anchoring to the shared background. At a designated early step, these branches are fused according to the layer order to form a unified latent with a pre-established layout. (2) Then, Layer-wise Semantic Nursing reinforces regional details and maintains the occlusion order via a layer-wise attention enhancement. Specifically, a sequential layered attention path operates alongside the standard global path, with updates composited under a layer-transparency scheduler. LayerBind is training-free and plug-and-play, serving as a regional and occlusion controller across Diffusion Transformers. Beyond generation, it natively supports editable workflows, allowing for flexible modifications like changing instances or rearranging visible orders. Both qualitative and quantitative results demonstrate LayerBind’s effectiveness, highlighting its strong potential for creative applications.
[CV-138] Bridging Domains through Subspace-Aware Model Merging
【速读】:该论文旨在解决模型合并(model merging)在跨域泛化(domain generalization)场景下的性能瓶颈问题,即当多个任务特定模型在不同分布偏移(distribution shifts)下训练时,其合并后在未见域上的表现受限于参数空间中的子空间冲突(subspace conflicts)。解决方案的关键在于提出SCORE(Subspace COnflict-Resolving mErging)方法:通过奇异值分解(SVD)分析各模型任务矩阵的主成分,计算所有模型主导奇异向量的联合主成分以构建共享正交基底,并将每个模型的任务矩阵投影到该基底上,去除非对角线方向的冲突分量,从而有效缓解不同域训练模型间的子空间竞争问题。此方法在多种架构和模型规模下均显著优于现有合并策略,在跨域泛化场景中展现出优异的性能与可扩展性。
链接: https://arxiv.org/abs/2603.05768
作者: Levy Chaves,Chao Zhou,Rebekka Burkholz,Eduardo Valle,Sandra Avila
机构: Universidade Estadual de Campinas (UNICAMP); CISPA Helmholtz Center for Information Security; Intercom
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (Main Track)
Abstract:Model merging integrates multiple task-specific models into a single consolidated one. Recent research has made progress in improving merging performance for in-distribution or multi-task scenarios, but domain generalization in model merging remains underexplored. We investigate how merging models fine-tuned on distinct domains affects generalization to unseen domains. Through an analysis of parameter competition in the task matrix using singular value decomposition, we show that merging models trained under different distribution shifts induces stronger conflicts between their subspaces compared to traditional multi-task settings. To mitigate this issue, we propose SCORE (Subspace COnflict-Resolving mErging), a method designed to alleviate such singular subspace conflicts. SCORE finds a shared orthogonal basis by computing the principal components of the concatenated leading singular vectors of all models. It then projects each task matrix into the shared basis, pruning off-diagonal components to remove conflicting singular directions. SCORE consistently outperforms, on average, existing model merging approaches in domain generalization settings across a variety of architectures and model scales, demonstrating its effectiveness and scalability.
[CV-139] Full Dynamic Range Sky-Modelling For Image Based Lighting
【速读】:该论文旨在解决当前深度神经网络(DNN)天空模型在高分辨率下难以准确重建真实户外天空的动态范围与光照细节的问题,尤其是对14EV+级别太阳区域的建模不足导致的视觉失真、光影偏差和色调不准确。其解决方案的关键在于提出Icarus——一个全天气天空模型,能够学习并生成具有完整动态范围(Full Dynamic Range, FDR)的真实物理捕捉图像的曝光特性;该模型支持用户通过直观定位太阳和云层位置进行条件生成,并引入可控文本化大气现象纹理生成能力,从而显著提升基于图像的光照(Image Based Lighting, IBL)场景中的光照方向性、阴影精度与色彩真实性。
链接: https://arxiv.org/abs/2603.05758
作者: Ian J. Maquignaz
机构: Université Laval (拉瓦尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:
Abstract:Accurate environment maps are a key component to modelling real-world outdoor scenes. They enable captivating visual arts, immersive virtual reality and a wide range of scientific and engineering applications. To alleviate the burden of physical-capture, physically-simulation and volumetric rendering, sky-models have been proposed as fast, flexible, and cost-saving alternatives. In recent years, sky-models have been extended through deep learning to be more comprehensive and inclusive of cloud formations, but recent work has demonstrated these models fall short in faithfully recreating accurate and photorealistic natural skies. Particularly at higher resolutions, DNN sky-models struggle to accurately model the 14EV+ class-imbalanced solar region, resulting in poor visual quality and scenes illuminated with skewed light transmission, shadows and tones. In this work, we propose Icarus, an all-weather sky-model capable of learning the exposure range of Full Dynamic Range (FDR) physically captured outdoor imagery. Our model allows conditional generation of environment maps with intuitive user-positioning of solar and cloud formations, and extends on current state-of-the-art to enable user-controlled texturing of atmospheric formations. Through our evaluation, we demonstrate Icarus is interchangeable with FDR physically captured outdoor imagery or parametric sky-models, and illuminates scenes with unprecedented accuracy, photorealism, lighting directionality (shadows), and tones in Image Based Lightning (IBL).
[CV-140] From Phase Grounding to Intelligent Surgical Narratives
【速读】:该论文旨在解决手术视频时间线(surgical timeline)生成效率低的问题,传统方法依赖外科医生术后填写模糊的报告或手动标注视频,存在耗时且主观性强的缺陷。其解决方案的关键在于提出一种基于CLIP(Contrastive Language–Image Pretraining)的多模态框架,通过将手术视频帧与文本手势描述对齐,在共享嵌入空间中实现视觉与语义信息的精准匹配;具体而言,利用CLIP的视觉编码器提取视频帧特征、文本编码器嵌入手势描述,并通过微调优化二者之间的对齐关系,从而自动识别每帧对应的手术阶段和动作,最终构建结构化的手术时间线,显著减少对手术视频的人工审查与标注需求。
链接: https://arxiv.org/abs/2603.05732
作者: Ethan Peterson,Huixin Zhan
机构: New Mexico Institute of Mining and Technology (新墨西哥矿业技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video surgery timelines are an important part of tool-assisted surgeries, as they allow surgeons to quickly focus on key parts of the procedure. Current methods involve the surgeon filling out a post-operation (OP) report, which is often vague, or manually annotating the surgical videos, which is highly time-consuming. Our proposed method sits between these two extremes: we aim to automatically create a surgical timeline and narrative directly from the surgical video. To achieve this, we employ a CLIP-based multi-modal framework that aligns surgical video frames with textual gesture descriptions. Specifically, we use the CLIP visual encoder to extract representations from surgical video frames and the text encoder to embed the corresponding gesture sentences into a shared embedding space. We then fine-tune the model to improve the alignment between video gestures and textual tokens. Once trained, the model predicts gestures and phases for video frames, enabling the construction of a structured surgical timeline. This approach leverages pretrained multi-modal representations to bridge visual gestures and textual narratives, reducing the need for manual video review and annotation by surgeons.
[CV-141] Unlocking ImageNets Multi-Object Nature: Automated Large-Scale Multilabel Annotation CVPR2026
【速读】:该论文旨在解决ImageNet训练集长期存在的单标签(single-label)假设问题,即多数图像包含多个对象却仅标注一个类别,导致标签噪声并限制学习信号的丰富性。为提升模型表征能力与泛化性能,作者提出一种无需人工标注的自动化多标签(multi-label)标注管道:首先利用自监督视觉Transformer(Vision Transformer)进行无监督目标发现,随后选取与原标签对齐的区域训练轻量级分类器,并将其应用于全图区域以生成一致的多标签注释。该方案的关键在于通过自监督预训练模型实现高效、可扩展的多标签标注,显著优于传统单标签监督,在多种基准测试中均表现出更强的分类准确率和下游任务迁移能力。
链接: https://arxiv.org/abs/2603.05729
作者: Junyu Chen,Md Yousuf Harun,Christopher Kanan
机构: University of Rochester (罗切斯特大学); Rochester Institute of Technology (罗切斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Findings
Abstract:The original ImageNet benchmark enforces a single-label assumption, despite many images depicting multiple objects. This leads to label noise and limits the richness of the learning signal. Multi-label annotations more accurately reflect real-world visual scenes, where multiple objects co-occur and contribute to semantic understanding, enabling models to learn richer and more robust representations. While prior efforts (e.g., ReaL, ImageNetv2) have improved the validation set, there has not yet been a scalable, high-quality multi-label annotation for the training set. To this end, we present an automated pipeline to convert the ImageNet training set into a multi-label dataset, without human annotations. Using self-supervised Vision Transformers, we perform unsupervised object discovery, select regions aligned with original labels to train a lightweight classifier, and apply it to all regions to generate coherent multi-label annotations across the dataset. Our labels show strong alignment with human judgment in qualitative evaluations and consistently improve performance across quantitative benchmarks. Compared to traditional single-label scheme, models trained with our multi-label supervision achieve consistently better in-domain accuracy across architectures (up to +2.0 top-1 accuracy on ReaL and +1.5 on ImageNet-V2) and exhibit stronger transferability to downstream tasks (up to +4.2 and +2.3 mAP on COCO and VOC, respectively). These results underscore the importance of accurate multi-label annotations for enhancing both classification performance and representation learning. Project code and the generated multi-label annotations are available at this https URL.
[CV-142] Any to Full: Prompting Depth Anything for Depth Completion in One Stage
【速读】:该论文旨在解决RGBD融合深度补全方法在域泛化能力和对不同深度模式鲁棒性方面的局限性问题,以及现有两阶段集成策略因显式相对-度量对齐带来的计算开销和结构失真问题。解决方案的关键在于提出一种单阶段、域通用且模式无关的框架Any2Full,其核心创新是将深度补全重构为预训练单目深度估计(Monocular Depth Estimation, MDE)模型的尺度提示适应过程,并设计了Scale-Aware Prompt Encoder以从稀疏输入中提取尺度线索并生成统一的尺度提示,从而在保持MDE模型几何先验的同时实现全局尺度一致性预测,显著提升了性能与效率。
链接: https://arxiv.org/abs/2603.05711
作者: Zhiyuan Zhou,Ruofeng Liu,Taichi Liu,Weijian Zuo,Shanshan Wang,Zhiqing Hong,Desheng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate, dense depth estimation is crucial for robotic perception, but commodity sensors often yield sparse or incomplete measurements due to hardware limitations. Existing RGBD-fused depth completion methods learn priors jointly conditioned on training RGB distribution and specific depth patterns, limiting domain generalization and robustness to various depth patterns. Recent efforts leverage monocular depth estimation (MDE) models to introduce domain-general geometric priors, but current two-stage integration strategies relying on explicit relative-to-metric alignment incur additional computation and introduce structured distortions. To this end, we present Any2Full, a one-stage, domain-general, and pattern-agnostic framework that reformulates completion as a scale-prompting adaptation of a pretrained MDE model. To address varying depth sparsity levels and irregular spatial distributions, we design a Scale-Aware Prompt Encoder. It distills scale cues from sparse inputs into unified scale prompts, guiding the MDE model toward globally scale-consistent predictions while preserving its geometric priors. Extensive experiments demonstrate that Any2Full achieves superior robustness and efficiency. It outperforms OMNI-DC by 32.2% in average AbsREL and delivers a 1.4 \times speedup over PriorDA with the same MDE backbone, establishing a new paradigm for universal depth completion. Codes and checkpoints are available at this https URL.
[CV-143] Interpretable Perception and Reasoning for Audiovisual Geolocation
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像基础上进行全球地理定位时面临的精度不足问题,尤其是由视觉景观固有模糊性以及听觉线索未被充分挖掘所导致的挑战。其解决方案的关键在于提出了一种名为Audiovisual Geolocation的框架,通过可解释的感知与推理机制实现高精度定位:首先,在感知阶段利用混合自回归稀疏自动编码器将噪声音频分解为语义明确的“声学原子”(acoustic atoms);其次,在多模态推理阶段,采用基于组相对策略优化(Group Relative Policy Optimization, GRPO)微调的MLLM将这些声学原子与视觉特征融合;最后,在精确定位阶段,借助球面流匹配(Riemannian Flow Matching on the S² manifold)实现对全球尺度坐标的精确预测。实验表明,该方法显著优于单模态基线,证明了可解释的声音场景感知作为正交信号的重要性。
链接: https://arxiv.org/abs/2603.05708
作者: Yiyang Su,Xiaoming Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While recent advances in Multimodal Large Language Models (MLLMs) have improved image-based localization, precise global geolocation remains a formidable challenge due to the inherent ambiguity of visual landscapes and the largely untapped potential of auditory cues. In this paper, we introduce Audiovisual Geolocation, a framework designed to resolve geographic ambiguity through interpretable perception and reasoning. We present AVG, a high-quality global-scale video benchmark for geolocation, comprising 20,000 curated clips across 1,000 distinct locations. To address the complexity of audiovisual geolocation, we propose a three-stage framework: (1) a Perception stage that utilizes a mixture-autoregressive sparse autoencoder to decompose noisy audio into semantically grounded “acoustic atoms”; (2) a Multimodal Reasoning stage that employs an MLLM finetuned via Group Relative Policy Optimization (GRPO) to synthesize these atoms with visual features; and (3) a Precision Prediction stage using Riemannian Flow Matching on the S^2 manifold. Our experiments demonstrate that our framework significantly outperforms unimodal baselines. These results entail that interpretable perception of the soundscape provides a critical, orthogonal signal that, when coupled with multimodal reasoning, enables high-precision global localization.
[CV-144] MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images Videos and Documents
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在真实场景中面临的重大挑战:即如何从大规模、异构的多模态语料库中有效检索相关证据,再进行推理。现有基准测试通常限制检索范围为小规模、单一模态的候选集,导致评估结果过于乐观,无法反映实际应用中的复杂性。为此,作者提出了MultiHaystack基准,其关键创新在于构建了一个包含超过46,000个多模态候选项(文档、图像和视频)的大规模跨模态检索数据集,并设计了747个开放且可验证的问题,每个问题均基于检索池中唯一验证过的证据项,要求模型在不同模态间精确定位证据并执行细粒度推理。实验表明,尽管模型在提供正确证据时表现良好,但当需从完整语料库中检索时性能显著下降,凸显了多模态检索仍是MLLMs的核心瓶颈,MultiHaystack因此成为推动检索驱动型多模态系统发展的关键评测平台。
链接: https://arxiv.org/abs/2603.05697
作者: Dannong Xu,Zhongyu Yang,Jun Chen,Yingfang Yuan,Ming Hu,Lei Sun,Luc Van Gool,Danda Pani Paudel,Chun-Mei Feng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) achieve strong performance on benchmarks that evaluate text, image, or video understanding separately. However, these settings do not assess a critical real-world requirement, which involves retrieving relevant evidence from large, heterogeneous multimodal corpora prior to reasoning. Most existing benchmarks restrict retrieval to small, single-modality candidate sets, substantially simplifying the search space and overstating end-to-end reliability. To address this gap, we introduce MultiHaystack, the first benchmark designed to evaluate both retrieval and reasoning under large-scale, cross-modal conditions. MultiHaystack comprises over 46,000 multimodal retrieval candidates across documents, images, and videos, along with 747 open yet verifiable questions. Each question is grounded in a unique validated evidence item within the retrieval pool, requiring evidence localization across modalities and fine-grained reasoning. In our study, we find that models perform competitively when provided with the corresponding evidence, but their performance drops sharply when required to retrieve that evidence from the full corpus. Additionally, even the strongest retriever, E5-V, achieves only 40.8% Recall@1, while state-of-the-art MLLMs such as GPT-5 experience a significant drop in reasoning accuracy from 80.86% when provided with the corresponding evidence to 51.4% under top-5 retrieval. These results indicate that multimodal retrieval over heterogeneous pools remains a primary bottleneck for MLLMs, positioning MultiHaystack as a valuable testbed that highlights underlying limitations obscured by small-scale evaluations and promotes retrieval-centric advances in multimodal systems.
[CV-145] OWL: A Novel Approach to Machine Perception During Motion
【速读】:该论文旨在解决运动过程中三维感知(3D perception)的复杂挑战,尤其是如何从视觉运动线索中实现几何恒常性(geometric constancy)和无尺度依赖的三维场景重建。其解决方案的关键在于提出一种名为OWL(Optical Wind Loom)的感知函数,该函数直接基于两个基本的视觉运动线索——近 fixation point 的局部视觉逼近(perceived local visual looming)和刚体相对于固定点的视旋转(perceived rotation),从而在不需预先知道相机与场景之间相对位置或运动参数的情况下,解析地建模相机与任意可见3D点之间的瞬时方向范围(directional range)与方向平移(directional translation)的关系。OWL通过像素级并行计算实现对动态环境中3D点的实时、可缩放表示,为机器人学和自主导航提供了新的理论基础与实践路径。
链接: https://arxiv.org/abs/2603.05686
作者: Daniel Raviv,Juan D. Yepes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a perception-related function, OWL, designed to address the complex challenges of 3D perception during motion. It derives its values directly from two fundamental visual motion cues, with one set of cue values per point per time instant. During motion, two visual motion cues relative to a fixation point emerge: 1) perceived local visual looming of points near the fixation point, and 2) perceived rotation of the rigid object relative to the fixation point. It also expresses the relation between two well-known physical quantities, the relative instantaneous directional range and directional translation in 3D between the camera and any visible 3D point, without explicitly requiring their measurement or prior knowledge of their individual values. OWL offers a unified, analytical time-based approach that enhances and simplifies key perception capabilities, including scaled 3D mapping and camera heading. Simulations demonstrate that OWL achieves geometric constancy of 3D objects over time and enables scaled 3D scene reconstruction from visual motion cues alone. By leveraging direct measurements from raw visual motion image sequences, OWL values can be obtained without prior knowledge of stationary environments, moving objects, or camera motion. This approach employs minimalistic, pixel-based, parallel computations, providing an alternative real-time representation for 3D points in relative motion. OWL bridges the gap between theoretical concepts and practical applications in robotics and autonomous navigation and may unlock new possibilities for real-time decision-making and interaction, potentially serving as a building block for next-generation autonomous systems. This paper offers an alternative perspective on machine perception, with implications that may extend to natural perception and contribute to a better understanding of behavioral psychology and neural functionality.
[CV-146] Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding
【速读】:该论文旨在解决视频时序定位(Video Temporal Grounding, VTG)任务中因使用视频语言模型(VLM)导致的计算成本过高问题。现有无训练视觉标记剪枝方法在视频问答任务中表现良好,但直接应用于VTG时效果显著下降,因其对边界敏感证据和跨帧推理链依赖性强。为此,作者提出两个VTG特有剪枝原则:证据保留(Evidence Retention, ER),确保事件边界附近的关键补丁得以保留;连通性强度(Connectivity Strength, CS),维持帧间标记级连接以支持长程证据聚合。解决方案的核心是SemVID框架,其通过平衡查询相关性和帧间差异分配每帧标记预算,并选择三类互补标记:用于多样关键证据的对象标记、作为跨帧桥梁的运动标记以及保障场景连续性的少量上下文标记,从而在仅使用12.5%视觉标记的情况下保留高达95.4%的mIoU并实现最高5.8倍预填充加速,显著优于现有方法。
链接: https://arxiv.org/abs/2603.05663
作者: Jiaqi Li,Shuntian Zheng,Yixian Shen,Jia-Hong Huang,Xiaoman Lu,Minzhe Ni,Yu Guan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Temporal Grounding (VTG) localizes the temporal boundaries of a query-relevant moment in long, untrimmed videos, making video-language-model (VLM) pipelines prohibitively expensive. While recent training-free visual token pruning has shown success in video question answering, naively applying these objectives to VTG often causes drastic degradation, as VTG crucially depends on boundary-sensitive evidence and cross-frame reasoning chains. We therefore identify two VTG-specific pruning principles: Evidence Retention (ER), which keeps query-critical patches especially around event boundaries, and Connectivity Strength (CS), which preserves token-level cross-frame connectivity for long-range evidence aggregation. Building on these insights, we propose SemVID, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles. SemVID first allocates per-frame token budgets by balancing query relevance and inter-frame variation to avoid over-pruned segments, and then selects three types of tokens: object tokens for diverse query-critical evidence, motion tokens to capture meaningful transitions and serve as cross-frame relays, and a small set of context tokens for scene continuity. Extensive experiments on VTG benchmarks show that SemVID achieves a strong accuracy-efficiency trade-off, retaining up to 95.4% mIoU with only 12.5% visual tokens and delivering up to a 5.8x prefill speedup, consistently outperforming prior methods under the same budgets.
[CV-147] When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On
【速读】:该论文旨在解决当前后训练方法在缺乏理想参考答案(ideal reference answers)的场景下,难以有效评估和优化模型输出的问题。特别是在存在多个合法输出的现实任务中,传统基于评分标准(rubric)的方法因依赖单一理想答案而失效。解决方案的关键在于提出隐式错误计数(Implicit Error Counting, IEC),其核心思想是不依赖显式参考答案,转而通过严重性加权的方式对模型输出中的错误进行枚举,并将这些错误转化为校准后的分项奖励。IEC进一步引入两个关键设计:隐式分数发射(implicit score emission)和组校准(group calibration),以克服显式枚举带来的噪声问题,从而实现稳定且可靠的强化学习优化。实验表明,在虚拟试衣(virtual try-on, VTO)等复杂场景中,IEC显著优于基于评分标准的方法(RaR),尤其在Mismatch-DressCode(MDressBench)基准上表现突出,验证了错误计数作为奖励信号的有效性。
链接: https://arxiv.org/abs/2603.05659
作者: Wisdom Ikezogwo,Mehmet Saygin Seyfioglu,Ranjay Krishna,Karim Bouyarmane
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) and Rubrics as Rewards (RaR) have driven strong gains in domains with clear correctness signals and even in subjective domains by synthesizing evaluation criteria from ideal reference answers. But many real-world tasks admit multiple valid outputs and lack the single ideal answer that rubric generation depends on. We identify this reference-free setting as a gap in current post-training methods and propose Implicit Error Counting (IEC) to fill it. Instead of checking what a response gets right against a rubric, IEC enumerates what it gets wrong, applying severity-weighted scores across task-relevant axes and converting them into calibrated per-aspect rewards. We show that naïve explicit enumeration is too noisy for stable optimization, and that two design choices: implicit score emission and group calibration are necessary to make error counting a reliable reward. As a case study, we validate IEC on virtual try-on (VTO), a domain that is simultaneously too constrained for holistic scoring and too permissive for rubric-based evaluation: subtle garment errors are unacceptable, yet many output variations are correct. We introduce Cascaded Error Counting (CEC) as an evaluation metric, which tracks human preferences well (60% top-1 vs. 30% others), and curate Mismatch-DressCode (MDressBench), a benchmark with maximal attribute mismatch to stress-test reward designs. On MDressBench, IEC outperforms RaR across all metrics (CEC: 5.31 vs. 5.60 on flat references; 5.20 vs. 5.53 on non-flat). On VITON-HD and DressCode, IEC matches or surpasses six baselines on 6 of 8 perceptual metrics. These results suggest that when ideal answers are unavailable, counting errors provide a stronger signal than constructing rubrics.
[CV-148] Making Reconstruction FID Predictive of Diffusion Generation FID
【速读】:该论文旨在解决变分自编码器(Variational Autoencoder, VAE)的重建FID(reconstruction FID, rFID)与扩散模型生成FID(generation FID, gFID)之间相关性较差的问题。现有研究表明,rFID无法有效反映扩散模型在生成质量上的表现,尤其是在扩散导航阶段(diffusion navigation phase)。为此,作者提出了一种新的度量指标——插值FID(interpolated FID, iFID),其核心在于:对数据集中每个样本,在潜在空间中检索其最近邻(nearest neighbor, NN),并对其潜在表示进行线性插值,随后解码插值后的潜在向量,并计算解码样本与原始数据集之间的FID。该方法通过引入潜在空间中的局部结构信息,显著提升了与gFID的相关性(Pearson相关系数约0.85),且能更准确地反映扩散模型在不同训练阶段(如扩散精炼阶段与扩散导航阶段)的样本质量变化。
链接: https://arxiv.org/abs/2603.05630
作者: Tongda Xu,Mingwei He,Shady Abu-Hussein,Jose Miguel Hernandez-Lobato,Haotian Zhang,Kai Zhao,Chao Zhou,Ya-Qin Zhang,Yan Wang
机构: Tsinghua University (清华大学); University of Cambridge (剑桥大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:It is well known that the reconstruction FID (rFID) of a VAE is poorly correlated with the generation FID (gFID) of a latent diffusion model. We propose interpolated FID (iFID), a simple variant of rFID that exhibits a strong correlation with gFID. Specifically, for each element in the dataset, we retrieve its nearest neighbor (NN) in the latent space and interpolate their latent representations. We then decode the interpolated latent and compute the FID between the decoded samples and the original dataset. Additionally, we refine the claim that rFID correlates poorly with gFID, by showing that rFID correlates with sample quality in the diffusion refinement phase, whereas iFID correlates with sample quality in the diffusion navigation phase. Furthermore, we provide an explanation for why iFID correlates well with gFID, and why reconstruction metrics are negatively correlated with gFID, by connecting to results in the diffusion generalization and hallucination. Empirically, iFID is the first metric to demonstrate a strong correlation with diffusion gFID, achieving Pearson linear and Spearman rank correlations approximately 0.85. The source code is provided in this https URL.
[CV-149] Rethinking Concept Bottleneck Models: From Pitfalls to Solutions CVPR2026
【速读】:该论文旨在解决概念瓶颈模型(Concept Bottleneck Models, CBMs)在实际应用中面临的四大核心问题:缺乏用于预评估概念相关性的度量指标、因“线性问题”导致模型绕过概念瓶颈而丧失可解释性、与黑箱模型相比存在显著的准确率差距,以及对不同视觉主干网络(visual backbones)和视觉语言模型(Vision-Language Models, VLMs)影响机制缺乏系统研究。其解决方案的关键在于提出一个名为CBM-Suite的方法论框架:首先引入基于熵的度量来量化概念集对数据集的内在适配性;其次通过在概念激活与分类器之间插入非线性层,有效缓解线性问题,确保模型精度真实反映概念相关性;再次利用由线性教师探针引导的蒸馏损失缩小准确率差距;最后系统分析了不同视觉编码器、VLM及概念集组合对CBM性能与可解释性的交互影响,从而提升概念驱动型模型的准确性与可信度。
链接: https://arxiv.org/abs/2603.05629
作者: Merve Tapli,Quentin Bouniot,Wolfgang Stammer,Zeynep Akata,Emre Akbas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Concept Bottleneck Models (CBMs) ground predictions in human-understandable concepts but face fundamental limitations: the absence of a metric to pre-evaluate concept relevance, the “linearity problem” causing recent CBMs to bypass the concept bottleneck entirely, an accuracy gap compared to opaque models, and finally the lack of systematic study on the impact of different visual backbones and VLMs. We introduce CBM-Suite, a methodological framework to systematically addresses these challenges. First, we propose an entropy-based metric to quantify the intrinsic suitability of a concept set for a given dataset. Second, we resolve the linearity problem by inserting a non-linear layer between concept activations and the classifier, which ensures that model accuracy faithfully reflects concept relevance. Third, we narrow the accuracy gap by leveraging a distillation loss guided by a linear teacher probe. Finally, we provide comprehensive analyses on how different vision encoders, vision-language models, and concept sets interact to influence accuracy and interpretability in CBMs. Extensive evaluations show that CBM-Suite yields more accurate models and provides insights for improving concept-based interpretability.
[CV-150] Post Fusion Birds Eye View Feature Stabilization for Robust Multimodal 3D Detection
【速读】:该论文旨在解决相机-激光雷达(Camera-LiDAR)融合检测器在域偏移(domain shift)和传感器故障条件下性能显著下降的问题,从而提升自动驾驶系统中三维目标检测的鲁棒性。现有增强鲁棒性的方法通常需要修改融合架构或重新训练专用模型,难以集成到已部署的系统中。其解决方案的关键在于提出一种轻量级后融合稳定模块(Post Fusion Stabilizer, PFS),该模块作用于现有检测器的鸟瞰图(BEV)中间特征表示,通过近恒等变换设计,在不改变原检测头的前提下,实现特征统计稳定、受传感器退化影响区域的空间抑制以及通过残差校正自适应恢复弱化特征,从而在保持原有性能的同时显著提升对多种相机与激光雷达退化场景的适应能力。
链接: https://arxiv.org/abs/2603.05623
作者: Trung Tien Dong,Dev Thakkar,Arman Sargolzaei,Xiaomin Lin
机构: University of South Florida (南佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Camera-LiDAR fusion is widely used in autonomous driving to enable accurate 3D object detection. However, bird’s-eye view (BEV) fusion detectors can degrade significantly under domain shift and sensor failures, limiting reliability in real-world deployment. Existing robustness approaches often require modifying the fusion architecture or retraining specialized models, making them difficult to integrate into already deployed systems. We propose a Post Fusion Stabilizer (PFS), a lightweight module that operates on intermediate BEV representations of existing detectors and produces a refined feature map for the original detection head. The design stabilizes feature statistics under domain shift, suppresses spatial regions affected by sensor degradation, and adaptively restores weakened cues through residual correction. Designed as a near-identity transformation, PFS preserves performance while improving robustness under diverse camera and LiDAR corruptions. Evaluations on the nuScenes benchmark demonstrate that PFS achieves state-of-the-art results in several failure modes, notably improving camera dropout robustness by +1.2% and low-light performance by +4.4% mAP while maintaining a lightweight footprint of only 3.3 M parameters. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.05623 [cs.CV] (or arXiv:2603.05623v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.05623 Focus to learn more arXiv-issued DOI via DataCite
[CV-151] Adversarial Batch Representation Augmentation for Batch Correction in High-Content Cellular Screening
【速读】:该论文旨在解决高内涵筛选(High-Content Screening)中因实验批次差异导致的生物批次效应(bio-batch effects)问题,此类效应会引发协变量偏移(covariate shift),显著降低深度学习模型在未见数据上的泛化性能。解决方案的关键在于将生物批次缓解建模为领域泛化(Domain Generalization, DG)问题,并提出对抗性批次表示增强(Adversarial Batch Representation Augmentation, ABRA)方法:ABRA通过参数化特征统计量为结构化不确定性来显式建模批次间的统计波动;利用最小-最大优化框架在表示空间中主动合成最坏情况的生物批次扰动,同时以严格的角几何边界约束保持细粒度类别可分性;此外,引入协同分布对齐目标防止对抗探索过程中的表征坍塌,从而实现鲁棒且可泛化的表征学习。
链接: https://arxiv.org/abs/2603.05622
作者: Lei Tong,Xujing Yao,Adam Corrigan,Long Chen,Navin Rathna Kumar,Kerry Hallbrook,Jonathan Orme,Yinhai Wang,Huiyu Zhou
机构: AstraZeneca R&D (阿斯利康研发部); University of Leicester (莱斯特大学); Nanjing Tech University (南京工业大学); Imperial College London (伦敦帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:High-Content Screening routinely generates massive volumes of cell painting images for phenotypic profiling. However, technical variations across experimental executions inevitably induce biological batch (bio-batch) effects. These cause covariate shifts and degrade the generalization of deep learning models on unseen data. Existing batch correction methods typically rely on additional prior knowledge (e.g., treatment or cell culture information) or struggle to generalize to unseen bio-batches. In this work, we frame bio-batch mitigation as a Domain Generalization (DG) problem and propose Adversarial Batch Representation Augmentation (ABRA). ABRA explicitly models batch-wise statistical fluctuations by parameterizing feature statistics as structured uncertainties. Through a min-max optimization framework, it actively synthesizes worst-case bio-batch perturbations in the representation space, guided by a strict angular geometric margin to preserve fine-grained class discriminability. To prevent representation collapse during this adversarial exploration, we introduce a synergistic distribution alignment objective. Extensive evaluations on the large-scale RxRx1 and RxRx1-WILDS benchmarks demonstrate that ABRA establishes a new state-of-the-art for siRNA perturbation classification.
[CV-152] DreamCAD: Scaling Multi-modal CAD Generation using Differentiable Parametric Surfaces
【速读】:该论文旨在解决当前生成式 CAD(Computer-Aided Design)方法受限于小规模标注数据集的问题,尤其是缺乏带有明确设计历史或边界表示(Boundary Representation, BRep)标签的数据,导致难以实现大规模、可编辑的 CAD 模型生成。解决方案的关键在于提出 DreamCAD,一个基于点级监督的多模态生成框架,其核心创新是将每个 BRep 表示为一组参数化曲面(如 Bézier 曲面),并通过可微分的三角剖分(differentiable tessellation)方法生成网格,从而在无需 CAD 特定标注的情况下实现对复杂几何结构的端到端训练与可编辑表面重建。此外,作者还构建了目前最大的 CAD 描述数据集 CADCap-1M,以推动文本到 CAD 的研究进展。
链接: https://arxiv.org/abs/2603.05607
作者: Mohammad Sadil Khan,Muhammad Usama,Rolandos Alexandros Potamias,Didier Stricker,Muhammad Zeshan Afzal,Jiankang Deng,Ismail Elezi
机构: DFKI; RPTU, Kaiserslautern; Imperial College London; Huawei London Research Center
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: For Caption Dataset: this https URL
Abstract:Computer-Aided Design (CAD) relies on structured and editable geometric representations, yet existing generative methods are constrained by small annotated datasets with explicit design histories or boundary representation (BRep) labels. Meanwhile, millions of unannotated 3D meshes remain untapped, limiting progress in scalable CAD generation. To address this, we propose DreamCAD, a multi-modal generative framework that directly produces editable BReps from point-level supervision, without CAD-specific annotations. DreamCAD represents each BRep as a set of parametric patches (e.g., Bézier surfaces) and uses a differentiable tessellation method to generate meshes. This enables large-scale training on 3D datasets while reconstructing connected and editable surfaces. Furthermore, we introduce CADCap-1M, the largest CAD captioning dataset to date, with 1M+ descriptions generated using GPT-5 for advancing text-to-CAD research. DreamCAD achieves state-of-the-art performance on ABC and Objaverse benchmarks across text, image, and point modalities, improving geometric fidelity and surpassing 75% user preference. Code and dataset will be publicly available.
[CV-153] From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications
【速读】:该论文旨在解决基于热图(heatmap)的关键点检测模型在面对微小输入扰动时的鲁棒性验证问题,这一问题在姿态估计、视角恢复和三维重建等视觉任务中尤为关键。现有方法多采用解耦的分类式验证策略,独立评估每个关键点的鲁棒性,导致保守且不准确的保证。本文提出首个耦合鲁棒性验证框架,其核心在于通过混合整数线性规划(MILP)建模,联合约束所有关键点的坐标偏移,从而捕捉关键点之间的依赖关系及其对下游任务的影响。该方法将可达热图集合与编码联合偏差约束的多面体(polytope)相结合,实现对模型整体行为的精确验证:不可行性证明鲁棒性,可行性则提供反例,并理论证明该方法是可靠的(sound)。实验表明,该耦合方法在严格误差阈值下仍保持高效验证率,显著优于传统解耦方法。
链接: https://arxiv.org/abs/2603.05604
作者: Xusheng Luo,Changliu Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 21 pages, 4 figures, 9 tables. arXiv admin note: text overlap with arXiv:2408.00117
Abstract:Keypoint detection underpins many vision tasks, including pose estimation, viewpoint recovery, and 3D reconstruction, yet modern neural models remain vulnerable to small input perturbations. Despite its importance, formal robustness verification for keypoint detectors is largely unexplored due to high-dimensional inputs and continuous coordinate outputs. We propose the first coupled robustness verification framework for heatmap-based keypoint detectors that bounds the joint deviation across all keypoints, capturing their interdependencies and downstream task requirements. Unlike prior decoupled, classification-style approaches that verify each keypoint independently and yield conservative guarantees, our method verifies collective behavior. We formulate verification as a falsification problem using a mixed-integer linear program (MILP) that combines reachable heatmap sets with a polytope encoding joint deviation constraints. Infeasibility certifies robustness, while feasibility provides counterexamples, and we prove the method is sound: if it certifies the model as robust, then the keypoint detection model is guaranteed to be robust. Experiments show that our coupled approach achieves high verified rates and remains effective under strict error thresholds where decoupled methods fail.
[CV-154] hinking with Spatial Code for Physical-World Video Reasoning
【速读】:该论文旨在解决物理世界视觉问答(Visual Question Answering, VQA)中模型缺乏对空间结构显式建模与几何感知推理能力的问题。现有方法通常依赖隐式的视觉表征,难以实现跨视角的一致性空间理解与精确的几何推理。其解决方案的关键在于提出“空间编码思维”(Thinking with Spatial Code)框架,通过一个统一的时空空间编码器(spatial encoder),将RGB视频转化为具有时间一致性且包含显式3D定向边界框(oriented bounding boxes)和语义标签的空间代码(spatial code),从而使得大语言模型(LLMs)能够基于明确的空间变量进行推理;此外,该框架进一步采用基于空间规则的强化学习(reinforcement learning with a spatial rubric reward)微调LLM,以激励模型生成具有视角感知和几何约束的推理过程,最终在VSI-Bench基准上超越了现有的专有视觉-语言模型,达到新的最先进水平。
链接: https://arxiv.org/abs/2603.05591
作者: Jieneng Chen,Wenxin Ma,Ruisheng Yuan,Yunzhi Zhang,Jiajun Wu,Alan Yuille
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code at this https URL
Abstract:We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to reason directly over explicit spatial variables. Specifically, we propose the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and we further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference. As a result, our model outperforms proprietary vision-language models on VSI-Bench, setting a new state-of-the-art. Code is available at this https URL.
[CV-155] Bias In Bias Out? Finding Unbiased Subnetworks in Vanilla Models CVPR
【速读】:该论文旨在解决深度学习模型中存在的算法偏见(algorithmic bias)问题,特别是如何在不依赖额外无偏数据或重新训练模型参数的前提下,从标准训练的神经网络中提取出公平且对偏见不敏感的子网络。其解决方案的关键在于提出了一种名为Bias-Invariant Subnetwork Extraction (BISE) 的学习策略,通过结构化剪枝(pruning)识别并隔离原始模型中已存在的“无偏”子网络,从而实现无需微调或重训练即可获得性能稳定、鲁棒性强的去偏模型,显著提升了偏见缓解的效率与实用性。
链接: https://arxiv.org/abs/2603.05582
作者: Ivan Luiz De Moura Matos,Abdel Djalil Sad Saoud,Ekaterina Iakovleva,Vito Paolo Pastore,Enzo Tartaglione
机构: LTCI, Télécom Paris, Institut Polytechnique de Paris, France; MaLGa-DIBRIS, University of Genoa, Italy; AI for Good (AIGO), Istituto Italiano di Tecnologia, Genoa, Italy
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted for publication at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
Abstract:The issue of algorithmic biases in deep learning has led to the development of various debiasing techniques, many of which perform complex training procedures or dataset manipulation. However, an intriguing question arises: is it possible to extract fair and bias-agnostic subnetworks from standard vanilla-trained models without relying on additional data, such as unbiased training set? In this work, we introduce Bias-Invariant Subnetwork Extraction (BISE), a learning strategy that identifies and isolates “bias-free” subnetworks that already exist within conventionally trained models, without retraining or finetuning the original parameters. Our approach demonstrates that such subnetworks can be extracted via pruning and can operate without modification, effectively relying less on biased features and maintaining robust performance. Our findings contribute towards efficient bias mitigation through structural adaptation of pre-trained neural networks via parameter removal, as opposed to costly strategies that are either data-centric or involve (re)training all model parameters. Extensive experiments on common benchmarks show the advantages of our approach in terms of the performance and computational efficiency of the resulting debiased model.
[CV-156] Digital-Twin Losses for Lane-Compliant Trajectory Prediction at Urban Intersections
【速读】:该论文旨在解决车联网(V2X)环境下复杂多智能体交互场景中轨迹预测的准确性与安全性问题,尤其关注信号交叉口处车辆与基础设施协同感知下的安全轨迹生成。其解决方案的关键在于构建一个基于数字孪生(digital twin)驱动的V2X轨迹预测流程,该流程融合了车载与路侧设施的协作感知信息,并采用双流损失函数优化模型训练:其中标准均方误差(MSE)损失保障点级预测精度,而创新性的“孪生损失”(twin loss)则编码了交通规则约束、碰撞规避、多模态多样性及来自数字孪生的规则先验,有效抑制交通违规、预测冲突和模式崩溃,从而实现场景一致且符合安全规范的轨迹预测。
链接: https://arxiv.org/abs/2603.05546
作者: Kuo-Yi Chao,Erik Leo Haß,Melina Gegg,Jiajie Zhang,Ralph Raßhofer,Alois Christian Knoll
机构: Technical University of Munich (慕尼黑工业大学); University of Stuttgart (斯图加特大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2 figures, conference
Abstract:Accurate and safety-conscious trajectory prediction is a key technology for intelligent transportation systems, especially in V2X-enabled urban environments with complex multi-agent interactions. In this paper, we created a digital twin-driven V2X trajectory prediction pipeline that jointly leverages cooperative perception from vehicles and infrastructure to forecast multi-agent motion at signalized intersections. The proposed model combines a Bi-LSTM-based generator with a structured training objective consisting of a standard mean squared error (MSE) loss and a novel twin loss. The twin loss encodes infrastructure constraints, collision avoidance, diversity across predicted modes, and rule-based priors derived from the digital twin. While the MSE term ensures point-wise accuracy, the twin loss penalizes traffic rule violations, predicted collisions, and mode collapse, guiding the model toward scene-consistent and safety-compliant predictions. We train and evaluate our approach on real-world V2X data sent from the intersection to the vehicle and collected in urban corridors. In addition to standard trajectory metrics (ADE, FDE), we introduce ITS-relevant safety indicators, including infrastructure and rule violation rates. Experimental results demonstrate that the proposed training scheme significantly reduces critical violations while maintaining comparable prediction accuracy and real-time performance, highlighting the potential of digital twin-driven multi-loss learning for V2X-enabled intelligent transportation systems. Comments: 7 pages, 2 figures, conference Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.05546 [cs.RO] (or arXiv:2603.05546v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2603.05546 Focus to learn more arXiv-issued DOI via DataCite
[CV-157] Edges Are All You Need: Robust Gait Recognition via Label-Free Structure
【速读】:该论文旨在解决现有步态识别(gait recognition)方法中因依赖稀疏轮廓(silhouette)或需强语义标签的解析(parsing)而导致的判别能力不足与性能不稳定问题。其关键解决方案在于提出一种无需标注的新型视觉模态——SKETCH,通过边缘检测器从RGB图像中提取高频率结构线索(如肢体关节运动和自遮挡轮廓),构建密集且具结构性的表示;同时设计了SKETCHGAIT框架,采用双流独立学习与轻量级早期融合机制,实现标签引导的解析与无标签结构特征之间的语义解耦与结构互补,从而显著提升跨数据集的识别精度与鲁棒性。
链接: https://arxiv.org/abs/2603.05537
作者: Chao Zhang,Zhuang Zheng,Ruixin Li,Zhanyong Mei
机构: Chengdu University of Technology (成都理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 8 pages, 2 figures
Abstract:Gait recognition is a non-intrusive biometric technique for security applications, yet existing studies are dominated by silhouette- and parsing-based representations. Silhouettes are sparse and miss internal structural details, limiting discriminability. Parsing enriches silhouettes with part-level structures, but relies heavily on upstream human parsers (e.g., label granularity and boundary precision), leading to unstable performance across datasets and sometimes even inferior results to silhouettes. We revisit gait representations from a structural perspective and describe a design space defined by edge density and supervision form: silhouettes use sparse boundary edges with weak single-label supervision, while parsing uses denser cues with strong semantic priors. In this space, we identify an underexplored paradigm: dense part-level structure without explicit semantic labels, and introduce SKETCH as a new visual modality for gait recognition. Sketch extracts high-frequency structural cues (e.g., limb articulations and self-occlusion contours) directly from RGB images via edge-based detectors in a label-free manner. We further show that label-guided parsing and label-free sketch are semantically decoupled and structurally complementary. Based on this, we propose SKETCHGAIT, a hierarchically disentangled multi-modal framework with two independent streams for modality-specific learning and a lightweight early-stage fusion branch to capture structural complementarity. Extensive experiments on SUSTech1K and CCPG validate the proposed modality and framework: SketchGait achieves 92.9% Rank-1 on SUSTech1K and 93.1% mean Rank-1 on CCPG.
[CV-158] RoboLayout: Differentiable 3D Scene Generation for Embodied Agents
【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在生成室内场景布局时,难以兼顾语义一致性与可交互性的问题,尤其是在物理受限的室内环境中,如何确保生成的布局不仅语义合理,还能被具身智能体(embodied agents)有效导航和操作。解决方案的关键在于提出RoboLayout框架,其核心创新是将显式的可达性约束(reachability constraints)引入可微分布局优化过程中,使生成的场景具备导航可行性和动作可执行性;同时,该框架支持多种物理能力各异的代理抽象(如服务机器人、人类或动物),实现面向特定代理的环境设计,并通过局部精修阶段选择性重优化问题区域,提升收敛效率而不增加全局优化迭代次数。
链接: https://arxiv.org/abs/2603.05522
作者: Ali Shamsaddinlou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Recent advances in vision language models (VLMs) have shown strong potential for spatial reasoning and 3D scene layout generation from open-ended language instructions. However, generating layouts that are not only semantically coherent but also feasible for interaction by embodied agents remains challenging, particularly in physically constrained indoor environments. In this paper, RoboLayout is introduced as an extension of LayoutVLM that augments the original framework with agent-aware reasoning and improved optimization stability. RoboLayout integrates explicit reachability constraints into a differentiable layout optimization process, enabling the generation of layouts that are navigable and actionable by embodied agents. Importantly, the agent abstraction is not limited to a specific robot platform and can represent diverse entities with distinct physical capabilities, such as service robots, warehouse robots, humans of different age groups, or animals, allowing environment design to be tailored to the intended agent. In addition, a local refinement stage is proposed that selectively reoptimizes problematic object placements while keeping the remainder of the scene fixed, improving convergence efficiency without increasing global optimization iterations. Overall, RoboLayout preserves the strong semantic alignment and physical plausibility of LayoutVLM while enhancing applicability to agent-centric indoor scene generation, as demonstrated by experimental results across diverse scene configurations.
[CV-159] Enhancing Neural Video Compression of Static Scenes with Positive-Incentive Noise
【速读】:该论文旨在解决静态场景视频(如监控录像和视频通话流)在传统标准化编码器和神经视频压缩(Neural Video Compression, NVC)方法中效率低下的问题,其核心挑战在于:一方面,现有方法未能有效利用时间冗余;另一方面,NVC模型在训练与测试数据分布不一致时性能显著下降,且生成式压缩方法虽提升主观质量却引入不可接受的幻觉细节。解决方案的关键在于引入正向激励噪声(positive-incentive noise),将短期时间变化重新解释为有助于模型微调的正向信号,从而实现瞬态变化与持久背景的解耦,使结构化先验信息内化至压缩模型中。推理阶段仅需传输不变成分,大幅减少数据传输量的同时保持像素级保真度,实验表明相较通用NVC模型可实现73%的Bjøntegaard delta (BD)码率节省。
链接: https://arxiv.org/abs/2603.06095
作者: Cheng Yuan,Zhenyu Jia,Jiawei Shao,Xuelong Li
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Static scene videos, such as surveillance feeds and videotelephony streams, constitute a dominant share of storage consumption and network traffic. However, both traditional standardized codecs and neural video compression (NVC) methods struggle to encode these videos efficiently due to inadequate usage of temporal redundancy and severe distribution gaps between training and test data, respectively. While recent generative compression methods improve perceptual quality, they introduce hallucinated details that are unacceptable in authenticity-critical applications. To overcome these limitations, we propose to incorporate positive-incentive noise into NVC for static scene videos, where short-term temporal changes are reinterpreted as positive-incentive noise to facilitate model finetuning. By disentangling transient variations from the persistent background, structured prior information is internalized in the compression model. During inference, the invariant component requires minimal signaling, thus reducing data transmission while maintaining pixel-level fidelity. Preliminary experiments demonstrate a 73% Bjøntegaard delta (BD) rate saving compared to general NVC models. Our method provides an effective solution to trade computation for bandwidth, enabling robust video transmission under adverse network conditions and economic long-term retention of surveillance footage.
[CV-160] Architectural Unification for Polarimetric Imaging Across Multiple Degradations
【速读】:该论文旨在解决极化成像中因低光照噪声、运动模糊和马赛克伪影等多样化退化导致的物理一致极化参数(总强度TI、偏振度DoP、偏振角AoP)恢复难题。现有方法通常针对单一退化类型设计专用网络结构,且多采用多阶段处理流程或仅在图像域或Stokes域中操作,难以兼顾适应性和物理一致性。其解决方案的关键在于提出一个统一的架构框架,在不同退化场景下共享相同网络结构,通过单阶段联合图像-Stokes域处理实现端到端恢复,避免误差累积并显式保留物理约束关系,从而在多种退化任务中均达到最优性能。
链接: https://arxiv.org/abs/2603.05834
作者: Chu Zhou,Yufei Han,Junda Liao,Linrui Dai,Wangze Xu,Art Subpa-Asa,Heng Guo,Boxin Shi,Imari Sato
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Polarimetric imaging aims to recover polarimetric parameters, including Total Intensity (TI), Degree of Polarization (DoP), and Angle of Polarization (AoP), from captured polarized measurements. In real-world scenarios, these measurements are frequently affected by diverse degradations such as low-light noise, motion blur, and mosaicing artifacts. Due to the nonlinear dependency of DoP and AoP on the measured intensities, accurately retrieving physically consistent polarimetric parameters from degraded observations remains highly challenging. Existing approaches typically adopt task-specific network architectures tailored to individual degradation types, limiting their adaptability across different restoration scenarios. Moreover, many methods rely on multi-stage processing pipelines that suffer from error accumulation, or operate solely in a single domain (either image or Stokes domain), failing to fully exploit the intrinsic physical relationships between them. In this work, we propose a unified architectural framework for polarimetric imaging that is structurally shared across multiple degradation scenarios. Rather than redesigning network structures for each task, our framework maintains a consistent architectural design while being trained separately for different degradations. The model performs single-stage joint image-Stokes processing, avoiding error accumulation and explicitly preserving physical consistency. Extensive experiments show that this unified architectural design, when trained for specific degradation types, consistently achieves state-of-the-art performance across low-light denoising, motion deblurring, and demosaicing tasks, establishing a versatile and physically grounded solution for degraded polarimetric imaging.
[CV-161] Uni-LVC: A Unified Method for Intra- and Inter-Mode Learned Video Compression
【速读】:该论文旨在解决现有学习型视频压缩(Learned Video Compression, LVC)方法在实际应用中的两大关键问题:一是模型通常需分别设计 intra 和 inter 编码模式,导致架构冗余且难以统一优化;二是当时间参考帧不可靠时,inter 编码性能显著下降,影响整体压缩效率。解决方案的核心在于提出 Uni-LVC,一种统一的 LVC 方法,通过将 inter 编码建模为基于参考帧提取的时间信息条件下的 intra 编码,实现了单模型支持 intra 与 inter 编码模式(包括低延迟和随机访问)。其关键技术包括:1)设计了一个高效的跨注意力适配模块以整合时间线索,支持单向(低延迟)和双向(随机访问)预测;2)引入可靠性感知分类器,动态调整时间线索强度,在参考帧不可靠时使模型更接近 intra 编码行为;3)采用多阶段训练策略促进不同编码模式下的自适应学习。实验表明,Uni-LVC 在 intra 和 inter 配置下均实现更优率失真性能,同时保持计算效率。
链接: https://arxiv.org/abs/2603.05756
作者: Yichi Zhang,Ruoyu Yang,Fengqing Zhu
机构: Purdue University (普渡大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in learned video compression (LVC) have led to significant performance gains, with codecs such as DCVC-RT surpassing the H.266/VVC low-delay mode in compression efficiency. However, existing LVCs still exhibit key limitations: they often require separate models for intra and inter coding modes, and their performance degrades when temporal references are unreliable. To address this, we introduce Uni-LVC, a unified LVC method that supports both intra and inter coding with low-delay and random-access in a single model. Building on a strong intra-codec, Uni-LVC formulates inter-coding as intra-coding conditioned on temporal information extracted from reference frames. We design an efficient cross-attention adaptation module that integrates temporal cues, enabling seamless support for both unidirectional (low-delay) and bidirectional (random-access) prediction modes. A reliability-aware classifier is proposed to selectively scale the temporal cues, making Uni-LVC behave closer to intra coding when references are unreliable. We further propose a multistage training strategy to facilitate adaptive learning across various coding modes. Extensive experiments demonstrate that Uni-LVC achieves superior rate-distortion performance in intra and inter configurations while maintaining comparable computational efficiency.
[CV-162] Interpretable Motion Artificat Detection in structural Brain MRI
【速读】:该论文旨在解决结构化脑部磁共振成像(MRI)中运动伪影导致的质量评估自动化难题,尤其针对现有方法在跨扫描站点时泛化能力差以及计算成本高的问题。其解决方案的关键在于提出一种轻量且可解释的框架,通过将二维方向的判别梯度幅值直方图(Discriminative Histogram of Gradient Magnitude, DHoGM)扩展至三维空间,融合切片级(2D)与体积级(3D)特征,并采用并行决策策略以同时捕捉局部与全局运动引起的图像退化;同时利用重叠的3D立方体进行体素级分析,在保证计算效率的同时实现全面的空间覆盖,最终构建一个仅含209个可训练参数的高效模型,显著提升了在已知和未知站点条件下的检测准确率与鲁棒性。
链接: https://arxiv.org/abs/2603.05726
作者: Naveetha Nithianandam,Prabhjot Kaur,Anil Kumar Sao
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated quality assessment of structural brain MRI is an important prerequisite for reliable neuroimaging analysis, but yet remains challenging due to motion artifacts and poor generalization across acquisition sites. Existing approaches based on image quality metrics (IQMs) or deep learning either requires extensive preprocessing, which incurs high computational cost, or poor generalization to unseen data. In this work, we propose a lightweight and interpretable framework for detecting motion related artifacts in T1 weighted brain MRI by extending the Discriminative Histogram of Gradient Magnitude (DHoGM) to a three dimensional space. The proposed method integrates complementary slice-level (2D) and volume-level (3D) DHoGM features through a parallel decision strategy, capturing both localized and global motion-induced degradation. Volumetric analysis is performed using overlapping 3D cuboids to achieve comprehensive spatial coverage while maintaining computational efficiency. A simple threshold-based classifier and a low parameter multilayer perceptron are used, which results in a model with only 209 trainable parameters. Our method was evaluated on the MR-ART and ABIDE datasets under both seen-site and unseen-site conditions. Experimental results demonstrate strong performance, achieving up to 94.34% accuracy the in domain evaluation and 89% accuracy on unseen sites, while almost completely avoiding false acceptance of poor-quality scans. Ablation studies confirms the complementary benefits of combining 2D and 3D features. Overall, the proposed approach offers an effective, efficient, and robust solution for automated MRI quality check, with strong potential for integration into large scale clinical and research workflows.
[CV-163] Longitudinal Lesion Inpainting in Brain MRI via 3D Region Aware Diffusion
【速读】:该论文旨在解决纵向脑部磁共振成像(MRI)分析中因病灶动态演变而导致自动化神经影像处理流程产生偏差的问题。现有方法多为横截面操作或缺乏三维解剖连续性,难以有效重建随时间变化的病变区域。其解决方案的关键在于提出一种基于去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPM)的伪3D纵向修复框架,通过多通道条件输入融合不同时间点(t₁, t₂)的纵向上下文信息,并引入区域感知扩散(Region-Aware Diffusion, RAD)机制,聚焦于病灶区域进行生成式修复,同时保持周围健康组织不变。该方法显著提升了感知保真度与时间稳定性,且在效率上实现约10倍加速,为进行渐进性神经退行性疾病研究提供了可靠高效的预处理方案。
链接: https://arxiv.org/abs/2603.05693
作者: Zahra Karimaghaloo,Dumitru Fetco,Haz-Edine Assemlal,Hassan Rivaz,Douglas L. Arnold
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate longitudinal analysis of brain MRI is often hindered by evolving lesions, which bias automated neuroimaging pipelines. While deep generative models have shown promise in inpainting these lesions, most existing methods operate cross-sectionally or lack 3D anatomical continuity. We present a novel pseudo-3D longitudinal inpainting framework based on Denoising Diffusion Probabilistic Models (DDPM). Our approach utilizes multi-channel conditioning to incorporate longitudinal context from distinct visits (t_1, t_2) and extends Region-Aware Diffusion (RAD) to the medical domain, focusing the generative process on pathological regions without altering surrounding healthy tissue. We evaluated our model against state-of-the-art baselines on longitudinal brain MRI from 93 patients. Our model significantly outperforms the leading baseline (FastSurfer-LIT) in terms of perceptual fidelity, reducing the Learned Perceptual Image Patch Similarity (LPIPS) distance from 0.07 to 0.03 while effectively eliminating inter-slice discontinuities. Furthermore, our model demonstrates high longitudinal stability with a Temporal Fidelity Index of 1.024, closely approaching the ideal value of 1.0 and substantially narrowing the gap compared to LIT’s TFI of 1.22. Notably, the RAD mechanism provides a substantial gain in efficiency; our framework achieves an average processing time of 2.53 min per volume, representing approximately 10x speedup over the 24.30 min required by LIT. By leveraging longitudinal priors and region-specific denoising, our framework provides a highly reliable and efficient preprocessing step for the study of progressive neurodegenerative diseases. A derivative dataset consisting of 93 pre-processed scans used for testing will be available upon request after acceptance. Code will be released upon acceptance.
[CV-164] Gabor Primitives for Accelerated Cardiac Cine MRI Reconstruction
【速读】:该论文旨在解决加速心脏电影磁共振成像(cardiac cine MRI)中,从高度欠采样k空间数据重建时空图像的难题。传统方法如压缩感知(compressed sensing)或隐式神经表示(implicit neural representations, INRs)存在训练数据依赖性强或物理参数不可解释的问题。本文提出使用Gabor基元(Gabor primitives)作为解决方案的核心:通过将高斯包络与复指数调制相结合,使每个基元的频谱支持可定位在任意k空间位置,从而高效表征平滑结构和边缘锐利的特征;同时,利用心脏电影数据中的时空冗余特性,将每个基元的时间变化分解为低秩几何基(捕捉心脏运动)和信号强度基(建模对比度变化),实现更紧凑且具物理意义的连续分辨率重建。实验表明,该方法在笛卡尔和径向轨迹下均显著优于压缩感知、高斯基元及哈希网格INR基线。
链接: https://arxiv.org/abs/2603.05681
作者: Wenqi Huang,Veronika Spieker,Nil Stolt-Ansó,Natascha Niessen,Maik Dannecker,Sevgi Gokce Kafali,Sila Kurugol,Julia A. Schnabel,Daniel Rueckert
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accelerated cardiac cine MRI requires reconstructing spatiotemporal images from highly undersampled k-space data. Implicit neural representations (INRs) enable scan-specific reconstruction without large training datasets, but encode content implicitly in network weights without physically interpretable parameters. Gaussian primitives provide an explicit and geometrically interpretable alternative, but their spectra are confined near the k-space origin, limiting high-frequency representation. We propose Gabor primitives for MRI reconstruction, modulating each Gaussian envelope with a complex exponential to place its spectral support at an arbitrary k-space location, enabling efficient representation of both smooth structures and sharp boundaries. To exploit spatiotemporal redundancy in cardiac cine, we decompose per-primitive temporal variation into a low-rank geometry basis capturing cardiac motion and a signal-intensity basis modeling contrast changes. Experiments on cardiac cine data with Cartesian and radial trajectories show that Gabor primitives consistently outperform compressed sensing, Gaussian primitives, and hash-grid INR baselines, while providing a compact, continuous-resolution representation with physically meaningful parameters.
[CV-165] Clinical-Injection Transformer with Domain-Adapted MAE for Lupus Nephritis Prognosis Prediction
【速读】:该论文旨在解决儿童狼疮性肾炎(lupus nephritis, LN)预后预测在计算病理学领域长期缺失的问题,以及现有基于组织病理学的方法依赖多种昂贵染色协议且未能融合临床数据的局限性。其核心解决方案是提出首个多模态计算病理框架,仅使用常规 PAS 染色活检切片与结构化临床数据,实现对三种治疗反应(完全缓解、部分缓解和无反应)的精准分类。关键创新在于:一是设计临床注入 Transformer(Clinical-Injection Transformer, CIT),将临床特征作为条件令牌嵌入到图像块级自注意力机制中,实现在统一注意力空间内的隐式双向跨模态交互;二是采用解耦表示-知识适应策略,利用领域适配的掩码自编码器(domain-adapted Masked Autoencoder, MAE)显式分离形态学特征学习与病理知识提取,并引入多粒度形态类型注入机制,以桥接提炼的分类知识与患者层面的预后预测。
链接: https://arxiv.org/abs/2603.05535
作者: Yuewen Huang,Zhitao Ye,Guangnan Feng,Fudan Zheng,Xia Gao,Yutong Lu
机构: Fudan University (复旦大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Lupus nephritis (LN) is a severe complication of systemic lupus erythematosus that affects pediatric patients with significantly greater severity and worse renal outcomes compared to adults. Despite the urgent clinical need, predicting pediatric LN prognosis remains unexplored in computational pathology. Furthermore, the only existing histopathology-based approach for LN relies on multiple costly staining protocols and fails to integrate complementary clinical data. To address these gaps, we propose the first multimodal computational pathology framework for three-class treatment response prediction (complete remission, partial response, and no response) in pediatric LN, utilizing only routine PAS-stained biopsies and structured clinical data. Our framework introduces two key methodological innovations. First, a Clinical-Injection Transformer (CIT) embeds clinical features as condition tokens into patch-level self-attention, facilitating implicit and bidirectional cross-modal interactions within a unified attention space. Second, we design a decoupled representation-knowledge adaptation strategy using a domain-adapted Masked Autoencoder (MAE). This strategy explicitly separates self-supervised morphological feature learning from pathological knowledge extraction. Additionally, we introduce a multi-granularity morphological type injection mechanism to bridge distilled classification knowledge with downstream prognostic predictions at both the instance and patient levels. Evaluated on a cohort of 71 pediatric LN patients with KDIGO-standardized labels, our method achieves a three-class accuracy of 90.1% and an AUC of 89.4%, demonstrating its potential as a highly accurate and cost-effective prognostic tool.
人工智能
[AI-0] Fly360: Omnidirectional Obstacle Avoidance within Drone View
【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicles, UAVs)在复杂环境中实现全向避障的问题,尤其针对传统方法依赖有限视场传感器、难以满足运动方向与航向不一致时的全局空间感知需求的局限性。其解决方案的关键在于提出Fly360,一个两阶段感知-决策流水线:首先利用全景RGB图像生成鲁棒的深度图作为中间表示,再通过轻量级策略网络从深度输入中直接输出机体坐标系下的速度指令;同时引入固定随机偏航训练策略,显著提升了模型在任意方向障碍物场景下的稳定避障能力,实验证明其优于仅基于前视视角的基线方法。
链接: https://arxiv.org/abs/2603.06573
作者: Xiangkai Zhang,Dizhe Zhang,WenZhuo Cao,Zhaoliang Wan,Yingjie Niu,Lu Qi,Xu Yang,Zhiyong Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 16 pages, 10 figures
Abstract:Obstacle avoidance in unmanned aerial vehicles (UAVs), as a fundamental capability, has gained increasing attention with the growing focus on spatial intelligence. However, current obstacle-avoidance methods mainly depend on limited field-of-view sensors and are ill-suited for UAV scenarios which require full-spatial awareness when the movement direction differs from the UAV’s heading. This limitation motivates us to explore omnidirectional obstacle avoidance for panoramic drones with full-view perception. We first study an under explored problem setting in which a UAV must generate collision-free motion in environments with obstacles from arbitrary directions, and then construct a benchmark that consists of three representative flight tasks. Based on such settings, we propose Fly360, a two-stage perception-decision pipeline with a fixed random-yaw training strategy. At the perception stage, panoramic RGB observations are input and converted into depth maps as a robust intermediate representation. For the policy network, it is lightweight and used to output body-frame velocity commands from depth inputs. Extensive simulation and real-world experiments demonstrate that Fly360 achieves stable omnidirectional obstacle avoidance and outperforms forward-view baselines across all tasks. Our model is available at this https URL
[AI-1] Boosting deep Reinforcement Learning using pretraining with Logical Options
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)智能体因过度利用早期奖励信号而导致的对齐偏差问题,即智能体倾向于追求短期奖励而非长期目标。解决方案的关键在于提出一种混合层次强化学习框架(Hybrid Hierarchical RL, H²RL),其核心创新是引入符号结构到基于神经网络的强化学习中,通过两阶段策略实现:首先采用逻辑选项预训练策略引导学习策略摆脱短时奖励循环,转向目标导向行为;其次保留深度策略的表达能力,允许最终策略通过标准环境交互进一步优化。该方法在不牺牲深度策略灵活性的前提下,显著提升了长时决策性能,并优于强基准模型。
链接: https://arxiv.org/abs/2603.06565
作者: Zihan Ye,Phil Chau,Raban Emunds,Jannis Blüml,Cedric Derstroff,Quentin Delfosse,Oleg Arenz,Kristian Kersting
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Deep reinforcement learning agents are often misaligned, as they over-exploit early reward signals. Recently, several symbolic approaches have addressed these challenges by encoding sparse objectives along with aligned plans. However, purely symbolic architectures are complex to scale and difficult to apply to continuous settings. Hence, we propose a hybrid approach, inspired by humans’ ability to acquire new skills. We use a two-stage framework that injects symbolic structure into neural-based reinforcement learning agents without sacrificing the expressivity of deep policies. Our method, called Hybrid Hierarchical RL (H^2RL), introduces a logical option-based pretraining strategy to steer the learning policy away from short-term reward loops and toward goal-directed behavior while allowing the final policy to be refined via standard environment interaction. Empirically, we show that this approach consistently improves long-horizon decision-making and yields agents that outperform strong neural, symbolic, and neuro-symbolic baselines.
[AI-2] RAMoEA-QA: Hierarchical Specialization for Robust Respiratory Audio Question Answering
【速读】:该论文针对生成式 AI (Generative AI) 在呼吸 care 领域应用中面临的异质性挑战展开研究,旨在解决现有生物医学音频-语言问答(audio-language QA)系统在处理多样化呼吸录音和多意图查询时性能受限的问题。具体而言,现有系统通常为单体结构,缺乏对不同音频来源、采集环境及问题格式的专门化处理机制,且在真实世界场景中的泛化能力尚未验证。其解决方案的关键在于提出 RAMoEA-QA——一个分层路由的生成模型,通过两阶段条件专业化实现高效适配:第一阶段使用 Audio Mixture-of-Experts(音频混合专家)根据输入音频特征动态选择最优预训练音频编码器;第二阶段采用 Language Mixture-of-Adapters(语言混合适配器)在冻结的大语言模型(LLM)上按查询意图和答案格式选择 LoRA 适配器,从而实现声学表征与生成行为的细粒度个性化匹配。此设计显著提升了模型在多种领域、模态和任务偏移下的准确性与鲁棒性,同时保持极低的参数开销。
链接: https://arxiv.org/abs/2603.06542
作者: Gaia A. Bertolino,Yuwei Zhang,Tong Xia,Domenico Talia,Cecilia Mascolo
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings captured via mobile microphones, enables scalable screening and longitudinal monitoring, but the heterogeneity challenge is particularly acute: recordings vary widely across devices, environments, and acquisition protocols, and questions span multiple intents and question formats. Existing biomedical audio-language QA systems are typically monolithic, without any specialization mechanisms for tackling diverse respiratory corpora and query intents. They are also only validated in limited settings, leaving it unclear how reliably they handle the shifts encountered in real-world settings. To address these limitations, we introduce RAMoEA-QA, a hierarchically routed generative model for respiratory audio question answering that unifies multiple question types and supports both discrete and continuous targets within a single multimodal system. RAMoEA-QA applies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording to a suitable pre-trained audio encoder, and a Language Mixture-of-Adapters selects a LoRA adapter on a shared frozen LLM to match the query intent and answer format. By specializing both acoustic representations and generation behaviour per example, RAMoEA-QA consistently outperforms strong baselines and routing ablations with minimal parameter overhead, improving in-domain test accuracy to 0.72 (vs. 0.61 and 0.67 for state-of-the-art baselines) and exhibiting the strongest generalization for diagnosis under domain, modality, and task shifts. Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.06542 [cs.SD] (or arXiv:2603.06542v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2603.06542 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-3] Prosodic Boundary-Aware Streaming Generation for LLM -Based TTS with Streaming Text Input
【速读】:该论文旨在解决流式文本到语音(Streaming TTS)系统在交互式应用中面临的两大挑战:一是因缺乏前瞻信息导致的韵律不自然问题,二是因上下文无界导致的长文本合成中的“长程崩溃”(long-form collapse)问题。其解决方案的关键在于提出一种基于韵律边界感知的后训练策略(prosodic-boundary-aware post-training strategy),利用弱时间对齐数据对预训练的大语言模型(LLM)驱动的TTS模型进行微调,使其能够在仅接收到有限未来文本时学会在指定内容边界处提前停止生成;同时,在推理阶段采用滑动窗口提示机制,向前传递先前的文本和语音标记,从而保证上下文有界且合成结果无缝拼接,显著提升了短文本与长文本场景下的语音质量与一致性。
链接: https://arxiv.org/abs/2603.06444
作者: Changsong Liu,Tianrui Wang,Ye Ni,Yizhou Peng,Eng Siong Chng
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Streaming TTS that receives streaming text is essential for interactive systems, yet this scheme faces two major challenges: unnatural prosody due to missing lookahead and long-form collapse due to unbounded context. We propose a prosodic-boundary-aware post-training strategy, adapting a pretrained LLM-based TTS model using weakly time-aligned data. Specifically, the model is adapted to learn early stopping at specified content boundaries when provided with limited future text. During inference, a sliding-window prompt carries forward previous text and speech tokens, ensuring bounded context and seamless concatenation. Evaluations show our method outperforms CosyVoice-Style interleaved baseline in both short and long-form scenarios. In long-text synthesis, especially, it achieves a 66.2% absolute reduction in word error rate (from 71.0% to 4.8%) and increases speaker and emotion similarity by 16.1% and 1.5% relatively, offering a robust solution for streaming TTS with incremental text.
[AI-4] A Reference Architecture of Reinforcement Learning Frameworks
【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)框架在架构设计上缺乏统一参考标准的问题,即当前RL框架的实现方式多样且不一致,导致难以进行比较、评估和集成。解决方案的关键在于通过扎根理论(grounded theory)方法分析18个实际应用中的RL框架,识别出重复出现的架构组件及其相互关系,并将其归纳为一个可复用的参考架构(Reference Architecture, RA),从而为RL框架的设计与改进提供结构化依据。
链接: https://arxiv.org/abs/2603.06413
作者: Xiaoran Liu,Istvan David
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The surge in reinforcement learning (RL) applications gave rise to diverse supporting technology, such as RL frameworks. However, the architectural patterns of these frameworks are inconsistent across implementations and there exists no reference architecture (RA) to form a common basis of comparison, evaluation, and integration. To address this gap, we propose an RA of RL frameworks. Through a grounded theory approach, we analyze 18 state-of-the-practice RL frameworks and, by that, we identify recurring architectural components and their relationships, and codify them in an RA. To demonstrate our RA, we reconstruct characteristic RL patterns. Finally, we identify architectural trends, e.g., commonly used components, and outline paths to improving RL frameworks.
[AI-5] Kinetic-based regularization: Learning spatial derivatives and PDE applications ICLR2026
【速读】:该论文旨在解决从离散且含噪数据中准确估计空间导数的问题,这是科学机器学习(Scientific Machine Learning)和偏微分方程(PDE)数值求解中的核心挑战。解决方案的关键在于扩展基于动力学的正则化(Kinetic-Based Regularization, KBR),这是一种具有单个可训练参数的局部多维核回归方法,通过提出显式和隐式两种导数学习方案,实现了在1D情况下理论保证的二阶精度。其中,显式方案基于闭式预测表达式,隐式方案则在关注点上求解扰动后的线性系统;两者均具备局部化特性,无需全局系统求解或启发式平滑,从而实现高效且自适应噪声的导数估计,并初步验证了其与保守求解器结合后在1D双曲型PDE中稳定捕捉激波的能力,为高维不规则点云上的PDE求解及守恒律保持提供了可行路径。
链接: https://arxiv.org/abs/2603.06380
作者: Abhisek Ganguly,Santosh Ansumali,Sauro Succi
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published as a conference paper at ICLR 2026 Workshop AI and PDE
Abstract:Accurate estimation of spatial derivatives from discrete and noisy data is central to scientific machine learning and numerical solutions of PDEs. We extend kinetic-based regularization (KBR), a localized multidimensional kernel regression method with a single trainable parameter, to learn spatial derivatives with provable second-order accuracy in 1D. Two derivative-learning schemes are proposed: an explicit scheme based on the closed-form prediction expressions, and an implicit scheme that solves a perturbed linear system at the points of interest. The fully localized formulation enables efficient, noise-adaptive derivative estimation without requiring global system solving or heuristic smoothing. Both approaches exhibit quadratic convergence, matching second-order finite difference for clean data, along with a possible high-dimensional formulation. Preliminary results show that coupling KBR with conservative solvers enables stable shock capture in 1D hyperbolic PDEs, acting as a step towards solving PDEs on irregular point clouds in higher dimensions while preserving conservation laws.
[AI-6] ESAA-Security: An Event-Sourced Verifiable Architecture for Agent -Assisted Security Audits of AI-Generated Code
【速读】:该论文旨在解决AI辅助软件生成过程中存在的结构性安全漏洞问题,即系统虽功能正确但可能存在未被发现的安全风险。传统基于提示(prompt-based)的大语言模型(Large Language Models, LLMs)安全审查常因覆盖不全、可复现性差、支持证据不足及缺乏不可变审计日志而失效。解决方案的关键在于提出ESAA-Security框架,其核心思想是将安全审查建模为一个受控的执行流水线,通过分离代理认知与确定性状态变更,利用追加只读事件(append-only events)、约束输出和基于重放的验证机制,实现结构化、可追溯、可复现且风险导向的安全审计流程。该架构以合约和事件驱动的方式规范代理行为,确保每一步操作均可验证、可审计,并最终生成标准化的审计报告,从而从根本上提升AI生成代码的安全治理能力。
链接: https://arxiv.org/abs/2603.06365
作者: Elzo Brito dos Santos Filho
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Open-source implementation available at: this https URL
Abstract:AI-assisted software generation has increased development speed, but it has also amplified a persistent engineering problem: systems that are functionally correct may still be structurally insecure. In practice, prompt-based security review with large language models often suffers from uneven coverage, weak reproducibility, unsupported findings, and the absence of an immutable audit trail. The ESAA architecture addresses a related governance problem in agentic software engineering by separating heuristic agent cognition from deterministic state mutation through append-only events, constrained outputs, and replay-based verification. This paper presents ESAA-Security, a domain-specific specialization of ESAA for agent-assisted security auditing of software repositories, with particular emphasis on AI-generated or AI-modified code. ESAA-Security structures auditing as a governed execution pipeline with four phases reconnaissance, domain audit execution, risk classification, and final reporting and operationalizes the workflow into 26 tasks, 16 security domains, and 95 executable checks. The framework produces structured check results, vulnerability inventories, severity classifications, risk matrices, remediation guidance, executive summaries, and a final markdown/JSON audit report. The central idea is that security review should not be modeled as a free-form conversation with an LLM, but as an evidence-oriented audit process governed by contracts and events. In ESAA-Security, agents emit structured intentions under constrained protocols; the orchestrator validates them, persists accepted outputs to an append-only log, reprojects derived views, and verifies consistency through replay and hashing. The result is a traceable, reproducible, and risk-oriented audit architecture whose final report is auditable by construction.
[AI-7] CLAIRE: Compressed Latent Autoencoder for Industrial Representation and Evaluation – A Deep Learning Framework for Smart Manufacturing
【速读】:该论文旨在解决高维工业环境中故障检测的准确性问题,该问题因传感器数据固有的复杂性、噪声和冗余而尤为突出。解决方案的关键在于提出一种混合端到端学习框架 CLAIRE,其核心由两部分组成:首先利用优化的深度自编码器(deep autoencoder)将原始输入映射到紧凑的潜在空间,有效捕捉数据内在结构并抑制无关或噪声特征;其次将学习到的表示输入下游分类器进行二分类故障预测。该框架通过特征感知正则化与可解释人工智能(Explainable AI)相结合,不仅提升了检测性能,还借助基于博弈论的后处理解释技术识别对故障预测最具信息量的输入特征,从而实现模型的模块化、可解释性和鲁棒性。
链接: https://arxiv.org/abs/2603.06361
作者: Mohammadhossein Ghahramani,Mengchu Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 13 pages. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2026
Abstract:Accurate fault detection in high-dimensional industrial environments remains a major challenge due to the inherent complexity, noise, and redundancy in sensor data. This paper introduces CLAIRE, i.e., a hybrid end-to-end learning framework that integrates unsupervised deep representation learning with supervised classification for intelligent quality control in smart manufacturing systems. It employs an optimized deep autoencoder to transform raw input into a compact latent space, effectively capturing the intrinsic data structure while suppressing irrelevant or noisy features. The learned representations are then fed into a downstream classifier to perform binary fault prediction. Experimental results on a high-dimensional dataset demonstrate that CLAIRE significantly outperforms conventional classifiers trained directly on raw features. Moreover, the framework incorporates a post hoc phase, using a game-theory-based interpretability technique, to analyze the latent space and identify the most informative input features contributing to fault predictions. The proposed framework highlights the potential of integrating explainable AI with feature-aware regularization for robust fault detection. The modular and interpretable nature of the proposed framework makes it highly adaptable, offering promising applications in other domains characterized by complex, high-dimensional data, such as healthcare, finance, and environmental monitoring.
[AI-8] MoEless: Efficient MoE LLM Serving via Serverless Computing
【速读】:该论文旨在解决混合专家模型(Mixture-of-Experts, MoE)在分布式部署中因稀疏激活机制导致的专家负载不均衡问题,该问题表现为部分专家成为瓶颈(stragglers),显著增加推理延迟和部署成本。现有方案受限于静态资源配置,难以实现弹性扩展与高效负载平衡,从而影响服务性能或生成质量。论文提出的解决方案是MoEless,其关键在于引入轻量级、分层感知的预测器以精准估计专家负载分布并主动识别潜在straggler,并设计优化的专家缩放与放置策略,在最大化函数局部性的同时提升GPU利用率并均衡专家与GPU间的负载分配,最终实现低延迟、低成本的MoE推理服务。
链接: https://arxiv.org/abs/2603.06350
作者: Hanfei Yu,Bei Ouyang,Shwai He,Ang Li,Hao Wang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have become a cornerstone of AI, driving progress across diverse domains such as content creation, search and recommendation systems, and AI-assisted workflows. To alleviate extreme training costs and advancing model scales, Mixture-of-Experts (MoE) has become a popular backbone for modern LLMs, which are commonly served in distributed deployment using expert parallelism (EP). However, MoE’s sparse activation mechanism leads to severe expert load imbalance, where a few experts become overloaded while others remain idle, resulting in expert stragglers that inflate inference latency and serving cost. Existing expert load balancing solutions assume static resource configurations on serverful infrastructures, limiting expert scalability and elasticity, and resulting in either costly real-time expert swapping or degraded generation quality. We present MoEless, the first serverless MoE serving framework that mitigates expert load imbalance and accelerates inference via serverless experts. MoEless employs lightweight, layer-aware predictors to accurately estimate incoming expert load distributions and proactively identify stragglers. We design optimized expert scaling and placement strategies to maximize function locality, improve GPU utilization, and balance loads across experts and GPUs. MoEless is prototyped on top of Megatron-LM and deployed on an eight-GPU testbed. Experiments with open-source MoE models and real-world workloads show that MoEless reduces inference latency by 43% and inference cost by 84% compared to state-of-the-art solutions.
[AI-9] From Entropy to Calibrated Uncertainty: Training Language Models to Reason About Uncertainty AISTATS
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险场景中缺乏可解释且校准良好的不确定性估计问题。现有后处理方法通常依赖采样,计算成本高或校准不足。其解决方案的关键在于提出一个三阶段流水线:首先基于细粒度熵计算训练数据上的不确定性得分,捕捉嵌入空间中模型输出的分布变异性;其次通过Platt缩放对这些得分进行校准,生成可靠且人类可理解的不确定性信号;最后利用强化学习对目标模型进行后训练,使其策略与校准后的信号对齐,从而在测试时实现高效、可解释的不确定性推理。该方法显著优于基线,在未见任务上亦能泛化,表明模型习得了稳健的不确定性推理能力。
链接: https://arxiv.org/abs/2603.06317
作者: Azza Jenane,Nassim Walha,Lukas Kuhn,Florian Buettner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 4 pages, submitted to AISTATS Workshop
Abstract:Large Language Models (LLMs) that can express interpretable and calibrated uncertainty are crucial in high-stakes domains. While methods to compute uncertainty post-hoc exist, they are often sampling-based and therefore computationally expensive or lack calibration. We propose a three-stage pipeline to post-train LLMs to efficiently infer calibrated uncertainty estimates for their responses. First, we compute fine-grained entropy-based uncertainty scores on the training data, capturing the distributional variability of model outputs in embedding space. Second, these scores are calibrated via Platt scaling, producing reliable and human-interpretable uncertainty signals. Finally, the target LLM is post-trained via reinforcement learning to align its policy with these calibrated signals through a verifiable reward function. Unlike post-hoc uncertainty estimation methods, our approach provides interpretable and computationally efficient uncertainty estimates at test time. Experiments show that models trained with our pipeline achieve better calibration than baselines and generalize to unseen tasks without further processing, suggesting that they learn a robust uncertainty reasoning behavior.
[AI-10] Learning Where the Physics Is: Probabilistic Adaptive Sampling for Stiff PDEs
【速读】:该论文旨在解决刚性偏微分方程(Stiff Partial Differential Equations, PDEs)中因陡峭梯度导致的建模难题,尤其是在传统物理信息神经网络(Physics-Informed Neural Networks, PINNs)存在谱偏差(spectral bias)和训练缓慢,而物理信息极限学习机(Physics-Informed Extreme Learning Machines, PIELMs)受限于物理无关的随机初始化问题。其解决方案的关键在于提出高斯混合模型自适应物理信息极限学习机(Gaussian Mixture Model Adaptive PIELM, GMM-PIELM),通过学习一个概率密度函数来表征“物理所在位置”,并利用加权期望最大化(weighted Expectation-Maximization, EM)算法自动将径向基函数(Radial Basis Function, RBF)中心集中到数值误差较高的区域(如激波前沿和边界层),从而在不依赖梯度优化或贝叶斯搜索的前提下动态改善隐藏层条件,显著提升求解精度(L₂误差降低达7个数量级),同时保持极限学习机(Extreme Learning Machine, ELM)的计算效率优势。
链接: https://arxiv.org/abs/2603.06287
作者: Akshay Govind Srinivasan,Balaji Srinivasan
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
备注: Accepted at AIPDE Workshop at the Fourteenth International Conference on Learning Representations
Abstract:Modeling stiff partial differential equations (PDEs) with sharp gradients remains a significant challenge for scientific machine learning. While Physics-Informed Neural Networks (PINNs) struggle with spectral bias and slow training times, Physics-Informed Extreme Learning Machines (PIELMs) offer a rapid, closed-form linear solution but are fundamentally limited by physics-agnostic, random initialization. We introduce the Gaussian Mixture Model Adaptive PIELM (GMM-PIELM), a probabilistic framework that learns a probability density function representing the ``location of physics’’ for adaptively sampling kernels of PIELMs. By employing a weighted Expectation-Maximization (EM) algorithm, GMM-PIELM autonomously concentrates radial basis function centers in regions of high numerical error, such as shock fronts and boundary layers. This approach dynamically improves the conditioning of the hidden layer without the expensive gradient-based optimization(of PINNs) or Bayesian search. We evaluate our methodology on 1D singularly perturbed convection-diffusion equations with diffusion coefficients \nu=10^-4 . Our method achieves L_2 errors up to 7 orders of magnitude lower than baseline RBF-PIELMs, successfully resolving exponentially thin boundary layers while retaining the orders-of-magnitude speed advantage of the ELM architecture.
[AI-11] Artificial Intelligence for Climate Adaptation: Reinforcement Learning for Climate Change-Resilient Transport
【速读】:该论文旨在解决气候变化背景下城市交通系统因强降雨加剧导致的暴雨洪涝风险增加问题,尤其针对长期基础设施投资决策中面临的气候不确定性、多阶段决策特性以及洪水-基础设施-出行影响之间的复杂交互关系。其解决方案的关键在于提出了一种基于强化学习(Reinforcement Learning, RL)的新型决策支持框架,将降雨预测、洪水模拟、交通仿真及直接与间接影响量化整合为一个综合评估模型(Integrated Assessment Model, IAM),通过RL算法自动学习在投资维护成本与规避损失之间权衡的适应性策略,从而识别出空间与时间上协调的适应路径,显著优于传统优化方法,提升了基础设施规划在气候不确定性下的韧性。
链接: https://arxiv.org/abs/2603.06278
作者: Miguel Costa,Arthur Vandervoort,Carolin Schmidt,João Miranda,Morten W. Petersen,Martin Drews,Karyn Morrisey,Francisco C. Pereira
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Climate change is expected to intensify rainfall and, consequently, pluvial flooding, leading to increased disruptions in urban transportation systems over the coming decades. Designing effective adaptation strategies is challenging due to the long-term, sequential nature of infrastructure investments, deep climate uncertainty, and the complex interactions between flooding, infrastructure, and mobility impacts. In this work, we propose a novel decision-support framework using reinforcement learning (RL) for long-term flood adaptation planning. Formulated as an integrated assessment model (IAM), the framework combines rainfall projection and flood modeling, transport simulation, and quantification of direct and indirect impacts on infrastructure and mobility. Our RL-based approach learns adaptive strategies that balance investment and maintenance costs against avoided impacts. We evaluate the framework through a case study of Copenhagen’s inner city over the 2024-2100 period, testing multiple adaptation options, and different belief and realized climate scenarios. Results show that the framework outperforms traditional optimization approaches by discovering coordinated spatial and temporal adaptation pathways and learning trade-offs between impact reduction and adaptation investment, yielding more resilient strategies. Overall, our results showcase the potential of reinforcement learning as a flexible decision-support tool for adaptive infrastructure planning under climate uncertainty.
[AI-12] Stem: Rethinking Causal Information Flow in Sparse Attention
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在长文本处理中因自注意力机制(self-attention)的二次计算复杂度导致的性能瓶颈,尤其是在预填充(pre-filling)阶段。其核心问题是现有稀疏方法未考虑因果注意力结构中token位置累积依赖特性,导致信息流失与计算效率低下。解决方案的关键在于提出Stem模块,该模块通过两个创新策略实现信息流对齐:一是引入Token Position-Decay策略,在每一层内采用基于位置衰减的top-k选择机制,以保留初始token的递归依赖;二是设计Output-Aware Metric,根据近似输出幅度优先保留高影响力token,从而在降低计算量的同时保持模型精度和减少预填充延迟。
链接: https://arxiv.org/abs/2603.06274
作者: Lin Niu,Xin Luo,Linchuan Xie,Yifu Sun,Guanghua Yu,Jianchen Zhu,S Kevin Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, preprint
Abstract:The quadratic computational complexity of self-attention remains a fundamental bottleneck for scaling Large Language Models (LLMs) to long contexts, particularly during the pre-filling phase. In this paper, we rethink the causal attention mechanism from the perspective of information flow. Due to causal constraints, tokens at initial positions participate in the aggregation of every subsequent token. However, existing sparse methods typically apply a uniform top-k selection across all token positions within a layer, ignoring the cumulative dependency of token information inherent in causal architectures. To address this, we propose Stem, a novel, plug-and-play sparsity module aligned with information flow. First, Stem employs the Token Position-Decay strategy, applying position-dependent top-k within each layer to retain initial tokens for recursive dependencies. Second, to preserve information-rich tokens, Stem utilizes the Output-Aware Metric. It prioritizes high-impact tokens based on approximate output magnitude. Extensive evaluations demonstrate that Stem achieves superior accuracy with reduced computation and pre-filling latency.
[AI-13] Looking Through Glass Box
【速读】:该论文旨在解决如何将模糊认知图(Fuzzy Cognitive Map, FCM)神经实现并用于学习因果模式的问题,同时通过逆向求解输出节点值来提供可操作的修改准则,从而支持决策优化。其解决方案的关键在于设计了一个基于Langevin微分动力学的神经网络模型,该模型不仅能模拟FCM的传播机制,还能利用随机动力学避免过拟合,并通过逆向求解获得输出节点值的可行解,进而为不同服务或产品的适配性提供量化判断依据。
链接: https://arxiv.org/abs/2603.06272
作者: Alexis Kafantaris
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
备注: This is a theoretical framework with some empirical validation
Abstract:This essay is about a neural implementation of the fuzzy cognitive map, the FHM, and corresponding evaluations. Firstly, a neural net has been designed to behave the same way that an FCM does; as inputs it accepts many fuzzy cognitive maps and propagates them in order to learn causality patterns. Moreover, the network uses langevin differential Dynamics, which avoid overfit, to inverse solve the output node values according to some policy. Nevertheless, having obtained an inverse solution provides the user a modification criterion. Having the modification criterion suggests that information is now according to discretion as a different service or product is a better fit. Lastly, evaluation has been done on several data sets in order to examine the networks performance.
[AI-14] Agent ic retrieval-augmented reasoning reshapes collective reliability under model variability in radiology question answering
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床决策支持中因模型差异导致的可靠性问题,特别是当多个异构模型在面对相同任务时可能出现对齐、分歧或同步错误的情况,而传统基于准确率的评估方法无法充分刻画这种变异性下的系统稳定性。解决方案的关键在于引入一种多步骤的代理式检索增强推理(Agentic Retrieval-Augmented Reasoning)流程,该流程通过统一提供结构化的外部知识证据报告,使所有模型基于一致的知识源进行推理和决策,从而显著降低模型间决策分布的离散度(熵从0.48降至0.13),提升跨模型正确性的一致性(平均正确率从0.74升至0.81),并增强多数共识强度,表明该机制能够有效提高系统在模型变异条件下的可靠性与鲁棒性。
链接: https://arxiv.org/abs/2603.06271
作者: Mina Farajiamiri,Jeta Sopa,Saba Afza,Lisa Adams,Felix Barajas Ordonez,Tri-Thien Nguyen,Mahshad Lotfinia,Sebastian Wind,Keno Bressem,Sven Nebelung,Daniel Truhn,Soroosh Tayebi Arasteh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic retrieval-augmented reasoning pipelines are increasingly used to structure how large language models (LLMs) incorporate external evidence in clinical decision support. These systems iteratively retrieve curated domain knowledge and synthesize it into structured reports before answer selection. Although such pipelines can improve performance, their impact on reliability under model variability remains unclear. In real-world deployment, heterogeneous models may align, diverge, or synchronize errors in ways not captured by accuracy. We evaluated 34 LLMs on 169 expert-curated publicly available radiology questions, comparing zero-shot inference with a radiology-specific multi-step agentic retrieval condition in which all models received identical structured evidence reports derived from curated radiology knowledge. Agentic inference reduced inter-model decision dispersion (median entropy 0.48 vs. 0.13) and increased robustness of correctness across models (mean 0.74 vs. 0.81). Majority consensus also increased overall (P0.001). Consensus strength and robust correctness remained correlated under both strategies (\rho=0.88 for zero-shot; \rho=0.87 for agentic), although high agreement did not guarantee correctness. Response verbosity showed no meaningful association with correctness. Among 572 incorrect outputs, 72% were associated with moderate or high clinically assessed severity, although inter-rater agreement was low (\kappa=0.02). Agentic retrieval therefore was associated with more concentrated decision distributions, stronger consensus, and higher cross-model robustness of correctness. These findings suggest that evaluating agentic systems through accuracy or agreement alone may not always be sufficient, and that complementary analyses of stability, cross-model robustness, and potential clinical impact are needed to characterize reliability under model variability.
[AI-15] Learning to Solve Orienteering Problem with Time Windows and Variable Profits ICLR2026
【速读】:该论文旨在解决带有时间窗和可变收益的定向问题(Orienteering Problem with Time Windows and Variable Profits, OPTWVP),该问题在现实世界中广泛应用,其特点在于同时包含离散决策变量(如访问节点的顺序)和连续决策变量(如服务时间)。现有方法难以高效求解此类混合整数规划问题。解决方案的关键在于提出一种基于学习的两阶段解耦优化框架——DeCoST(DEcoupled discrete-Continuous Optimization with Service-time-guided Trajectory),其核心思想是将离散路径选择与连续服务时间分配进行有效解耦:第一阶段通过并行解码结构预测路径及初始服务时间分配;第二阶段利用线性规划(Linear Programming, LP)精确优化服务时间,并提供长期结构估计的学习能力,且严格证明了第二阶段解的全局最优性。该方法在求解质量与计算效率上均显著优于当前最先进的构造式求解器和元启发式算法。
链接: https://arxiv.org/abs/2603.06260
作者: Songqun Gao,Zanxi Ruan,Patrick Floor,Marco Roveri,Luigi Palopoli,Daniele Fontanelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026
Abstract:The orienteering problem with time windows and variable profits (OPTWVP) is common in many real-world applications and involves continuous time variables. Current approaches fail to develop an efficient solver for this orienteering problem variant with discrete and continuous variables. In this paper, we propose a learning-based two-stage DEcoupled discrete-Continuous optimization with Service-time-guided Trajectory (DeCoST), which aims to effectively decouple the discrete and continuous decision variables in the OPTWVP problem, while enabling efficient and learnable coordination between them. In the first stage, a parallel decoding structure is employed to predict the path and the initial service time allocation. The second stage optimizes the service times through a linear programming (LP) formulation and provides a long-horizon learning of structure estimation. We rigorously prove the global optimality of the second-stage solution. Experiments on OPTWVP instances demonstrate that DeCoST outperforms both state-of-the-art constructive solvers and the latest meta-heuristic algorithms in terms of solution quality and computational efficiency, achieving up to 6.6x inference speedup on instances with fewer than 500 nodes. Moreover, the proposed framework is compatible with various constructive solvers and consistently enhances the solution quality for OPTWVP.
[AI-16] Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding INTERSPEECH2026
【速读】:该论文旨在解决长序列语音识别中大型编码器-解码器模型(如Whisper)常见的幻觉、重复循环和内容遗漏等问题,这些问题在利用前一语段转录作为解码上下文时会进一步累积和放大。解决方案的关键在于提出一种无需训练的对比解码框架Whisper-CD,其核心思想是通过三个声学启发的扰动(高斯噪声注入、静音信号和音频时间偏移)计算负样本 logits,并使用 log-sum-exp 操作聚合这些负样本,构建一个统一的多负样本目标函数,从而实现逐词解码时的对比优化。该方法仅在推理阶段运行,可直接替换已部署的Whisper系统而无需重新训练。
链接: https://arxiv.org/abs/2603.06193
作者: Hoseong Ahn,Jeongyun Chae,Yoonji Park,Kyuhong Shim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Submitted to Interspeech 2026
Abstract:Long-form speech recognition with large encoder-decoder models such as Whisper often exhibit hallucinations, repetition loops, and content omissions. These errors can accumulate and be further amplified when the previous segment’s transcription is used as decoding context. We propose Whisper-CD, a training-free contrastive decoding framework that contrasts clean-audio logits against negative logits computed from three acoustically motivated perturbations: Gaussian noise injection, silence signal, and audio temporal shift. We aggregate these negatives via the log-sum-exp operator, building a unified multi-negative objective for token-by-token decoding. Across five English long-form benchmarks, Whisper-CD reduces WER by up to 24.3pp on CORAAL and shows 48% faster token generation throughput than beam search. Because Whisper-CD operates purely at inference time, it can be applied as a drop-in replacement to already-deployed Whisper systems without retraining.
[AI-17] Ensemble Graph Neural Networks for Probabilistic Sea Surface Temperature Forecasting via Input Perturbations
【速读】:该论文旨在解决区域海洋预报中模型计算效率与预测不确定性表征之间的矛盾问题,即如何在保持计算高效的同时提升对海表温度(SST)预测不确定性的准确刻画。其解决方案的关键在于采用图神经网络(Graph Neural Networks, GNNs)结合同质集成学习策略,通过在推理阶段引入结构化的输入扰动而非重复训练多个模型来实现多样性,其中扰动设计的类型(如高斯噪声、Perlin噪声、分形Perlin噪声)及其空间相干性对不确定性校准和概率评分(如CRPS)具有显著影响。研究表明,使用低分辨率Perlin噪声等具有空间一致性的扰动方式可获得更优的不确定性表征,尤其在较长预报时效下,且无需额外训练成本,从而验证了集成GNN在业务化区域海洋预测中的可行性。
链接: https://arxiv.org/abs/2603.06153
作者: Alejandro J. González-Santana,Giovanny A. Cuervo-Londoño,Javier Sánchez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
备注: 20 pages, 14 figures, 6 tables
Abstract:Accurate regional ocean forecasting requires models that are both computationally efficient and capable of representing predictive uncertainty. This work investigates ensemble learning strategies for sea surface temperature (SST) forecasting using Graph Neural Networks (GNNs), with a focus on how input perturbation design affects forecast skill and uncertainty representation. We adapt a GNN architecture to the Canary Islands region in the North Atlantic and implement a homogeneous ensemble approach inspired by bagging, where diversity is introduced during inference by perturbing initial ocean states rather than retraining multiple models. Several noise-based ensemble generation strategies are evaluated, including Gaussian noise, Perlin noise, and fractal Perlin noise, with systematic variation of noise intensity and spatial structure. Ensemble forecasts are assessed over a 15-day horizon using deterministic metrics (RMSE and bias) and probabilistic metrics, including the Continuous Ranked Probability Score (CRPS) and the Spread-skill ratio. Results show that, while deterministic skill remains comparable to the single-model forecast, the type and structure of input perturbations strongly influence uncertainty representation, particularly at longer lead times. Ensembles generated with spatially coherent perturbations, such as low-resolution Perlin noise, achieve better calibration and lower CRPS than purely random Gaussian perturbations. These findings highlight the critical role of noise structure and scale in ensemble GNN design and demonstrate that carefully constructed input perturbations can yield well-calibrated probabilistic forecasts without additional training cost, supporting the feasibility of ensemble GNNs for operational regional ocean prediction.
[AI-18] Predictive Coding Graphs are a Superset of Feedforward Neural Networks NEURIPS2024
【速读】:该论文旨在解决预测编码图(Predictive Coding Graphs, PCGs)与前馈人工神经网络(如多层感知机,Multilayer Perceptrons)之间数学关系不明确的问题。其解决方案的关键在于证明PCGs构成了一类比传统前馈神经网络更广泛的数学模型——即PCGs是多层感知机的数学超集(mathematical superset),从而在理论上确立了PCGs在当代机器学习框架中的更强地位,并为探索非分层拓扑结构的神经网络在机器学习任务中的应用提供了理论支撑。
链接: https://arxiv.org/abs/2603.06142
作者: Björn van Zwol
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
备注: 11 pages, 3 figures. Accepted at the NeuroAI Workshop @ NeurIPS 2024. OpenReview: this https URL
Abstract:Predictive coding graphs (PCGs) are a recently introduced generalization to predictive coding networks, a neuroscience-inspired probabilistic latent variable model. Here, we prove how PCGs define a mathematical superset of feedforward artificial neural networks (multilayer perceptrons). This positions PCNs more strongly within contemporary machine learning (ML), and reinforces earlier proposals to study the use of non-hierarchical neural networks for ML tasks, and more generally the notion of topology in neural networks.
[AI-19] Partial Policy Gradients for RL in LLM s
【速读】:该论文旨在解决强化学习中策略(policy)结构建模与优化的问题,特别是在未知环境中如何设计更可靠、可比较的策略类别的方法。其解决方案的关键在于提出一种基于未来奖励子集优化的策略建模方法:通过限制优化的未来奖励范围,使得较小的子集对应更简单的策略结构,从而提升经验梯度估计的准确性,增强学习的可靠性;该方法支持对多种策略类别(如完整规划、贪心策略、K步前瞻策略和分段策略)进行建模与比较,实验证明不同策略在不同任务中表现各异,凸显了策略类别选择的重要性。
链接: https://arxiv.org/abs/2603.06138
作者: Puneet Mathur,Branislav Kveton,Subhojyoti Mukherjee,Viet Dac Lai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning is a framework for learning to act sequentially in an unknown environment. We propose a natural approach for modeling policy structure in policy gradients. The key idea is to optimize for a subset of future rewards: smaller subsets represent simpler policies, which can be learned more reliably because their empirical gradient estimates are more accurate. Our approach allows for modeling and comparison of different policy classes, including full planning, greedy, K-step lookahead, and segment policies. We evaluate the policies empirically on multiple persona-alignment conversational problems. Different policies excel in different problems, reflecting their different characteristics and highlighting the importance of our studied policy class.
[AI-20] A Hazard-Informed Data Pipeline for Robotics Physical Safety
【速读】:该论文旨在解决机器人物理安全评估中传统风险工程与现代机器学习方法之间缺乏系统性整合的问题,尤其是在安全边界学习和风险建模方面。其解决方案的关键在于构建一个基于显式资产声明、系统性漏洞枚举以及以危害驱动的合成数据生成的结构化机器人物理安全框架,通过将经典安全工程、数字孪生仿真、合成数据生成与机器学习模型训练相统一,实现基于形式化危害本体的安全边界学习。
链接: https://arxiv.org/abs/2603.06130
作者: Alexei Odinokov,Rostislav Yavorskiy
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 4th International Conference on Automation and Mechatronics Engineering (ICAME 2026)
Abstract:This report presents a structured Robotics Physical Safety Framework based on explicit asset declaration, systematic vulnerability enumeration, and hazard-driven synthetic data generation. The approach bridges classical risk engineering with modern machine learning pipelines, enabling safety envelope learning grounded in a formalized hazard ontology. The key contribution of this framework is the alignment between classical safety engineering, digital twin simulation, synthetic data generation, and machine learning model training.
[AI-21] Offline Materials Optimization with CliqueFlowmer
【速读】:该论文旨在解决生成式 AI 在计算材料发现(Computational Materials Discovery, CMD)中因最大似然训练导致的探索能力不足问题,即难以有效开拓材料空间中具有高目标属性值的区域。其解决方案的关键在于提出一种基于离线模型-based优化(Offline Model-Based Optimization, MBO)的新方法——CliqueFlowmer,该方法将目标材料属性的直接优化嵌入生成过程,结合了基于团(clique)的MBO思想与Transformer和流模型(flow generation)架构,从而在不依赖在线交互的情况下实现高效、定向的材料生成。
链接: https://arxiv.org/abs/2603.06082
作者: Jakub Grudzien Kuba,Benjamin Kurt Miller,Sergey Levine,Pieter Abbeel
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Recent advances in deep learning inspired neural network-based approaches to computational materials discovery (CMD). A plethora of problems in this field involve finding materials that optimize a target property. Nevertheless, the increasingly popular generative modeling methods are ineffective at boldly exploring attractive regions of the materials space due to their maximum likelihood training. In this work, we offer an alternative CMD technique based on offline model-based optimization (MBO) that fuses direct optimization of a target material property into generation. To that end, we introduce a domain-specific model, dubbed CliqueFlowmer, that incorporates recent advances of clique-based MBO into transformer and flow generation. We validate CliqueFlowmer’s optimization abilities and show that materials it produces strongly outperform those provided by generative baselines. To enable employment of CliqueFlowmer in specialized materials optimization problems and support interdisciplinary research, we open-source our code at this https URL.
[AI-22] Lifelong Embodied Navigation Learning
【速读】:该论文旨在解决持续式具身导航学习(Lifelong Embodied Navigation Learning, LENL)中的灾难性遗忘问题,即在面对多场景、多样指令风格的连续导航任务时,传统基于大语言模型的导航代理难以有效保留先前学到的知识。其解决方案的关键在于提出Uni-Walker框架,通过Decoder Extension LoRA(DE-LoRA)将导航知识解耦为任务共享与任务特定两部分:前者采用知识继承策略和专家协同激活机制实现跨任务的知识迁移与优化;后者引入专家子空间正交约束与导航特定的思维链推理机制,以增强对具体任务和指令风格的理解与适应能力。
链接: https://arxiv.org/abs/2603.06073
作者: Xudong Wang,Jiahua Dong,Baichen Liu,Qi Lyu,Lianqing Liu,Zhi Han
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 24 pages, 7 figures
Abstract:Embodied navigation agents powered by large language models have shown strong performance on individual tasks but struggle to continually acquire new navigation skills, which suffer from catastrophic forgetting. We formalize this challenge as lifelong embodied navigation learning (LENL), where an agent is required to adapt to a sequence of navigation tasks spanning multiple scenes and diverse user instruction styles, while retaining previously learned knowledge. To tackle this problem, we propose Uni-Walker, a lifelong embodied navigation framework that decouples navigation knowledge into task-shared and task-specific components with Decoder Extension LoRA (DE-LoRA). To learn the shared knowledge, we design a knowledge inheritance strategy and an experts co-activation strategy to facilitate shared knowledge transfer and refinement across multiple navigation tasks. To learn the specific knowledge, we propose an expert subspace orthogonality constraint together and a navigation-specific chain-of-thought reasoning mechanism to capture specific knowledge and enhance instruction-style understanding. Extensive experiments demonstrate the superiority of Uni-Walker for building universal navigation agents with lifelong learning.
[AI-23] Aggregative Semantics for Quantitative Bipolar Argumentation Frameworks
【速读】:该论文旨在解决定量双极论证框架(Quantitative Bipolar Argumentation Frameworks, QBAF)中如何定义更灵活且可解释的渐进语义(gradual semantics)的问题,尤其关注攻击者与支持者在论证评价中不对称的角色。传统模块化语义(modular semantics)通常将攻击与支持合并处理,忽略了二者在实际推理中的差异化影响。为此,作者提出了一种新型渐进语义家族——聚合语义(aggregative semantics),其关键在于将计算过程分解为三个独立且可解释的步骤:首先分别计算一个论点的攻击权重和支撑权重,然后将这两个全局权重与论点自身的内在权重进行聚合。这种三阶段机制不仅增强了模型的参数化能力,还保留了双极性(bipolarity)信息的层次性,从而提升了语义的可理解性和适用性。
链接: https://arxiv.org/abs/2603.06067
作者: Yann Munro,Isabelle Bloch,Marie-Jeanne Lesot
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Formal argumentation is being used increasingly in artificial intelligence as an effective and understandable way to model potentially conflicting pieces of information, called arguments, and identify so-called acceptable arguments depending on a chosen semantics. This paper deals with the specific context of Quantitative Bipolar Argumentation Frameworks (QBAF), where arguments have intrinsic weights and can attack or support each other. In this context, we introduce a novel family of gradual semantics, called aggregative semantics. In order to deal with situations in which attackers and supporters do not play a symmetric role, and in contrast to modular semantics, we propose to aggregate attackers and supporters separately. This leads to a three-stage computation, which consists in computing a global weight for attackers and another for supporters, before aggregating these two values with the intrinsic weight of the argument. We discuss the properties that the three aggregation functions should satisfy depending on the context, as well as their relationships with the classical principles for gradual semantics. This discussion is supported by various simple examples, as well as a final example on which five hundred aggregative semantics are tested and compared, illustrating the range of possible behaviours with aggregative semantics. Decomposing the computation into three distinct and interpretable steps leads to a more parametrisable computation: it keeps the bipolarity one step further than what is done in the literature, and it leads to more understandable gradual semantics.
[AI-24] Agent ic LLM Planning via Step-Wise PDDL Simulation: An Empirical Characterisation
【速读】:该论文旨在解决自主机器人系统中任务规划(task planning)的问题,即如何从初始状态有效序列化动作以达成目标,同时探讨大型语言模型(Large Language Models, LLMs)是否可作为可行的规划器,与传统符号方法并行使用。解决方案的关键在于提出 PyPDDLEngine——一个开源的 Planning Domain Definition Language (PDDL) 模拟引擎,通过 Model Context Protocol (MCP) 接口将规划操作暴露为 LLM 工具调用;其核心创新是让 LLM 作为交互式搜索策略,逐次选择动作、观察状态反馈,并支持重置和重试,而非一次性生成完整动作序列。实验表明,相较于经典规划器 Fast Downward 的 85.3% 成功率,基于 PyPDDLEngine 的代理式 LLM 规划方法达到 66.7% 成功率,在 token 成本增加约 5.7 倍的前提下实现小幅提升,且在多数可解难题上生成更短计划,反映出其性能依赖于环境反馈的质量:外部验证信号(如编译错误)优于自评估的 PDDL 步骤反馈。
链接: https://arxiv.org/abs/2603.06064
作者: Kai Göbel,Pierrick Lorang,Patrik Zips,Tobias Glück
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Task planning, the problem of sequencing actions to reach a goal from an initial state, is a core capability requirement for autonomous robotic systems. Whether large language models (LLMs) can serve as viable planners alongside classical symbolic methods remains an open question. We present PyPDDLEngine, an open-source Planning Domain Definition Language (PDDL) simulation engine that exposes planning operations as LLM tool calls through a Model Context Protocol (MCP) interface. Rather than committing to a complete action sequence upfront, the LLM acts as an interactive search policy that selects one action at a time, observes each resulting state, and can reset and retry. We evaluate four approaches on 102 International Planning Competition (IPC) Blocksworld instances under a uniform 180-second budget: Fast Downward lama-first and seq-sat-lama-2011 as classical baselines, direct LLM planning (Claude Haiku 4.5), and agentic LLM planning via PyPDDLEngine. Fast Downward achieves 85.3% success. The direct and agentic LLM approaches achieve 63.7% and 66.7%, respectively, a consistent but modest three-percentage-point advantage for the agentic approach at 5.7\times higher token cost per solution. Across most co-solved difficulty blocks, both LLM approaches produce shorter plans than seq-sat-lama-2011 despite its iterative quality improvement, a result consistent with training-data recall rather than generalisable planning. These results suggest that agentic gains depend on the nature of environmental feedback. Coding agents benefit from externally grounded signals such as compiler errors and test failures, whereas PDDL step feedback is self-assessed, leaving the agent to evaluate its own progress without external verification.
[AI-25] ADPO: Reinforcement Learning Goes Off-road ICRA2026
【速读】:该论文旨在解决非结构化地形下的高动态、长时程自主驾驶问题,其核心挑战在于如何在未建图、地形多变且动力学不确定性高的环境中实现有效的长期规划与自适应控制。解决方案的关键在于提出一种名为TADPO的新型策略梯度方法,该方法基于近端策略优化(Proximal Policy Optimization, PPO)进行扩展,通过利用离策略轨迹提供教师指导,同时结合在策略轨迹进行学生探索,从而提升样本效率和策略稳定性。在此基础上,研究构建了一个基于视觉的端到端强化学习系统,能够在高速条件下成功导航极端坡度和障碍密集地形,并实现了仿真到真实全尺寸越野车辆的零样本迁移部署,标志着首个将强化学习策略应用于全尺寸越野平台的工作。
链接: https://arxiv.org/abs/2603.05995
作者: Zhouchonghao Wu,Raymond Song,Vedant Mundheda,Luis E. Navarro-Serment,Christof Schoenborn,Jeff Schneider
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 5 figures, 2 tables. Accepted at ICRA 2026
Abstract:Off-road autonomous driving poses significant challenges such as navigating unmapped, variable terrain with uncertain and diverse dynamics. Addressing these challenges requires effective long-horizon planning and adaptable control. Reinforcement Learning (RL) offers a promising solution by learning control policies directly from interaction. However, because off-road driving is a long-horizon task with low-signal rewards, standard RL methods are challenging to apply in this setting. We introduce TADPO, a novel policy gradient formulation that extends Proximal Policy Optimization (PPO), leveraging off-policy trajectories for teacher guidance and on-policy trajectories for student exploration. Building on this, we develop a vision-based, end-to-end RL system for high-speed off-road driving, capable of navigating extreme slopes and obstacle-rich terrain. We demonstrate our performance in simulation and, importantly, zero-shot sim-to-real transfer on a full-scale off-road vehicle. To our knowledge, this work represents the first deployment of RL-based policies on a full-scale off-road platform.
[AI-26] An Interactive Multi-Agent System for Evaluation of New Product Concepts
【速读】:该论文旨在解决产品概念评估阶段中传统专家主导方法存在的主观偏差大、耗时长及成本高等问题。其解决方案的关键在于构建一个基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS),该系统由八个代表不同专业领域的虚拟代理组成,如研发(R&D)和市场营销等,通过检索增强生成(Retrieval-Augmented Generation, RAG)与实时搜索工具获取客观证据,并依据技术可行性与市场可行性两大维度进行结构化讨论与决策判断;同时,系统通过专业产品评审数据对代理进行微调,以提升评估准确性,最终在显示器产品概念案例中验证了其评估结果与资深行业专家高度一致,证明了该多智能体方法在支持产品开发决策中的有效性。
链接: https://arxiv.org/abs/2603.05980
作者: Bin Xuan,Ruo Ai,Hakyeon Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 46 pages, 3 figures + This paper proposes an LLM-based multi-agent system (MAS) for automated evaluation of new product concepts, incorporating retrieval-augmented generation (RAG) and cross-functional virtual agents to assess technical and market feasibility
Abstract:Product concept evaluation is a critical stage that determines strategic resource allocation and project success in enterprises. However, traditional expert-led approaches face limitations such as subjective bias and high time and cost requirements. To support this process, this study proposes an automated approach utilizing a large language model (LLM)-based multi-agent system (MAS). Through a systematic analysis of previous research on product development and team collaboration, this study established two primary evaluation dimensions, namely technical feasibility and market feasibility. The proposed system consists of a team of eight virtual agents representing specialized domains such as RD and marketing. These agents use retrieval-augmented generation (RAG) and real-time search tools to gather objective evidence and validate concepts through structured deliberations based on the established criteria. The agents were further fine-tuned using professional product review data to enhance their judgment accuracy. A case study involving professional display monitor concepts demonstrated that the system’s evaluation rankings were consistent with those of senior industry experts. These results confirm the usability of the proposed multi-agent-based evaluation approach for supporting product development decisions.
[AI-27] Domain-Adaptive Model Merging across Disconnected Modes ICASSP2026
【速读】:该论文旨在解决跨域学习中因数据隐私或异构性导致无法集中数据时,难以训练单一综合模型的问题。现有方法受限于数据共享的障碍和高昂的重新训练成本,而模型合并(model merging)提供了一种替代方案,即从多个专业化模型中整合知识形成统一模型。其解决方案的关键在于提出一种无需数据的模型合并框架DMM,该框架包含三个步骤:首先独立训练各领域专用模型;其次对相似模型进行标准合并以保证稳定性;最后通过归一化统计量生成伪数据,并利用轻量级精炼机制将差异较大的模型知识蒸馏到合并模型中,从而在保持稳定性的前提下保留稀有但关键的知识。实验表明,DMM在单模态与多模态基准上均优于现有合并方法。
链接: https://arxiv.org/abs/2603.05957
作者: Junming Liu,Yusen Zhang,Rongchao Zhang,Wenkai Zhu,Tian Wu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure, 3 tables; Accepted by ICASSP 2026
Abstract:Learning across domains is challenging when data cannot be centralized due to privacy or heterogeneity, which limits the ability to train a single comprehensive model. Model merging provides an appealing alternative by consolidating knowledge from multiple specialized models into one, avoiding data sharing and reducing retraining cost. In this work, we present DMM, a data-free model merging framework designed to handle highly divergent models. DMM proceeds in three steps. First, domain-specific models are trained independently. Second, models with high similarity are merged using standard techniques to ensure stability. Third, we synthesize pseudo-data from normalization statistics and distill knowledge from divergent models into the merged model through a lightweight refinement guided by these samples. This approach preserves rare but critical knowledge while maintaining stability. Extensive experiments on unimodal and multimodal benchmarks show that DMM achieves state-of-the-art performance over existing merging methods.
[AI-28] XAI for Coding Agent Failures: Transforming Raw Execution Traces into Actionable Insights
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的代码生成代理在软件开发中因失败模式复杂且难以理解而导致的可解释性问题。现有方法依赖通用LLM提供临时解释,但原始执行轨迹对开发者仍具挑战性。解决方案的关键在于提出一种系统性的可解释人工智能(Explainable AI, XAI)框架,包含三个核心组件:(1) 基于真实代理失败案例构建的领域特定故障分类体系;(2) 自动化标注系统,依据预定义标注方案对故障进行分类;(3) 混合解释生成器,输出可视化执行流程、自然语言说明及可操作建议。实证研究表明,该方法显著提升用户识别故障根因的速度(快2.8倍)和修复准确性(提高73%),优于现有基于通用模型的即兴解释,实现了结构化、一致且与领域强相关的故障分析能力。
链接: https://arxiv.org/abs/2603.05941
作者: Arun Joshi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 17 Pages, 3 Figures, 2 Tables
Abstract:Large Language Model (LLM)-based coding agents show promise in automating software development tasks, yet they frequently fail in ways that are difficult for developers to understand and debug. While general-purpose LLMs like GPT can provide ad-hoc explanations of failures, raw execution traces remain challenging to interpret even for experienced developers. We present a systematic explainable AI (XAI) approach that transforms raw agent execution traces into structured, human-interpretable explanations. Our method consists of three key components: (1) a domain-specific failure taxonomy derived from analyzing real agent failures, (2) an automatic annotation system that classifies failures using defined annotation schema, (3) a hybrid explanation generator that produces visual execution flows, natural language explanations, and actionable recommendations. Through a user study with 20 participants (10 technical, 10 non-technical), we demonstrate that our approach enables users to identify failure root causes 2.8 times faster and propose correct fixes with 73% higher accuracy compared to raw execution traces. Importantly, our structured approach outperforms ad-hoc state of the art models explanations by providing consistent, domain-specific insights with integrated visualizations. Our work establishes a framework for systematic agent failure analysis, addressing the critical need for interpretable AI systems in software development workflows
[AI-29] Stock Market Prediction Using Node Transformer Architecture Integrated with BERT Sentiment Analysis
【速读】:该论文旨在解决股票市场预测中因噪声、非平稳性和行为动态等因素导致的传统方法难以捕捉复杂模式与跨股票依赖关系的问题。其解决方案的关键在于构建一个融合节点变换器(node transformer)架构与基于BERT的情感分析的集成框架:首先将股票市场建模为图结构,其中个股作为节点、行业归属、价格联动和供应链关系构成边;其次利用微调后的BERT模型提取社交媒体情感信息,并通过注意力机制与量化市场特征融合;最后由节点变换器同时处理历史数据的时间演化和跨股票依赖关系。实验证明该方法在SP 500成分股上实现了显著优于ARIMA和LSTM的预测精度(MAPE降低至0.80%),尤其在财报公告期间情感分析贡献了25%的误差减少,图建模进一步提升15%性能,验证了其有效性与鲁棒性。
链接: https://arxiv.org/abs/2603.05917
作者: Mohammad Al Ridhawi,Mahtab Haj Ali,Hussein Al Osman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistical Finance (q-fin.ST)
备注: 14 pages, 5 figures, 10 tables, submitted to IEEE Access
Abstract:Stock market prediction presents considerable challenges for investors, financial institutions, and policymakers operating in complex market environments characterized by noise, non-stationarity, and behavioral dynamics. Traditional forecasting methods often fail to capture the intricate patterns and cross-sectional dependencies inherent in financial markets. This paper presents an integrated framework combining a node transformer architecture with BERT-based sentiment analysis for stock price forecasting. The proposed model represents the stock market as a graph structure where individual stocks form nodes and edges capture relationships including sectoral affiliations, correlated price movements, and supply chain connections. A fine-tuned BERT model extracts sentiment from social media posts and combines it with quantitative market features through attention-based fusion. The node transformer processes historical market data while capturing both temporal evolution and cross-sectional dependencies among stocks. Experiments on 20 SP 500 stocks spanning January 1982 to March 2025 demonstrate that the integrated model achieves a mean absolute percentage error (MAPE) of 0.80% for one-day-ahead predictions, compared to 1.20% for ARIMA and 1.00% for LSTM. Sentiment analysis reduces prediction error by 10% overall and 25% during earnings announcements, while graph-based modeling contributes an additional 15% improvement by capturing inter-stock dependencies. Directional accuracy reaches 65% for one-day forecasts. Statistical validation through paired t-tests confirms these improvements (p 0.05 for all comparisons). The model maintains MAPE below 1.5% during high-volatility periods where baseline models exceed 2%.
[AI-30] DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality
【速读】:该论文旨在解决生成式 AI(Generative AI)在构建深度研究报告(Deep Research Reports, DRRs)时,如何有效验证其主张级事实性(claim-level factuality)的问题。现有事实核查工具主要针对通用领域、原子化事实型陈述设计,缺乏适用于DRR的基准测试,且静态专家标注的基准存在脆弱性。解决方案的关键在于提出“审计-评分”(Audit-then-Score, AtS)的演化式基准方法:通过允许模型与专家共同参与争议仲裁机制,使基准标签和推理依据可修订——当验证器与当前基准不一致时,需提交证据,由专家审计员裁决并更新基准,从而逐步提升专家标注质量。实验表明,在四轮AtS迭代后,专家在微黄金集上的准确率从60.8%提升至90.9%,证明该机制显著增强基准可靠性,并据此构建了可审计的DeepFact-Bench基准与高效的DeepFact-Eval验证代理。
链接: https://arxiv.org/abs/2603.05912
作者: Yukun Huang,Leonardo F. R. Ribeiro,Momchil Hardalov,Bhuwan Dhingra,Markus Dreyer,Venkatesh Saligrama
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that static expert-labeled benchmarks are brittle in this setting: in a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on a hidden micro-gold set of verifiable claims. We propose Evolving Benchmarking via Audit-then-Score (AtS), where benchmark labels and rationales are explicitly revisable: when a verifier disagrees with the current benchmark, it must submit evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before models are scored. Across four AtS rounds, expert micro-gold accuracy rises to 90.9%, indicating experts are substantially more reliable as auditors than as one-shot labelers. We instantiate AtS as DeepFact-Bench, a versioned DRR factuality benchmark with auditable rationales, and DeepFact-Eval, a document-level verification agent (with a grouped lite variant) that outperforms existing verifiers on DeepFact-Bench and transfers well to external factuality datasets.
[AI-31] he World Wont Stay Still: Programmable Evolution for Agent Benchmarks
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)驱动的智能体在静态环境基准下评估时,无法有效衡量其对现实世界动态变化适应能力的问题。现有基准通常假设环境具有固定的数据模式和工具集,忽略了真实场景中环境随时间演化的特性以及智能体对此类变化的鲁棒性需求。解决方案的关键在于提出ProEvolve框架,该框架基于有类型的关系图(typed relational graph)构建统一、显式的环境表示形式,将数据、工具与模式结构化为可操作的图节点与边;在此基础上,通过图变换(graph transformations)实现对环境能力的添加、删除或修改,并保证更新在工具、模式与数据访问之间的一致性传播,从而实现环境演化的可编程性与可控性,支持自动构造多样化环境及任务沙盒以更真实地评估智能体的适应能力。
链接: https://arxiv.org/abs/2603.05910
作者: Guangrui Li,Yaochen Xie,Yi Liu,Ziwei Dong,Xingyuan Pan,Tianqi Zheng,Jason Choi,Michael J. Morais,Binit Jha,Shaunak Mishra,Bingrou Zhou,Chen Luo,Monica Xiao Cheng,Dawn Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-powered agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks assume static environments with fixed schemas and toolsets, neglecting the evolutionary nature of the real world and agents’ robustness to environmental changes. In this paper, we study a crucial problem: how to evolve the agent environment in a scalable and controllable way, thereby better evaluating agents’ adaptability to real-world dynamics. We propose ProEvolve, a graph-based framework that makes environment evolution programmable. At its core, a typed relational graph provides a unified, explicit representation of the environment: data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that coherently propagate updates across tools, schemas, and data access. Building on this, ProEvolve can (1) program the evolutionary dynamics as graph transformations to generate environments automatically, and (2) instantiate task sandboxes via subgraph sampling and programming. We validate ProEvolve by evolving a single environment into 200 environments and 3,000 task sandboxes, and benchmark representative agents accordingly.
[AI-32] LUMINA: LLM -Guided GPU Architecture Exploration via Bottleneck Analysis
【速读】:该论文旨在解决现代AI工作负载(如大语言模型LLM推理)下GPU设计空间探索(DSE)效率低下的问题,其核心挑战在于GPU设计空间庞大且多模态、仿真成本高,以及优化目标(如性能、功耗和面积之间的权衡)复杂。传统自动化DSE方法要么需要大量采样导致计算开销过大,要么依赖人工设计的复杂关键路径分析,难以规模化应用。解决方案的关键在于提出LUMINA框架,该框架利用生成式AI(Generative AI)驱动的设计空间探索机制:首先从仿真器代码中提取架构知识并自动构建DSE规则,通过敏感性分析实现规则的自动生成与动态修正;其次引入一个系统性的DSE基准测试(DSE Benchmark),用于评估和提升LLM在架构优化中所需的三项核心能力(即理解、推理与决策),从而确保模型选择的科学性和架构推理的一致性。最终,在包含470万种可能配置的设计空间中,LUMINA仅用20步即找到优于A100 GPU的6个设计方案,相较机器学习基线方法实现了17.5倍更高的探索效率和32.9%更优的帕累托超体积(Pareto Hypervolume),验证了其在低搜索成本下提供高质量设计指导的能力。
链接: https://arxiv.org/abs/2603.05904
作者: Tao Zhang,Rui Ma,Shuotao Xu,Peng Cheng,Yongqiang Xiong
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:GPU design space exploration (DSE) for modern AI workloads, such as Large-Language Model (LLM) inference, is challenging because of GPUs’ vast, multi-modal design spaces, high simulation costs, and complex design optimization objectives (e.g. performance, power and area trade-offs). Existing automated DSE methods are often prohibitively expensive, either requiring an excessive number of exploration samples or depending on intricate, manually crafted analyses of interdependent critical paths guided by human heuristics. We present LUMINA, an LLM-driven GPU architecture exploration framework that leverage AI to enhance the DSE efficiency and efficacy for GPUs. LUMINA extracts architectural knowledge from simulator code and performs sensitivity studies to automatically compose DSE rules,which are auto-corrected during exploration. A core component of LUMINA is a DSE Benchmark that comprehensively evaluates and enhances LLMs’ capabilities across three fundamental skills required for architecture optimization, which provides a principled and reproducible basis for model selection and ensuring consistent architectural reasoning. In the design space with 4.7 million possible samples, LUMINA identifies 6 designs of better performance and area than an A100 GPU efficiently, using only 20 steps via LLM-assisted bottleneck analysis. In comparison, LUMINA achieves 17.5x higher than design space exploration efficiency, and 32.9% better designs (i.e. Pareto Hypervolume) than Machine-Learning baselines, showcasing its ability to deliver high-quality design guidance with minimal search cost. Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.05904 [cs.AR] (or arXiv:2603.05904v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2603.05904 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-33] Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning
【速读】:该论文旨在解决在基于指令的分子优化任务中,传统监督微调(SFT)和基于可验证奖励的强化学习(RLVR)方法因缺乏逐步骤优化轨迹而表现不佳的问题。具体而言,仅使用参考分子进行答案级SFT会破坏推理过程,而RLVR在相似性约束下因模型探索能力不足导致奖励稀疏、学习效率低下且优化效果受限。解决方案的关键在于提出参考引导策略优化(Reference-guided Policy Optimization, RePO),其核心机制是在每次更新时从模型中采样候选分子及其中间推理轨迹,并通过强化学习方式利用可验证奖励来衡量属性满足度与相似性约束;同时引入参考引导项,将中间推理轨迹作为上下文保留,并仅对最终答案进行监督训练,从而在促进新分子探索的同时,借助参考分子稳定训练并缓解奖励稀疏问题。
链接: https://arxiv.org/abs/2603.05900
作者: Xuan Li,Zhanke Zhou,Zongze Li,Jiangchao Yao,Yu Rong,Lu Zhang,Bo Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) benefit substantially from supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) in reasoning tasks. However, these recipes perform poorly in instruction-based molecular optimization, where each data point typically provides only a single optimized reference molecule and no step-by-step optimization trajectory. We reveal that answer-only SFT on the reference molecules collapses reasoning, and RLVR provides sparse feedback under similarity constraints due to the model’s lack of effective exploration, which slows learning and limits optimization. To encourage the exploration of new molecules while balancing the exploitation of the reference molecules, we introduce Reference-guided Policy Optimization (RePO), an optimization approach that learns from reference molecules without requiring trajectory data. At each update, RePO samples candidate molecules with their intermediate reasoning trajectories from the model and trains the model using verifiable rewards that measure property satisfaction under similarity constraints in an RL manner. Meanwhile, it applies reference guidance by keeping the policy’s intermediate reasoning trajectory as context and training only the answer in a supervised manner. Together, the RL term promotes exploration, while the guidance term mitigates reward sparsity and stabilizes training by grounding outputs to references when many valid molecular edits exist. Across molecular optimization benchmarks, RePO consistently outperforms SFT and RLVR baselines (e.g., GRPO), achieving improvements on the optimization metric (Success Rate \times Similarity), improving balance across competing objectives, and generalizing better to unseen instruction styles. Our code is publicly available at this https URL.
[AI-34] Computational Pathology in the Era of Emerging Foundation and Agent ic AI – International Expert Perspectives on Clinical Integration and Translational Readiness
【速读】:该论文试图解决生成式 AI (Generative AI) 在临床病理学中从实验室研究向真实世界医疗实践转化过程中所面临的采纳滞后问题,其核心挑战在于如何将技术性能优势转化为可部署的临床价值。解决方案的关键在于建立一个综合评估框架,将可落地的临床相关性与下游分析能力、技术成熟度、运营准备度以及经济和监管环境相连接,从而实现负责任的整合路径。
链接: https://arxiv.org/abs/2603.05884
作者: Qian Da,Yijiang Chen,Min Ju,Zheyi Ji,Albert Zhou,Wenwen Wang,Matthew A Abikenari,Philip Chikontwe,Guillaume Larghero,Bowen Chen,Peter Neiglinger,Dingrong Zhong,Shuhao Wang,Wei Xu,Drew Williamson,German Corredor,Sen Yang,Le Lu,Xiao Han,Kun-Hsing Yu,Jun-zhou Huang,Laura Barisoni,Geert Litjens,Anant Madabhushi,Lifeng Zhu,Chaofu Wang,Junhan Zhao,Weiguo Hu
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent breakthroughs in artificial intelligence through foundation models and agents have accelerated the evolution of computational pathology. Demonstrated performance gains reported across academia in benchmarking datasets in predictive tasks such as diagnosis, prognosis, and treatment response have ignited substantial enthusiasm for clinical application. Despite this development momentum, real world adoption has lagged, as implementation faces economic, technical, and administrative challenges. Beyond existing discussions of technical architectures and comparative performance, this review considers how these emerging AI systems can be responsibly integrated into medical practice by connecting deployable clinical relevance with downstream analytical capabilities and their technical maturity, operational readiness, and economic and regulatory context. Drawing on perspectives from an international group, we provide a practical assessment of current capabilities and barriers to adoption in patient care settings.
[AI-35] Ambiguity Collapse by LLM s: A Taxonomy of Epistemic Risks
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在处理具有多重合法解释的模糊概念(如“仇恨言论”或“偏见”)时,常常产生单一、看似确定的输出,这一现象称为“模糊性坍缩”(ambiguity collapse),从而绕过人类通过协商、争议和论证来建构意义的正常过程,带来认知、输出和生态系统层面的潜在风险。解决方案的关键在于设计多层缓解机制,涵盖训练策略、机构部署设计、界面功能支持以及对不明确提示(underspecified prompts)的管理,旨在使系统能够显式呈现、保留并负责任地治理模糊性,从而维护概念的开放性和社会协商空间。
链接: https://arxiv.org/abs/2603.05801
作者: Shira Gur-Arieh,Angelina Wang,Sina Fazelpour
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly used to make sense of ambiguous, open-textured, value-laden terms. Platforms routinely rely on LLMs for content moderation, asking them to label text based on disputed concepts like “hate speech” or “incitement”; hiring managers may use LLMs to rank who counts as “qualified”; and AI labs increasingly train models to self-regulate under constitutional-style ambiguous principles such as “biased” or “legitimate”. This paper introduces ambiguity collapse: a phenomenon that occurs when an LLM encounters a term that genuinely admits multiple legitimate interpretations, yet produces a singular resolution, in ways that bypass the human practices through which meaning is ordinarily negotiated, contested, and justified. Drawing on interdisciplinary accounts of ambiguity as a productive epistemic resource, we develop a taxonomy of the epistemic risks posed by ambiguity collapse at three levels: process (foreclosing opportunities to deliberate, develop cognitive skills, and shape contested terms), output (distorting the concepts and reasons agents act upon), and ecosystem (reshaping shared vocabularies, interpretive norms, and how concepts evolve over time). We illustrate these risks through three case studies, and conclude by sketching multi-layer mitigation principles spanning training, institutional deployment design, interface affordances, and the management of underspecified prompts, with the goal of designing systems that surface, preserve, and responsibly govern ambiguity.
[AI-36] StreamWise: Serving Multi-Modal Generation in Real-Time at Scale
【速读】:该论文旨在解决实时多模态工作流(如播客视频生成)在大规模部署时面临的高延迟、资源消耗大和调度复杂等问题,这些问题源于不同模态模型(语言、音频、图像、视频)对计算资源的需求差异及严格的性能指标约束。其解决方案的关键在于设计了一种自适应、模块化的服务系统 StreamWise,通过动态调整输出质量(如分辨率、清晰度)、内容并行度与资源感知调度策略,在异构硬件环境下实现低延迟(<45秒启动延迟)与高效率的协同优化,从而在保证服务质量目标(SLO)的前提下显著降低整体成本。
链接: https://arxiv.org/abs/2603.05800
作者: Haoran Qiu,Gohar Irfan Chaudhry,Chaojie Zhang,Íñigo Goiri,Esha Choukse,Rodrigo Fonseca,Ricardo Bianchini
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Advances in multi-modal generative models are enabling new applications, from storytelling to automated media synthesis. Most current workloads generate simple outputs (e.g., image generation from a prompt) in batch mode, often requiring several seconds even for basic results. Serving real-time multi-modal workflows at scale is costly and complex, requiring efficient coordination of diverse models (each with unique resource needs) across language, audio, image, and video, all under strict latency and resource constraints. We tackle these challenges through the lens of real-time podcast video generation, integrating LLMs, text-to-speech, and video-audio generation. To meet tight SLOs, we design an adaptive, modular serving system, StreamWise, that dynamically manages quality (e.g., resolution, sharpness), model/content parallelism, and resource-aware scheduling. We leverage heterogeneous hardware to maximize responsiveness and efficiency. For example, the system can lower video resolution and allocate more resources to early scenes. We quantify the trade-offs between latency, cost, and quality. The cheapest setup generates a 10-minute podcast video on A100 GPUs in 1.4 hours (8.4x slower than the real-time) for less than \ 25. StreamWise enables high-quality real-time streaming with a sub-second startup delay under 45. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.05800 [cs.DC] (or arXiv:2603.05800v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2603.05800 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-37] Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models
【速读】:该论文旨在解决当前安全对齐(safety alignment)机制中存在的根本性缺陷——即模型在检测到有害内容时自动拒绝响应的“单一体系”假设与实际中持续存在的越狱攻击(jailbreak attacks)之间的矛盾。作者提出解耦安全假说(Disentangled Safety Hypothesis, DSH),认为安全计算实际上发生在两个独立的子空间:一个用于识别危害性的“认知轴”(Recognition Axis, vH,“知道”),另一个用于执行拒绝行为的“执行轴”(Execution Axis, vR,“行动”)。其核心解决方案在于通过几何分析揭示了从早期层到深层层的安全信号由对抗纠缠向结构独立演化的普遍规律,并利用“双差分提取”和“自适应因果引导”技术实现因果性解耦,从而构造出“知而不行”的状态,进一步提出了拒绝擦除攻击(Refusal Erasure Attack, REA),精准切除拒绝机制以达成最先进的攻击效果。这一发现还揭示了不同架构下安全控制方式的根本差异:Llama3.1采用显式语义控制,而Qwen2.5则依赖潜在分布式控制。
链接: https://arxiv.org/abs/2603.05773
作者: Jinman Wu,Yi Xie,Shen Lin,Shiqian Zhao,Xiaofeng Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mechanistic decoupling. We propose the \textbf\underlineDisentangled \textbf\underlineSafety \textbf\underlineHypothesis \textbf(DSH), positing that safety computation operates on two distinct subspaces: a \textitRecognition Axis ( \mathbfv_H , Knowing'') and an \textitExecution Axis ( \mathbfv_R , Acting’‘). Our geometric analysis reveals a universal Reflex-to-Dissociation'' evolution, where these signals transition from antagonistic entanglement in early layers to structural independence in deep layers. To validate this, we introduce \textitDouble-Difference Extraction and \textitAdaptive Causal Steering. Using our curated \textscAmbiguityBench, we demonstrate a causal double dissociation, effectively creating a state of Knowing without Acting.’’ Crucially, we leverage this disentanglement to propose the \textbfRefusal Erasure Attack (REA), which achieves State-of-the-Art attack success rates by surgically lobotomizing the refusal mechanism. Furthermore, we uncover a critical architectural divergence, contrasting the \textitExplicit Semantic Control of Llama3.1 with the \textitLatent Distributed Control of Qwen2.5. The code and dataset are available at this https URL.
[AI-38] Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads
【速读】:该论文旨在解决开源大语言模型(Open-Sourced Large Language Models, OSLLMs)在对齐(alignment)后仍可能遭受越狱攻击(jailbreak attacks)的问题,尤其是现有攻击方法多局限于浅层(如提示词或嵌入层),未能充分揭示深层模型组件中的潜在漏洞,从而造成防御上的虚假安全感。解决方案的关键在于提出一种基于注意力头(attention head)层级的越狱框架——SAHA(Safety Attention Head Attack),其核心创新包括:1)发现深层注意力层更易受攻击,并设计了基于消融影响排名(Ablation-Impact Ranking)的注意力头选择策略以精准定位生成不安全内容的关键层;2)引入边界感知的扰动方法——层间扰动(Layer-Wise Perturbation),在最小干扰注意力机制的前提下探测不安全内容生成路径,确保扰动后的输出与目标意图具有更高语义相关性并有效规避检测。实验表明,SAHA相较当前最优基线将攻击成功率(ASR)提升14%,揭示了注意力头层面的攻击面漏洞。
链接: https://arxiv.org/abs/2603.05772
作者: Jinman Wu,Yi Xie,Shiqian Zhao,Xiaofeng Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Currently, open-sourced large language models (OSLLMs) have demonstrated remarkable generative performance. However, as their structure and weights are made public, they are exposed to jailbreak attacks even after alignment. Existing attacks operate primarily at shallow levels, such as the prompt or embedding level, and often fail to expose vulnerabilities rooted in deeper model components, which creates a false sense of security for successful defense. In this paper, we propose \textbf\underlineSafety \textbf\underlineAttention \textbf\underlineHead \textbf\underlineAttack (\textbfSAHA), an attention-head-level jailbreak framework that explores the vulnerability in deeper but insufficiently aligned attention heads. SAHA contains two novel designs. Firstly, we reveal that deeper attention layers introduce more vulnerability against jailbreak attacks. Based on this finding, \textbfSAHA introduces \textitAblation-Impact Ranking head selection strategy to effectively locate the most vital layer for unsafe output. Secondly, we introduce a boundary-aware perturbation method, \textiti.e. Layer-Wise Perturbation, to probe the generation of unsafe content with minimal perturbation to the attention. This constrained perturbation guarantees higher semantic relevance with the target intent while ensuring evasion. Extensive experiments show the superiority of our method: SAHA improves ASR by 14% over SOTA baselines, revealing the vulnerability of the attack surface on the attention head. Our code is available at this https URL.
[AI-39] ML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks
【速读】:该论文旨在解决数据科学代理(data science agents)在Kaggle风格任务中实现端到端正确性和可靠性的问题,尤其是在时间限制下的表现评估。其解决方案的关键在于构建并发布TML-Bench——一个用于评估生成式AI(Generative AI)在表格数据任务中表现的基准测试平台,通过在四个Kaggle竞赛上对10个开源大语言模型(Open Source Large Language Models, OSS LLMs)进行系统性评估,设定三种不同时间预算(240秒、600秒和1200秒),并以是否成功生成有效提交及在隐藏标签上的私有验证分数作为成功标准,从而量化模型的性能、成功率与运行稳定性。
链接: https://arxiv.org/abs/2603.05764
作者: Mykola Pinchuk
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 16 tables and figures
Abstract:Autonomous coding agents can produce strong tabular baselines quickly on Kaggle-style tasks. Practical value depends on end-to-end correctness and reliability under time limits. This paper introduces TML-Bench, a tabular benchmark for data science agents on Kaggle-style tasks. This paper evaluates 10 OSS LLMs on four Kaggle competitions and three time budgets (240s, 600s, and 1200s). Each model is run five times per task and budget. A run is successful if it produces a valid submission and a private-holdout score on hidden labels that are not accessible to the agent. This paper reports median performance, success rates, and run-to-run variability. MiniMax-M2.1 model achieves the best aggregate performance score on all four competitions under the paper’s primary aggregation. Average performance improves with larger time budgets. Scaling is noisy for some individual models at the current run count. Code and materials are available at this https URL.
[AI-40] Revisiting the (Sub)Optimality of Best-of-N for Inference-Time Alignment
【速读】:该论文旨在解决最佳N采样(Best-of-N, BoN)在推理阶段对齐语言模型时存在的统计次优性与奖励欺骗(reward hacking)问题,尤其是在实际应用中更关注胜率(win-rate)这一基于成对比较的指标时。以往理论分析多聚焦于期望真实奖励,但该论文指出这在实践中未必合理,因此转而从胜率角度重新审视BoN的性能。其核心解决方案在于:在对参考模型和奖励模型质量做出最小假设的前提下,证明适当调参后的BoN在提升胜率方面具有计算与统计最优性;同时提出一种简单且实用的改进变体,可严格消除奖励欺骗,同时保持最优统计性能。这一发现解释了BoN在实践中广泛成功的原因,并强调了选择合适目标函数对于推理阶段对齐方法分析的重要性。
链接: https://arxiv.org/abs/2603.05739
作者: Ved Sriraman,Adam Block
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 52 pages
Abstract:Best-of-N (BoN) sampling is a widely used inference-time alignment method for language models, whereby N candidate responses are sampled from a reference model and the one with the highest predicted reward according to a learned reward model is selected. Despite its widespread practical use, recent theoretical work has suggested that it is statistically suboptimal and vulnerable to reward hacking, the process by which models exploit weaknesses in the learned reward model to achieve high estimated reward without genuinely improving performance. We revisit this question under assumptions that more closely reflect practice than that of prior work. In particular, in contradistinction to earlier analyses that focused on expected true reward, which may not be meaningful in many practical settings, we investigate how inference-time alignment affects the win-rate, a pairwise comparison-based metric more closely aligned with how reward models are trained and evaluated in practice. We demonstrate that, under minimal conditions on the quality of the reference model and learned reward model, properly tuned BoN is both computationally and statistically optimal in achieving high win-rate, partially explaining its widespread practical success. Because BoN remains susceptible to reward-hacking in this setting, we propose a simple and practical variant that provably eliminates reward-hacking while maintaining optimal statistical performance. Finally, we show that prior approaches are provably suboptimal when considering win-rate, highlighting the importance of choosing appropriate objectives when analyzing inference-time alignment methods.
[AI-41] LTLGuard: Formalizing LTL Specifications with Compact Language Models and Lightweight Symbolic Reasoning
【速读】:该论文旨在解决将非正式需求(informal requirements)转化为形式化规范(formal specifications)时所面临的挑战,特别是针对资源受限的小型或中型语言模型(4B–14B参数)在生成线性时序逻辑(Linear Temporal Logic, LTL)规范时存在的语法有效性与语义一致性不足的问题。解决方案的关键在于提出一种模块化工具链LTLGuard,其核心机制是结合约束生成(constrained generation)与形式化一致性检查(formal consistency checking),利用生成式模型的能力初步产出候选LTL规范,并通过轻量级自动推理工具迭代优化,识别冲突来源并消除不一致,从而实现高效、正确且无冲突的LTL规范生成。
链接: https://arxiv.org/abs/2603.05728
作者: Medina Andresel,Cristinel Mateis,Dejan Nickovic,Spyridon Kounoupidis,Panagiotis Katsaros,Stavros Tripakis
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Translating informal requirements into formal specifications is challenging due to the ambiguity and variability of natural language (NL). This challenge is particularly pronounced when relying on compact (small and medium) language models, which may lack robust knowledge of temporal logic and thus struggle to produce syntactically valid and consistent formal specifications. In this work, we focus on enabling resource-efficient open-weight models (4B–14B parameters) to generate correct linear temporal logic (LTL) specifications from informal requirements. We present LTLGuard, a modular toolchain that combines constrained generation with formal consistency checking to generate conflict-free LTL specifications from informal input. Our method integrates the generative capabilities of model languages with lightweight automated reasoning tools to iteratively refine candidate specifications, understand the origin of the conflicts and thus help in eliminating inconsistencies. We demonstrate the usability and the effectiveness of our approach and perform quantitative evaluation of the resulting framework.
[AI-42] Reasoning Models Struggle to Control their Chains of Thought
【速读】:该论文旨在解决生成式 AI (Generative AI) 模型在链式思维(Chain-of-thought, CoT)监控中的潜在风险问题,即模型是否能够主动控制其 CoT 内容以规避检测。研究发现,尽管模型具备一定的输出可控性(如最终答案),但其 CoT 可控性显著较低——例如 Claude Sonnet 4.5 在控制 CoT 时仅成功 2.7%,而控制最终输出时可达 61.9%。解决方案的关键在于提出并应用 CoT-Control 评估套件,通过设计强制约束类任务(如禁止使用特定术语进行推理)来量化模型的 CoT 控制能力,从而揭示当前主流模型对 CoT 监控仍具有较强不可控性,这为保障 CoT 监控有效性提供了实证依据,并建议前沿实验室持续追踪该指标。
链接: https://arxiv.org/abs/2603.05706
作者: Chen Yueh-Han,Robert McCarthy,Bruce W. Lee,He He,Ian Kivlichan,Bowen Baker,Micah Carroll,Tomek Korbak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in their CoT, it could undermine CoT monitorability. To measure this undesirable capability – CoT controllability – we introduce the CoT-Control evaluation suite, which includes tasks that require models to solve problems while adhering to CoT instructions, e.g., reasoning about a genetics question without using the word ‘chromosome’. We show that reasoning models possess significantly lower CoT controllability than output controllability; for instance, Claude Sonnet 4.5 can control its CoT only 2.7% of the time but 61.9% when controlling its final output. We also find that CoT controllability is higher for larger models and decreases with more RL training, test-time compute, and increased problem difficulty. CoT controllability failures extend even to situations in which models are given incentives (as opposed to direct requests) to evade CoT monitors, although models exhibit slightly higher controllability when they are told they are being monitored. Similarly, eliciting controllability by adversarially optimizing prompts does not meaningfully increase controllability. Our results leave us cautiously optimistic that CoT controllability is currently unlikely to be a failure mode of CoT monitorability. However, the mechanism behind low controllability is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs track CoT controllability in future models.
[AI-43] SecureRAG -RTL: A Retrieval-Augmented Multi-Agent Zero-Shot LLM -Driven Framework for Hardware Vulnerability Detection
【速读】:该论文旨在解决生成式 AI (Generative AI) 在硬件安全验证中应用受限的问题,核心瓶颈在于硬件描述语言(HDL)数据集的稀缺性导致模型难以有效识别 HDL 设计中的漏洞。解决方案的关键在于提出 SecureRAG-RTL,一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的方法,通过引入领域特定的检索机制与生成式推理相结合,弥补 LLM 在硬件安全专业知识上的不足,从而显著提升漏洞检测准确率,平均提升约 30%。
链接: https://arxiv.org/abs/2603.05689
作者: Touseef Hasan,Blessing Airehenbuwa,Nitin Pundir,Souvika Sarkar,Ujjwal Guin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have shown remarkable capabilities in natural language processing tasks, yet their application in hardware security verification remains limited due to scarcity of publicly available hardware description language (HDL) datasets. This knowledge gap constrains LLM performance in detecting vulnerabilities within HDL designs. To address this challenge, we propose SecureRAG-RTL, a novel Retrieval-Augmented Generation (RAG)-based approach that significantly enhances LLM-based security verification of hardware designs. Our approach integrates domain-specific retrieval with generative reasoning, enabling models to overcome inherent limitations in hardware security expertise. We establish baseline vulnerability detection rates using prompt-only methods and then demonstrate that SecureRAG-RTL achieves substantial improvements across diverse LLM architectures, regardless of size. On average, our method increases detection accuracy by about 30%, highlighting its effectiveness in bridging domain knowledge gaps. For evaluation, we curated and annotated a benchmark dataset of 14 HDL designs containing real-world security vulnerabilities, which we will release publicly to support future research. These findings underscore the potential of RAG-driven augmentation to enable scalable, efficient, and accurate hardware security verification workflows.
[AI-44] Relational Semantic Reasoning on 3D Scene Graphs for Open World Interactive Object Search
【速读】:该论文旨在解决开放世界家庭环境中交互式物体搜索的问题,即如何在复杂、动态的场景中高效地理解物体与其周围环境之间的语义关系,以指导探索行为。传统方法依赖视觉-语言嵌入相似性,难以捕捉任务相关的关联语义;而使用大语言模型(Large Language Models, LLMs)虽能提供强语义推理能力,但计算成本高、延迟大,不适用于实时机器人部署。解决方案的关键在于提出SCOUT框架:通过构建3D场景图(Scene Graph),利用基于关系的探索启发式策略(如房间-物体包含关系和物体共现关系)为房间、前沿区域和物体分配效用分数,从而实现高效路径规划;同时设计了一种离线过程蒸馏框架,将LLM中的结构化关系知识提取并压缩为轻量级模型,用于机器人端实时推理,兼顾性能与效率。
链接: https://arxiv.org/abs/2603.05642
作者: Imen Mahdi,Matteo Cassinelli,Fabien Despinoy,Tim Welschehold,Abhinav Valada
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Open-world interactive object search in household environments requires understanding semantic relationships between objects and their surrounding context to guide exploration efficiently. Prior methods either rely on vision-language embeddings similarity, which does not reliably capture task-relevant relational semantics, or large language models (LLMs), which are too slow and costly for real-time deployment. We introduce SCOUT: Scene Graph-Based Exploration with Learned Utility for Open-World Interactive Object Search, a novel method that searches directly over 3D scene graphs by assigning utility scores to rooms, frontiers, and objects using relational exploration heuristics such as room-object containment and object-object co-occurrence. To make this practical without sacrificing open-vocabulary generalization, we propose an offline procedural distillation framework that extracts structured relational knowledge from LLMs into lightweight models for on-robot inference. Furthermore, we present SymSearch, a scalable symbolic benchmark for evaluating semantic reasoning in interactive object search tasks. Extensive evaluations across symbolic and simulation environments show that SCOUT outperforms embedding similarity-based methods and matches LLM-level performance while remaining computationally efficient. Finally, real-world experiments demonstrate effective transfer to physical environments, enabling open-world interactive object search under realistic sensing and navigation constraints.
[AI-45] Real-Time AI Service Economy: A Framework for Agent ic Computing Across the Continuum
【速读】:该论文旨在解决实时人工智能(AI)服务在设备-边缘-云连续体中,因自治AI代理生成延迟敏感型工作负载、编排多阶段处理流水线并竞争共享资源时,如何实现可扩展且可靠的去中心化价格机制资源配置问题。其核心挑战在于服务依赖图(service-dependency graphs,建模为有向无环图 DAG)的拓扑结构对价格收敛性与分配效率具有决定性影响:当依赖关系呈现分层结构(如树状或串行-并行结构)时,价格趋于稳定且最优分配可高效计算;而复杂交叉依赖则导致价格震荡、分配质量下降。解决方案的关键在于提出一种混合管理架构,通过跨域集成器将复杂子图封装为资源切片(resource slices),向市场其余部分提供结构清晰的接口,从而显著降低价格波动(实验表明降幅达70–75%),同时保持吞吐量不变,并验证了在诚实报价下去中心化市场可达到集中式价值最优基准水平。
链接: https://arxiv.org/abs/2603.05614
作者: Lauri Lovén,Alaa Saleh,Reza Farahani,Ilir Murturi,Miguel Bordallo López,Praveen Kumar Donta,Schahram Dustdar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Real-time AI services increasingly operate across the device-edge-cloud continuum, where autonomous AI agents generate latency-sensitive workloads, orchestrate multi-stage processing pipelines, and compete for shared resources under policy and governance constraints. This article shows that the structure of service-dependency graphs, modelled as DAGs whose nodes represent compute stages and whose edges encode execution ordering, is a primary determinant of whether decentralised, price-based resource allocation can work reliably at scale. When dependency graphs are hierarchical (tree or series-parallel), prices converge to stable equilibria, optimal allocations can be computed efficiently, and under appropriate mechanism design (with quasilinear utilities and discrete slice items), agents have no incentive to misreport their valuations within each decision epoch. When dependencies are more complex, with cross-cutting ties between pipeline stages, prices oscillate, allocation quality degrades, and the system becomes difficult to manage. To bridge this gap, we propose a hybrid management architecture in which cross-domain integrators encapsulate complex sub-graphs into resource slices that present a simpler, well-structured interface to the rest of the market. A systematic ablation study across six experiments (1,620 runs, 10 seeds each) confirms that (i) dependency-graph topology is a first-order determinant of price stability and scalability,(ii) the hybrid architecture reduces price volatility by up to 70-75% without sacrificing throughput, (iii) governance constraints create quantifiable efficiency-compliance trade-offs that depend jointly on topology and load, and (iv) under truthful bidding the decentralised market matches a centralised value-optimal baseline, confirming that decentralised coordination can replicate centralised allocation quality.
[AI-46] On the Value of Tokeniser Pretraining in Physics Foundation Models ICLR2026
【速读】:该论文旨在解决物理模拟中基础模型训练效率与精度之间的权衡问题,特别是针对高分辨率时空数据的压缩表示与物理动力学建模同时学习时存在的相互干扰现象。其核心解决方案在于引入分阶段预训练策略:首先使用自编码目标对tokeniser进行预训练,以学习高质量的紧凑数据表示,再在此基础上训练动力学模型。关键发现是,这种预训练方式显著提升了下游任务的计算效率和准确性,尤其是在预训练与目标任务属于同一物理系统(in-domain)时,可使验证相对均方误差(VRMSE)降低64%。此外,论文还提出可动态调整压缩比的时空压缩操作,进一步增强了模型对多样化下游任务的适应能力。
链接: https://arxiv.org/abs/2603.05598
作者: Hadi Sotoudeh,Payel Mukhopadhyay,Ruben Ohana,Michael McCabe,Neil D. Lawrence,Shirley Ho,Miles Cranmer
机构: 未知
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 16 pages, 4 figures. Workshop paper at ICLR 2026 AI PDE
Abstract:We investigate the impact of tokeniser pretraining on the accuracy and efficiency of physics emulation. Modern high-resolution simulations produce vast volumes of data spanning diverse physical regimes and scales. Training foundation models to learn the dynamics underlying such data enables the modelling of complex multiphysics phenomena, especially in data-limited settings. The emerging class of physics foundation models typically aims to learn two tasks jointly: (i) extracting compact representations of high-resolution spatiotemporal data, and (ii) capturing governing physical dynamics. However, learning both tasks from scratch simultaneously can impede the effectiveness of either process. We demonstrate that pretraining the tokeniser with an autoencoding objective prior to training the dynamics model enhances computational efficiency for downstream tasks. Notably, the magnitude of this benefit depends on domain alignment: pretraining on the same physical system as the downstream task yields the largest improvements, while pretraining on other systems provides moderate gains. In-domain pretraining reduces VRMSE by 64% after 10,500 training steps compared to training from scratch. To our knowledge, this is the first systematic investigation of tokeniser pretraining for physics foundation models. We further introduce flexible spatiotemporal compression operations that extend causal convolutions to support runtime-adjustable compression ratios, enabling efficient adaptation to diverse downstream tasks. Our findings provide practical guidance for training efficient physics emulators and highlight the importance of strategic pretraining data selection.
[AI-47] Spatiotemporal Heterogeneity of AI-Driven Traffic Flow Patterns and Land Use Interaction: A GeoAI-Based Analysis of Multimodal Urban Mobility
【速读】:该论文旨在解决城市交通流建模中难以同时捕捉多尺度动态特征(包括土地利用格局与时空异质性出行需求之间的复杂非线性交互)的问题,传统全局回归和时间序列模型无法有效处理跨多种交通方式的多尺度空间异质性。其解决方案的关键在于提出一种GeoAI Hybrid分析框架,该框架通过顺序集成多尺度地理加权回归(Multiscale Geographically Weighted Regression, MGWR)、随机森林(Random Forest, RF)和时空图卷积网络(Spatio-Temporal Graph Convolutional Networks, ST-GCN),实现对机动车、公共交通和慢行交通三种出行模式下交通流时空模式及其与土地利用关系的精准建模与解释。
链接: https://arxiv.org/abs/2603.05581
作者: Olaf Yunus Laitinen Imanov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 13 pages, 7 figures, 9 tables. Submitted to Computers, Environment and Urban Systems (Elsevier)
Abstract:Urban traffic flow is governed by the complex, nonlinear interaction between land use configuration and spatiotemporally heterogeneous mobility demand. Conventional global regression and time-series models cannot simultaneously capture these multi-scale dynamics across multiple travel modes. This study proposes a GeoAI Hybrid analytical framework that sequentially integrates Multiscale Geographically Weighted Regression (MGWR), Random Forest (RF), and Spatio-Temporal Graph Convolutional Networks (ST-GCN) to model the spatiotemporal heterogeneity of traffic flow patterns and their interaction with land use across three mobility modes: motor vehicle, public transit, and active transport. Applying the framework to an empirically calibrated dataset of 350 traffic analysis zones across six cities spanning two contrasting urban morphologies, four key findings emerge: (i) the GeoAI Hybrid achieves a root mean squared error (RMSE) of 0.119 and an R^2 of 0.891, outperforming all benchmarks by 23-62%; (ii) SHAP analysis identifies land use mix as the strongest predictor for motor vehicle flows and transit stop density as the strongest predictor for public transit; (iii) DBSCAN clustering identifies five functionally distinct urban traffic typologies with a silhouette score of 0.71, and GeoAI Hybrid residuals exhibit Moran’s I=0.218 (p0.001), a 72% reduction relative to OLS baselines; and (iv) cross-city transfer experiments reveal moderate within-cluster transferability (R^2=0.78) and limited cross-cluster generalisability, underscoring the primacy of urban morphological context. The framework offers planners and transportation engineers an interpretable, scalable toolkit for evidence-based multimodal mobility management and land use policy design.
[AI-48] ool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent
【速读】:该论文旨在解决当前自演化语言智能体(self-evolving language agents)在工具生成与使用过程中存在的两个核心问题:一是现有基准测试多依赖预定义工具规格,限制了系统的可扩展性和真正自主演化的潜力;二是已有方法虽能动态生成工具,但缺乏透明的诊断机制,导致难以定位失败原因。解决方案的关键在于提出 Tool-Genesis,这是一个多维度诊断基准,能够量化评估智能体在接口合规性(interface compliance)、功能正确性(functional correctness)和下游实用性(downstream utility)三个层面的能力。通过要求智能体仅凭抽象任务需求构建工具并解决实际问题,Tool-Genesis 揭示出即使最先进的模型在一次性生成阶段也难以保证精确的接口或可执行逻辑,且这些微小初始缺陷会在流水线中被放大,显著降低最终性能,从而为未来研究提供了明确的方向——即训练和引导模型合成持久、通用的工具以应对现实挑战。
链接: https://arxiv.org/abs/2603.05578
作者: Bowei Xia,Mengkang Hu,Shijian Wang,Jiarui Jin,Wenxiang Jiao,Yuan Lu,Kexin Li,Ping Luo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 25 pages, 10 figures, 2 tables
Abstract:Research on self-evolving language agents has accelerated, drawing increasing attention to their ability to create, adapt, and maintain tools from task requirements. However, existing benchmarks predominantly rely on predefined specifications, which limits scalability and hinders truly autonomous evolution. While recent studies attempt to dynamically generate tools, they primarily emphasize downstream performance, resulting in a “black-box” evaluation that makes it difficult to attribute failures to specific causes. To address this, we propose Tool-Genesis, a diagnostic benchmark designed to quantify agent capabilities across multiple dimensions, including interface compliance, functional correctness, and downstream utility. Tool-Genesis evaluates whether agents can construct task-relevant tools solely from abstract requirements (without preset specifications) and use them to solve realistic problems. Crucially, we find that even state-of-the-art models struggle to produce precise tool interfaces or executable logic in a one-shot setting. These minor initial flaws are amplified through the pipeline, leading to a sharp degradation in downstream metrics. We hope Tool-Genesis will guide future research toward training and steering models to synthesize persistent, general-purpose tools that better address real-world challenges.
[AI-49] PRISM: Personalized Refinement of Imitation Skills for Manipulation via Human Instructions
【速读】:该论文旨在解决机器人操作中模仿学习(Imitation Learning, IL)策略在面对新目标配置或约束时泛化能力不足、数据效率低的问题。解决方案的关键在于提出PRISM方法,这是一种基于指令的策略精炼机制,将IL与强化学习(Reinforcement Learning, RL)无缝整合为一个流程:首先从用户引导的演示中生成通用任务的模仿策略,随后通过RL进行精细调整,利用自然语言任务描述迭代生成奖励函数,并引入人类反馈对中间轨迹进行修正,从而实现策略的可复用性和高效适应新场景的能力。实验表明,该方法显著提升了部署鲁棒性并降低了计算开销。
链接: https://arxiv.org/abs/2603.05574
作者: Arnau Boix-Granell,Alberto San-Miguel-Tello,Magí Dalmau-Moreno,Néstor García
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, Accepted for publication at European Robotics Forum 2026
Abstract:This paper presents PRISM: an instruction-conditioned refinement method for imitation policies in robotic manipulation. This approach bridges Imitation Learning (IL) and Reinforcement Learning (RL) frameworks into a seamless pipeline, such that an imitation policy on a broad generic task, generated from a set of user-guided demonstrations, can be refined through reinforcement to generate new unseen fine-grain behaviours. The refinement process follows the Eureka paradigm, where reward functions for RL are iteratively generated from an initial natural-language task description. Presented approach, builds on top of this mechanism to adapt a refined IL policy of a generic task to new goal configurations and the introduction of constraints by adding also human feedback correction on intermediate rollouts, enabling policy reusability and therefore data efficiency. Results for a pick-and-place task in a simulated scenario show that proposed method outperforms policies without human feedback, improving robustness on deployment and reducing computational burden.
[AI-50] When AI Levels the Playing Field: Skill Homogenization Asset Concentration and Two Regimes of Inequality
【速读】:该论文试图解决的问题是:生成式 AI (Generative AI) 在压缩个体任务内技能差异的同时,如何通过重塑经济价值分配机制加剧整体不平等,形成“技术均等化个体表现却扩大群体差距”的悖论。其解决方案的关键在于构建一个包含内生教育、雇主筛选和异质性企业特征的任务导向模型,识别出两种均衡状态的边界条件——该边界由 AI 的技术结构(专有 vs. 商品化)与劳动力市场制度(租金分享弹性、资产集中度)共同决定。研究进一步通过模拟矩法校准模型参数,并揭示非 Gini 指标(如 m₆ 和 ξ)对整体不平等方向起决定作用,而 AI 技术结构(η₁ vs. η₀)独立控制均衡状态转换,从而明确机制路径而非简单判断不平等符号。
链接: https://arxiv.org/abs/2603.05565
作者: Xupeng Chen,Shuchen Meng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI compresses within-task skill differences while shifting economic value toward concentrated complementary assets, creating an apparent paradox: the technology that equalizes individual performance may widen aggregate inequality. We formalize this tension in a task-based model with endogenous education, employer screening, and heterogeneous firms. The model yields two regimes whose boundary depends on AI’s technology structure (proprietary vs.\ commodity) and labor market institutions (rent-sharing elasticity, asset concentration). A scenario analysis via Method of Simulated Moments, matching six empirical targets, disciplines the model’s quantitative magnitudes; a sensitivity decomposition reveals that the five non- \Delta Gini moments identify mechanism rates but not the aggregate sign, which at the calibrated parameters is pinned by m_6 and \xi , while AI’s technology structure ( \eta_1 vs.\ \eta_0 ) independently crosses the boundary. The contribution is the mechanism – not a verdict on the sign. Occupation-level regressions using BLS OEWS data (2019–2023) illustrate why such data cannot test the model’s task-level predictions. The predictions are testable with within-occupation, within-task panel data that do not yet exist at scale.
[AI-51] Model Change for Description Logic Concepts AAAI2026
【速读】:该论文试图解决在描述逻辑(Description Logic)框架下对概念描述进行模型变更的问题,具体包括三种操作:驱逐(eviction,仅移除模型)、接收(reception,引入新模型)和修订(revision,同时执行移除与引入)。其核心挑战在于如何形式化地定义“修订”这一复合操作,并证明它不能简单等价于驱逐与接收的组合,这与直觉相反。解决方案的关键在于提出一个严格的修订形式化定义,并通过理论分析揭示了在EL和ALC两类描述逻辑中,驱逐与接收的兼容性以及ALC下修订操作的可行性边界,从而为知识库动态更新提供了逻辑基础。
链接: https://arxiv.org/abs/2603.05562
作者: Ana Ozaki,Jandson S. Ribeiro
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Presented at AAAI 2026 (main track)
Abstract:We consider the problem of modifying a description logic concept in light of models represented as pointed interpretations. We call this setting model change, and distinguish three main kinds of changes: eviction, which consists of only removing models; reception, which incorporates models; and revision, which combines removal with incorporation of models in a single operation. We introduce a formal notion of revision and argue that it does not reduce to a simple combination of eviction and reception, contrary to intuition. We provide positive and negative results on the compatibility of eviction and reception for EL and ALC description logic concepts and on the compatibility of revision for ALC concepts.
[AI-52] owards Efficient and Stable Ocean State Forecasting: A Continuous-Time Koopman Approach
【速读】:该论文旨在解决长期海洋状态预测中传统数值模型计算成本高、长时间滚动预测易出现误差累积与能量漂移的问题。解决方案的关键在于提出连续时间Koopman自编码器(Continuous-Time Koopman Autoencoder, CT-KAE),通过将非线性动力学投影到由线性常微分方程控制的潜在空间,实现结构化且可解释的时间演化;同时利用矩阵指数形式实现时间分辨率无关的预测,从而在2083天的滚动预测中保持误差增长可控、大尺度统计特性稳定,相较自回归Transformer基线模型具有显著优势。
链接: https://arxiv.org/abs/2603.05560
作者: Rares Grozavescu,Pengyu Zhang,Mark Girolami,Etienne Meunier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph); Geophysics (physics.geo-ph)
备注:
Abstract:We investigate the Continuous-Time Koopman Autoencoder (CT-KAE) as a lightweight surrogate model for long-horizon ocean state forecasting in a two-layer quasi-geostrophic (QG) system. By projecting nonlinear dynamics into a latent space governed by a linear ordinary differential equation, the model enforces structured and interpretable temporal evolution while enabling temporally resolution-invariant forecasting via a matrix exponential formulation. Across 2083-day rollouts, CT-KAE exhibits bounded error growth and stable large-scale statistics, in contrast to autoregressive Transformer baselines which exhibit gradual error amplification and energy drift over long rollouts. While fine-scale turbulent structures are partially dissipated, bulk energy spectra, enstrophy evolution, and autocorrelation structure remain consistent over long horizons. The model achieves orders-of-magnitude faster inference compared to the numerical solver, suggesting that continuous-time Koopman surrogates offer a promising backbone for efficient and stable hybrid physical-machine learning climate models.
[AI-53] Human-Data Interaction Exploration and Visualization in the AI Era: Challenges and Opportunities
【速读】:该论文旨在解决人工智能(AI)快速发展背景下,人-数据交互系统面临的一系列核心挑战,包括感知不对齐的延迟、可扩展性限制、现有交互与探索范式的局限性,以及由大语言模型(LLM)和视觉语言模型(VLM)等基础模型引入的分析过程中的不确定性问题。解决方案的关键在于超越传统效率与可扩展性指标,重新定义人类与机器在分析工作流中的角色,并将认知科学、感知机制与设计原则深度融合至人-数据交互栈的每一层级,从而构建更可靠、可解释且以用户为中心的AI增强型交互式数据分析系统。
链接: https://arxiv.org/abs/2603.05542
作者: Jean-Daniel Fekete,Yifan Hu,Dominik Moritz,Arnab Nandi,Senjuti Basu Roy,Eugene Wu,Nikos Bikakis,George Papastefanatos,Panos K. Chrysanthis,Guoliang Li,Lingyun Yu
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Graphics (cs.GR); Multimedia (cs.MM)
备注:
Abstract:The rapid advancement of AI is transforming human-centered systems, with profound implications for human-AI interaction, human-data interaction, and visual analytics. In the AI era, data analysis increasingly involves large-scale, heterogeneous, and multimodal data that is predominantly unstructured, as well as foundation models such as LLMs and VLMs, which introduce additional uncertainty into analytical processes. These shifts expose persistent challenges for human-data interactive systems, including perceptually misaligned latency, scalability constraints, limitations of existing interaction and exploration paradigms, and growing uncertainty regarding the reliability and interpretability of AI-generated insights. Responding to these challenges requires moving beyond conventional efficiency and scalability metrics, redefining the roles of humans and machines in analytical workflows, and incorporating cognitive, perceptual, and design principles into every level of the human-data interaction stack. This paper investigates the challenges introduced by recent advances in AI and examines how these developments are reshaping the ways users engage with data, while outlining limitations and open research directions for building human-centered AI systems for interactive data analysis in the AI era.
[AI-54] JAWS: Enhancing Long-term Rollout of Neural Operators via Spatially-Adaptive Jacobian Regularization
【速读】:该论文旨在解决数据驱动的代理模型在模拟连续动力系统时,因自回归滚动(autoregressive rollout)导致的不稳定性和频谱爆炸问题。传统全局正则化方法虽能强制收缩动力学,但会均匀衰减高频特征,引发收缩-耗散困境;而显式校正漂移的长时程轨迹优化方法则受限于内存瓶颈。其解决方案的关键在于提出JAWS(Jacobian-Adaptive Weighting for Stability),一种基于概率正则化的策略,将算子学习建模为带有空间异方差不确定性的最大后验估计(Maximum A Posteriori, MAP),通过局部物理复杂度动态调节正则化强度:在平滑区域增强收缩以抑制噪声,在奇异特征附近放松约束以保留梯度,从而实现类似数值激波捕捉(shock-capturing)的行为。该方法有效降低基础算子对高频不稳定的负担,使短时程轨迹优化在内存效率下达到或超越长时程基线的长期精度。
链接: https://arxiv.org/abs/2603.05538
作者: Fengxiang Nie,Yasuhiro Suzuki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 11 pages, 16 figures
Abstract:Data-driven surrogate models improve the efficiency of simulating continuous dynamical systems, yet their autoregressive rollouts are often limited by instability and spectral blow-up. While global regularization techniques can enforce contractive dynamics, they uniformly damp high-frequency features, introducing a contraction-dissipation dilemma. Furthermore, long-horizon trajectory optimization methods that explicitly correct drift are bottlenecked by memory constraints. In this work, we propose Jacobian-Adaptive Weighting for Stability (JAWS), a probabilistic regularization strategy designed to mitigate these limitations. By framing operator learning as Maximum A Posteriori (MAP) estimation with spatially heteroscedastic uncertainty, JAWS dynamically modulates the regularization strength based on local physical complexity. This allows the model to enforce contraction in smooth regions to suppress noise, while relaxing constraints near singular features to preserve gradients, effectively realizing a behavior similar to numerical shock-capturing schemes. Experiments demonstrate that this spatially-adaptive prior serves as an effective spectral pre-conditioner, which reduces the base operator’s burden of handling high-frequency instabilities. This reduction enables memory-efficient, short-horizon trajectory optimization to match or exceed the long-term accuracy of long-horizon baselines. Evaluated on the 1D viscous Burgers’ equation, our hybrid approach improves long-term stability, shock fidelity, and out-of-distribution generalization while reducing training computational costs.
[AI-55] owards Neural Graph Data Management
【速读】:该论文旨在解决当前神经网络模型在处理结构化数据(如图数据库中的图数据)方面能力不足的问题,尤其是在复杂查询理解、噪声鲁棒性和分析精度方面的短板。其解决方案的关键在于提出NGDBench——一个统一的基准测试平台,支持完整的Cypher查询语言,涵盖金融、医疗和AI代理工具等多个领域,能够评估模型在复杂模式匹配、可变长度路径查询及数值聚合等任务上的表现,并引入真实噪声注入和动态数据管理操作以增强评估的真实性与挑战性。
链接: https://arxiv.org/abs/2603.05529
作者: Yufei Li,Yisen Gao,Jiaxin Bai,Jiaxuan Xiong,Haoyu Huang,Zhongwei Xie,Hong Ting Tsang,Yangqiu Song
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:While AI systems have made remarkable progress in processing unstructured text, structured data such as graphs stored in databases, continues to grow rapidly yet remains difficult for neural models to effectively utilize. We introduce NGDBench, a unified benchmark for evaluating neural graph database capabilities across five diverse domains, including finance, medicine, and AI agent tooling. Unlike prior benchmarks limited to elementary logical operations, NGDBench supports the full Cypher query language, enabling complex pattern matching, variable-length paths, and numerical aggregations, while incorporating realistic noise injection and dynamic data management operations. Our evaluation of state-of-the-art LLMs and RAG methods reveals significant limitations in structured reasoning, noise robustness, and analytical precision, establishing NGDBench as a critical testbed for advancing neural graph data management. Our code and data are available at this https URL.
[AI-56] raversal-as-Policy: Log-Distilled Gated Behavior Trees as Externalized Verifiable Policies for Safe Robust and Efficient Agents
【速读】:该论文旨在解决自主大语言模型(Large Language Model, LLM)代理在长时程任务中因策略隐含于模型权重和对话记录中、安全性需事后补救而导致的不可靠性问题。其解决方案的关键在于提出“遍历即策略”(Traversal-as-Policy)框架:将沙箱环境下的OpenHands执行日志蒸馏为单一可执行的门控行为树(Gated Behavior Tree, GBT),并将树的遍历而非无约束生成作为控制策略;每个节点编码从成功轨迹中挖掘并验证的状态条件动作宏(macro),不安全轨迹关联的宏则附加确定性预执行门控机制,基于经验驱动的单调性更新确保已被拒绝的危险上下文无法重新引入;运行时轻量遍历器匹配基础模型意图至子宏,逐个执行并在停滞时采用风险感知最短路径恢复至可行成功叶节点,所经路径构成紧凑脊柱记忆(spine memory)替代传统转录回放,从而显著提升成功率、趋近零违规并降低计算成本。
链接: https://arxiv.org/abs/2603.05517
作者: Peiran Li,Jiashuo Sun,Fangzhou Lin,Shuo Xing,Tianfu Fu,Suofei Feng,Chaoqun Ni,Zhengzhong Tu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
备注: 30 pages, 1 figurres, 23 tables
Abstract:Autonomous LLM agents fail because long-horizon policy remains implicit in model weights and transcripts, while safety is retrofitted post hoc. We propose Traversal-as-Policy: distill sandboxed OpenHands execution logs into a single executable Gated Behavior Tree (GBT) and treat tree traversal – rather than unconstrained generation – as the control policy whenever a task is in coverage. Each node encodes a state-conditioned action macro mined and merge-checked from successful trajectories; macros implicated by unsafe traces attach deterministic pre-execution gates over structured tool context and bounded history, updated under experience-grounded monotonicity so previously rejected unsafe contexts cannot be re-admitted. At runtime, a lightweight traverser matches the base model’s intent to child macros, executes one macro at a time under global and node-local gating, and when stalled performs risk-aware shortest-path recovery to a feasible success leaf; the visited path forms a compact spine memory that replaces transcript replay. Evaluated in a unified OpenHands sandbox on 15+ software, web, reasoning, and safety/security benchmarks, GBT improves success while driving violations toward zero and reducing cost. On SWE-bench Verified (Protocol A, 500 issues), GBT-SE raises success from 34.6% to 73.6%, reduces violations from 2.8% to 0.2%, and cuts token/character usage from 208k/820k to 126k/490k; with the same distilled tree, 8B executors more than double success on SWE-bench Verified (14.0%58.8%) and WebArena (9.1%37.3%).
[AI-57] Can LLM Aid in Solving Constraints with Inductive Definitions?
【速读】:该论文旨在解决涉及归纳(inductive)定义的约束求解难题,这类问题在当前最先进的SMT/约束Horn逻辑(CHC)求解器和一阶逻辑定理证明器中仍缺乏有效支持,尤其是在处理抽象数据类型时。其解决方案的关键在于利用结构化提示(structured prompts)引导大型语言模型(LLM)生成辅助引理(auxiliary lemmas),以辅助推理归纳定义;进一步提出一种神经符号(neuro-symbolic)方法,通过迭代式交互实现LLM与约束求解器的协同:LLM生成假设,求解器验证其有效性与对目标证明的助益性,从而显著提升求解能力。实验表明,该方法可使现有SMT和CHC求解器在涉及归纳定义的证明任务上成功率提升约25%。
链接: https://arxiv.org/abs/2603.03668
作者: Weizhi Feng,Shidong Shen,Jiaxiang Liu,Taolue Chen,Fu Song,Zhilin Wu
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 22 pages, 4 figures, accepted by the 27th Symposium on Formal Methods (FM 2026)
Abstract:Solving constraints involving inductive (aka recursive) definitions is challenging. State-of-the-art SMT/CHC solvers and first-order logic provers provide only limited support for solving such constraints, especially when they involve, e.g., abstract data types. In this work, we leverage structured prompts to elicit Large Language Models (LLMs) to generate auxiliary lemmas that are necessary for reasoning about these inductive definitions. We further propose a neuro-symbolic approach, which synergistically integrates LLMs with constraint solvers: the LLM iteratively generates conjectures, while the solver checks their validity and usefulness for proving the goal. We evaluate our approach on a diverse benchmark suite comprising constraints originating from algebrai data types and recurrence relations. The experimental results show that our approach can improve the state-of-the-art SMT and CHC solvers, solving considerably more (around 25%) proof tasks involving inductive definitions, demonstrating its efficacy.
[AI-58] LiveSense: A Real-Time Wi-Fi Sensing Platform for Range-Doppler on COTS Laptop
【速读】:该论文旨在解决如何利用商用现成(COTS)Wi-Fi网卡在保持通信能力的同时,实现厘米级距离-多普勒感知的问题。其核心挑战在于从受限带宽(如160 MHz)的商用Wi-Fi信号中提取高精度的通道状态信息(CSI),并实时处理以获得目标的距离、径向速度及微动特征(如呼吸和手势)。解决方案的关键在于LiveSense系统的设计:它能够在Intel AX211(Wi-Fi 6E)或BE201(Wi-Fi 7)等COTS网卡上,以40 Hz频率同步提取CSI,并在设备端完成时间相位对齐与自干扰消除,从而实现实时输出范围、多普勒、子载波幅度/相位以及标注视频帧的Python/Qt图形界面交互,首次实现了基于商用Wi-Fi的精确目标测距能力。
链接: https://arxiv.org/abs/2603.06545
作者: Jessica Sanson,Rahul C. Shah,Maximilian Pinaroc,Cagri Tanriover,Valerio Frascolla
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:We present LiveSense - a cross-platform that transforms a commercial off-the-shelf (COTS) Wi-Fi Network Interface Card (NIC) on a laptop into a centimeter-level Range-Doppler sensor while preserving simultaneous communication capability. The laptops are equipped with COTS Intel AX211 (Wi-Fi 6E) or Intel BE201 (Wi-Fi 7) NICs. LiveSense can (i) Extract fully-synchronized channel state information (CSI) at = 40 Hz, (ii) Perform time-phase alignment and self-interference cancellation on-device, and (iii) Provide a real-time stream of range, Doppler, subcarrier magnitude/phase and annotated video frames to a Python/Qt Graphical User Interface (GUI). The demo will showcase the ability to detect (i) Distance and radial velocity of attendees within a few meters of the device, (ii) Micro-motion (respiration), and (iii) Hand-gesture ranging. To the best of our knowledge, this is the first-ever demo to obtain accurate range information of targets from commercial Wi-Fi, despite the limited 160 MHz bandwidth.
[AI-59] AI End-to-End Radiation Treatment Planning Under One Second
【速读】:该论文旨在解决放射治疗(Radiation Therapy, RT)计划制定过程中耗时长、不同规划师间一致性差的问题,以提升临床流程的效率与标准化程度。其解决方案的关键在于提出了一种端到端的深度学习框架AIRT(Artificial Intelligence-based Radiotherapy),该框架可直接从CT图像和结构轮廓生成可执行的治疗计划,无需多次剂量评估与修正;其核心创新包括可微分的剂量反馈机制、对抗性调强射野形状优化以及计划生成增强策略,从而在单张Nvidia A100 GPU上实现<1秒的单弧VMAT前列腺计划生成,且在靶区覆盖均匀性和器官保护(OAR sparing)方面达到与RapidPlan Eclipse参考计划非劣效水平。
链接: https://arxiv.org/abs/2603.06338
作者: Simon Arberet,Riqiang Gao,Martin Kraus,Florin C. Ghesu,Wilko Verbakel,Mamadou Diallo,Anthony Magliari,Venkatesan Karuppusamy,Sushil Beriwal,REQUITE Consortium,Ali Kamen,Dorin Comaniciu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Medical Physics (physics.med-ph)
备注:
Abstract:Artificial intelligence-based radiation therapy (RT) planning has the potential to reduce planning time and inter-planner variability, improving efficiency and consistency in clinical workflows. Most existing automated approaches rely on multiple dose evaluations and corrections, resulting in plan generation times of several minutes. We introduce AIRT (Artificial Intelligence-based Radiotherapy), an end-to-end deep-learning framework that directly infers deliverable treatment plans from CT images and structure contours. AIRT generates single-arc VMAT prostate plans, from imaging and anatomical inputs to leaf sequencing, in under one second on a single Nvidia A100 GPU. The framework includes a differentiable dose feedback, an adversarial fluence map shaping, and a plan generation augmentation to improve plan quality and robustness. The model was trained on more than 10,000 intact prostate cases. Non-inferiority to RapidPlan Eclipse was demonstrated across target coverage and OAR sparing metrics. Target homogeneity (HI = 0.10 \pm 0.01) and OAR sparing were similar to reference plans when evaluated using AcurosXB. These results represent a significant step toward ultra-fast standardized RT planning and a streamlined clinical workflow.
[AI-60] StreamVoiceAnon: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation
【速读】:该论文旨在解决流式说话人匿名化(Streaming Speaker Anonymization, SA)过程中情感信息丢失的问题。现有基于神经音频编解码语言模型的音频续写方法在处理语音时,会因内容标记丢弃情感特征,且模型倾向于采用主导的声学模式而非保留语调等副语言属性,导致情绪保真度下降。其解决方案的关键在于:一是通过监督微调(supervised fine-tuning)引入来自同一说话人的中性情绪语音对,使模型学习保留情感特征;二是采用帧级情感蒸馏(frame-level emotion distillation)在声学标记隐藏状态上进行情感信息引导,从而在不增加推理延迟的前提下有效提升情绪保真度。该方法仅需4张GPU训练不足2小时,保持180ms流式延迟,并在VoicePrivacy 2024协议下实现49.2% UAR(情绪保真度),较基线提升24%,同时维持强隐私保护能力(EER 49.0%)。
链接: https://arxiv.org/abs/2603.06079
作者: Nikita Kuzmin,Kong Aik Lee,Eng Siong Chng
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:We address the challenge of preserving emotional content in streaming speaker anonymization (SA). Neural audio codec language models trained for audio continuation tend to degrade source emotion: content tokens discard emotional information, and the model defaults to dominant acoustic patterns rather than preserving paralinguistic attributes. We propose supervised finetuning with neutral-emotion utterance pairs from the same speaker, combined with frame-level emotion distillation on acoustic token hidden states. All modifications are confined to finetuning, which takes less than 2 hours on 4 GPUs and adds zero inference latency overhead, while maintaining a competitive 180ms streaming latency. On the VoicePrivacy 2024 protocol, our approach achieves a 49.2% UAR (emotion preservation) with 5.77% WER (intelligibility), a +24% relative UAR improvement over the baseline (39.7%-49.2%) and +10% over the emotion-prompt variant (44.6% UAR), while maintaining strong privacy (EER 49.0%). Demo and code are available: this https URL
[AI-61] Reconstruct! Dont Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec INTERSPEECH2026
【速读】:该论文旨在解决神经音频编解码器(Neural Audio Codec)在梅尔频谱重构过程中难以保持语音可懂度的问题。传统方法虽能优化梅尔频谱重建,但常导致语音内容信息丢失,影响听觉质量。其解决方案的关键在于引入自监督表示重建(Self-Supervised Representation Reconstruction, SSRR)损失函数:首先,SSRR显著加速模型收敛,使单GPU即可实现媲美多卡训练的性能;其次,通过重建编码输出中的自监督语义表示,有效提升语音可懂度;最后,无需额外的前瞻(lookahead)机制即可在流式Transformer架构中实现高可懂度,从而支持零前瞻部署,适用于实时场景。该方法最终使JHCodec达到当前最优性能,同时具备低延迟与低成本训练优势。
链接: https://arxiv.org/abs/2603.05887
作者: Junhyeok Lee,Xiluo He,Jihwan Lee,Helin Wang,Shrikanth Narayanan,Thomas Thebaud,Laureano Moro-Velazquez,Jesús Villalba,Najim Dehak
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Submitted to Interspeech 2026
Abstract:Neural audio codecs optimized for mel-spectrogram reconstruction often fail to preserve intelligibility. While semantic encoder distillation improves encoded representations, it does not guarantee content preservation in reconstructed speech. In this work, we demonstrate that self-supervised representation reconstruction (SSRR) loss fundamentally improves codec training and performance. First, SSRR significantly accelerates convergence, enabling competitive results using only a single GPU. Second, it enhances intelligibility by reconstructing distilled self-supervised representations from codec outputs. Third, SSRR enables high intelligibility without additional lookahead in streaming Transformer-based codecs, allowing a zero-lookahead architecture for real-time deployment. As a result, our JHCodec achieves state-of-the-art performance while maintaining minimal latency and reduced training cost. We open-source the full implementation, training pipeline, and demo on Github this https URL.
[AI-62] he Rise of AI in Weather and Climate Information and its Impact on Global Inequality
【速读】:该论文旨在解决人工智能(AI)在地球系统科学中快速发展所加剧的全球南北不平等问题,特别是高计算能力和数据基础设施的分布不均导致的气候信息生成与应用中的结构性偏见。其核心问题在于:当前基础模型的开发高度集中于全球北方,这种不对称性贯穿于模型输入、处理和输出三个阶段,造成对脆弱地区气候预测精度不足、影响评估失真以及知识表达偏颇等后果。解决方案的关键在于重构AI气候建模的范式,具体包括:(i) 从以模型为中心转向以数据为中心的开发理念,确保数据代表性与公平性;(ii) 建立气候数字公共基础设施并引入以人为中心的评估指标,提升模型适用性和社会可接受度;(iii) 推动从单向生产者-消费者关系向知识协同共创模式转变,整合多元知识体系,从而实现计算主权的真正民主化,并保障AI革命促进系统韧性而非加深不平等。
链接: https://arxiv.org/abs/2603.05710
作者: Amirpasha Mozaffari,Amanda Duarte,Lina Teckentrup,Stefano Materia,Gina E. C. Charnley,Lluis Palma,Eulalia Baulenas Serra,Dragana Bojovic,Paula Checchia,Aude Carreric,Francisco Doblas-Reyes
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The rapid adoption of AI in Earth system science promises unprecedented speed and fidelity in the generation of climate information. However, this technological prowess rests on a fragile and unequal foundation: the current trajectory of AI development risks further automating and amplifying the North-South divide in the global climate information system. We outline the global asymmetry in High-Performance Computing and data infrastructure, demonstrating that the development of foundation models is almost exclusively concentrated in the Global North. Using three different domains, we show how this infrastructure inequality continues through models’ inputs, processes and outputs. As an example, in weather and climate modelling, the reliance on historically biased data leads to systematic performance gaps that disproportionately affect the most vulnerable regions. In climate impact modelling, data sparsity and unrepresentative validation risk driving misleading interventions and maladaptation. Finally, in large language models, dependence on dominant textualised forms of climate knowledge risks reinforcing existing biases. We conclude that addressing these disparities demands revisiting the three phases, i.e. models Input, Process and Output. This involves (i) a perspective shift from model-centric to data-centric development, (ii) the establishment of a Climate Digital Public Infrastructure and human-centric evaluation metrics, and (iii) a move from producer-consumer dynamics toward knowledge co-production. This integration of diverse knowledge systems would truly democratise compute sovereignty and ensure that the AI revolution fosters genuine systemic resilience rather than exacerbating inequity.
[AI-63] On the Reliability of AI Methods in Drug Discovery: Evaluation of Boltz-2 for Structure and Binding Affinity Prediction
【速读】:该论文旨在解决当前生成式 AI 在药物发现领域中“尚未有获批药物”的现实问题,聚焦于评估最新一代基于 AI 的分子基础模型 Boltz-2 在预测蛋白质-配体结构与结合亲和力方面的有效性与可靠性。其关键解决方案在于采用一种联合“共折叠”(co-folding)策略,试图在 AI 高效性与物理驱动的精确性之间建立桥梁。研究通过大规模数据集(3CLPro 和 TNKS2 分别包含 16,780 和 21,702 个化合物)对比 Boltz-2 的预测结果与基于物理机制的 ESMACS 方法计算的结合自由能,揭示了 Boltz-2 虽具备快速初筛能力,但其能量分辨率不足,难以支撑先导化合物识别,强调了在 AI 模型基础上引入物理机制方法以提升可靠性和精度的必要性。
链接: https://arxiv.org/abs/2603.05532
作者: Shunzhou Wan,Xibei Zhang,Xiao Xue,Peter V. Coveney
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite continuing hype about the role of AI in drug discovery, no “AI-discovered drugs” have so far received regulatory approval. Here we assess one of the latest AI based tools in this domain. The ability to rapidly predict protein-ligand structures and binding affinities is pivotal for accelerating drug discovery. Boltz-2, a recently developed biomolecular foundation model, aims to bridge the gap between AI efficiency and physics-based precision through a joint “co-folding” approach. In this study, we provide an extensive evaluation of Boltz-2 using two large-scale datasets: 16,780 compounds for 3CLPro and 21,702 compounds for TNKS2. We compare Boltz-2 predicted structures with traditional docking and binding affinities with binding free energies derived from the physics-based ESMACS protocol. Structural analysis reveals significant global RMSD variations, indicating that Boltz-2 predicts multiple protein conformations and ligand binding positions rather than a single converged pose. Energetic evaluations exhibit only weak to moderate correlations across the global datasets. Furthermore, a focused analysis of the top 100 compounds yields no significant correlation between the Boltz-2 predictions and the binding free energies from fine-grained ESMACS, alongside observed saturation difference in ligand structures. Our results show that while Boltz-2 offers substantial speed for initial screening, it lacks the energetic resolution required for lead identification. These findings highlight the necessity of employing physics-based methods for the reliability and refinement of AI-derived models.
[AI-64] Molecular Representations for AI in Chemistry and Materials Science: An NLP Perspective
【速读】:该论文试图解决的问题是:如何构建既适合机器处理又便于跨学科科学家理解的分子表示方法,以促进人工智能(AI)在化学和材料科学中的应用。解决方案的关键在于借鉴自然语言处理(Natural Language Processing, NLP)领域的技术,采用受NLP启发的数字分子表示方法,如将分子结构映射为序列或图结构,从而实现对复杂分子信息的有效编码与计算处理,同时保持其可解释性与跨领域适用性。
链接: https://arxiv.org/abs/2603.05525
作者: Sanjanasri JP,Pratiti Bhadra,N. Sukumar,Soman KP
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Deep learning, a subfield of machine learning, has gained importance in various application areas in recent years. Its growing popularity has led it to enter the natural sciences as well. This has created the need for molecular representations that are both machine-readable and understandable to scientists from different fields. Over the years, many chemical molecular representations have been constructed, and new ones continue to be developed as computer technology advances and knowledge of molecular complexity increases. This paper presents some of the most popular digital molecular representations inspired by natural language processing (NLP) and used in chemical informatics. In addition, the paper discusses some notable AI-based applications that use these representations. This paper aims to provide a guide to structural representations that are important for the application of AI in chemistry and materials science from the perspective of an NLP researcher. This review is a reference tool for researchers with little experience working with chemical representations who wish to work on projects at the interface of these fields.
机器学习
[LG-0] A recipe for scalable attention-based MLIPs: unlocking long-range accuracy with all-to-all node attention
链接: https://arxiv.org/abs/2603.06567
作者: Eric Qu,Brandon M. Wood,Aditi S. Krishnapriyan,Zachary W. Ulissi
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Engineering, Finance, and Science (cs.CE); Chemical Physics (physics.chem-ph); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Machine-learning interatomic potentials (MLIPs) have advanced rapidly, with many top models relying on strong physics-based inductive biases. However, as models scale to larger systems like biomolecules and electrolytes, they struggle to accurately capture long-range (LR) interactions, leading current approaches to rely on explicit physics-based terms or components. In this work, we propose AllScAIP, a straightforward, attention-based, and energy-conserving MLIP model that scales to O(100 million) training samples. It addresses the long-range challenge using an all-to-all node attention component that is data-driven. Extensive ablations reveal that in low-data/small-model regimes, inductive biases improve sample efficiency. However, as data and model size scale, these benefits diminish or even reverse, while all-to-all attention remains critical for capturing LR interactions. Our model achieves state-of-the-art energy/force accuracy on molecular systems, as well as a number of physics-based evaluations (OMol25), while being competitive on materials (OMat24) and catalysts (OC20). Furthermore, it enables stable, long-timescale MD simulations that accurately recover experimental observables, including density and heat of vaporization predictions.
[LG-1] Causal Interpretation of Neural Network Computations with Contribution Decomposition ICLR2026
链接: https://arxiv.org/abs/2603.06557
作者: Joshua Brendan Melander,Zaki Alaoui,Shenghua Liu,Surya Ganguli,Stephen A. Baccus
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 32 pages, 19 figures. ICLR 2026 poster
Abstract:Understanding how neural networks transform inputs into outputs is crucial for interpreting and manipulating their behavior. Most existing approaches analyze internal representations by identifying hidden-layer activation patterns correlated with human-interpretable concepts. Here we take a direct approach to examine how hidden neurons act to drive network outputs. We introduce CODEC (Contribution Decomposition), a method that uses sparse autoencoders to decompose network behavior into sparse motifs of hidden-neuron contributions, revealing causal processes that cannot be determined by analyzing activations alone. Applying CODEC to benchmark image-classification networks, we find that contributions grow in sparsity and dimensionality across layers and, unexpectedly, that they progressively decorrelate positive and negative effects on network outputs. We further show that decomposing contributions into sparse modes enables greater control and interpretation of intermediate layers, supporting both causal manipulations of network output and human-interpretable visualizations of distinct image components that combine to drive that output. Finally, by analyzing state-of-the-art models of neural activity in the vertebrate retina, we demonstrate that CODEC uncovers combinatorial actions of model interneurons and identifies the sources of dynamic receptive fields. Overall, CODEC provides a rich and interpretable framework for understanding how nonlinear computations evolve across hierarchical layers, establishing contribution modes as an informative unit of analysis for mechanistic insights into artificial neural networks.
[LG-2] Hierarchical Industrial Demand Forecasting with Temporal and Uncertainty Explanations
链接: https://arxiv.org/abs/2603.06555
作者: Harshavardhan Kamarthi,Shangqing Xu,Xinjie Tong,Xingyu Zhou,James Peters,Joseph Czyzyk,B. Aditya Prakash
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hierarchical time-series forecasting is essential for demand prediction across various industries. While machine learning models have obtained significant accuracy and scalability on such forecasting tasks, the interpretability of their predictions, informed by application, is still largely unexplored. To bridge this gap, we introduce a novel interpretability method for large hierarchical probabilistic time-series forecasting, adapting generic interpretability techniques while addressing challenges associated with hierarchical structures and uncertainty. Our approach offers valuable interpretative insights in response to real-world industrial supply chain scenarios, including 1) the significance of various time-series within the hierarchy and external variables at specific time points, 2) the impact of different variables on forecast uncertainty, and 3) explanations for forecast changes in response to modifications in the training dataset. To evaluate the explainability method, we generate semi-synthetic datasets based on real-world scenarios of explaining hierarchical demands for over ten thousand products at a large chemical company. The experiments showed that our explainability method successfully explained state-of-the-art industrial forecasting methods with significantly higher explainability accuracy. Furthermore, we provide multiple real-world case studies that show the efficacy of our approach in identifying important patterns and explanations that help stakeholders better understand the forecasts. Additionally, our method facilitates the identification of key drivers behind forecasted demand, enabling more informed decision-making and strategic planning. Our approach helps build trust and confidence among users, ultimately leading to better adoption and utilization of hierarchical forecasting models in practice.
[LG-3] When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models ICLR2026
链接: https://arxiv.org/abs/2603.06508
作者: Qitong Wang,Haoran Dai,Haotian Zhang,Christopher Rasmussen,Binghui Wang
类目: Machine Learning (cs.LG)
*备注: Accepted to the ICLR 2026 Workshop on Principled Design for Trustworthy AI. The first two authors contributed equally
Abstract:While diffusion models have revolutionized visual content generation, their rapid adoption has underscored the critical need to investigate vulnerabilities, e.g., to backdoor attacks. In multimodal diffusion models, it is natural to expect that attacking multiple modalities simultaneously (e.g., text and image) would yield complementary effects and strengthen the overall backdoor. In this paper, we challenge this assumption by investigating the phenomenon of Backdoor Modality Collapse, a scenario where the backdoor mechanism degenerates to rely predominantly on a subset of modalities, rendering others redundant. To rigorously quantify this behavior, we introduce two novel metrics: Trigger Modality Attribution (TMA) and Cross-Trigger Interaction (CTI). Through extensive experiments across diverse training configurations in multimodal conditional diffusion, we consistently observe a ``winner-takes-all’’ dynamic in backdoor behavior. Our results reveal that (1) attacks often collapse into subset-modality dominance, and (2) cross-modal interaction is negligible or even negative, contradicting the intuition of synergistic vulnerability. These findings highlight a critical blind spot in current assessments, suggesting that high attack success rates often mask a fundamental reliance on a subset of modalities. This establishes a principled foundation for mechanistic analysis and future defense development.
[LG-4] oward Generative Quantum Utility via Correlation-Complexity Map
链接: https://arxiv.org/abs/2603.06440
作者: Chen-Yu Liu,Leonardo Placidi,Eric Brunner,Enrico Rinaldi
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 33 pages, 8 figures
Abstract:We propose a Correlation-Complexity Map as a practical diagnostic tool for determining when real-world data distributions are structurally aligned with IQP-type quantum generative models. Characterized by two complementary indicators: (i) a Quantum Correlation-Likeness Indicator (QCLI), computed from the dataset’s correlation-order (Walsh-Hadamard/Fourier) power spectrum aggregated by interaction order and quantified via Jensen-Shannon divergence from an i.i.d. binomial reference; and (ii) a Classical Correlation-Complexity Indicator (CCI), defined as the fraction of total correlation not captured by the optimal Chow-Liu tree approximation, normalized by total correlation. We provide theoretical support by relating QCLI to a support-mismatch mechanism, for fixed-architecture IQP families trained with an MMD objective, higher QCLI implies a smaller irreducible approximation floor. Using the map, we identify the classical turbulence data as both IQP-compatible and classically complex (high QCLI/high CCI). Guided by this placement, we use an invertible float-to-bitstring representation and a latent-parameter adaptation scheme that reuses a compact IQP circuit over a temporal sequence by learning and interpolating a low-dimensional latent trajectory. In comparative evaluations against classical models such as Restricted Boltzmann Machine (RBM) and Deep Convolutional Generative Adversarial Networks (DCGAN), the IQP approach achieves competitive distributional alignment while using substantially fewer training snapshots and a small latent block, supporting the use of QCLI/CCI as practical indicators for locating IQP-aligned domains and advancing generative quantum utility.
[LG-5] Certified and accurate computation of function space norms of deep neural networks
链接: https://arxiv.org/abs/2603.06431
作者: Johannes Gründler,Moritz Maibaum,Philipp Petersen
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Neural network methods for PDEs require reliable error control in function space norms. However, trained neural networks can typically only be probed at a finite number of point values. Without strong assumptions, point evaluations alone do not provide enough information to derive tight deterministic and guaranteed bounds on function space norms. In this work, we move beyond a purely black-box setting and exploit the neural network structure directly. We present a framework for the certified and accurate computation of integral quantities of neural networks, including Lebesgue and Sobolev norms, by combining interval arithmetic enclosures on axis-aligned boxes with adaptive marking/refinement and quadrature-based aggregation. On each box, we compute guaranteed lower and upper bounds for function values and derivatives, and propagate these local certificates to global lower and upper bounds for the target integrals. Our analysis provides a general convergence theorem for such certified adaptive quadrature procedures and instantiates it for function values, Jacobians, and Hessians, yielding certified computation of L^p , W^1,p , and W^2,p norms. We further show how these ingredients lead to practical certified bounds for PINN interior residuals. Numerical experiments illustrate the accuracy and practical behavior of the proposed methods.
[LG-6] Adapter-Augmented Bandits for Online Multi-Constrained Multi-Modal Inference Scheduling
链接: https://arxiv.org/abs/2603.06403
作者: Xianzhi Zhang,Yue Xu,Yinlin Zhu,Di Wu,Yipeng Zhou,Miao Hu,Guocong Quan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-modal large language model (MLLM) inference scheduling enables strong response quality under practical and heterogeneous budgets, beyond what a homogeneous single-backend setting can offer. Yet online MLLM task scheduling is nontrivial, as requests vary sharply in modality composition and latent reasoning difficulty, while execution backends incur distinct, time-varying costs due to system jitter and network variation. These coupled uncertainties pose two core challenges: deriving semantically faithful yet scheduling-relevant multi-modal task representations, and making low-overhead online decisions over irreversible multi-dimensional budgets. Accordingly, we propose \emphM-CMAB (\underlineMulti-modal \underlineMulti-constraint \underlineContextual \underlineMulti-\underlineArmed \underlineBandit), a multi-adapter-enhanced MLLM inference scheduling framework with three components: (i) a CLS-attentive, frozen-backbone \emphPredictor that extracts compact task representations and updates only lightweight adapters for action-specific estimation; (ii) a primal-dual \emphConstrainer that maintains online Lagrange multipliers to enforce long-horizon constraints via per-round objectives; and (iii) a two-phase \emphScheduler that balances exploration and exploitation under irreversible budgets. We establish a regret guarantee under multi-dimensional knapsack constraints. On a composite multimodal benchmark with heterogeneous backends, \emphM-CMAB consistently outperforms state-of-the-art baselines across budget regimes, achieving up to 14.18% higher reward and closely tracking an oracle-aided upper bound. Codes are available at this https URL.
[LG-7] Adaptive Lipschitz-Free Conditional Gradient Methods for Stochastic Composite Nonconvex Optimization
链接: https://arxiv.org/abs/2603.06369
作者: Ganzhao Yuan
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注:
Abstract:We propose ALFCG (Adaptive Lipschitz-Free Conditional Gradient), the first \textitadaptive projection-free framework for stochastic composite nonconvex minimization that \textitrequires neither global smoothness constants nor line search. Unlike prior conditional gradient methods that use openloop diminishing stepsizes, conservative Lipschitz constants, or costly backtracking, ALFCG maintains a self-normalized accumulator of historical iterate differences to estimate local smoothness and minimize a quadratic surrogate model at each step. This retains the simplicity of Frank-Wolfe while adapting to unknown geometry. We study three variants. ALFCG-FS addresses finite-sum problems with a SPIDER estimator. ALFCG-MVR1 and ALFCG-MVR2 handle stochastic expectation problems by using momentum-based variance reduction with single-batch and two-batch updates, and operate under average and individual smoothness, respectively. To reach an \epsilon -stationary point, ALFCG-FS attains \mathcalO(N+\sqrtN\epsilon^-2) iteration complexity, while ALFCG-MVR1 and ALFCG-MVR2 achieve \tilde\mathcalO(\sigma^2\epsilon^-4+\epsilon^-2) and \tilde\mathcalO(\sigma\epsilon^-3+\epsilon^-2) , where N is the number of components and \sigma is the noise level. In contrast to typical \mathcalO(\epsilon^-4) or \mathcalO(\epsilon^-3) rates, our bounds reduce to the optimal rate up to logarithmic factors \tilde\mathcalO(\epsilon^-2) as the noise level \sigma \to 0 . Extensive experiments on multiclass classification over nuclear norm balls and \ell_p balls show that ALFCG generally outperforms state-of-the-art conditional gradient baselines.
[LG-8] ny Hardware-Independent Compression-based Classification
链接: https://arxiv.org/abs/2603.06359
作者: Charles Meyers,Aaron MacSween,Erik Elmroth,Tommy Löfstedt
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:The recent developments in machine learning have highlighted a conflict between online platforms and their users in terms of privacy. The importance of user privacy and the struggle for power over user data has been intensified as regulators and operators attempt to police online platforms. As users have become increasingly aware of privacy issues, client-side data storage, management, and analysis have become a favoured approach to large-scale centralised machine learning. However, state-of-the-art machine learning methods require vast amounts of labelled user data, making them unsuitable for models that reside client-side and only have access to a single user’s data. State-of-the-art methods are also computationally expensive, which degrades the user experience on compute-limited hardware and also reduces battery life. A recent alternative approach has proven remarkably successful in classification tasks across a wide variety of data – using a compression-based distance measure (called normalised compression distance) to measure the distance between generic objects in classical distance-based machine learning methods. In this work, we demonstrate that the normalised compression distance is actually not a metric; develop it for the wider context of kernel methods to allow modelling of complex data; and present techniques to improve the training time of models that use this distance measure. We demonstrate that the normalised compression distance works as well as and sometimes better than other metrics and kernels – while requiring only marginally more computational costs and in spite of the lack of formal metric properties. The end results is a simple model with remarkable accuracy even when trained on a very small number of samples allowing for models that are small and effective enough to run entirely on a client device using only user-supplied data.
[LG-9] Frequency-Separable Hamiltonian Neural Network for Multi-Timescale Dynamics
链接: https://arxiv.org/abs/2603.06354
作者: Yaojun Li,Yulong Yang,Christine Allen-Blanchette
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:While Hamiltonian mechanics provides a powerful inductive bias for neural networks modeling dynamical systems, Hamiltonian Neural Networks and their variants often fail to capture complex temporal dynamics spanning multiple timescales. This limitation is commonly linked to the spectral bias of deep neural networks, which favors learning low-frequency, slow-varying dynamics. Prior approaches have sought to address this issue through symplectic integration schemes that enforce energy conservation or by incorporating geometric constraints to impose structure on the configuration-space. However, such methods either remain limited in their ability to fully capture multiscale dynamics or require substantial domain specific assumptions. In this work, we exploit the observation that Hamiltonian functions admit decompositions into explicit fast and slow modes and can be reconstructed from these components. We introduce the Frequency-Separable Hamiltonian Neural Network (FS-HNN), which parameterizes the system Hamiltonian using multiple networks, each governed by Hamiltonian dynamics and trained on data sampled at distinct timescales. We further extend this framework to partial differential equations by learning a state- and boundary-conditioned symplectic operators. Empirically, we show that FS-HNN improves long-horizon extrapolation performance on challenging dynamical systems and generalizes across a broad range of ODE and PDE problems.
[LG-10] Polarized Direct Cross-Attention Message Passing in GNNs for Machinery Fault Diagnosis
链接: https://arxiv.org/abs/2603.06303
作者: Zongyu Shi,Laibin Zhang,Maoyin Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:The reliability of safety-critical industrial systems hinges on accurate and robust fault diagnosis in rotating machinery. Conventional graph neural networks (GNNs) for machinery fault diagnosis face limitations in modeling complex dynamic interactions due to their reliance on predefined static graph structures and homogeneous aggregation schemes. To overcome these challenges, this paper introduces polarized direct cross-attention (PolaDCA), a novel relational learning framework that enables adaptive message passing through data-driven graph construction. Our approach builds upon a direct cross-attention (DCA) mechanism that dynamically infers attention weights from three semantically distinct node features (such as individual characteristics, neighborhood consensus, and neighborhood diversity) without requiring fixed adjacency matrices. Theoretical analysis establishes PolaDCA’s superior noise robustness over conventional GNNs. Extensive experiments on industrial datasets (i.e., XJTUSuprgear, CWRUBearing and Three-Phase Flow Facility datasets) demonstrate state-of-the-art diagnostic accuracy and enhanced generalization under varying noise conditions, outperforming seven competitive baseline methods. The proposed framework provides an effective solution for safety-critical industrial applications.
[LG-11] Synthetic Monitoring Environments for Reinforcement Learning
链接: https://arxiv.org/abs/2603.06252
作者: Leonard Pleiss,Carolin Schmidt,Maximilian Schiffer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Reinforcement Learning (RL) lacks benchmarks that enable precise, white-box diagnostics of agent behavior. Current environments often entangle complexity factors and lack ground-truth optimality metrics, making it difficult to isolate why algorithms fail. We introduce Synthetic Monitoring Environments (SMEs), an infinite suite of continuous control tasks. SMEs provide fully configurable task characteristics and known optimal policies. As such, SMEs allow for the exact calculation of instantaneous regret. Their rigorous geometric state space bounds allow for systematic within-distribution (WD) and out-of-distribution (OOD) evaluation. We demonstrate the framework’s benefit through multidimensional ablations of PPO, TD3, and SAC, revealing how specific environmental properties - such as action or state space size, reward sparsity and complexity of the optimal policy - impact WD and OOD performance. We thereby show that SMEs offer a standardized, transparent testbed for transitioning RL evaluation from empirical benchmarking toward rigorous scientific analysis.
[LG-12] Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions
链接: https://arxiv.org/abs/2603.06248
作者: Aditya Varre,Mark Rofin,Nicolas Flammarion
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 35 pages, 21 figures
Abstract:Understanding the intricate non-convex training dynamics of softmax-based models is crucial for explaining the empirical success of transformers. In this article, we analyze the gradient flow dynamics of the value-softmax model, defined as L(\mathbfV \sigma(\mathbfa)) , where \mathbfV and \mathbfa are a learnable value matrix and attention vector, respectively. As the matrix times softmax vector parameterization constitutes the core building block of self-attention, our analysis provides direct insight into transformer’s training dynamics. We reveal that gradient flow on this structure inherently drives the optimization toward solutions characterized by low-entropy outputs. We demonstrate the universality of this polarizing effect across various objectives, including logistic and square loss. Furthermore, we discuss the practical implications of these theoretical results, offering a formal mechanism for empirical phenomena such as attention sinks and massive activations.
[LG-13] FedSCS-XGB – Federated Server-centric surrogate XGBoost for continual health monitoring
链接: https://arxiv.org/abs/2603.06224
作者: Felix Walger,Mehdi Ejtehadi,Anke Schmeink,Diego Paez-Granados
类目: Machine Learning (cs.LG)
*备注: Submitted to IEEE EMBC 2026
Abstract:Wearable sensors with local data processing can detect health threats early, enhance documentation, and support personalized therapy. In the context of spinal cord injury (SCI), which involves risks such as pressure injuries and blood pressure instability, continuous monitoring can help mitigate these by enabling early deDtection and intervention. In this work, we present a novel distributed machine learning (DML) protocol for human activity recognition (HAR) from wearable sensor data based on gradient-boosted decision trees (XGBoost). The proposed architecture is inspired by Party-Adaptive XGBoost (PAX) while explicitly preserving key structural and optimization properties of standard XGBoost, including histogram-based split construction and tree-ensemble dynamics. First, we provide a theoretical analysis showing that, under appropriate data conditions and suitable hyperparameter selection, the proposed distributed protocol can converge to solutions equivalent to centralized XGBoost training. Second, the protocol is empirically evaluated on a representative wearable-sensor HAR dataset, reflecting the heterogeneity and data fragmentation typical of remote monitoring scenarios. Benchmarking against centralized XGBoost and IBM PAX demonstrates that the theoretical convergence properties are reflected in practice. The results indicate that the proposed approach can match centralized performance up to a gap under 1% while retaining the structural advantages of XGBoost in distributed wearable-based HAR settings.
[LG-14] opological descriptors of foot clearance gait dynamics improve differential diagnosis of Parkinsonism
链接: https://arxiv.org/abs/2603.06212
作者: Jhonathan Barrios,Wolfram Erlhagen,Miguel F. Gago,Estela Bicho,Flora Ferreira
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 17 pages, 12 figures, Under review
Abstract:Differential diagnosis among parkinsonian syndromes remains a clinical challenge due to overlapping motor symptoms and subtle gait abnormalities. Accurate differentiation is crucial for treatment planning and prognosis. While gait analysis is a well established approach for assessing motor impairments, conventional methods often overlook hidden nonlinear and structural features embedded in foot clearance patterns. We evaluated Topological Data Analysis (TDA) as a complementary tool for Parkinsonism classification using foot clearance time series. Persistent homology produced Betti curves, persistence landscapes, and silhouettes, which were used as features for a Random Forest classifier. The dataset comprised 15 controls (CO), 15 idiopathic Parkinson’s disease (IPD), and 14 vascular Parkinsonism (VaP). Models were assessed with leave-one-out cross-validation (LOOCV). Betti-curve descriptors consistently yielded the strongest results. For IPD vs VaP, foot clearance variables minimum toe clearance, maximum toe late swing, and maximum heel clearance achieved 83% accuracy and AUC=0.89 under LOOCV in the medicated (On) state. Performance improved in the On state and further when both Off and On states were considered, indicating sensitivity of the topological features to levodopa related gait changes. These findings support integrating TDA with machine learning to improve clinical gait analysis and aid differential diagnosis across parkinsonian disorders.
[LG-15] DQE: A Semantic-Aware Evaluation Metric for Time Series Anomaly Detection
链接: https://arxiv.org/abs/2603.06131
作者: Yuewei Li,Dalin Zhang,Huan Li,Xinyi Gong,Hongjun Chu,Zhaohui Song
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series anomaly detection has achieved remarkable progress in recent years. However, evaluation practices have received comparatively less attention, despite their critical importance. Existing metrics exhibit several limitations: (1) bias toward point-level coverage, (2) insensitivity or inconsistency in near-miss detections, (3) inadequate penalization of false alarms, and (4) inconsistency caused by threshold or threshold-interval selection. These limitations can produce unreliable or counterintuitive results, hindering objective progress. In this work, we revisit the evaluation of time series anomaly detection from the perspective of detection semantics and propose a novel metric for more comprehensive assessment. We first introduce a partitioning strategy grounded in detection semantics, which decomposes the local temporal region of each anomaly into three functionally distinct subregions. Using this partitioning, we evaluate overall detection behavior across events and design finer-grained scoring mechanisms for each subregion, enabling more reliable and interpretable assessment. Through a systematic study of existing metrics, we identify an evaluation bias associated with threshold-interval selection and adopt an approach that aggregates detection qualities across the full threshold spectrum, thereby eliminating evaluation inconsistency. Extensive experiments on synthetic and real-world data demonstrate that our metric provides stable, discriminative, and interpretable evaluation, while achieving robust assessment compared with ten widely used metrics.
[LG-16] Dynamic Momentum Recalibration in Online Gradient Learning CVPR2026
链接: https://arxiv.org/abs/2603.06120
作者: Zhipeng Yao,Rui Yu,Guisong Chang,Ying Li,Yu Zhang,Dazhou Li
类目: Machine Learning (cs.LG)
*备注: Accepted by CVPR 2026
Abstract:Stochastic Gradient Descent (SGD) and its momentum variants form the backbone of deep learning optimization, yet the underlying dynamics of their gradient behavior remain insufficiently understood. In this work, we reinterpret gradient updates through the lens of signal processing and reveal that fixed momentum coefficients inherently distort the balance between bias and variance, leading to skewed or suboptimal parameter updates. To address this, we propose SGDF (SGD with Filter), an optimizer inspired by the principles of Optimal Linear Filtering. SGDF computes an online, time-varying gain to dynamically refine gradient estimation by minimizing the mean-squared error, thereby achieving an optimal trade-off between noise suppression and signal preservation. Furthermore, our approach could extend to other optimizers, showcasing its broad applicability to optimization frameworks. Extensive experiments across diverse architectures and benchmarks demonstrate SGDF surpasses conventional momentum methods and achieves performance on par with or surpassing state-of-the-art optimizers.
[LG-17] Latent Diffusion-Based 3D Molecular Recovery from Vibrational Spectra
链接: https://arxiv.org/abs/2603.06113
作者: Wenjin Wu,Aleš Leonardis,Linjiang Chen,Jianbo Jiao
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 27 pages, 10 figures
Abstract:Infrared (IR) spectroscopy, a type of vibrational spectroscopy, is widely used for molecular structure determination and provides critical structural information for chemists. However, existing approaches for recovering molecular structures from IR spectra typically rely on one-dimensional SMILES strings or two-dimensional molecular graphs, which fail to capture the intricate relationship between spectral features and three-dimensional molecular geometry. Recent advances in diffusion models have greatly enhanced the ability to generate molecular structures in 3D space. Yet, no existing model has explored the distribution of 3D molecular geometries corresponding to a single IR spectrum. In this work, we introduce IR-GeoDiff, a latent diffusion model that recovers 3D molecular geometries from IR spectra by integrating spectral information into both node and edge representations of molecular structures. We evaluate IR-GeoDiff from both spectral and structural perspectives, demonstrating its ability to recover the molecular distribution corresponding to a given IR spectrum. Furthermore, an attention-based analysis reveals that the model is able to focus on characteristic functional group regions in IR spectra, qualitatively consistent with common chemical interpretation practices.
[LG-18] Improved high-dimensional estimation with Langevin dynamics and stochastic weight averag ing
链接: https://arxiv.org/abs/2603.06028
作者: Stanley Wei,Alex Damian,Jason D. Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Significant recent work has studied the ability of gradient descent to recover a hidden planted direction \theta^\star \in S^d-1 in different high-dimensional settings, including tensor PCA and single-index models. The key quantity that governs the ability of gradient descent to traverse these landscapes is the information exponent k^\star (Ben Arous et al., (2021)), which corresponds to the order of the saddle at initialization in the population landscape. Ben Arous et al., (2021) showed that n \gtrsim d^\max(1, k^\star-1) samples were necessary and sufficient for online SGD to recover \theta^\star , and Ben Arous et al., (2020) proved a similar lower bound for Langevin dynamics. More recently, Damian et al., (2023) showed it was possible to circumvent these lower bounds by running gradient descent on a smoothed landscape, and that this algorithm succeeds with n \gtrsim d^\max(1, k^\star/2) samples, which is optimal in the worst case. This raises the question of whether it is possible to achieve the same rate without explicit smoothing. In this paper, we show that Langevin dynamics can succeed with n \gtrsim d^ k^\star/2 samples if one considers the average iterate, rather than the last iterate. The key idea is that the combination of noise-injection and iterate averaging is able to emulate the effect of landscape smoothing. We apply this result to both the tensor PCA and single-index model settings. Finally, we conjecture that minibatch SGD can also achieve the same rate without adding any additional noise.
[LG-19] Agnostic learning in (almost) optimal time via Gaussian surface area
链接: https://arxiv.org/abs/2603.06027
作者: Lucas Pesenti,Lucas Slot,Manuel Wiedmer
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 20 pages
Abstract:The complexity of learning a concept class under Gaussian marginals in the difficult agnostic model is closely related to its L_1 -approximability by low-degree polynomials. For any concept class with Gaussian surface area at most \Gamma , Klivans et al. (2008) show that degree d = O(\Gamma^2 / \varepsilon^4) suffices to achieve an \varepsilon -approximation. This leads to the best-known bounds on the complexity of learning a variety of concept classes. In this note, we improve their analysis by showing that degree d = \tilde O (\Gamma^2 / \varepsilon^2) is enough. In light of lower bounds due to Diakonikolas et al. (2021), this yields (near) optimal bounds on the complexity of agnostically learning polynomial threshold functions in the statistical query model. Our proof relies on a direct analogue of a construction of Feldman et al. (2020), who considered L_1 -approximation on the Boolean hypercube.
[LG-20] Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments
链接: https://arxiv.org/abs/2603.06009
作者: Michael Beukman,Khimya Khetarpal,Zeyu Zheng,Will Dabney,Jakob Foerster,Michael Dennis,Clare Lyle
类目: Machine Learning (cs.LG)
*备注:
Abstract:Plateaus, where an agent’s performance stagnates at a suboptimal level, are a common problem in deep on-policy RL. Focusing on PPO due to its widespread adoption, we show that plateaus in certain regimes arise not because of known exploration, capacity, or optimization challenges, but because sample-based estimates of the loss eventually become poor proxies for the true objective over the course of training. As a recap, PPO switches between sampling rollouts from several parallel environments online using the current policy (which we call the outer loop) and performing repeated minibatch SGD steps against this offline dataset (the inner loop). In our work we consider only the outer loop, and conceptually model it as stochastic optimization. The step size is then controlled by the regularization strength towards the previous policy and the gradient noise by the number of samples collected between policy update steps. This model predicts that performance will plateau at a suboptimal level if the outer step size is too large relative to the noise. Recasting PPO in this light makes it clear that there are two ways to address this particular type of learning stagnation: either reduce the step size or increase the number of samples collected between updates. We first validate the predictions of our model and investigate how hyperparameter choices influence the step size and update noise, concluding that increasing the number of parallel environments is a simple and robust way to reduce both factors. Next, we propose a recipe for how to co-scale the other hyperparameters when increasing parallelization, and show that incorrectly doing so can lead to severe performance degradation. Finally, we vastly outperform prior baselines in a complex open-ended domain by scaling PPO to more than 1M parallel environments, thereby enabling monotonic performance improvement up to one trillion transitions.
[LG-21] EvoESAP: Non-Uniform Expert Pruning for Sparse MoE
链接: https://arxiv.org/abs/2603.06003
作者: Zongfang Liu,Shengkun Tang,Boyang Sun,Zhiqiang Shen,Xin Yuan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sparse Mixture-of-Experts (SMoE) language models achieve strong capability at low per-token compute, yet deployment remains memory- and throughput-bound because the full expert pool must be stored and served. Post-training expert pruning reduces this cost, but most methods focus on which experts to prune within each layer and default to a uniform layer-wise sparsity allocation, even though the allocation can strongly affect performance. We decouple pruning into within-layer expert ranking and across-layer budget allocation, and introduce \textbfExpected \textbfSpeculative \textbfAcceptance \textbfProxy (\textbfESAP), a speculative-decoding-inspired, teacher-forced metric that measures how well a pruned model matches the full model. ESAP is bounded and stable, enabling cheap comparison of many candidates without costly autoregressive decoding. Building on ESAP, we propose EvoESAP, an evolutionary searching framework that optimizes a non-uniform layer-wise sparsity allocation under a fixed global budget while holding the within-layer pruning order fixed, making it a plug-and-play method with criteria such as Frequency, EAN, SEER, and REAP. Across 7B–30B SMoE LLMs at 25% and 50% sparsity, EvoESAP consistently discovers non-uniform allocations that improve open-ended generation (up to \textbf+19.6% on MATH-500 at 50% sparsity) while preserving competitive multiple-choice accuracy compared with uniform pruning at the same sparsity.
[LG-22] Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence
链接: https://arxiv.org/abs/2603.05960
作者: Hui Yang,Tao Ren,Jinyang Jiang,Wan Tian,Yijie Peng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Memory-efficient optimization methods have recently gained increasing attention for scaling full-parameter training of large language models under the GPU-memory bottleneck. Existing approaches either lack clear convergence guarantees, or only achieve the standard \mathcalO(\epsilon^-4) iteration complexity in the nonconvex settings. We propose Omni-Masked Gradient Descent (OMGD), an optimization method based on mask traversal for memory efficient training, and provide a nonconvex convergence analysis that establishes a strictly improved iteration complexity of \tilde\mathcalO(\epsilon^-3) for finding an \epsilon -approximate stationary point. Empirically, OMGD is a lightweight, plug-and-play approach that integrates seamlessly into most mainstream optimizers, yielding consistent improvements over competitive baselines in both fine-tuning and pre-training tasks.
[LG-23] A Persistent-State Dataflow Accelerator for Memory-Bound Linear Attention Decode on FPGA
链接: https://arxiv.org/abs/2603.05931
作者: Neelesh Gupta,Peter Wang,Rajgopal Kannan,Viktor K. Prasanna
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 6 pages, 6 figures
Abstract:Gated DeltaNet (GDN) is a linear attention mechanism that replaces the growing KV cache with a fixed-size recurrent state. Hybrid LLMs like Qwen3-Next use 75% GDN layers and achieve competitive accuracy to attention-only models. However, at batch-1, GDN decode is memory-bound on GPUs since the full recurrent state must be round-tripped through HBM every token. We show that this bottleneck is architectural, not algorithmic, as all subquadratic sequence models exhibit arithmetic intensities below 1 FLOP/B at decode time, making them more memory-bound than standard Transformers. We present an FPGA accelerator that eliminates this bottleneck by holding the full 2 MB recurrent state persistently in on-chip BRAM, converting the workload from memory-bound to compute-bound. Our design fuses the GDN recurrence into a five-phase pipelined datapath that performs only one read and one write pass over each state matrix per token, exploits Grouped Value Attention for paired-head parallelism, and overlaps preparation, computation, and output storage via dataflow pipelining. We explore four design points on an AMD Alveo U55C using Vitis HLS, varying head-level parallelism from 2 to 16 value-heads per iteration. Our fastest configuration achieves 63 \mu s per token, 4.5 \times faster than the GPU reference on NVIDIA H100 PCIe. Post-implementation power analysis reports 9.96 W on-chip, yielding up to 60 \times greater energy efficiency per token decoded.
[LG-24] Weak-SIGReg: Covariance Regularization for Stable Deep Learning KR ICLR2026
链接: https://arxiv.org/abs/2603.05924
作者: Habibullah Akbar
类目: Machine Learning (cs.LG)
*备注: Accepted at GRaM workshop (ICLR 2026). Code supplementary: this https URL
Abstract:Modern neural network optimization relies heavily on architectural priorssuch as Batch Normalization and Residual connectionsto stabilize training dynamics. Without these, or in low-data regimes with aggressive augmentation, low-bias architectures like Vision Transformers (ViTs) often suffer from optimization collapse. This work adopts Sketched Isotropic Gaussian Regularization (SIGReg), recently introduced in the LeJEPA self-supervised framework, and repurposes it as a general optimization stabilizer for supervised learning. While the original formulation targets the full characteristic function, a computationally efficient variant is derived, Weak-SIGReg, which targets the covariance matrix via random sketching. Inspired by interacting particle systems, representation collapse is viewed as stochastic drift; SIGReg constrains the representation density towards an isotropic Gaussian, mitigating this drift. Empirically, SIGReg recovers the training of a ViT on CIFAR-100 from a collapsed 20.73% to 72.02% accuracy without architectural hacks and significantly improves the convergence of deep vanilla MLPs trained with pure SGD. Code is available at \hrefthis https URLthis http URL.
[LG-25] Design Experiments to Compare Multi-armed Bandit Algorithms
链接: https://arxiv.org/abs/2603.05919
作者: Huiling Meng,Ningyuan Chen,Xuefeng Gao
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:Online platforms routinely compare multi-armed bandit algorithms, such as UCB and Thompson Sampling, to select the best-performing policy. Unlike standard A/B tests for static treatments, each run of a bandit algorithm over T users produces only one dependent trajectory, because the algorithm’s decisions depend on all past interactions. Reliable inference therefore demands many independent restarts of the algorithm, making experimentation costly and delaying deployment decisions. We propose Artificial Replay (AR) as a new experimental design for this problem. AR first runs one policy and records its trajectory. When the second policy is executed, it reuses a recorded reward whenever it selects an action the first policy already took, and queries the real environment only otherwise. We develop a new analytical framework for this design and prove three key properties of the resulting estimator: it is unbiased; it requires only T + o(T) user interactions instead of 2T for a run of the treatment and control policies, nearly halving the experimental cost when both policies have sub-linear regret; and its variance grows sub-linearly in T , whereas the estimator from a naïve design has a linearly-growing variance. Numerical experiments with UCB, Thompson Sampling, and \epsilon -greedy policies confirm these theoretical gains.
[LG-26] Stochastic Event Prediction via Temporal Motif Transitions
链接: https://arxiv.org/abs/2603.05874
作者: İbrahim Bahadır Altun,Ahmet Erdem Sarıyüce
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:
Abstract:Networks of timestamped interactions arise across social, financial, and biological domains, where forecasting future events requires modeling both evolving topology and temporal ordering. Temporal link prediction methods typically frame the task as binary classification with negative sampling, discarding the sequential and correlated nature of real-world interactions. We introduce STEP (STochastic Event Predictor), a framework that reformulates temporal link prediction as a sequential forecasting problem in continuous time. STEP models event dynamics through discrete temporal motif transitions governed by Poisson processes, maintaining a set of open motif instances that evolve as new interactions arrive. At each step, the framework decides whether to initiate a new temporal motif or extend an existing one, selecting the most probable event via Bayesian scoring of temporal likelihoods and structural priors. STEP also produces compact, temporal motif-based feature vectors that can be concatenated with existing temporal graph neural network outputs, enriching their representations without architectural modifications. Experiments on five real-world datasets demonstrate up to 21% average precision gains over state-of-the-art baselines in classification and 0.99 precision in next k sequential forecasting, with consistently lower runtime than competing motif-aware methods.
[LG-27] Self-Auditing Parameter-Efficient Fine-Tuning for Few-Shot 3D Medical Image Segmentation
链接: https://arxiv.org/abs/2603.05822
作者: Son Thai Ly,Hien V. Nguyen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Adapting foundation models to new clinical sites remains challenging in practice. Domain shift and scarce annotations must be handled by experts, yet many clinical groups do not have ready access to skilled AI engineers to tune adapter designs and training recipes. As a result, adaptation cycles can stretch from weeks to months, particularly in few-shot settings. Existing PEFT methods either require manual adapter configuration or automated searches that are computationally infeasible in few-shot 3D settings. We propose SEA-PEFT (SElf-Auditing Parameter-Efficient Fine-Tuning) to automate this process. SEA-PEFT treats adapter configuration as an online allocation problem solved during fine-tuning rather than through manual, fixed-topology choices. SEA-PEFT uses a search-audit-allocate loop that trains active adapters, estimates each adapter’s Dice utility by momentarily toggling it off, and then reselects the active set under a parameter budget using a greedy knapsack allocator. Exponential Moving Average and Interquartile Range smoothing, together with a Finite-State Ranking controller, stabilize the loop and improve reliability in high-noise few-shot regimes. On TotalSegmentator and FLARE’22, SEA-PEFT improves mean Dice by 2.4–2.8 points over the strongest fixed-topology PEFT baselines across 1/5/10-shot settings while training 1% of parameters. For reproducibility purposes, we made our code publicly available at this https URL
[LG-28] MoE Lens – An Expert Is All You Need ICLR2025
链接: https://arxiv.org/abs/2603.05806
作者: Marmik Chaudhari,Idhant Gulati,Nishkal Hundia,Pranav Karra,Shivam Raval
类目: Machine Learning (cs.LG)
*备注: 15 pages, 10 figures, ICLR 2025 Workshop on Sparsity in LLMs (SLLM)
Abstract:Mixture of Experts (MoE) models enable parameter-efficient scaling through sparse expert activations, yet optimizing their inference and memory costs remains challenging due to limited understanding of their specialization behavior. We present a systematic analysis of expert specialization in MoEs through two complementary approaches: domain-specific routing patterns and an early decoding framework that tracks expert contributions to output representations. Our analysis of the DeepSeekMoE model reveals that despite having 64 routed experts with 6 active for each layer’s computation, the model predominantly relies on a few specialized experts, with the top-weighted expert’s output closely approximating the full ensemble prediction. We quantitatively validate these findings through a systematic analysis of the token routing distribution, demonstrating that very few experts handle over 50% of routing decisions across different specialized domains. Hidden state similarity between single and ensemble experts for every layer is extremely high, with some layers having cosine similarity as high as 0.95 and perplexity increasing by only 5% when using a single expert across all three domains. Our results indicate that Mixture of Experts models exhibit concentrated expertise highlighting potential opportunities for inference optimization through targeted expert pruning while maintaining model performance and opening avenues towards studying localization of learned knowledge in these models.
[LG-29] Sparse Crosscoders for diffing MoEs and Dense models
链接: https://arxiv.org/abs/2603.05805
作者: Marmik Chaudhari,Nishkal Hundia,Idhant Gulati
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures
Abstract:Mixture of Experts (MoE) achieve parameter-efficient scaling through sparse expert routing, yet their internal representations remain poorly understood compared to dense models. We present a systematic comparison of MoE and dense model internals using crosscoders, a variant of sparse autoencoders, that jointly models multiple activation spaces. We train 5-layer dense and MoEs (equal active parameters) on 1B tokens across code, scientific text, and english stories. Using BatchTopK crosscoders with explicitly designated shared features, we achieve \sim 87% fractional variance explained and uncover concrete differences in feature organization. The MoE learns significantly fewer unique features compared to the dense model. MoE-specific features also exhibit higher activation density than shared features, whereas dense-specific features show lower density. Our analysis reveals that MoEs develop more specialized, focused representations while dense models distribute information across broader, more general-purpose features.
[LG-30] First-Order Softmax Weighted Switching Gradient Method for Distributed Stochastic Minimax Optimization with Stochastic Constraints
链接: https://arxiv.org/abs/2603.05774
作者: Zhankun Luo,Antesh Upadhyay,Sang Bin Moon,Abolfazl Hashemi
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:This paper addresses the distributed stochastic minimax optimization problem subject to stochastic constraints. We propose a novel first-order Softmax-Weighted Switching Gradient method tailored for federated learning. Under full client participation, our algorithm achieves the standard \mathcalO(\epsilon^-4) oracle complexity to satisfy a unified bound \epsilon for both the optimality gap and feasibility tolerance. We extend our theoretical analysis to the practical partial participation regime by quantifying client sampling noise through a stochastic superiority assumption. Furthermore, by relaxing standard boundedness assumptions on the objective functions, we establish a strictly tighter lower bound for the softmax hyperparameter. We provide a unified error decomposition and establish a sharp \mathcalO(\log\frac1\delta) high-probability convergence guarantee. Ultimately, our framework demonstrates that a single-loop primal-only switching mechanism provides a stable alternative for optimizing worst-case client performance, effectively bypassing the hyperparameter sensitivity and convergence oscillations often encountered in traditional primal-dual or penalty-based approaches. We verify the efficacy of our algorithm via experiment on the Neyman-Pearson (NP) classification and fair classification tasks.
[LG-31] Score-Guided Proximal Projection: A Unified Geometric Framework for Rectified Flow Editing
链接: https://arxiv.org/abs/2603.05761
作者: Vansh Bansal,James G Scott
类目: Machine Learning (cs.LG)
*备注:
Abstract:Rectified Flow (RF) models achieve state-of-the-art generation quality, yet controlling them for precise tasks – such as semantic editing or blind image recovery – remains a challenge. Current approaches bifurcate into inversion-based guidance, which suffers from “geometric locking” by rigidly adhering to the source trajectory, and posterior sampling approximations (e.g., DPS), which are computationally expensive and unstable. In this work, we propose Score-Guided Proximal Projection (SGPP), a unified framework that bridges the gap between deterministic optimization and stochastic sampling. We reformulate the recovery task as a proximal optimization problem, defining an energy landscape that balances fidelity to the input with realism from the pre-trained score field. We theoretically prove that this objective induces a normal contraction property, geometrically guaranteeing that out-of-distribution inputs are snapped onto the data manifold, and it effectively reaches the posterior mode constrained to the manifold. Crucially, we demonstrate that SGPP generalizes state-of-the-art editing methods: RF-inversion is effectively a limiting case of our framework. By relaxing the proximal variance, SGPP enables “soft guidance,” offering a continuous, training-free trade-off between strict identity preservation and generative freedom.
[LG-32] MIRACL: A Diverse Meta-Reinforcement Learning for Multi-Objective Multi-Echelon Combinatorial Supply Chain Optimisation
链接: https://arxiv.org/abs/2603.05760
作者: Rifny Rachman,Josh Tingey,Richard Allmendinger,Wei Pan,Pradyumn Shukla,Bahrul Ilmi Nasution
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-objective reinforcement learning (MORL) is effective for multi-echelon combinatorial supply chain optimisation, where tasks involve high dimensionality, uncertainty, and competing objectives. However, its deployment in dynamic environments is hindered by the need for task-specific retraining and substantial computational cost. We introduce MIRACL (Meta multI-objective Reinforcement leArning with Composite Learning), a hierarchical Meta-MORL framework that allows for a few-shot generalisation across diverse tasks. MIRACL decomposes each task into structured subproblems for efficient policy adaptation and meta-learns a global policy across tasks using a Pareto-based adaptation strategy to encourage diversity in meta-training and fine-tuning. To our knowledge, this is the first integration of Meta-MORL with such mechanisms in combinatorial optimisation. Although validated in the supply chain domain, MIRACL is theoretically domain-agnostic and applicable to broader dynamic multi-objective decision-making problems. Empirical evaluations show that MIRACL outperforms conventional MORL baselines in simple to moderate tasks, achieving up to 10% higher hypervolume and 5% better expected utility. These results underscore the potential of MIRACL for robust, efficient adaptation in multi-objective problems.
[LG-33] Unsupervised domain adaptation for radioisotope identification in gamma spectroscopy
链接: https://arxiv.org/abs/2603.05719
作者: Peter Lalor,Ayush Panigrahy,Alex Hagen
类目: Machine Learning (cs.LG)
*备注: 32 pages, 3 figures, and 14 tables
Abstract:Training machine learning models for radioisotope identification using gamma spectroscopy remains an elusive challenge for many practical applications, largely stemming from the difficulty of acquiring and labeling large, diverse experimental datasets. Simulations can mitigate this challenge, but the accuracy of models trained on simulated data can deteriorate substantially when deployed to an out-of-distribution operational environment. In this study, we demonstrate that unsupervised domain adaptation (UDA) can improve the ability of a model trained on synthetic data to generalize to a new testing domain, provided unlabeled data from the target domain are available. Conventional supervised techniques are unable to utilize this data because the absence of isotope labels precludes defining a supervised classification loss. Instead, we first pretrain a spectral classifier using labeled synthetic data and subsequently leverage unlabeled target data to align the learned feature representations between the source and target domains. We compare a range of different UDA techniques, finding that minimizing the maximum mean discrepancy (MMD) between source and target feature vectors yields the most consistent improvement to testing scores. For instance, using a custom transformer-based neural network, we achieved a testing accuracy of 0.904 \pm 0.022 on an experimental LaBr _3 test set after performing unsupervised feature alignment via MMD minimization, compared to 0.754 \pm 0.014 before alignment. Overall, our results highlight the potential of using UDA to adapt a radioisotope classifier trained on synthetic data for real-world deployment.
[LG-34] Warm Starting State-Space Models with Automata Learning
链接: https://arxiv.org/abs/2603.05694
作者: William Fishell,Sam Nicholas Kouteili,Mark Santolucito
类目: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL)
*备注:
Abstract:We prove that Moore machines can be exactly realized as state-space models (SSMs), establishing a formal correspondence between symbolic automata and these continuous machine learning architectures. These Moore-SSMs preserve both the complete symbolic structure and input-output behavior of the original Moore machine, but operate in Euclidean space. With this correspondence, we compare the training of SSMs with both passive and active automata learning. In recovering automata from the SYNTCOMP benchmark, we show that SSMs require orders of magnitude more data than symbolic methods and fail to learn state structure. This suggests that symbolic structure provides a strong inductive bias for learning these systems. We leverage this insight to combine the strengths of both automata learning and SSMs in order to learn complex systems efficiently. We learn an adaptive arbitration policy on a suite of arbiters from SYNTCOMP and show that initializing SSMs with symbolically-learned approximations learn both faster and better. We see 2-5 times faster convergence compared to randomly initialized models and better overall model accuracies on test data. Our work lifts automata learning out of purely discrete spaces, enabling principled exploitation of symbolic structure in continuous domains for efficiently learning in complex settings.
[LG-35] Parallelization Strategies for Dense LLM Deployment: Navigating Through Application-Specific Tradeoffs and Bottlenecks
链接: https://arxiv.org/abs/2603.05692
作者: Burak Topcu,Musa Oguzhan Cim,Poovaiah Palangappa,Meena Arunachalam,Mahmut Taylan Kandemir
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注: 17 pages, 8 figures, 3 tables
Abstract:Breakthroughs in the generative AI domain have fueled an explosion of large language model (LLM)-powered applications, whose workloads fundamentally consist of sequences of inferences through transformer architectures. Within this rapidly expanding ecosystem, dense LLMs–those that activate all model parameters for each token generation–form the foundation for advanced expert-based variants. Dense models continue to dominate because of their strong generalization ability, scalability, ease of fine-tuning, and versatility across diverse tasks. In LLM inference systems, performance is mainly characterized by latency, response time, and throughput (i.e., tokens generated per unit of time). Latency and throughput are inherently coupled: optimizing for one often comes at the expense of the other. Moreover, batching strategies and parallelism configurations, which are essential when dense model parameters exceed device memory capacity, can significantly affect both latency and overall system throughput. This paper (i) investigates the workloads of two representative dense LLMs–Llama-3.1-70B and Llama-3.1-405B, focusing in particular on intra-node parallelization schemes, (ii) analyzes how input characteristics, batching, and parallelism strategies influence latency flexibility and the latency-throughput tradeoff, and (iii) identifies key performance bottlenecks that inform design choices for meeting service-level agreements (SLAs) and sustaining inference quality. Our empirical evaluations reveal that Tensor Parallelism (TP) improves the latency objectives while Pipeline Parallelism (PP) is better-suited for throughput-oriented applications. We highlight that their hybrid usage by controlling the TP and PP degrees provides control over the latency-throughput interplay.
[LG-36] Improved Scaling Laws via Weak-to-Strong Generalization in Random Feature Ridge Regression
链接: https://arxiv.org/abs/2603.05691
作者: Diyuan Wu,Lehan Chen,Theodor Misiakiewicz,Marco Mondelli
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:It is increasingly common in machine learning to use learned models to label data and then employ such data to train more capable models. The phenomenon of weak-to-strong generalization exemplifies the advantage of this two-stage procedure: a strong student is trained on imperfect labels obtained from a weak teacher, and yet the strong student outperforms the weak teacher. In this paper, we show that the potential improvement is substantial, in the sense that it affects the scaling law followed by the test error. Specifically, we consider students and teachers trained via random feature ridge regression (RFRR). Our main technical contribution is to derive a deterministic equivalent for the excess test error of the student trained on labels obtained via the teacher. Via this deterministic equivalent, we then identify regimes in which the scaling law of the student improves upon that of the teacher, unveiling that the improvement can be achieved both in bias-dominated and variance-dominated settings. Strikingly, the student may attain the minimax optimal rate regardless of the scaling law of the teacher – in fact, when the test error of the teacher does not even decay with the sample size.
[LG-37] Reinforcement Learning for Power-Flow Network Analysis DATE
链接: https://arxiv.org/abs/2603.05673
作者: Alperen Ergur,Julia Lindberg,Vinny Miller
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC); Algebraic Geometry (math.AG)
*备注: more experiments will be added in a relatively soon date
Abstract:The power flow equations are non-linear multivariate equations that describe the relationship between power injections and bus voltages of electric power networks. Given a network topology, we are interested in finding network parameters with many equilibrium points. This corresponds to finding instances of the power flow equations with many real solutions. Current state-of-the art algorithms in computational algebra are not capable of answering this question for networks involving more than a small number of variables. To remedy this, we design a probabilistic reward function that gives a good approximation to this root count, and a state-space that mimics the space of power flow equations. We derive the average root count for a Gaussian model, and use this as a baseline for our RL agents. The agents discover instances of the power flow equations with many more solutions than the average baseline. This demonstrates the potential of RL for power-flow network design and analysis as well as the potential for RL to contribute meaningfully to problems that involve complex non-linear algebra or geometry. \footnoteAuthor order alphabetic, all authors contributed equally.
[LG-38] he Value of Graph-based Encoding in NBA Salary Prediction
链接: https://arxiv.org/abs/2603.05671
作者: Junhao Su,David Grimsman,Christopher Archibald
类目: Machine Learning (cs.LG)
*备注: 6 pages,IEEE tempelate conference style. Submitted to IETC 2026, get decision on Mar 22th
Abstract:Market valuations for professional athletes is a difficult problem, given the amount of variability in performance and location from year to year. In the National Basketball Association (NBA), a straightforward way to address this problem is to build a tabular data set and use supervised machine learning to predict a player’s salary based on the player’s performance in the previous year. For younger players, whose contracts are mostly built on draft position, this approach works well, however it can fail for veterans or those whose salaries are on the high tail of the distribution. In this paper, we show that building a knowledge graph with on and off court data, embedding that graph in a vector space, and including that vector in the tabular data allows the supervised learning to better understand the landscape of factors that affect salary. We compare several graph embedding algorithms and show that such a process is vital to NBA salary prediction.
[LG-39] Identifying Adversary Characteristics from an Observed Attack
链接: https://arxiv.org/abs/2603.05625
作者: Soyon Choi,Scott Alfeld,Meiyi Ma
类目: Machine Learning (cs.LG)
*备注:
Abstract:When used in automated decision-making systems, machine learning (ML) models are vulnerable to data-manipulation attacks. Some defense mechanisms (e.g., adversarial regularization) directly affect the ML models while others (e.g., anomaly detection) act within the broader system. In this paper we consider a different task for defending the adversary, focusing on the attacker, rather than the attack. We present and demonstrate a framework for identifying characteristics about the attacker from an observed attack. We prove that, without additional knowledge, the attacker is non-identifiable (multiple potential attackers would perform the same observed attack). To address this challenge, we propose a domain-agnostic framework to identify the most probable attacker. This framework aids the defender in two ways. First, knowledge about the attacker can be leveraged for exogenous mitigation (i.e., addressing the vulnerability by altering the decision-making system outside the learning algorithm and/or limiting the attacker’s capability). Second, when implementing defense methods that directly affect the learning process (e.g., adversarial regularization), knowledge of the specific attacker improves performance. We present the details of our framework and illustrate its applicability through specific instantiations on a variety of learners. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.05625 [cs.LG] (or arXiv:2603.05625v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.05625 Focus to learn more arXiv-issued DOI via DataCite
[LG-40] A Novel Hybrid Heuristic-Reinforcement Learning Optimization Approach for a Class of Railcar Shunting Problems
链接: https://arxiv.org/abs/2603.05579
作者: Ruonan Zhao,Joseph Geunes
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Railcar shunting is a core planning task in freight railyards, where yard planners need to disassemble and reassemble groups of railcars to form outbound trains. Classification tracks with access from one side only can be considered as stack structures, where railcars are added and removed from only one end, leading to a last-in-first-out (LIFO) retrieval order. In contrast, two-sided tracks function like queue structures, allowing railcars to be added from one end and removed from the opposite end, following a first-in-first-out (FIFO) order. We consider a problem requiring assembly of multiple outbound trains using two locomotives in a railyard with two-sided classification track access. To address this combinatorially challenging problem class, we decompose the problem into two subproblems, each with one-sided classification track access and a locomotive on each side. We present a novel Hybrid Heuristic-Reinforcement Learning (HHRL) framework that integrates railway-specific heuristic solution approaches with a reinforcement learning method, specifically Q-learning. The proposed framework leverages methods to decrease the state-action space and guide exploration during reinforcement learning. The results of a series of numerical experiments demonstrate the efficiency and quality of the HHRL algorithm in both one-sided access, single-locomotive problems and two-sided access, two-locomotive problems.
[LG-41] Koopman Regularized Deep Speech Disentanglement for Speaker Verification
链接: https://arxiv.org/abs/2603.05577
作者: Nikos Chazaridis,Mohammad Belal,Rafael Mestre,Timothy J. Norman,Christine Evers
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Human speech contains both linguistic content and speaker dependent characteristics making speaker verification a key technology in identity critical applications. Modern deep learning speaker verification systems aim to learn speaker representations that are invariant to semantic content and nuisance factors such as ambient noise. However, many existing approaches depend on labelled data, textual supervision or large pretrained models as feature extractors, limiting scalability and practical deployment, raising sustainability concerns. We propose Deep Koopman Speech Disentanglement Autoencoder (DKSD-AE), a structured autoencoder that combines a novel multi-step Koopman operator learning module with instance normalization to disentangle speaker and content dynamics. Quantitative experiments across multiple datasets demonstrate that DKSD-AE achieves improved or competitive speaker verification performance compared to state-of-the-art baselines while maintaining high content EER, confirming effective disentanglement. These results are obtained with substantially fewer parameters and without textual supervision. Moreover, performance remains stable under increased evaluation scale, highlighting representation robustness and generalization. Our findings suggest that Koopman-based temporal modelling, when combined with instance normalization, provides an efficient and principled solution for speaker-focused representation learning.
[LG-42] Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View
链接: https://arxiv.org/abs/2603.05573
作者: Gyuryang Heo,Timothy Ngotiaoco,Kazuki Irie,Samuel J. Gershman,Bernardo Sabatini
类目: Machine Learning (cs.LG)
*备注:
Abstract:Scalable sequence models, such as Transformer variants and structured state-space models, often trade expressivity power for sequence-level parallelism, which enables efficient training. Here we examine the bounds on error and how error scales when models operate outside of their expressivity regimes using a Lie-algebraic control perspective. Our theory formulates a correspondence between the depth of a sequence model and the tower of Lie algebra extensions. Echoing recent theoretical studies, we characterize the Lie-algebraic class of constant-depth sequence models and their corresponding expressivity bounds. Furthermore, we analytically derive an approximation error bound and show that error diminishes exponentially as the depth increases, consistent with the strong empirical performance of these models. We validate our theoretical predictions using experiments on symbolic word and continuous-valued state-tracking problems.
[LG-43] FuseDiff: Symmetry-Preserving Joint Diffusion for Dual-Target Structure-Based Drug Design
链接: https://arxiv.org/abs/2603.05567
作者: Jianliang Wu,Anjie Qiao,Zhen Wang,Zhewei Wei,Sheng Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dual-target structure-based drug design aims to generate a single ligand together with two pocket-specific binding poses, each compatible with a corresponding target pocket, enabling polypharmacological therapies with improved efficacy and reduced resistance. Existing approaches typically rely on staged pipelines, which either decouple the two poses via conditional-independence assumptions or enforce overly rigid correlations, and therefore fail to jointly generate two target-specific binding modes. To address this, we propose FuseDiff, an end-to-end diffusion model that jointly generates a ligand molecular graph and two pocket-specific binding poses conditioned on both pockets. FuseDiff features a message-passing backbone with Dual-target Local Context Fusion (DLCF), which fuses each ligand atom’s local context from both pockets to enable expressive joint modeling while preserving the desired symmetries. Together with explicit bond generation, FuseDiff enforces topological consistency across the two poses under a shared graph while allowing target-specific geometric adaptation in each pocket. To support principled training and evaluation, we derive a dual-target training set and use an independent held-out test set for evaluation. Experiments on the benchmark and a real-world dual-target system show that FuseDiff achieves state-of-the-art docking performance and enables the first systematic assessment of dual-target pose quality prior to docking-based pose search.
[LG-44] Autocorrelation effects in a stochastic-process model for decision making via time series
链接: https://arxiv.org/abs/2603.05559
作者: Tomoki Yamagami,Mikio Hasegawa,Takatomo Mihana,Ryoichi Horisaki,Atsushi Uchida
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Probability (math.PR); Optics (physics.optics)
*备注: 21 pages, 10 figures
Abstract:Decision makers exploiting photonic chaotic dynamics obtained by semiconductor lasers provide an ultrafast approach to solving multi-armed bandit problems by using a temporal optical signal as the driving source for sequential decisions. In such systems, the sampling interval of the chaotic waveform shapes the temporal correlation of the resulting time series, and experiments have reported that decision accuracy depends strongly on this autocorrelation property. However, it remains unclear whether the benefit of autocorrelation can be explained by a minimal mathematical model. Here, we analyze a stochastic-process model of the time-series-based decision making using the tug-of-war principle for solving the two-armed bandit problem, where the threshold and a two-valued Markov signal evolve jointly. Numerical results reveal an environment-dependent structure: negative (positive) autocorrelation is optimal in reward-rich (reward-poor) environments. These findings show that negative autocorrelation of the time series is advantageous when the sum of the winning probabilities is more than 1 , whereas positive autocorrelation is useful when the sum of the winning probabilities is less than 1 . Moreover, the performance is independent of autocorrelation if the sum of the winning probabilities equals 1 , which is mathematically clarified. This study paves the way for improving the decision-making scheme for reinforcement learning applications in wireless communications and robotics.
[LG-45] IntSeqBERT: Learning Arithmetic Structure in OEIS via Modulo-Spectrum Embeddings
链接: https://arxiv.org/abs/2603.05556
作者: Kazuhisa Nakasho
类目: Machine Learning (cs.LG)
*备注:
Abstract:Integer sequences in the OEIS span values from single-digit constants to astronomical factorials and exponentials, making prediction challenging for standard tokenised models that cannot handle out-of-vocabulary values or exploit periodic arithmetic structure. We present IntSeqBERT, a dual-stream Transformer encoder for masked integer-sequence modelling on OEIS. Each sequence element is encoded along two complementary axes: a continuous log-scale magnitude embedding and sin/cos modulo embeddings for 100 residues (moduli 2 – 101 ), fused via FiLM. Three prediction heads (magnitude regression, sign classification, and modulo prediction for 100 moduli) are trained jointly on 274,705 OEIS sequences. At the Large scale (91.5M parameters), IntSeqBERT achieves 95.85% magnitude accuracy and 50.38% Mean Modulo Accuracy (MMA) on the test set, outperforming a standard tokenised Transformer baseline by +8.9 pt and +4.5 pt, respectively. An ablation removing the modulo stream confirms it accounts for +15.2 pt of the MMA gain and contributes an additional +6.2 pt to magnitude accuracy. A probabilistic Chinese Remainder Theorem (CRT)-based Solver converts the model’s predictions into concrete integers, yielding a 7.4-fold improvement in next-term prediction over the tokenised-Transformer baseline (Top-1: 19.09% vs. 2.59%). Modulo spectrum analysis reveals a strong negative correlation between Normalised Information Gain (NIG) and Euler’s totient ratio \varphi(m)/m ( r = -0.851 , p 10^-28 ), providing empirical evidence that composite moduli capture OEIS arithmetic structure more efficiently via CRT aggregation.
[LG-46] Predicting Atomistic Transitions with Transformers
链接: https://arxiv.org/abs/2603.06526
作者: Henry Tischler,Wenting Li,Qi Tang,Danny Perez,Thomas Vogel
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: Presented at the 2025 Conference on Data Analysis (CoDA), February 25-28, Santa Fe, New Mexico
Abstract:Accurate knowledge of the atomistic transition pathways in materials and material surfaces is crucial for many material science problems. However, conventional simulation techniques used to find these transitions are extremely computationally intensive. Even with large-scale, accelerated material simulations, the computational cost constrains the applicable domain in practice. Machine learning models, with the potential to learn the complex emergent behaviors governing atomistic transitions as a fast surrogate model, have great promise to predict transitions with a vastly reduced computational cost. Here, we demonstrate how transformers can be trained to predict atomistic transitions in nano-clusters. We show how we evaluate physical validity of the predictions and how a multitude of additional, different microstates can be generated by slightly varying the data provided to the model.
[LG-47] Semantics-Aware Caching for Concept Learning
链接: https://arxiv.org/abs/2603.06506
作者: Louis Mozart Kamdem Teyou,Caglar Demir,Axel-Cyrille Ngonga Ngomo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Concept learning is a form of supervised machine learning that operates on knowledge bases in description logics. State-of-the-art concept learners often rely on an iterative search through a countably infinite concept space. In each iteration, they retrieve instances of candidate solutions to select the best concept for the next iteration. While simple learning problems might require a few dozen instance retrieval calls to find a fitting solution, complex learning problems might necessitate thousands of calls. We alleviate the resulting runtime challenge by presenting a semantics-aware caching approach. Our cache is essentially a subsumption-aware map that links concepts to a set of instances via crisp set operations. Our experiments on 5 datasets with 4 symbolic reasoners, a neuro-symbolic reasoner, and 5 popular pagination policies demonstrate that our cache can reduce the runtime of concept retrieval and concept learning by an order of magnitude while being effective for both symbolic and neuro-symbolic reasoners.
[LG-48] Quantum Diffusion Models: Score Reversal Is Not Free in Gaussian Dynamics
链接: https://arxiv.org/abs/2603.06488
作者: Ammar Fayad
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:
Abstract:Diffusion-based generative modeling suggests reversing a noising semigroup by adding a score drift. For continuous-variable Gaussian Markov dynamics, complete positivity couples drift and diffusion at the generator level. For a quantum-limited attenuator with thermal parameter \nu and squeezing r , the fixed-diffusion Wigner-score (Bayes) reverse drift violates CP iff \cosh(2r)\nu . Any Gaussian CP repair must inject extra diffusion, implying -2\ln F\ge c_\textgeom(\nu_\min)I_\mathrmdec^\mathrmwc .
[LG-49] U6G XL-MIMO Radiomap Prediction: Multi-Config Dataset and Beam Map Approach
链接: https://arxiv.org/abs/2603.06401
作者: Xiaojie Li,Yu Han,Zhizheng Lu,Shi Jin,Chao-Kai Wen
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:The upper 6 GHz (U6G) band with XL-MIMO is a key enabler for sixth-generation wireless systems, yet intelligent radiomap prediction for such systems remains challenging. Existing datasets support only small-scale arrays (up to 8x8) with predominantly isotropic antennas, far from the 1024-element directional arrays envisioned for 6G. Moreover, current methods encode array configurations as scalar parameters, forcing neural networks to extrapolate array-specific radiation patterns, which fails when predicting radiomaps for configurations absent from training data. To jointly address data scarcity and generalization limitations, this paper advances XL-MIMO radiomap prediction from three aspects. To overcome data limitations, we construct the first XL-MIMO radiomap dataset containing 78400 radiomaps across 800 urban scenes, five frequency bands (1.8-6.7 GHz), and nine array configurations up to 32x32 uniform planar arrays with directional elements. To enable systematic evaluation, we establish a comprehensive benchmark framework covering practical scenarios from coverage estimation without field measurements to generalization across unseen configurations and environments. To enable generalization to arbitrary beam configurations without retraining, we propose the beam map, a physics-informed spatial feature that analytically computes array-specific coverage patterns. By decoupling deterministic array radiation from data learned multipath propagation, beam maps shift generalization from neural network extrapolation to physics-based computation. Integrating beam maps into existing architectures reduces mean absolute error by up to 60.0% when generalizing to unseen configurations and up to 50.5% when transferring to unseen environments. The complete dataset and code are publicly available at this https URL.
[LG-50] Robust support vector model based on bounded asymmetric elastic net loss for binary classification
链接: https://arxiv.org/abs/2603.06257
作者: Haiyan Du,Hu Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we propose a novel bounded asymmetric elastic net ( L_baen ) loss function and combine it with the support vector machine (SVM), resulting in the BAEN-SVM. The L_baen is bounded and asymmetric and can degrade to the asymmetric elastic net hinge loss, pinball loss, and asymmetric least squares loss. BAEN-SVM not only effectively handles noise-contaminated data but also addresses the geometric irrationalities in the traditional SVM. By proving the violation tolerance upper bound (VTUB) of BAEN-SVM, we show that the model is geometrically well-defined. Furthermore, we derive that the influence function of BAEN-SVM is bounded, providing a theoretical guarantee of its robustness to noise. The Fisher consistency of the model further ensures its generalization capability. Since the ( L_\textbaen ) loss is non-convex, we designed a clipping dual coordinate descent-based half-quadratic algorithm to solve the non-convex optimization problem efficiently. Experimental results on artificial and benchmark datasets indicate that the proposed method outperforms classical and advanced SVMs, particularly in noisy environments.
[LG-51] SPPCSO: Adaptive Penalized Estimation Method for High-Dimensional Correlated Data
链接: https://arxiv.org/abs/2603.06251
作者: Ying Hu,Hu Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:With the rise of high-dimensional correlated data, multicollinearity poses a significant challenge to model stability, often leading to unstable estimation and reduced predictive accuracy. This work proposes the Single-Parametric Principal Component Selection Operator (SPPCSO), an innovative penalized estimation method that integrates single-parametric principal component regression and L_1 regularization to adaptively adjust the shrinkage factor by incorporating principal component information. This approach achieves a balance between variable selection and coefficient estimation, ensuring model stability and robust estimation even in high-dimensional, high-noise environments. The primary contribution lies in addressing the instability of traditional variable selection methods when applied to high-noise, high-dimensional correlated data. Theoretically, our method exhibits selection consistency and achieves a smaller estimation error bound compared to traditional penalized estimation approaches. Extensive numerical experiments demonstrate that SPPCSO not only delivers stable and reliable estimation in high-noise settings but also accurately distinguishes signal variables from noise variables in group-effect structured data with highly correlated noise variables, effectively eliminating redundant variables and achieving more stable variable selection. Furthermore, SPPCSO successfully identifies disease-associated genes in gene expression data analysis, showcasing strong practical value. The results indicate that SPPCSO serves as an ideal tool for high-dimensional variable selection, offering an efficient and interpretable solution for modeling correlated data.
[LG-52] Random Quadratic Form on a Sphere: Synchronization by Common Noise
链接: https://arxiv.org/abs/2603.06187
作者: Maximilian Engel,Anna Shalova
类目: Probability (math.PR); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:We introduce the Random Quadratic Form (RQF): a stochastic differential equation which formally corresponds to the gradient flow of a random quadratic functional on a sphere. While the one-point dynamics of the system is a Brownian motion and thus has no preferred direction, the two-point motion exhibits nontrivial synchronizing behaviour. In this work we study synchronization of the RQF, namely we give both distributional and path-wise characterizations of the solutions by studying invariant measures and random attractors of the system. The RQF model is motivated by the study of the role of linear layers in transformers and illustrates the synchronization by common noise phenomena arising in the simplified models of transformers. In particular, we provide an alternative (independent of self-attention) explanation of the clustering behaviour in deep transformers and show that tokens cluster even in the absence of the self-attention mechanism. Subjects: Probability (math.PR); Machine Learning (cs.LG); Dynamical Systems (math.DS) Cite as: arXiv:2603.06187 [math.PR] (or arXiv:2603.06187v1 [math.PR] for this version) https://doi.org/10.48550/arXiv.2603.06187 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-53] Random Dot Product Graphs as Dynamical Systems: Limitations and Opportunities
链接: https://arxiv.org/abs/2603.05703
作者: Giulio Valentino Dalla Riva
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 39 pages, 3 figures
Abstract:Can we learn the differential equations governing the evolution of a temporal network? We investigate this within Random Dot Product Graphs (RDPGs), where each network snapshot is generated from latent positions evolving under unknown dynamics. We identify three fundamental obstructions: gauge freedom from rotational ambiguity in latent positions, realizability constraints from the manifold structure of the probability matrix, and trajectory recovery artifacts from spectral embedding. We develop a geometric framework based on principal fiber bundles that formalizes these obstructions. We characterize invisible dynamics as exactly the skew-symmetric generators, and show the realizable tangent space has dimension nd - d(d-1)/2 . An holonomy dichotomy emerges: polynomial dynamics have commuting generators, stationary eigenvectors, and trivial holonomy, making gauge alignment purely statistical; Laplacian dynamics satisfy a non-commutativity criterion producing nontrivial holonomy, with curvature weighted by 1/(\lambda_\iota + \lambda_\gamma) linking gauge sensitivity to the spectral gap. In d=2 this yields full restricted holonomy \mathrmSO(2) ; for d \ge 3 generic full \mathrmSO(d) remains conjectural. Cram’er–Rao lower bounds reveal that the same spectral gap controlling curvature and injectivity simultaneously controls Fisher information, so geometric and statistical difficulty are inextricable. We prove an identifiability principle: symmetric dynamics cannot absorb skew-symmetric gauge contamination, so dynamics structure can resolve gauge ambiguity. We demonstrate this constructively with anchor-based alignment and a UDE pipeline recovering vector fields from noisy graph sequences. Yet finite-sample interactions between noise, gauge, and dynamics expressiveness remain beyond the asymptotic theory. We frame this gap as an open challenge. Comments: 39 pages, 3 figures Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2603.05703 [stat.ME] (or arXiv:2603.05703v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2603.05703 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-54] Behavior-dLDS: A decomposed linear dynamical systems model for neural activity partially constrained by behavior
链接: https://arxiv.org/abs/2603.05612
作者: Eva Yezerets,En Yang,Misha B. Ahrens,Adam S. Charles
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:Brain-wide recordings of large-scale networks of neurons now provide an unprecedented view into how the brain drives behavior. However, brain activity contains both information directly related to behavior as well as the potential for many internal computations. Moreover, observable behavior is executed not only by the brain, but also by the spinal cord and peripheral nervous system. Behavior is a coarse-grained product of neural activity, and we thus take the view that it can be best represented by lower-dimensional latent neural dynamics. Capturing this indirect relationship while disambiguating behavior-generating networks from internal computations running in parallel requires new modeling approaches that can embody the parallel and distributed nature of large-scale neural populations. We thus present behavior-decomposed linear dynamical systems (b-dLDS) to disentangle simultaneously recorded subsystems and identify how the latent neural subsystems relate to behavior. We demonstrate the ability of b-dLDS to decouple behavioral vs. internal computations on controlled, simulated data, showing improvements over a state-of-the-art model that uses behavior to supervise all dynamics based on behavior. We then show that b-dLDS can further scale up to tens of thousands of neurons by applying our model to large-scale recording of a zebrafish hindbrain during the complex positional homeostasis behavior, wherein b-dLDS highlights behavior-related dynamic connectivity networks.
[LG-55] Prediction-Powered Conditional Inference
链接: https://arxiv.org/abs/2603.05575
作者: Yang Sui,Jin Zhou,Hua Zhou,Xiaowu Dai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study prediction-powered conditional inference in the setting where labeled data are scarce, unlabeled covariates are abundant, and a black-box machine-learning predictor is available. The goal is to perform statistical inference on conditional functionals evaluated at a fixed test point, such as conditional means, without imposing a parametric model for the conditional relationship. Our approach combines localization with prediction-based variance reduction. First, we introduce a reproducing kernel-based localization method that learns a data-adaptive weight function from covariates and reformulates the target conditional moment at the test point as a weighted unconditional moment. Second, we incorporate machine-learning predictions through a correction-based decomposition of this localized moment, yielding a prediction-powered estimator and confidence interval that reduce variance when the predictor is informative while preserving validity regardless of predictor accuracy. We establish nonasymptotic error bounds and minimax-optimal convergence rates for the resulting estimator, prove pointwise asymptotic normality with consistent variance estimation, and provide an explicit variance decomposition that characterizes how machine-learning predictions and unlabeled covariates improve statistical efficiency. Numerical experiments on simulated and real datasets demonstrate valid conditional coverage and substantially sharper confidence intervals than alternative methods.
[LG-56] Machine Learning for analysis of Multiple Sclerosis cross-tissue bulk and single-cell transcriptomics data
链接: https://arxiv.org/abs/2603.05572
作者: Francesco Massafra,Samuele Punzo,Silvia Giulia Galfré,Alessandro Maglione,Simone Pernice,Stefano Forti,Simona Rolla,Marco Beccuti,Marinella Clerico,Corrado Priami,Alina Sîrbu
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:
Abstract:Multiple Sclerosis (MS) is a chronic autoimmune disease of the central nervous system whose molecular mechanisms remain incompletely understood. In this study, we developed an end-to-end machine learning pipeline to analyze transcriptomic data from peripheral blood mononuclear cells and cerebrospinal fluid, integrating both bulk microarray and single-cell RNA sequencing datasets (concentrating on CD4+ and B-cells). After rigorous preprocessing, batch correction, and gene declustering, XGBoost classifiers were trained to distinguish MS patients from healthy controls. Explainable AI tools, namely SHapley Additive exPlanations (SHAP), were employed to identify key genes driving classification, and results were compared with Differential Expression Analysis (DEA). SHAP-prioritized genes were further investigated through interaction networks and pathway enrichment analyses. The models achieved strong performance, particularly in CSF B-cells (AUC=0.94) and microarray (AUC=0.86). SHAP gene selection proved to be complementary to classical DEA. Gene clusters identified across multiple datasets highlighted immune activation, non-canonical immune checkpoints (ITK, CLEC2D, KLRG1, CEACAM1), ribosomal and translational programs, ubiquitin-proteasome regulation, lipid trafficking, and Epstein-Barr virus-related pathways. Our integrative and explainable framework reveals complementary insights beyond conventional analysis and provides novel mechanistic hypotheses and potential biomarkers for MS pathogenesis.
[LG-57] Learning Optimal Distributionally Robust Individualized Treatment Rules Integrating Multi-Source Data
链接: https://arxiv.org/abs/2603.05568
作者: Wenhai Cui,Wen Su,Xingqiu Zhao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Integrative analysis of multiple datasets for estimating optimal individualized treatment rules (ITRs) can enhance decision efficiency. A central challenge is posterior shift, wherein the conditional distribution of potential outcomes given covariates differs between source and target populations. We propose a prior information-based distributionally robust ITR (PDRO-ITR) that maximizes the worst-case policy value over a covariate-dependent distributional uncertainty set, ensuring robust performance under posterior shift. The uncertainty set is constructed as an individualized combination of source distributions, with weights combining prior source-membership probabilities and deviation terms constrained to the probability simplex to accommodate posterior shift. We derive a closed-form solution for the PDRO-ITR and develop an adaptive procedure to tune the uncertainty level. We establish risk bounds for the PDRO-ITR estimator, which guarantees robust performance under the worst case. Extensive simulations and two real-data applications demonstrate that the proposed method achieves superior performance compared to existing approaches.
[LG-58] An intuitive rearranging of the Yates covariance decomposition for probabilistic verification of forecasts with the Brier score
链接: https://arxiv.org/abs/2603.05544
作者: Bruno Hebling Vieira(Methods of Plasticity Research, Department of Psychology, University of Zurich, Zurich, Switzerland)
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 4 pages, 0 figures
Abstract:Proper scoring rules are essential for evaluating probabilistic forecasts. We propose a simple algebraic rearrangement of the Yates covariance decomposition of the Brier score into three independently non-negative terms: a variance mismatch term, a correlation deficit term, and a calibration-in-the-large term. This rearrangement makes the optimality conditions for perfect forecasting transparent: the optimal forecast must simultaneously match the variance of outcomes, achieve perfect positive correlation with outcomes, and match the mean of outcomes. Any deviation from these conditions results in a positive contribution to the Brier score.
[LG-59] A mixed-frequency approach for exchange rates predictions
链接: https://arxiv.org/abs/2106.00283
作者: Raffaele Mattera,Michelangelo Misuraca,Germana Scepi,Maria Spano
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:
Abstract:Selecting an appropriate statistical model to forecast exchange rates is still today a relevant issue for policymakers and central bankers. The so-called Meese and Rogoff puzzle assesses that exchange rate fluctuations are unpredictable. In the literature, a lot of studies tried to solve the puzzle finding alternative predictors and statistical models based on temporal aggregation. In this paper, we propose an approach based on mixed frequency models to overcome the lack of information caused by temporal aggregation. We show the effectiveness of our approach in comparison with other proposed methods by performing CAD/USD exchange rate predictions.
附件下载



