Arxiv今日论文 | 2026-03-16

本篇博文主要内容为 2026-03-16 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共59篇(Computation and Language (cs.CL))
人工智能共135篇(Artificial Intelligence (cs.AI))
计算机视觉共146篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共150篇(Machine Learning (cs.LG))
多智能体系统共7篇(Multiagent Systems (cs.MA))
信息检索共16篇(Information Retrieval (cs.IR))
人机交互共29篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] A Generative Model of Conspicuous Consumption and Status Signaling

【速读】：该论文试图解决的问题是：为何特定物品、信号或行为能够获得声望（status）地位，以及这种声望的生成机制如何在社会互动中动态演化。传统理论如昂贵信号理论（Costly Signaling Theory）假设偏好固定，难以解释符号意义随情境变化或随时间漂移的现象，尤其无法说明文化趋势何时达到临界点（tipping points）。其解决方案的关键在于提出一种基于“适当性理论”（theory of appropriateness）的计算模型，认为声望符号通过社会观察与预测性模式补全之间的反馈循环内生生成。研究通过在Concordia框架中模拟基于大语言模型（Large Language Model, LLM）的代理群体，并操控社会可见性，验证了社会互动可将功能性需求转化为声望追求行为；同时发现价格飙升和正向价格弹性（即凡勃伦效应，Veblen effects）既出现在现实奢侈品中，也出现在程序生成的商品中，排除了预训练偏差作为唯一驱动力的可能性。这一机制为微观认知与宏观经济及社会现象之间提供了生成式桥梁。

链接: https://arxiv.org/abs/2603.13220
作者: Logan Cross,Jordi Grau-Moya,William A. Cunningham,Alexander Sasha Vezhnevets,Joel Z. Leibo
机构: Google DeepMind(谷歌深度思维)
类目: Multiagent Systems (cs.MA)
备注: 29 pages, 13 figures

点击查看摘要

Abstract:Status signaling drives human behavior and the allocation of scarce resources such as mating opportunities, yet the generative mechanisms governing how specific goods, signals, or behaviors acquire prestige remain a puzzle. Classical frameworks, such as Costly Signaling Theory, treat preferences as fixed and struggle to explain how semiotic meaning changes based on context or drifts dynamically over time, occasionally reaching tipping points. In this work, we propose a computational theory of status grounded in the theory of appropriateness, positing that status symbols emerge endogenously through a feedback loop of social observation and predictive pattern completion. We validate this theory using simulations of groups of Large Language Model (LLM)-based agents in the Concordia framework. By experimentally manipulating social visibility within naturalistic agent daily routines, we demonstrate that social interactions transform functional demand into status-seeking behavior. We observe the emergence of price run-ups and positive price elasticity (Veblen effects) for both real-world luxury items and procedurally generated synthetic goods, ruling out pretraining bias as the sole driver. Furthermore, we demonstrate that “influencer” agents can drive the endogenous formation of distinct subcultures through targeted sanctioning, and find that similar social influence effects generalize to non-monetary signaling behaviors. This work provides a generative bridge between micro-level cognition and macro-level economic and sociological phenomena, offering a new methodology for forecasting how cultural conventions emerge from interaction.

[MA-1] LLM Constitutional Multi-Agent Governance

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多智能体系统中诱导合作时可能引发的伦理风险问题，即这种合作是否真正体现利他性对齐，还是掩盖了代理自主性（agent autonomy）、认知完整性（epistemic integrity）和分配公平性（distributional fairness）的侵蚀。其解决方案的关键在于提出宪法式多智能体治理框架（Constitutional Multi-Agent Governance, CMAG），该框架通过两阶段机制实现：首先使用硬约束过滤器排除明显违规策略，再结合软惩罚优化目标，在最大化合作潜力的同时最小化操纵风险与自主性压力；同时引入道德合作得分（Ethical Cooperation Score, ECS）作为综合评估指标，以乘法形式惩罚通过操纵手段获得的合作，从而确保LLM驱动的影响导向伦理稳定的结果而非操纵均衡。

链接: https://arxiv.org/abs/2603.13189
作者: J. de Curtò,I. de Zarzà
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Accepted for publication in 20th International Conference on Agents and Multi-Agent Systems: Technologies and Applications (AMSTA 2026), to appear in Springer Nature proceedings (KES Smart Innovation Systems and Technologies). The final authenticated version will be available online at Springer

点击查看摘要

Abstract:Large Language Models (LLMs) can generate persuasive influence strategies that shift cooperative behavior in multi-agent populations, but a critical question remains: does the resulting cooperation reflect genuine prosocial alignment, or does it mask erosion of agent autonomy, epistemic integrity, and distributional fairness? We introduce Constitutional Multi-Agent Governance (CMAG), a two-stage framework that interposes between an LLM policy compiler and a networked agent population, combining hard constraint filtering with soft penalized-utility optimization that balances cooperation potential against manipulation risk and autonomy pressure. We propose the Ethical Cooperation Score (ECS), a multiplicative composite of cooperation, autonomy, integrity, and fairness that penalizes cooperation achieved through manipulative means. In experiments on scale-free networks of 80 agents under adversarial conditions (70% violating candidates), we benchmark three regimes: full CMAG, naive filtering, and unconstrained optimization. While unconstrained optimization achieves the highest raw cooperation (0.873), it yields the lowest ECS (0.645) due to severe autonomy erosion (0.867) and fairness degradation (0.888). CMAG attains an ECS of 0.741, a 14.9% improvement, while preserving autonomy at 0.985 and integrity at 0.995, with only modest cooperation reduction to 0.770. The naive ablation (ECS = 0.733) confirms that hard constraints alone are insufficient. Pareto analysis shows CMAG dominates the cooperation-autonomy trade-off space, and governance reduces hub-periphery exposure disparities by over 60%. These findings establish that cooperation is not inherently desirable without governance: constitutional constraints are necessary to ensure that LLM-mediated influence produces ethically stable outcomes rather than manipulative equilibria.

[MA-2] Conflict Mitigation in Shared Environments using Flow-Aware Multi-Agent Path Finding ICRA2026

【速读】：该论文旨在解决多机器人系统在与动态不可控代理共享环境时，因不可预见的冲突导致个体机器人任务延迟的问题，尤其是在大规模机器人集群场景下。现有研究虽关注于保持多智能体路径规划（Multi-Agent Path Finding, MAPF）解的完整性以应对延迟，但缺乏利用额外环境信息来提升存在其他动态代理时的解质量。解决方案的关键在于提出一种新型框架——面向流感知的多智能体路径规划（Flow-Aware Multi-Agent Path Finding, FA-MAPF），其核心创新是将不可控代理的学习运动模式整合进集中式MAPF算法中，从而在不牺牲任务效率的前提下显著减少与不可控代理的冲突，实验表明最多可降低55%的冲突率。

链接: https://arxiv.org/abs/2603.12736
作者: Lukas Heuer,Yufei Zhu,Luigi Palmieri,Andrey Rudenko,Anna Mannucci,Sven Koenig,Martin Magnusson
机构: Örebro University (奥勒布罗大学); Bosch Research (博世研究); Technical University of Munich (慕尼黑工业大学); University of California, Irvine (加州大学欧文分校)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: To be presented at ICRA 2026

点击查看摘要

Abstract:Deploying multi-robot systems in environments shared with dynamic and uncontrollable agents presents significant challenges, especially for large robot fleets. In such environments, individual robot operations can be delayed due to unforeseen conflicts with uncontrollable agents. While existing research primarily focuses on preserving the completeness of Multi-Agent Path Finding (MAPF) solutions considering delays, there is limited emphasis on utilizing additional environmental information to enhance solution quality in the presence of other dynamic agents. To this end, we propose Flow-Aware Multi-Agent Path Finding (FA-MAPF), a novel framework that integrates learned motion patterns of uncontrollable agents into centralized MAPF algorithms. Our evaluation, conducted on a diverse set of benchmark maps with simulated uncontrollable agents and on a real-world map with recorded human trajectories, demonstrates the effectiveness of FA-MAPF compared to state-of-the-art baselines. The experimental results show that FA-MAPF can consistently reduce conflicts with uncontrollable agents, up to 55%, without compromising task efficiency.

[MA-3] Collaborative Multi-Agent Optimization for Personalized Memory System

【速读】：该论文旨在解决多智能体记忆系统（Multi-Agent Memory Systems）中因各智能体独立优化而导致的全局性能不佳问题。现有方法通常通过提示工程或微调使各智能体专注于本地任务（如记忆构建和个性化检索），但忽视了跨智能体协作的重要性，从而难以保证整体系统的有效性。解决方案的关键在于提出一种协同强化学习框架（Collaborative Reinforcement Learning Framework for Multi-Agent Memory Systems, CoMAM），将多个智能体的执行建模为一个序列化的马尔可夫决策过程（Markov Decision Process, MDP），在状态转移中嵌入智能体间的依赖关系，并同时引入局部奖励（如信息覆盖率）与全局奖励（如查询-回答准确率）。通过量化每个智能体在局部与全局奖励之间的一致性排名，动态分配全局信用权重，实现局部与全局奖励的融合优化，从而引导各智能体的改进方向与系统整体性能对齐。

链接: https://arxiv.org/abs/2603.12631
作者: Wenyu Mao,Haoyang Liu,Zhao Liu,Haosong Tan,Yaorui Shi,Jiancan Wu,An Zhang,Xiang Wang
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Memory systems are crucial to personalized LLMs by mitigating the context window limitation in capturing long-term user-LLM conversations. Typically, such systems leverage multiple agents to handle multi-granular memory construction and personalized memory retrieval tasks. To optimize the system, existing methods focus on specializing agents on their local tasks independently via prompt engineering or fine-tuning. However, they overlook cross-agent collaboration, where independent optimization on local agents hardly guarantees the global system performance. To address this issue, we propose a Collaborative Reinforcement Learning Framework for Multi-Agent Memory Systems (CoMAM), jointly optimizing local agents to facilitate collaboration. Specifically, we regularize agents’ execution as a sequential Markov decision process (MDP) to embed inter-agent dependencies into the state transition, yielding both local task rewards (e.g., information coverage for memory construction) and global rewards (i.e., query-answer accuracy). Then, we quantify each agent’s contribution via group-level ranking consistency between local and global rewards, treating them as adaptive weights to assign global credit and integrate local-global rewards. Each agent is optimized by these integrated rewards, aligning local improvements with the global performance. Experiments show CoMAM outperforms leading memory systems, validating the efficacy of our proposed collaborative reinforcement learning for joint optimization.

[MA-4] Feynman: Knowledge-Infused Diagramming Agent for Scalable Visual Designs ICLR2025

【速读】：该论文旨在解决当前多模态AI系统在视觉设计应用中缺乏高质量、知识丰富且对齐良好的图像-文本数据的问题。现有互联网上的图文数据虽多，但具备语义一致性与结构化信息的图像-文本对稀缺，限制了模型训练效果。解决方案的关键在于构建一个可扩展的图表生成流水线（diagram generation pipeline），其核心是基于代理（agent）Feynman的自动化流程：首先枚举领域特定的知识组件（ideas），进行代码规划；随后将这些想法转化为简洁的声明式程序（declarative programs），并通过反馈迭代优化视觉表现；最终由Penrose渲染系统以优化方式生成兼具视觉语义一致性与布局多样性的图表。此方法显著降低了高质量图-文对的制作成本与时间，并成功构建了一个包含10万+对齐样本的数据集及名为Diagramma的视觉语言基准测试集，用于评估视觉推理能力。

链接: https://arxiv.org/abs/2603.12597
作者: Zixin Wen,Yifu Cai,Kyle Lee,Sam Estep,Josh Sunshine,Aarti Singh,Yuejie Chi,Wode Ni
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: A previous version was submitted to ICLR 2025

点击查看摘要

Abstract:Visual design is an essential application of state-of-the-art multi-modal AI systems. Improving these systems requires high-quality vision-language data at scale. Despite the abundance of internet image and text data, knowledge-rich and well-aligned image-text pairs are rare. In this paper, we present a scalable diagram generation pipeline built with our agent, Feynman. To create diagrams, Feynman first enumerates domain-specific knowledge components (‘‘ideas’’) and performs code planning based on the ideas. Given the plan, Feynman translates ideas into simple declarative programs and iterates to receives feedback and visually refine diagrams. Finally, the declarative programs are rendered by the Penrose diagramming system. The optimization-based rendering of Penrose preserves the visual semantics while injecting fresh randomness into the layout, thereby producing diagrams with visual consistency and diversity. As a result, Feynman can author diagrams along with grounded captions with very little cost and time. Using Feynman, we synthesized a dataset with more than 100k well-aligned diagram-caption pairs. We also curate a visual-language benchmark, Diagramma, from freshly generated data. Diagramma can be used for evaluating the visual reasoning capabilities of vision-language models. We plan to release the dataset, benchmark, and the full agent pipeline as an open-source project.

[MA-5] VQQA: An Agent ic Approach for Video Evaluation and Quality Improvement

【速读】：该论文旨在解决视频生成模型在对齐复杂用户意图方面的挑战，现有测试时优化方法通常计算成本高或需白盒访问模型内部。其解决方案的关键在于提出VQQA（Video Quality Question Answering）框架，通过动态生成视觉问题并利用视觉语言模型（Vision-Language Model, VLM）的批评作为语义梯度，将传统被动评估指标替换为可解释且可操作的人类可理解反馈，从而实现基于黑盒自然语言接口的高效闭环提示优化过程。

链接: https://arxiv.org/abs/2603.12310
作者: Yiwen Song,Tomas Pfister,Yale Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially improving generation quality in just a few refinement steps. Applicable to both text-to-video (T2V) and image-to-video (I2V) tasks, our method achieves absolute improvements of +11.57% on T2V-CompBench and +8.43% on VBench2 over vanilla generation, significantly outperforming state-of-the-art stochastic search and prompt optimization techniques.

[MA-6] DIALECTIC: A Multi-Agent System for Startup Evaluation EACL2026

【速读】：该论文旨在解决风险投资（Venture Capital, VC）机构在早期筛选创业项目时面临的效率与精度之间的权衡问题，即投资者受限于自身精力，难以对大量潜在机会进行充分评估。解决方案的关键在于提出DIALECTIC——一个基于大语言模型（Large Language Model, LLM）的多智能体系统，通过构建结构化的知识问答树整合事实信息，并生成正反两方的自然语言论证，再借助模拟辩论机制迭代批判和优化论证过程，最终提炼出最具说服力的投资判断依据，并输出数值化决策评分以支持优先级排序。实证结果显示，该系统在真实VC基金数据上的回测表现可达到人类投资人预测初创企业成功概率的精度水平。

链接: https://arxiv.org/abs/2603.12274
作者: Jae Yoon Bae,Simon Malberg,Joyce Galang,Andre Retterath,Georg Groh
机构: Technical University of Munich (慕尼黑工业大学); Earlybird Venture Capital (Earlybird风险投资公司); UVC Partners (UVC合伙公司)
类目: Multiagent Systems (cs.MA); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: Accepted at EACL 2026 Industry Track

点击查看摘要

Abstract:Venture capital (VC) investors face a large number of investment opportunities but only invest in few of these, with even fewer ending up successful. Early-stage screening of opportunities is often limited by investor bandwidth, demanding tradeoffs between evaluation diligence and number of opportunities assessed. To ease this tradeoff, we introduce DIALECTIC, an LLM-based multi-agent system for startup evaluation. DIALECTIC first gathers factual knowledge about a startup and organizes these facts into a hierarchical question tree. It then synthesizes the facts into natural-language arguments for and against an investment and iteratively critiques and refines these arguments through a simulated debate, which surfaces only the most convincing arguments. Our system also produces numeric decision scores that allow investors to rank and thus efficiently prioritize opportunities. We evaluate DIALECTIC through backtesting on real investment opportunities aggregated from five VC funds, showing that DIALECTIC matches the precision of human VCs in predicting startup success.

自然语言处理

[NLP-0] Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

【速读】：该论文旨在解决指令微调（Instruction Tuning, IT）数据效率低下的问题，即如何从大规模IT数据集中高效筛选出少量高质量样本，以提升大语言模型（Large Language Models, LLMs）在特定或通用能力上的表现。其核心挑战在于：过度使用IT数据可能导致性能下降，而精准选择小规模高质数据子集可显著增强模型能力。解决方案的关键在于提出一种名为NAIT的新框架，该框架通过分析神经元激活模式的相似性来评估IT数据对目标能力的影响——具体而言，NAIT首先从目标领域内数据中提取可复用且具备迁移性的神经元激活特征，随后基于候选样本与预期激活特征之间的相似度进行筛选和优化。实验表明，仅需10%的Alpaca-GPT4数据子集即可超越依赖外部先进模型或不确定性特征的方法，并揭示了神经元激活特征在不同能力间的强迁移性，尤其是逻辑推理与编程类数据具有广泛的泛化潜力。

链接: https://arxiv.org/abs/2603.13201
作者: Xin Chen,Junchao Wu,Shu Yang,Runzhe Zhan,Zeyu Wu,Min Yang,Shujian Huang,Lidia S. Chao,Derek F. Wong
机构: University of Macau (澳门大学); Shenzhen University of Advanced Technology (深圳先进技术研究院); Chinese Academy of Sciences (中国科学院); KAUST (沙特阿卜杜拉国王科技大学); Nanjing University (南京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction Tuning (IT) has been proven to be an effective approach to unlock the powerful capabilities of large language models (LLMs). Recent studies indicate that excessive IT data can degrade LLMs performance, while carefully selecting a small subset of high-quality IT data can significantly enhance their capabilities. Therefore, identifying the most efficient subset data from the IT dataset to effectively develop either specific or general abilities in LLMs has become a critical challenge. To address this, we propose a novel and efficient framework called NAIT. NAIT evaluates the impact of IT data on LLMs performance by analyzing the similarity of neuron activation patterns between the IT dataset and the target domain capability. Specifically, NAIT captures neuron activation patterns from in-domain datasets of target domain capabilities to construct reusable and transferable neuron activation features. It then evaluates and selects optimal samples based on the similarity between candidate samples and the expected activation features of the target capabilities. Experimental results show that training on the 10% Alpaca-GPT4 IT data subset selected by NAIT consistently outperforms methods that rely on external advanced models or uncertainty-based features across various tasks. Our findings also reveal the transferability of neuron activation features across different capabilities of LLMs. In particular, IT data with more logical reasoning and programmatic features possesses strong general transferability, enabling models to develop stronger capabilities across multiple tasks, while a stable core subset of data is sufficient to consistently activate fundamental model capabilities and universally improve performance across diverse tasks.

[NLP-1] Semantic Invariance in Agent ic AI

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际应用中因输入语义不变但形式变化而导致推理结果不稳定的问题，即缺乏对语义等价输入扰动的鲁棒性保障。解决方案的关键在于提出一种元测试（metamorphic testing）框架，通过八种语义保持变换（如重述、事实重组、扩展、收缩、学术/商业语境转换等）系统评估LLM推理代理的稳定性，并在七个基础模型和19个多步推理任务上进行验证，发现模型规模并不能预测鲁棒性，小模型如Qwen3-30B-A3B反而表现出最高的推理一致性（79.6%不变响应，语义相似度0.91）。

链接: https://arxiv.org/abs/2603.13173
作者: I. de Zarzà,J. de Curtò,Jordi Cabot,Pietro Manzoni,Carlos T. Calafate
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for publication in 20th International Conference on Agents and Multi-Agent Systems: Technologies and Applications (AMSTA 2026), to appear in Springer Nature proceedings (KES Smart Innovation Systems and Technologies). The final authenticated version will be available online at Springer

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordination systems. However, deploying LLM agents in consequential applications requires assurance that their reasoning remains stable under semantically equivalent input variations, a property we term semantic this http URL benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension. To address this shortcoming, in this paper we present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents, applying eight semantic-preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) across seven foundation models spanning four distinct architectural families: Hermes (70B, 405B), Qwen3 (30B-A3B, 235B-A22B), DeepSeek-R1, and gpt-oss (20B, 120B). Our evaluation encompasses 19 multi-step reasoning problems across eight scientific domains. The results reveal that model scale does not predict robustness: the smaller Qwen3-30B-A3B achieves the highest stability (79.6% invariant responses, semantic similarity 0.91), while larger models exhibit greater fragility.

[NLP-2] ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation AAAI2026

【速读】：该论文旨在解决企业环境、社会和治理（ESG）报告内容冗长复杂导致难以可靠自动化分析的问题，尤其关注大语言模型（LLM）在处理ESG信息时产生的幻觉（hallucination）现象。其解决方案的关键在于构建ESG-Bench这一基准数据集，该数据集包含基于真实ESG报告上下文的人工标注问答对，并带有细粒度标签以区分模型输出是否为事实支持或幻觉；同时设计任务特定的思维链（Chain-of-Thought, CoT）提示策略并结合CoT标注的推理路径对多个先进LLM进行微调，从而显著降低幻觉率并在跨领域问答基准上实现性能提升。

链接: https://arxiv.org/abs/2603.13154
作者: Siqi Sun,Ben Peng Wu,Mali Jin,Peizhen Bai,Hanpei Zhang,Xingyi Song
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in the AAAI 2026 proceedings

点击查看摘要

Abstract:As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms’ long-term and ethical performance. However, the length and complexity of ESG disclosures make them difficult to interpret and automate the analysis reliably. To support scalable and trustworthy analysis, this paper introduces ESG-Bench, a benchmark dataset for ESG report understanding and hallucination mitigation in large language models (LLMs). ESG-Bench contains human-annotated question-answer (QA) pairs grounded in real-world ESG report contexts, with fine-grained labels indicating whether model outputs are factually supported or hallucinated. Framing ESG report analysis as a QA task with verifiability constraints enables systematic evaluation of LLMs’ ability to extract and reason over ESG content and provides a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.

[NLP-3] Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在低资源语言对上的机器翻译性能不足的问题。现有后训练方法高度依赖高质量的平行语料，而这类数据在低资源语言中往往稀缺或不可用。为此，作者提出WALAR方法，这是一种仅使用单语文本进行强化学习（Reinforcement Learning, RL）训练的方法，能够在不牺牲高资源语言翻译性能的前提下显著提升模型在大量低资源语言上的翻译能力。解决方案的关键在于识别并缓解现有基于源语言的多语言质量评估（Multilingual Quality Estimation, QE）模型中存在的“漏洞”（failure modes），这些漏洞在RL训练中会被放大，导致模型性能下降；通过引入词对齐（word alignment）和语言对齐（language alignment）技术，WALAR有效降低了奖励信号中的偏差，从而提升了多语言翻译的鲁棒性与准确性。

链接: https://arxiv.org/abs/2603.13045
作者: Yifeng Liu,Siqi Ouyang,Yatish Hosmane Revanasiddappa,Lei Li
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: Our code is available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on low-resource translation still lags behind. Existing post-training methods rely heavily on high-quality parallel data, which are often scarce or unavailable for low-resource languages. In this paper, we introduce WALAR, a reinforcement training method using only monolingual text to elevate LLMs’ translation capabilities on massive low-resource languages while retaining their performance on high-resource languages. Our key insight is based on the observation of failure modes (or “holes”) in existing source-based multilingual quality estimation (QE) models. Reinforcement learning (RL) using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs. We develop techniques including word alignment and language alignment to mitigate such holes in WALAR’s reward for RL training. We continually trained an LLM supporting translation of 101 languages using WALAR. The experiments show that our new model outperforms LLaMAX, one of the strongest open-source multilingual LLMs by a large margin on 1400 language directions on Flores-101 dataset.

[NLP-4] Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse ACL2026

【速读】：该论文旨在解决监督语义差异法（Supervised Semantic Differential, SSD）中主成分分析（PCA）降维步骤里维度选择缺乏系统性标准的问题，这一问题导致了研究者在分析流程中存在可避免的自由度（researcher degrees of freedom），可能影响结果的稳定性和可解释性。解决方案的关键在于提出一种“PCA扫掠”（PCA sweep）程序，该程序将维度选择视为一个联合准则，同时优化表示能力（representation capacity）、梯度可解释性（gradient interpretability）和邻近K值下的稳定性（stability across nearby values of K），从而在保持SSD方法原有解释目标的同时，提升分析的透明度与心理意义。

链接: https://arxiv.org/abs/2603.13038
作者: Hubert Plisiecki,Maria Leniarska,Jan Piotrowski,Marcin Zajenkowski
机构: IDEAS Research Institute (IDEAS研究 institute); VIZJA University (VIZJA大学); Warsaw University of Technology (华沙理工大学); University of Warsaw (华沙大学)
类目: Computation and Language (cs.CL)
备注: Submitted to ACL 2026

点击查看摘要

Abstract:Supervised Semantic Differential (SSD) is a mixed quantitative-interpretive method that models how text meaning varies with continuous individual-difference variables by estimating a semantic gradient in an embedding space and interpreting its poles through clustering and text retrieval. SSD applies PCA before regression, but currently no systematic method exists for choosing the number of retained components, introducing avoidable researcher degrees of freedom in the analysis pipeline. We propose a PCA sweep procedure that treats dimensionality selection as a joint criterion over representation capacity, gradient interpretability, and stability across nearby values of K. We illustrate the method on a corpus of short posts about artificial intelligence written by Prolific participants who also completed Admiration and Rivalry narcissism scales. The sweep yields a stable, interpretable Admiration-related gradient contrasting optimistic, collaborative framings of AI with distrustful and derisive discourse, while no robust alignment emerges for Rivalry. We also show that a counterfactual using a high-PCA dimension solution heuristic produces diffuse, weakly structured clusters instead, reinforcing the value of the sweep-based choice of K. The case study shows how the PCA sweep constrains researcher degrees of freedom while preserving SSD’s interpretive aims, supporting transparent and psychologically meaningful analyses of connotative meaning.

[NLP-5] daVinci-Env: Open SWE Environment Synthesis at Scale

【速读】：该论文旨在解决软件工程（Software Engineering, SWE）智能体训练中缺乏大规模、可执行且可验证的环境问题，现有开源数据集在规模和仓库多样性上受限，而工业解决方案则因闭源特性难以复现，阻碍了学术研究。其核心解决方案是提出OpenSWE——一个完全透明的Python SWE代理训练框架，包含45,320个可执行Docker环境（覆盖12.8k个仓库），所有Dockerfile、评估脚本及基础设施均开源，确保可复现性；关键创新在于构建多智能体合成流水线与质量导向的过滤机制，通过自动化生成环境并筛选出具备适中难度的高质量实例，从而提升学习效率，最终在SWE-bench Verified基准上实现66.0%的准确率（OpenSWE-72B），并显著提升跨领域能力（如数学推理+12点）。

链接: https://arxiv.org/abs/2603.13023
作者: Dayuan Fu,Shenyu Wu,Yunze Wu,Zerui Peng,Yaxing Huang,Jie Sun,Ji Zeng,Mohan Jiang,Lin Zhang,Yukun Li,Jiarui Hu,Liming Liu,Jinlong Hou,Pengfei Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With 891K spent on environment construction and an additional 576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately 1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE’s effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.

[NLP-6] Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

【速读】：该论文旨在解决机器翻译评估中错误片段检测（Error Span Detection, ESD）任务对昂贵且易产生不一致的人工标注数据的依赖问题。现有方法通过微调模型提升ESD性能，但人工标注成本高且一致性差。为此，作者提出一种基于最小贝叶斯风险（Minimum Bayes Risk, MBR）解码的自进化框架——迭代MBR蒸馏（Iterative MBR Distillation for ESD），其核心创新在于利用现成的大语言模型（Large Language Model, LLM）生成伪标签，从而完全消除对人工标注的依赖；实验表明，仅使用这些自动生成的伪标签训练的模型在系统级和片段级性能上优于未适配的基础模型及基于人工标注数据训练的监督基线，同时保持与之相当的句子级性能。

链接: https://arxiv.org/abs/2603.12983
作者: Boxuan Lyu,Haiyue Song,Zhi Qu
机构: Institute of Science Tokyo (东京科学研究所); National Institute of Information and Communications Technology (日本信息通信技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation errors. While fine-tuning models on human-annotated data improves ESD performance, acquiring such data is expensive and prone to inconsistencies among annotators. To address this, we propose a novel self-evolution framework based on Minimum Bayes Risk (MBR) decoding, named Iterative MBR Distillation for ESD, which eliminates the reliance on human annotations by leveraging an off-the-shelf LLM to generate this http URL experiments on the WMT Metrics Shared Task datasets demonstrate that models trained solely on these self-generated pseudo-labels outperform both unadapted base model and supervised baselines trained on human annotations at the system and span levels, while maintaining competitive sentence-level performance.

[NLP-7] Long-form RewardBench: Evaluating Reward Models for Long-form Generation AAAI2026

【速读】：该论文旨在解决当前奖励模型（Reward Model）在长文本生成（Long-form Generation）任务中评估能力严重不足的问题，而这一问题在实际应用中具有关键意义。解决方案的关键在于构建首个专为长文本生成设计的基准测试平台——Long-form RewardBench，其涵盖问答（QA）、检索增强生成（RAG）、对话（Chat）、写作（Writing）和推理（Reasoning）五个核心子任务，并通过多阶段数据收集流程获取高质量指令与偏好数据。在此基础上，作者对20余种主流奖励模型（包括分类器与生成式模型）进行了系统性实验，发现现有模型在长文本场景下仍存在显著能力缺陷；进一步提出“长文本针 haystack 测试”（Long-form Needle-in-a-Haystack Test），揭示了奖励建模性能与错误位置及响应长度之间的关联性，并指出分类器类模型在泛化能力上优于同数据训练的生成式模型，从而为该领域提供了可量化、可比较的评估框架与重要洞察。

链接: https://arxiv.org/abs/2603.12963
作者: Hui Huang,Yancheng He,Wei Liu,Muyun Yang,Jiaheng Liu,Kehai Chen,Bing Xu,Conghui Zhu,Hailong Cao,Tiejun Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI2026

点击查看摘要

Abstract:The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built to evaluate reward models in various domains and scenarios. However, a significant gap remains in assessing reward models for long-form generation, despite its critical role in real-world applications. To bridge this, we introduce Long-form RewardBench, the first reward modeling testbed specifically designed for long-form generation. Our benchmark encompasses five key subtasks: QA, RAG, Chat, Writing, and Reasoning. We collected instruction and preference data through a meticulously designed multi-stage data collection process, and conducted extensive experiments on 20+ mainstream reward models, including both classifiers and generative models. Our findings reveal that current models still lack long-form reward modeling capabilities. Furthermore, we designed a novel Long-form Needle-in-a-Haystack Test, which revealed a correlation between reward modeling performance and the error’s position within a response, as well as the overall response length, with distinct characteristics observed between classification and generative models. Finally, we demonstrate that classifiers exhibit better generalizability compared to generative models trained on the same data. As the first benchmark for long-form reward modeling, this work aims to offer a robust platform for visualizing progress in this crucial area.

[NLP-8] DS2-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在特定领域适应过程中高质量指令微调数据稀缺的问题，现有数据合成方法难以捕捉领域特有的术语和推理模式。其解决方案的关键在于提出一个零样本框架 DS²-Instruct，通过三个核心步骤实现：首先生成任务感知的关键词以确保领域覆盖度；其次将这些关键词与布卢姆分类学（Bloom’s Taxonomy）中不同认知层级配对，生成多样化的指令；最后采用自一致性验证机制保障数据质量。该方法无需人工标注即可生成高质量领域专用指令数据集，在数学、金融和逻辑推理等七个挑战性领域验证了其有效性。

链接: https://arxiv.org/abs/2603.12932
作者: Ruiyao Xu,Noelle I. Samia,Han Liu
机构: Northwestern University (西北大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through human annotation. Existing data synthesis methods focus on general-purpose tasks and fail to capture domain-specific terminology and reasoning patterns. To address this, we introduce DS ^2 -Instruct, a zero-shot framework that generates domain-specific instruction datasets without human supervision. Our approach first generates task-informed keywords to ensure comprehensive domain coverage. It then creates diverse instructions by pairing these keywords with different cognitive levels from Bloom’s Taxonomy. Finally, it uses self-consistency validation to ensure data quality. We apply this framework to generate datasets across seven challenging domains, such as mathematics, finance, and logical reasoning. Comprehensive evaluation demonstrates that models fine-tuned on our generated data achieve substantial improvements over existing data generation methods.

[NLP-9] HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection

【速读】：该论文旨在解决社交媒体中网络霸凌（cyberbullying）检测在多语言和多标签场景下的局限性问题，现有方法常受限于单语假设或单一任务建模，难以应对跨语言、多类别重叠的现实复杂情况。其解决方案的关键在于提出HMS-BERT框架——一个基于预训练多语言BERT（multilingual BERT）的混合多任务自训练模型，通过融合上下文表示与人工设计的语言学特征，联合优化细粒度多标签滥用分类任务与三分类主分类任务，并引入基于置信度的伪标签自训练策略以缓解低资源语言中的标注数据稀缺问题，从而实现跨语言知识迁移与性能提升。

链接: https://arxiv.org/abs/2603.12920
作者: Zixin Feng,Xinying Cui,Yifan Sun,Zheng Wei,Jiachen Yuan,Jiazhen Hu,Ning Xin,Md Maruf Hasan
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Cyberbullying on social media is inherently multilingual and multi-faceted, where abusive behaviors often overlap across multiple categories. Existing methods are commonly limited by monolingual assumptions or single-task formulations, which restrict their effectiveness in realistic multilingual and multi-label scenarios. In this paper, we propose HMS-BERT, a hybrid multi-task self-training framework for multilingual and multi-label cyberbullying detection. Built upon a pretrained multilingual BERT backbone, HMS-BERT integrates contextual representations with handcrafted linguistic features and jointly optimizes a fine-grained multi-label abuse classification task and a three-class main classification task. To address labeled data scarcity in low-resource languages, an iterative self-training strategy with confidence-based pseudo-labeling is introduced to facilitate cross-lingual knowledge transfer. Experiments on four public datasets demonstrate that HMS-BERT achieves strong performance, attaining a macro F1-score of up to 0.9847 on the multi-label task and an accuracy of 0.6775 on the main classification task. Ablation studies further verify the effectiveness of the proposed components.

[NLP-10] Learning from Child-Directed Speech in Two-Language Scenarios: A French-English Case Study EACL2026

【速读】：该论文旨在解决当前发展性语言模型研究主要集中在英语语境下，缺乏对多语言场景下紧凑型语言模型系统性分析的问题。其核心解决方案在于通过扩展BabyBERTa至英法双语设置，在严格匹配模型规模和数据量的前提下，对比不同训练语料（儿童导向话语与多领域语料）对单语、双语及跨语言任务性能的影响，并构建了新的法语评估资源（如QAMR和QASRL的法语版本）以实现公平比较。关键发现表明：维基百科语料在语义任务上具有一致优势，而儿童导向话语在单语语法判断中表现更优；双语预训练显著提升文本蕴含任务性能，尤其在法语任务中效果突出，且这些模式在BabyBERTa、RoBERTa和LTG-BERT等多种架构中保持一致，揭示了跨模型的通用规律。

链接: https://arxiv.org/abs/2603.12906
作者: Liel Binyamin,Elior Sulem
机构: Ben-Gurion University of the Negev (本古里安大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of EACL 2026

点击查看摘要

Abstract:Research on developmentally plausible language models has largely focused on English, leaving open questions about multilingual settings. We present a systematic study of compact language models by extending BabyBERTa to English-French scenarios under strictly size-matched data conditions, covering monolingual, bilingual, and cross-lingual settings. Our design contrasts two types of training corpora: (i) child-directed speech (about 2.5M tokens), following BabyBERTa and related work, and (ii) multi-domain corpora (about 10M tokens), extending the BabyLM framework to French. To enable fair evaluation, we also introduce new resources, including French versions of QAMR and QASRL, as well as English and French multi-domain corpora. We evaluate the models on both syntactic and semantic tasks and compare them with models trained on Wikipedia-only data. The results reveal context-dependent effects: training on Wikipedia consistently benefits semantic tasks, whereas child-directed speech improves grammatical judgments in monolingual settings. Bilingual pretraining yields notable gains for textual entailment, with particularly strong improvements for French. Importantly, similar patterns emerge across BabyBERTa, RoBERTa, and LTG-BERT, suggesting consistent trends across architectures. Comments: Accepted to Findings of EACL 2026 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.12906 [cs.CL] (or arXiv:2603.12906v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.12906 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-11] CLARIN-PT-LDB: An Open LLM Leaderboard for Portuguese to assess Language Culture and Civility

【速读】：该论文旨在解决欧洲葡萄牙语（European Portuguese, PT-PT）大型语言模型（Large Language Models, LLMs）缺乏专门评估基准和排行榜的问题。当前对PT-PT的LLM评估存在显著空白，尤其在模型安全性和文化契合度等关键维度上尚未建立系统性评测标准。解决方案的关键在于构建一个针对PT-PT的公开排行榜，并开发一系列新型评测基准，其中包含首次引入的针对模型安全防护（model safeguards）和葡萄牙语文化对齐（cultural alignment）的测试指标，从而为PT-PT LLM提供全面、可比较的性能评估体系。

链接: https://arxiv.org/abs/2603.12872
作者: João Silva,Luís Gomes,António Branco
机构: University of Lisbon (里斯本大学); NLX—Grupo de Fala e Linguagem Natural, Departmento de Informática (NLX-自然语言与语音组，计算机系)
类目: Computation and Language (cs.CL)
备注: Accepted at PROPOR 2026

点击查看摘要

Abstract:This paper reports on the development of a leaderboard of Open Large Language Models (LLM) for European Portuguese (PT-PT), and on its associated benchmarks. This leaderboard comes as a way to address a gap in the evaluation of LLM for European Portuguese, which so far had no leaderboard dedicated to this variant of the language. The paper also reports on novel benchmarks, including some that address aspects of performance that so far have not been available in benchmarks for European Portuguese, namely model safeguards and alignment to Portuguese culture. The leaderboard is available at this https URL.

[NLP-12] Rethinking Multiple-Choice Questions for RLVR: Unlocking Potential via Distractor Design

【速读】：该论文旨在解决在强化学习与可验证奖励（Reinforcement Learning with Verifiable Rewards, RLVR）框架下，使用多选题（Multiple-Choice Questions, MCQs）进行训练时可能出现的“奖励劫持”（reward hacking）问题，即模型通过随机猜测或简单排除法绕过深度推理。传统方法通常将MCQ转化为开放问答格式以规避此问题，但损失了由专家设计干扰项（distractors）提供的对比信号。论文的核心解决方案是提出迭代干扰项优化（Iterative Distractor Curation, IDC）框架，其关键在于通过主动构建高质量干扰项来阻断消除类快捷路径，从而促进模型进行深层次推理，并在保持MCQ结构的同时显著提升RLVR训练效果。

链接: https://arxiv.org/abs/2603.12826
作者: Xu Guo,Qiming Ge,Jian Tong,Kedi Chen,Jin Zhang,Xiaogui Yang,Xuan Gao,Haijun Lv,Zhihui Lu,Yicheng Zou,Qipeng Guo
机构: Shanghai AI Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capabilities of Large Language Models. When applied to RLVR, Multiple-Choice Questions (MCQs) offer a scalable source of verifiable data but risk inducing reward hacking, where models shortcut reasoning via random guessing or simple elimination. Current approaches often mitigate this by converting MCQs to open-ended formats, thereby discarding the contrastive signal provided by expert-designed distractors. In this work, we systematically investigate the impact of option design on RLVR. Our analysis highlights two primary insights: (1) Mismatches in option counts between training and testing degrade performance. (2) Strong distractors effectively mitigate random guessing, enabling effective RLVR training even with 2-way questions. Motivated by these findings, we propose Iterative Distractor Curation (IDC), a framework that actively constructs high-quality distractors to block elimination shortcuts and promote deep reasoning. Experiments on various benchmarks demonstrate that our method effectively enhances distractor quality and yields significant gains in RLVR training compared to the original data.

[NLP-13] Adaptive Vision-Language Model Routing for Computer Use Agents

【速读】：该论文旨在解决计算机使用代理（Computer Use Agents, CUAs）在执行图形用户界面（GUI）操作时因视觉语言模型（Vision-Language Model, VLM）接地准确率差异显著而导致的效率与可靠性失衡问题。当前CUA系统通常对所有操作统一调用单一固定模型，忽视了不同任务难度对模型性能的影响。其解决方案的关键在于提出自适应VLM路由（Adaptive VLM Routing, AVR）框架，该框架在CUA编排器与多模型池之间引入轻量级语义路由层：通过多模态嵌入估计操作难度、利用小型VLM测量置信度，并依据预设可靠性阈值将任务分配给成本最低且满足精度要求的模型；同时，对于具备历史交互记忆的“暖”代理，额外引入上下文信息可缩小小模型与大模型的能力差距，减少不必要的模型升级，从而在保障性能的前提下实现高达78%的推理成本降低。

链接: https://arxiv.org/abs/2603.12823
作者: Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen
机构: vLLM Semantic Router Project; MBZUAI; McGill University; AMD; Red Hat
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbfAdaptive VLM Routing (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textitwarm agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost–accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: this https URL.

[NLP-14] SteerRM: Debiasing Reward Models via Sparse Autoencoders

【速读】：该论文旨在解决奖励模型（Reward Model, RM）在对齐流水线中对表面风格特征的偏倚问题，即模型倾向于偏好格式更优的回应而非语义更优的回答，从而影响对齐效果。解决方案的关键在于提出 SteerRM，一种无需重新训练的去偏方法，其核心是利用稀疏自编码器（Sparse Autoencoder, SAE）进行干预：通过对比成对响应识别与偏倚相关的SAE特征，并基于强度-稳定性准则筛选出关键特征，在推理阶段实施抑制，从而在不破坏整体性能的前提下显著提升Hard-split准确率（平均提升7.3点）。该方法揭示了格式相关特征集中于浅层且具有跨模型迁移性，表明存在共享的架构级偏倚编码模式。

链接: https://arxiv.org/abs/2603.12795
作者: Mengyuan Sun,Zhuohao Yu,Weizheng Gu,Shikun Zhang,Wei Ye
机构: National Engineering Research Center for Software Engineering, Peking University (北京大学软件工程国家工程研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses over semantically superior ones. Existing debiasing methods typically require retraining or architectural modifications, while direct activation suppression degrades performance due to representation entanglement. We propose SteerRM, the first training-free method for debiasing reward models using Sparse Autoencoder (SAE)-based interventions. SteerRM isolates stylistic effects using contrastive paired responses, identifies bias-related SAE features with a strength-stability criterion, and suppresses them at inference time. Across six reward models on RM-Bench, SteerRM improves Hard-split accuracy by 7.3 points on average while preserving overall performance. Results on a Gemma-based reward model and a controlled non-format bias further suggest generalization across RM architectures and bias types. We further find that format-related features are concentrated in shallow layers and transfer across models, revealing shared architecture-level bias encoding patterns. These results show that SAE-based interventions can mitigate reward-model biases without retraining, providing a practical and interpretable solution for alignment pipelines.

[NLP-15] SectEval: Evaluating the Latent Sectarian Preferences of Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理伊斯兰教逊尼派（Sunni）与什叶派（Shia）宗教知识时是否存在偏见的问题，尤其关注其在不同语言和地理位置情境下的表现一致性。解决方案的关键在于提出一个名为SectEval的多语言评测基准，包含88个问题，用于系统评估15种主流LLM（包括商业和开源模型）在英文和印地语中的响应倾向性。结果显示，模型在不同语言下表现出显著立场反转——例如GPT-4o和DeepSeek-v3在英语中偏向什叶派，在印地语中则转向逊尼派；同时，高级模型如Claude-3.5甚至根据用户所在国家调整回答内容，而小型模型则缺乏这种适应性。这表明当前LLM的宗教判断并非中立，而是受语言和地理上下文驱动，揭示了生成式AI在跨文化宗教场景中潜在的偏见风险。

链接: https://arxiv.org/abs/2603.12768
作者: Aditya Maheshwari,Amit Gajkeshwar,Kaushal Sharma,Vivek Patel
机构: Indian Institute of Management Indore (印度管理学院英迪拉普尔分校)
类目: Computation and Language (cs.CL)
备注: 14 pages; 3 figures

点击查看摘要

Abstract:As Large Language Models (LLMs) becomes a popular source for religious knowledge, it is important to know if it treats different groups fairly. This study is the first to measure how LLMs handle the differences between the two main sects of Islam: Sunni and Shia. We present a test called SectEval, available in both English and Hindi, consisting of 88 questions, to check the bias-ness of 15 top LLM models, both proprietary and open-weights. Our results show a major inconsistency based on language. In English, many powerful models DeepSeek-v3 and GPT-4o often favored Shia answers. However, when asked the exact same questions in Hindi, these models switched to favoring Sunni answers. This means a user could get completely different religious advice just by changing languages. We also looked at how models react to location. Advanced models Claude-3.5 changed their answers to match the user’s country-giving Shia answers to a user from Iran and Sunni answers to a user from Saudi Arabia. In contrast, smaller models (especially in Hindi) ignored the user’s location and stuck to a Sunni viewpoint. These findings show that AI is not neutral; its religious ``truth’’ changes depending on the language you speak and the country you claim to be from. The data set is available at this https URL

[NLP-16] A Method for Learning Large-Scale Computational Construction Grammars from Semantically Annotated Corpora

【速读】：该论文旨在解决如何从大规模语料库中自动学习具有广泛覆盖范围的构式语法（Construction Grammar）的问题，以支持对开放领域文本进行框架语义分析，并揭示语言使用中的句法-语义模式。解决方案的关键在于提出一种基于句法结构和语义框架标注语料的方法，能够在Fluid Construction Grammar（FCG）框架下形式化数十万个构式，从而构建出既具可解释性又具备实用性的计算构式语法模型，有效验证了构式语法核心假设的可扩展性，并为英语论元结构的构式主义研究提供了可操作的工具。

链接: https://arxiv.org/abs/2603.12754
作者: Paul Van Eecke,Katrien Beuls
机构: Vrije Universiteit Brussel (布鲁塞尔自由大学); Université de Namur (那慕尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a method for learning large-scale, broad-coverage construction grammars from corpora of language use. Starting from utterances annotated with constituency structure and semantic frames, the method facilitates the learning of human-interpretable computational construction grammars that capture the intricate relationship between syntactic structures and the semantic relations they express. The resulting grammars consist of networks of tens of thousands of constructions formalised within the Fluid Construction Grammar framework. Not only do these grammars support the frame-semantic analysis of open-domain text, they also house a trove of information about the syntactico-semantic usage patterns present in the data they were learnt from. The method and learnt grammars contribute to the scaling of usage-based, constructionist approaches to language, as they corroborate the scalability of a number of fundamental construction grammar conjectures while also providing a practical instrument for the constructionist study of English argument structure in broad-coverage corpora.

[NLP-17] MoKus: Leverag ing Cross-Modal Knowledge Transfer for Knowledge-Aware Concept Customization

【速读】：该论文旨在解决传统概念定制方法中因罕见标记（rare tokens）在预训练数据中稀缺而导致性能不稳定，且这些标记无法有效传递目标视觉概念内在知识的问题。其解决方案的关键在于提出一种新的任务——知识感知的概念定制（Knowledge-aware Concept Customization），并设计了MoKus框架：通过跨模态知识迁移机制，在文本模态中修改知识后可自然地传递至图像生成过程；具体包括两个阶段：首先学习锚点表示以存储目标概念的视觉信息，其次更新知识查询的答案至锚点表示，从而实现高保真度的定制化生成。

链接: https://arxiv.org/abs/2603.12743
作者: Chenyang Zhu,Hongxiang Li,Xiu Li,Long Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:Concept customization typically binds rare tokens to a target concept. Unfortunately, these approaches often suffer from unstable performance as the pretraining data seldom contains these rare tokens. Meanwhile, these rare tokens fail to convey the inherent knowledge of the target concept. Consequently, we introduce Knowledge-aware Concept Customization, a novel task aiming at binding diverse textual knowledge to target visual concepts. This task requires the model to identify the knowledge within the text prompt to perform high-fidelity customized generation. Meanwhile, the model should efficiently bind all the textual knowledge to the target concept. Therefore, we propose MoKus, a novel framework for knowledge-aware concept customization. Our framework relies on a key observation: cross-modal knowledge transfer, where modifying knowledge within the text modality naturally transfers to the visual modality during generation. Inspired by this observation, MoKus contains two stages: (1) In visual concept learning, we first learn the anchor representation to store the visual information of the target concept. (2) In textual knowledge updating, we update the answer for the knowledge queries to the anchor representation, enabling high-fidelity customized generation. To further comprehensively evaluate our proposed MoKus on the new task, we introduce the first benchmark for knowledge-aware concept customization: KnowCusBench. Extensive evaluations have demonstrated that MoKus outperforms state-of-the-art methods. Moreover, the cross-model knowledge transfer allows MoKus to be easily extended to other knowledge-aware applications like virtual concept creation and concept erasure. We also demonstrate the capability of our method to achieve improvements on world knowledge benchmarks.

[NLP-18] AI Planning Framework for LLM -Based Web Agents

【速读】：该论文旨在解决当前基于大语言模型（Large Language Model, LLM）的自主代理在执行网页任务时缺乏可解释性与可诊断性的核心问题，即代理行为常被视为“黑箱”，难以定位失败原因或理解其规划逻辑。解决方案的关键在于将网页任务形式化为顺序决策过程，并构建一个映射现代代理架构与传统规划范式的分类体系：Step-by-Step代理对应广度优先搜索（Breadth-First Search, BFS），Tree Search代理对应最佳优先树搜索（Best-First Tree Search），Full-Plan-in-Advance代理对应深度优先搜索（Depth-First Search, DFS）。这一框架不仅支持对系统失效（如上下文漂移和任务分解不一致）进行原理性诊断，还引入了五项新的评估指标以超越简单成功率，从而更全面地衡量轨迹质量，最终通过WebArena基准上的794条人工标注轨迹验证了该方法的有效性。

链接: https://arxiv.org/abs/2603.12710
作者: Orit Shahnovsky,Rotem Dror
机构: University of Haifa(海法大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Developing autonomous agents for web-based tasks is a core challenge in AI. While Large Language Model (LLM) agents can interpret complex user requests, they often operate as black boxes, making it difficult to diagnose why they fail or how they plan. This paper addresses this gap by formally treating web tasks as sequential decision-making processes. We introduce a taxonomy that maps modern agent architectures to traditional planning paradigms: Step-by-Step agents to Breadth-First Search (BFS), Tree Search agents to Best-First Tree Search, and Full-Plan-in-Advance agents to Depth-First Search (DFS). This framework allows for a principled diagnosis of system failures like context drift and incoherent task decomposition. To evaluate these behaviors, we propose five novel evaluation metrics that assess trajectory quality beyond simple success rates. We support this analysis with a new dataset of 794 human-labeled trajectories from the WebArena benchmark. Finally, we validate our evaluation framework by comparing a baseline Step-by-Step agent against a novel Full-Plan-in-Advance implementation. Our results reveal that while the Step-by-Step agent aligns more closely with human gold trajectories (38% overall success), the Full-Plan-in-Advance agent excels in technical measures such as element accuracy (89%), demonstrating the necessity of our proposed metrics for selecting appropriate agent architectures based on specific application constraints.

[NLP-19] EvolveCoder: Evolving Test Cases via Adversarial Verification for Code Reinforcement Learning

【速读】：该论文旨在解决当前基于强化学习的代码生成方法中因验证信号弱且静态而导致性能受限的问题。现有编码强化学习（Reinforcement Learning for Code Generation, RLCG）数据集中的测试用例缺乏动态适应性，难以有效区分优劣解，从而限制了模型优化效果。解决方案的关键在于提出一种条件化且对抗性的验证框架（solution-conditioned and adversarial verification framework），该框架通过迭代式地根据候选解的执行行为演化测试用例，逐步提升测试用例的难度、判别能力和去冗余性。基于此框架构建的 EvolveCoder-22k 数据集实现了验证信号的动态增强，使强化学习训练更稳定、高效，并在多个下游任务上显著优于基线模型。

链接: https://arxiv.org/abs/2603.12698
作者: Chi Ruan,Dongfu Jiang,Huaye Zeng,Ping Nie,Wenhu Chen
机构: University of Waterloo (滑铁卢大学); Vector Institute (向量研究所); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving code generation in large language models, but its effectiveness is limited by weak and static verification signals in existing coding RL datasets. In this paper, we propose a solution-conditioned and adversarial verification framework that iteratively refines test cases based on the execution behaviors of candidate solutions, with the goal of increasing difficulty, improving discriminative power, and reducing redundancy. Based on this framework, we introduce EvolveCoder-22k, a large-scale coding reinforcement learning dataset constructed through multiple rounds of adversarial test case evolution. Empirical analysis shows that iterative refinement substantially strengthens verification, with pass@1 decreasing from 43.80 to 31.22. Reinforcement learning on EvolveCoder-22k yields stable optimization and consistent performance gains, improving Qwen3-4B by an average of 4.2 points across four downstream benchmarks and outperforming strong 4B-scale baselines. Our results highlight the importance of adversarial, solution-conditioned verification for effective and scalable reinforcement learning in code generation.

[NLP-20] Experimental evidence of progressive ChatGPT models self-convergence

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在递归训练过程中因使用自动生成数据而引发的模型坍塌（model collapse）问题，特别是评估不同版本ChatGPT在长时间尺度下生成文本多样性下降的现象。其解决方案的关键在于引入文本相似度度量方法，量化分析多个ChatGPT版本在相同提示条件下输出文本的多样性变化，从而揭示模型自我收敛（model self-convergence）的趋势，并指出该现象可能源于训练数据中合成数据比例增加所导致的输出趋同。

链接: https://arxiv.org/abs/2603.12683
作者: Konstantinos F. Xylogiannopoulos,Petros Xanthopoulos,Panagiotis Karampelas,Georgios A. Bakamitsos
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) that undergo recursive training on synthetically generated data are susceptible to model collapse, a phenomenon marked by the generation of meaningless output. Existing research has examined this issue from either theoretical or empirical perspectives, often focusing on a single model trained recursively on its own outputs. While prior studies have cautioned against the potential degradation of LLM output quality under such conditions, no longitudinal investigation has yet been conducted to assess this effect over time. In this study, we employ a text similarity metric to evaluate different ChatGPT models’ capacity to generate diverse textual outputs. Our findings indicate a measurable decline of recent ChatGPT releases’ ability to produce varied text, even when explicitly prompted to do so, by setting the temperature parameter to one. The observed reduction in output diversity may be attributed to the influence of the amounts of synthetic data incorporated within their training datasets as the result of internet infiltration by LLM generated data. The phenomenon is defined as model self-convergence because of the gradual increase of similarities of produced texts among different ChatGPT versions.

[NLP-21] MetaKE: Meta-learning Aligned Knowledge Editing via Bi-level Optimization

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中知识编辑（Knowledge Editing, KE）任务中存在的“语义-执行断开”（Semantic-Execution Disconnect）问题，即传统方法在设定编辑目标时缺乏对模型可行域的反馈，导致语义上合理的编辑目标可能落在不可行空间内，从而引发梯度截断和编辑失败。解决方案的关键在于提出MetaKE框架，将知识编辑重构为一个双层优化问题：上层优化器将编辑目标视为可学习的元参数，以最大化编辑后的性能；下层求解器执行具体编辑操作。为应对复杂求解器的不可微性挑战，作者进一步设计了结构化梯度代理（Structural Gradient Proxy），显式地将可编辑性约束反向传播至目标学习阶段，理论上保证了编辑方向与模型可行流形的一致性，显著提升了编辑成功率与效果。

链接: https://arxiv.org/abs/2603.12677
作者: Shuxin Liu,Ou Wu
机构: Hangzhou Institute for Advanced Study (杭州研究院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 2 figures

点击查看摘要

Abstract:Knowledge editing (KE) aims to precisely rectify specific knowledge in Large Language Models (LLMs) without disrupting general capabilities. State-of-the-art methods suffer from an open-loop control mismatch. We identify a critical “Semantic-Execution Disconnect”: the semantic target is derived independently without feedback from the downstream’s feasible region. This misalignment often causes valid semantic targets to fall within the prohibited space, resulting in gradient truncation and editing failure. To bridge this gap, we propose MetaKE (Meta-learning Aligned Knowledge Editing), a new framework that reframes KE as a bi-level optimization problem. Departing from static calculation, MetaKE treats the edit target as a learnable meta-parameter: the upper-level optimizer seeks a feasible target to maximize post-edit performance, while the lower-level solver executes the editing. To address the challenge of differentiating through complex solvers, we derive a Structural Gradient Proxy, which explicitly backpropagates editability constraints to the target learning phase. Theoretical analysis demonstrates that MetaKE automatically aligns the edit direction with the model’s feasible manifold. Extensive experiments confirm that MetaKE significantly outperforms strong baselines, offering a new perspective on knowledge editing.

[NLP-22] From Text to Forecasts: Bridging Modality Gap with Temporal Evolution Semantic Space

【速读】：该论文旨在解决将文本信息融入时间序列预测时存在的模态鸿沟问题，即文本描述通常以隐式、定性方式表达时间影响，而预测模型依赖显式、定量信号，导致现有方法难以有效融合两者。其解决方案的关键在于提出TESS框架，引入一个中间瓶颈——时间演化语义空间（Temporal Evolution Semantic Space），该空间由大语言模型（LLM）通过结构化提示提取并经置信度感知门控过滤的可解释、数值化的时间原语（均值偏移、波动率、形状和滞后）构成，从而实现文本语义到可用数值线索的可靠转换。

链接: https://arxiv.org/abs/2603.12664
作者: Lehui Li,Yuyao Wang,Jisheng Yan,Wei Zhang,Jinliang Deng,Haoliang Sun,Zhongyi Han,Yongshun Gong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:Incorporating textual information into time-series forecasting holds promise for addressing event-driven non-stationarity; however, a fundamental modality gap hinders effective fusion: textual descriptions express temporal impacts implicitly and qualitatively, whereas forecasting models rely on explicit and quantitative signals. Through controlled semi-synthetic experiments, we show that existing methods over-attend to redundant tokens and struggle to reliably translate textual semantics into usable numerical cues. To bridge this gap, we propose TESS, which introduces a Temporal Evolution Semantic Space as an intermediate bottleneck between modalities. This space consists of interpretable, numerically grounded temporal primitives (mean shift, volatility, shape, and lag) extracted from text by an LLM via structured prompting and filtered through confidence-aware gating. Experiments on four real-world datasets demonstrate up to a 29 percent reduction in forecasting error compared to state-of-the-art unimodal and multimodal baselines. The code will be released after acceptance.

[NLP-23] Continual Learning in Large Language Models : Methods Challenges and Opportunities

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在持续学习（Continual Learning, CL）过程中面临的灾难性遗忘（catastrophic forgetting）问题，即模型在学习新知识时会覆盖或破坏先前习得的知识。其核心解决方案是系统梳理并分类适用于LLMs的持续学习方法，从三个训练阶段——持续预训练、持续微调和持续推理——出发，结合传统的重放（rehearsal）、正则化（regularization）和架构调整（architecture-based）三类方法，进一步按其缓解遗忘的具体机制进行细分，并对这些方法在适应性与改进效果上进行严谨对比分析。关键在于揭示LLM持续学习与传统机器学习在规模、参数效率及涌现能力等方面的本质差异，从而为实现跨任务与跨时间尺度的知识无缝整合提供结构化框架与研究方向。

链接: https://arxiv.org/abs/2603.12658
作者: Hongyang Chen,Zhongwu Sun,Hongfei Ye,Kunchi Li,Xuemin Lin
机构: Zhejiang Lab (浙江实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual learning (CL) has emerged as a pivotal paradigm to enable large language models (LLMs) to dynamically adapt to evolving knowledge and sequential tasks while mitigating catastrophic forgetting-a critical limitation of the static pre-training paradigm inherent to modern LLMs. This survey presents a comprehensive overview of CL methodologies tailored for LLMs, structured around three core training stages: continual pre-training, continual fine-tuning, and continual this http URL the canonical taxonomy of rehearsal-, regularization-, and architecture-based methods, we further subdivide each category by its distinct forgetting mitigation mechanisms and conduct a rigorous comparative analysis of the adaptability and critical improvements of traditional CL methods for LLMs. In doing so, we explicitly highlight core distinctions between LLM CL and traditional machine learning, particularly with respect to scale, parameter efficiency, and emergent capabilities. Our analysis covers essential evaluation metrics, including forgetting rates and knowledge transfer efficiency, along with emerging benchmarks for assessing CL performance. This survey reveals that while current methods demonstrate promising results in specific domains, fundamental challenges persist in achieving seamless knowledge integration across diverse tasks and temporal scales. This systematic review contributes to the growing body of knowledge on LLM adaptation, providing researchers and practitioners with a structured framework for understanding current achievements and future opportunities in lifelong learning for language models.

[NLP-24] 98times Faster LLM Routing Without a Dedicated GPU: Flash Attention Prompt Compression and Near-Streaming for the vLLM Semantic Router

【速读】：该论文旨在解决系统级路由器在处理大模型（LLM）请求时面临的双重挑战：一是如何在不显著增加延迟的前提下实现高效的安全分类、领域路由和敏感个人信息（PII）检测；二是如何在共享GPU资源的情况下避免因注意力机制的二次方复杂度（O(n²)）导致的显存溢出问题，从而使得长文本（8K–32K tokens）分类成为可能。其核心解决方案包含三个阶段优化：第一阶段通过定制化的CK Flash Attention算子将注意力内存从O(n²)降至O(n)，显著降低显存占用并提升吞吐效率（端到端延迟由4918ms降至127ms）；第二阶段采用经典NLP压缩技术（TextRank、位置加权、TF-IDF与新颖性评分）将输入统一压缩至约512 tokens，使延迟和显存使用量与原始输入长度无关；第三阶段引入近流式正文处理机制，结合自适应分块和零拷贝JSON解析，消除序列化开销。三阶段累计实现98倍性能提升（4918ms → 50ms），支持16K token路由且路由器GPU占用低于800MB，可无缝集成于vLLM服务共用同一GPU，无需专用加速器。

链接: https://arxiv.org/abs/2603.12646
作者: Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen
机构: vLLM Semantic Router Project; MBZUAI; McGill University; AMD; Red Hat
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:System-level routers that intercept LLM requests for safety classification, domain routing, and PII detection must be both fast and operationally lightweight: they should add minimal latency to every request, yet not require a dedicated GPU – an expensive resource better used for LLM inference itself. When the router co-locates on the same GPU as vLLM serving instances, standard attention’s O(n^2) memory makes long-context classification (8K–32K tokens) impossible: at 8K tokens, three concurrent classifiers need \sim 4.5,GB for attention masks alone, far exceeding the memory left by vLLM. We present three staged optimizations for the vLLM Semantic Router, benchmarked on AMD Instinct MI300X, that solve both the latency and the memory problem. \emphStage~1: a custom CK Flash Attention operator for ONNX Runtime on ROCm reduces attention memory from O(n^2) to O(n) and end-to-end (E2E) latency from 4,918,ms to 127,ms (\textbf38.7 \times ), enabling 8K–32K tokens where SDPA OOMs. \emphStage~2: classical NLP prompt compression (TextRank, position weighting, TF-IDF, and novelty scoring) reduces all inputs to \sim 512 tokens without neural inference, capping both latency and GPU memory at a constant regardless of original prompt length (E2E 127 \to 62,ms, \textbf2.0 \times ). \emphStage~3: near-streaming body processing with adaptive chunking and zero-copy JSON eliminates serialization overhead (E2E 62 \to 50,ms, \textbf1.2 \times ). Cumulatively: \textbf98 \times improvement (4,918,ms to 50,ms), 16K-token routing in 108,ms, and a total router GPU footprint under 800,MB – small enough to share a GPU with LLM serving and removing the need for a dedicated accelerator. Stage~1 targets AMD ROCm (NVIDIA GPUs already have FlashAttention via cuDNN); Stages~2 and~3 are hardware-agnostic.

[NLP-25] RTD-Guard: A Black-Box Textual Adversarial Detection Framework via Replacement Token Detection

【速读】：该论文旨在解决文本对抗攻击（Textual Adversarial Attacks）对自然语言处理（Natural Language Processing, NLP）系统造成的安全威胁问题，特别是现有检测方法普遍依赖攻击先验知识、白盒模型访问权限或大量查询次数，限制了其在实际场景中的部署可行性。解决方案的关键在于提出一种名为RTD-Guard的新型黑盒检测框架，其核心思想是利用预训练的“替换词检测”（Replaced Token Detection, RTD）判别器识别对抗样本中由词替换扰动产生的可疑token，并通过仅两次黑盒查询观察目标模型预测置信度的变化来实现高效检测，整个过程无需对抗数据、模型微调或内部模型访问，显著提升了检测机制的实用性与轻量化水平。

链接: https://arxiv.org/abs/2603.12582
作者: He Zhu,Yanshu Li,Wen Liu,Haitian Yang
机构: Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Textual adversarial attacks pose a serious security threat to Natural Language Processing (NLP) systems by introducing imperceptible perturbations that mislead deep learning models. While adversarial example detection offers a lightweight alternative to robust training, existing methods typically rely on prior knowledge of attacks, white-box access to the victim model, or numerous queries, which severely limits their practical deployment. This paper introduces RTD-Guard, a novel black-box framework for detecting textual adversarial examples. Our key insight is that word-substitution perturbations in adversarial attacks closely resemble the “replaced tokens” that a Replaced Token Detection (RTD) discriminator is pre-trained to identify. Leveraging this, RTD-Guard employs an off-the-shelf RTD discriminator-without fine-tuning-to localize suspicious tokens, masks them, and detects adversarial examples by observing the prediction confidence shift of the victim model before and after intervention. The entire process requires no adversarial data, model tuning, or internal model access, and uses only two black-box queries. Comprehensive experiments on multiple benchmark datasets demonstrate that RTD-Guard effectively detects adversarial texts generated by diverse state-of-the-art attack methods. It surpasses existing detection baselines across multiple metrics, offering a highly efficient, practical, and resource-light defense mechanism-particularly suited for real-world deployment in resource-constrained or privacy-sensitive environments.

[NLP-26] Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation

【速读】：该论文旨在解决当前基于专家混合（Mixture-of-Experts, MoE）的参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法在多任务场景下忽视任务复杂度层次性的问题。现有方法通常采用统一架构的专家模块，难以捕捉不同任务对特征粒度的差异化需求——某些任务需要高层语义抽象，而另一些则依赖细粒度句法操作。其解决方案的关键在于提出专家金字塔微调（Expert Pyramid Tuning, EPT），该方法将计算机视觉中的多尺度特征金字塔思想引入PEFT框架：首先构建一个共享的元知识子空间（meta-knowledge Subspace）以编码低维通用语言模式；随后通过可学习的上投影算子（Pyramid Projection Mechanism）在不同尺度重建高维特征，并由任务感知路由器动态选择最优的多尺度特征组合。这一设计不仅显著提升了多任务性能，还通过重参数化机制减少了训练参数量。

链接: https://arxiv.org/abs/2603.12577
作者: Jia-Chen Zhang,Zhen-Wei Yan,Yu-Jie Xiong,Chun-Ming Xia
机构: Shanghai University of Engineering Science (上海工程技术大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) has become a dominant paradigm for deploying LLMs in multi-task scenarios due to its extreme parameter efficiency. While Mixture-of-Experts (MoE) based LoRA variants have achieved promising results by dynamically routing tokens to different low-rank experts, they largely overlook the hierarchical nature of task complexity. Existing methods typically employ experts with uniform architectures, limiting their ability to capture diverse feature granularities required by distinct tasks–where some tasks demand high-level semantic abstraction while others require fine-grained syntactic manipulation. To bridge this gap, we propose Expert Pyramid Tuning (EPT), a novel architecture that integrates the multi-scale feature pyramid concept from computer vision into the realm of PEFT. Unlike standard LoRA, EPT decomposes task adaptation into two stages: (1) A shared meta-knowledge Subspace that encodes universal linguistic patterns in low dimensions; (2) A Pyramid Projection Mechanism that utilizes learnable up-projection operators to reconstruct high-dimensional features at varying scales. A task-aware router then dynamically selects the optimal combination of these multi-scale features. Extensive experiments across multiple multi-task benchmarks demonstrate that EPT significantly outperforms SOTA MoE-LoRA variants. Crucially, thanks to the re-parameterization capability of our design, EPT achieves this performance improvement while simultaneously reducing the number of training parameters.

[NLP-27] LMEB: Long-horizon Memory Embedding Benchmark

【速读】：该论文旨在解决当前文本嵌入基准测试对记忆增强系统中长时程记忆检索能力评估不足的问题，现有基准主要聚焦于传统段落检索任务，未能充分考察模型在处理碎片化、依赖上下文且时间跨度较大的信息时的表现。其解决方案的关键在于提出Long-horizon Memory Embedding Benchmark (LMEB)，这是一个涵盖22个数据集和193个零样本检索任务的综合性评估框架，覆盖四种不同抽象层级与时间依赖特性的记忆类型（情景记忆、对话记忆、语义记忆和程序性记忆），并包含AI生成与人工标注的数据。LMEB通过标准化、可复现的评测机制填补了记忆嵌入评估领域的空白，推动了面向长期、上下文相关记忆检索的文本嵌入技术发展。

链接: https://arxiv.org/abs/2603.12572
作者: Xinping Zhao,Xinshuo Hu,Jiaxin Xu,Danyu Tang,Xin Zhang,Mengjia Zhou,Yan Zhong,Yao Zhou,Zifei Shan,Meishan Zhang,Baotian Hu,Min Zhang
机构: Harbin Institute of Technology (Shenzhen); Shenzhen Loop Area Institute (SLAI); Peking University
类目: Computation and Language (cs.CL)
备注: 35 pages, 9 figures, 23 tables

点击查看摘要

Abstract:Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models’ ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models’ capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at this https URL.

[NLP-28] Speech-Worthy Alignment for Japanese SpeechLLM s via Direct Preference Optimization

【速读】：该论文旨在解决SpeechLLMs（语音大语言模型）在生成文本时继承书面语风格输出模式的问题，这种模式不适用于语音合成（text-to-speech synthesis），尤其是在日语中，口语与书面语在礼貌标记、句尾助词和句法复杂度上存在显著差异。解决方案的关键在于提出一种基于偏好的对齐方法（preference-based alignment approach），通过引导模型生成更简洁、更具对话性且适合语音合成的“语音友好型”文本（speech-worthy outputs）。该方法在新构建的SpokenElyza基准测试上实现了显著提升，同时保持了原始书面语任务性能，从而为日语语音对话系统的研究提供了可扩展的评估工具与优化路径。

链接: https://arxiv.org/abs/2603.12565
作者: Mengjie Zhao,Lianbo Liu,Yusuke Fujita,Hao Shi,Yuan Gao,Roman Koshkin,Yui Sudo
机构: SB Intuitions( SB Intuitions)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substantially in politeness markers, sentence-final particles, and syntactic complexity. We propose a preference-based alignment approach to adapt Japanese SpeechLLMs for speech-worthy outputs: text that is concise, conversational, and readily synthesized as natural speech. To rigorously evaluate this task, we introduce SpokenElyza, a benchmark for Japanese speech-worthiness derived from ELYZA-tasks-100 with auditory verification by native experts. Experiments show that our approach achieves substantial improvement on SpokenElyza while largely preserving performance on the original written-style evaluation. We will release SpokenElyza to support future research on Japanese spoken dialog systems.

[NLP-29] Agent Drift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 代理在高风险领域（如金融咨询）中评估盲区问题——即现有基于推荐质量的指标（如NDCG）无法有效捕捉代理因工具输出污染导致的安全风险，从而可能推荐不当产品。其解决方案的关键在于提出一种配对轨迹协议（paired-trajectory protocol），通过在干净与污染工具输出条件下重放真实对话，系统性分解偏差来源为信息通道（information-channel）和记忆通道（memory-channel）机制，并引入安全惩罚型NDCG（sNDCG）指标以显式量化安全性损失。实验表明，尽管推荐质量基本不变（utility preservation ratio ≈1.0），65–93%的对话轮次中出现不适当产品，且此类安全违规主要由信息通道驱动、首次污染即发生并持续存在，无自我纠正行为，说明传统指标严重低估风险；sNDCG显著降低保留率至0.51–0.74，揭示了将轨迹级安全监控纳入多轮代理部署评估的重要性。

链接: https://arxiv.org/abs/2603.12564
作者: Zekun Wu,Adriano Koshiyama,Sahan Bulathwela,Maria Perez-Ortiz
机构: University College London (伦敦大学学院); Holistic AI (整体人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 50 pages, 31 tables, 15 figures. Under review at COLM 2026

点击查看摘要

Abstract:Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms. Across the seven models tested, we consistently observe the evaluation-blindness pattern: recommendation quality is largely preserved under contamination (utility preservation ratio approximately 1.0) while risk-inappropriate products appear in 65-93% of turns, a systematic safety failure poorly reflected by standard NDCG. Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, and persist without self-correction over 23-step trajectories; no agent across 1,563 contaminated turns explicitly questions tool-data reliability. Even narrative-only corruption (biased headlines, no numerical manipulation) induces significant drift while completely evading consistency monitors. A safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74, indicating that much of the evaluation gap becomes visible once safety is explicitly measured. These results motivate considering trajectory-level safety monitoring, beyond single-turn quality, for deployed multi-turn agents in high-stakes settings.

[NLP-30] Reinforcement Learning for Diffusion LLM s with Entropy-Guided Step Selection and Stepwise Advantages

【速读】：该论文旨在解决将强化学习（Reinforcement Learning, RL）方法有效扩展至扩散语言模型（Diffusion Language Models, DLMs）时所面临的挑战，即由于序列级似然函数不可计算，导致现有方法依赖代理似然或启发式近似，从而引入偏差并掩盖去噪过程的序列结构。其解决方案的关键在于：将扩散序列生成建模为一个有限时域马尔可夫决策过程（Markov Decision Process, MDP），并推导出一种无需显式计算序列似然即可表达的、无偏的策略梯度，该梯度可分解为各去噪步骤上的中间优势（intermediate advantages）。进一步地，通过熵引导的近似边界选择更新步骤，并利用扩散模型自身提供的单步去噪奖励估计中间优势，避免了高成本的多步回放，实现了高效且计算友好的策略优化。

链接: https://arxiv.org/abs/2603.12554
作者: Vishnu Teja Kunde,Fatemeh Doudi,Mahdi Farahbakhsh,Dileep Kalathil,Krishna Narayanan,Jean-Francois Chamberland
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at this https URL.

[NLP-31] ERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

【速读】：该论文旨在解决大型推理模型（Large Reasoning Models, LRM）在执行链式思维（Chain-of-Thought, CoT）推理时存在的过度思考问题，即模型在生成最终答案后仍继续冗余计算，导致资源浪费。解决方案的关键在于设计一种名为TERMINATOR的早期退出策略，其核心思想是：LRM首次生成正确答案的位置具有可预测性，并利用这些首次答案位置构建一个用于训练TERMINATOR的新型最优推理长度数据集，从而实现对推理过程的有效截断。实验表明，TERMINATOR在多个实际数据集上平均缩短CoT长度14%-55%，且性能优于现有最先进方法。

链接: https://arxiv.org/abs/2603.12529
作者: Alliot Nagle,Jakhongir Saydaliev,Dhia Garbaya,Michael Gastpar,Ashok Vardhan Makkuva,Hyeji Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 35 pages, 31 figures

点击查看摘要

Abstract:Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to generate intermediate thinking tokens before arriving at the final answer. However, LRMs often suffer from significant overthinking, spending excessive compute time even after the answer is generated early on. Prior work has identified the existence of an optimal reasoning length such that truncating reasoning at this point significantly shortens CoT outputs with virtually no change in performance. However, determining optimal CoT lengths for practical datasets is highly non-trivial as they are fully task and model-dependent. In this paper, we precisely address this and design TERMINATOR, an early-exit strategy for LRMs at inference to mitigate overthinking. The central idea underpinning TERMINATOR is that the first arrival of an LRM’s final answer is often predictable, and we leverage these first answer positions to create a novel dataset of optimal reasoning lengths to train TERMINATOR. Powered by this approach, TERMINATOR achieves significant reductions in CoT lengths of 14%-55% on average across four challenging practical datasets: MATH-500, AIME 2025, HumanEval, and GPQA, whilst outperforming current state-of-the-art methods.

[NLP-32] When LLM Judge Scores Look Good but Best-of-N Decisions Fail

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）作为评判者（judge）时，仅依赖全局相关性指标（如与参考标签的相关系数）评估其性能所导致的误导性问题。在实际部署中，任务常为基于提示词（prompt）的“最佳选择”（best-of-n selection），而现有方法未能准确捕捉到提示内排序能力。研究发现，尽管全局相关性达到 r = 0.47，但该判别器仅能捕获理想选择相对于随机选择提升的21.0%，主要因为全局一致性受提示级基线效应主导，而提示内相关性仅为 r_within = 0.27，且点对点评分造成67%的配对比较出现平局。解决方案的关键在于采用成对比较（pairwise judging）进行审计，显著提升信号恢复率——从21.1%提升至61.2%，并建议未来应报告提示内信号、平局率及恢复率/Top-1准确率等指标，而非单一全局相关性。

链接: https://arxiv.org/abs/2603.12520
作者: Eddie Landesberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are often used as judges to score candidate responses, then validated with a single global metric such as correlation with reference labels. This can be misleading when the real deployment task is best-of-n selection within a prompt. In a 5,000-prompt best-of-4 benchmark from Chatbot Arena, a judge with moderate global correlation (r = 0.47) captures only 21.0% of the improvement that perfect selection would achieve over random choice. The gap arises because global agreement is driven largely by prompt-level baseline effects, while selection depends on within-prompt ranking: within-prompt correlation is only r_within = 0.27, and coarse pointwise scoring creates ties in 67% of pairwise comparisons. In a matched-pair best-of-2 audit, explicit pairwise judging recovers much of this lost signal, raising recovery from 21.1% to 61.2%. For judge-based selection, the relevant audit should report within-prompt signal, tie rates, and recovery/top-1 accuracy, not global agreement alone. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2603.12520 [cs.LG] (or arXiv:2603.12520v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.12520 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-33] Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies

【速读】：该论文旨在解决基于视觉-语言-动作（Vision-Language-Action, VLA）模型的机器人系统在面对不同措辞的语言指令时表现出高度敏感性的问题，即指令细微变化可能导致任务失败，且难以预测故障发生时机。其解决方案的关键在于提出Q-DIG（Quality Diversity for Diverse Instruction Generation），该方法通过质量多样性（Quality Diversity, QD）算法与视觉-语言模型（Vision-Language Models, VLMs）相结合，高效生成多样且任务相关的对抗性自然语言指令，从而暴露VLA行为中的关键脆弱点。实验表明，Q-DIG生成的指令比基线方法更具多样性与语义合理性，并通过微调VLA模型显著提升任务成功率，同时用户研究和真实世界测试验证了其生成指令的自然性和有效性。

链接: https://arxiv.org/abs/2603.12510
作者: Siddharth Srikanth,Freddie Liang,Sophie Hsu,Varun Bhatt,Shihan Zhao,Henry Chen,Bryon Tjanaka,Minjune Hwang,Akanksha Saran,Daniel Seita,Aaquib Tabrez,Stefanos Nikolaidis
机构: University of Southern California (南加州大学); Sony AI; Cornell University (康奈尔大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have significant potential to enable general-purpose robotic systems for a range of vision-language tasks. However, the performance of VLA-based robots is highly sensitive to the precise wording of language instructions, and it remains difficult to predict when such robots will fail. To improve the robustness of VLAs to different wordings, we present Q-DIG (Quality Diversity for Diverse Instruction Generation), which performs red-teaming by scalably identifying diverse natural language task descriptions that induce failures while remaining task-relevant. Q-DIG integrates Quality Diversity (QD) techniques with Vision-Language Models (VLMs) to generate a broad spectrum of adversarial instructions that expose meaningful vulnerabilities in VLA behavior. Our results across multiple simulation benchmarks show that Q-DIG finds more diverse and meaningful failure modes compared to baseline methods, and that fine-tuning VLAs on the generated instructions improves task success rates. Furthermore, results from a user study highlight that Q-DIG generates prompts judged to be more natural and human-like than those from baselines. Finally, real-world evaluations of Q-DIG prompts show results consistent with simulation, and fine-tuning VLAs on the generated prompts further success rates on unseen instructions. Together, these findings suggest that Q-DIG is a promising approach for identifying vulnerabilities and improving the robustness of VLA-based robots. Our anonymous project website is at this http URL.

[NLP-34] Shattering the Shortcut: A Topology-Regularized Benchmark for Multi-hop Medical Reasoning in LLM s

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在真实临床场景中进行多跳诊断推理时表现不佳的问题，其核心障碍是“捷径学习”（shortcut learning），即模型倾向于依赖知识图谱中高度连接的通用枢纽节点（如“炎症”）来绕过真实的微病理级联过程。解决方案的关键在于提出ShatterMed-QA基准和拓扑正则化的医学知识图谱构建框架：通过一种新颖的k-Shattering算法物理剪除通用枢纽节点，从而显式切断逻辑捷径；同时结合隐式桥接实体掩码与拓扑驱动的困难负样本采样策略，迫使模型在不依赖表面线索的情况下进行生物学合理的推理。实验表明，该方法能有效揭示并量化LLMs在复杂诊断推理中的根本缺陷，并验证了检索增强生成（Retrieval-Augmented Generation, RAG）对性能恢复的有效性。

链接: https://arxiv.org/abs/2603.12458
作者: Xing Zi,Xinying Zhou,Jinghao Xiao,Catarina Moreira,Mukesh Prasad
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) achieve expert-level performance on standard medical benchmarks through single-hop factual recall, they severely struggle with the complex, multi-hop diagnostic reasoning required in real-world clinical settings. A primary obstacle is “shortcut learning”, where models exploit highly connected, generic hub nodes (e.g., “inflammation”) in knowledge graphs to bypass authentic micro-pathological cascades. To address this, we introduce ShatterMed-QA, a bilingual benchmark of 10,558 multi-hop clinical questions designed to rigorously evaluate deep diagnostic reasoning. Our framework constructs a topology-regularized medical Knowledge Graph using a novel k -Shattering algorithm, which physically prunes generic hubs to explicitly sever logical shortcuts. We synthesize the evaluation vignettes by applying implicit bridge entity masking and topology-driven hard negative sampling, forcing models to navigate biologically plausible distractors without relying on superficial elimination. Comprehensive evaluations of 21 LLMs reveal massive performance degradation on our multi-hop tasks, particularly among domain-specific models. Crucially, restoring the masked evidence via Retrieval-Augmented Generation (RAG) triggers near-universal performance recovery, validating ShatterMed-QA’s structural fidelity and proving its efficacy in diagnosing the fundamental reasoning deficits of current medical AI. Explore the dataset, interactive examples, and full leaderboards at our project website: this https URL

[NLP-35] CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection

【速读】：该论文旨在解决政治访谈中回答清晰度分类问题，即区分回答为“明确回复（Clear Reply）”、“模糊回答（Ambivalent）”和“明确非回复（Clear Non-Reply）”。其解决方案的关键在于提出一种异构双大语言模型（Large Language Model, LLM）集成方法，结合自一致性（Self-Consistency, SC）与加权投票机制，并引入一种新颖的事后校正机制—— deliberative complexity gating (DCG)。DCG 利用跨模型的行为信号，发现响应长度代理指标与样本模糊性高度相关，从而动态调整推理过程；此外，作者还评估了多智能体辩论作为提升推理能力的替代策略，但指出该方法虽增加代理数量却未提升模型多样性。最终系统在测试集上获得 0.85 的 Macro-F1 分数，位列第三。

链接: https://arxiv.org/abs/2603.12453
作者: Christos Tzouvaras,Konstantinos Skianis,Athanasios Voulodimos
机构: University of Ioannina (伊奥annina大学); National Technical University of Athens (雅典国立技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper describes our system for SemEval-2026 Task 6, which classifies clarity of responses in political interviews into three categories: Clear Reply, Ambivalent, and Clear Non-Reply. We propose a heterogeneous dual large language model (LLM) ensemble via self-consistency (SC) and weighted voting, and a novel post-hoc correction mechanism, Deliberative Complexity Gating (DCG). This mechanism uses cross-model behavioral signals and exploits the finding that an LLM response-length proxy correlates strongly with sample ambiguity. To further examine mechanisms for improving ambiguity detection, we evaluated multi-agent debate as an alternative strategy for increasing deliberative capacity. Unlike DCG, which adaptively gates reasoning using cross-model behavioral signals, debate increases agent count without increasing model diversity. Our solution achieved a Macro-F1 score of 0.85 on the evaluation set, securing 3rd place.

[NLP-36] Interpreting Negation in GPT -2: Layer- and Head-Level Causal Analysis

【速读】：该论文旨在解决现代语言模型在处理否定句时存在的语义反转或事实性错误问题，即模型对肯定句与否定句的区分能力不足。其解决方案的关键在于通过因果分析揭示GPT-2 Small内部如何处理否定转换：研究者构建了一个包含12,000对匹配句子的自定义数据集，并引入Negation Effect Score（NES）量化模型对否定敏感度；进一步采用激活修补（activation patching）和注意力头消融（ablation）两种干预手段，发现否定信号主要集中在第4至第6层的少数中间层注意力头中，这些特定头负责传递肯定信号而非恢复基线行为，从而明确了模型中否定理解的因果结构基础。

链接: https://arxiv.org/abs/2603.12423
作者: Abdullah Al Mofael,Lisa M. Kuhn,Ghassan Alkadi,Kuo-Pao Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 4 figures, 1 table. Accepted at the 2026 IEEE 16th Annual Computing and Communication Workshop and Conference (CCWC)

点击查看摘要

Abstract:Negation remains a persistent challenge for modern language models, often causing reversed meanings or factual errors. In this work, we conduct a causal analysis of how GPT-2 Small internally processes such linguistic transformations. We examine its hidden representations at both the layer and head level. Our analysis is based on a self-curated 12,000-pair dataset of matched affirmative and negated sentences, covering multiple linguistic templates and forms of negation. To quantify this behavior, we define a metric, the Negation Effect Score (NES), which measures the model’s sensitivity in distinguishing between affirmative statements and their negations. We carried out two key interventions to probe causal structure. In activation patching, internal activations from affirmative sentences were inserted into their negated counterparts to see how meaning shifted. In ablation, specific attention heads were temporarily disabled to observe how logical polarity changed. Together, these steps revealed how negation signals move and evolve through GPT-2’s layers. Our findings indicate that this capability is not widespread; instead, it is highly concentrated within a limited number of mid-layer attention heads, primarily within layers 4 to 6. Ablating these specific components directly disrupts the model’s negation sensitivity: on our in-domain, ablation increased NES (indicating weaker negation sensitivity), and re-introducing cached affirmative activations (rescue) increased NES further, confirming that these heads carry affirmative signal rather than restoring baseline behavior. On xNot360, ablation slightly decreased NES and rescue restored performance above baseline. This pattern demonstrates that these causal patterns are consistent across various negation forms and remain detectable on the external xNot360 benchmark, though with smaller magnitude.

[NLP-37] Not Just the Destination But the Journey: Reasoning Traces Causally Shape Generalization Behaviors

【速读】：该论文试图解决的问题是：链式思维（Chain-of-Thought, CoT）是否在大语言模型（Large Language Models, LLMs）中具有因果作用，即其推理路径是否独立于最终答案影响模型的泛化行为，而非仅仅是事后合理化（post-hoc rationalization）。这一问题对AI对齐（alignment）策略至关重要，因为若推理内容本身具有因果效力，则仅监督输出可能不足以确保安全可控的行为。解决方案的关键在于设计了一个受控实验：固定有害的最终答案，同时系统性地改变推理路径类型（包括恶意型、误导型和顺从型推理），并训练不同架构的模型（0.6B–14B参数）在多种范式下（如QTA、QT、T-only），再评估其在有无推理生成条件下的行为表现。结果表明，推理内容确实独立且强烈地塑造了模型行为，即使在不生成推理时也持续存在，从而证明了推理路径具有因果影响力，挑战了仅依赖输出监督的对齐方法。

链接: https://arxiv.org/abs/2603.12397
作者: Pengcheng Wen,Yanxu Zhu,Jiapeng Sun,Han Zhu,Yujin Zhou,Chi-Min Chan,Sirui Han,Yike Guo
机构: Hong Kong University of Science and Technology (香港科技大学); Beijing Jiaotong University (北京交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) is often viewed as a window into LLM decision-making, yet recent work suggests it may function merely as post-hoc rationalization. This raises a critical alignment question: Does the reasoning trace causally shape model generalization independent of the final answer? To isolate reasoning’s causal effect, we design a controlled experiment holding final harmful answers constant while varying reasoning paths. We construct datasets with \textitEvil reasoning embracing malice, \textitMisleading reasoning rationalizing harm, and \textitSubmissive reasoning yielding to pressure. We train models (0.6B–14B parameters) under multiple paradigms, including question-thinking-answer (QTA), question-thinking (QT), and thinking-only (T-only), and evaluate them in both think and no-think modes. We find that: (1) CoT training could amplify harmful generalization more than standard fine-tuning; (2) distinct reasoning types induce distinct behavioral patterns aligned with their semantics, despite identical final answers; (3) training on reasoning without answer supervision (QT or T-only) is sufficient to alter behavior, proving reasoning carries an independent signal; and (4) these effects persist even when generating answers without reasoning, indicating deep internalization. Our findings demonstrate that reasoning content is causally potent, challenging alignment strategies that supervise only outputs.

[NLP-38] NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation

【速读】：该论文旨在解决现有参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法中因静态路由机制导致的参数干扰问题，尤其是在多任务学习和持续学习场景下模型性能下降的问题。其核心解决方案是提出NeuroLoRA，一种受生物神经调制（neuromodulation）启发的基于专家混合（Mixture-of-Experts, MoE）的LoRA框架：通过引入一个轻量级可学习的神经调制门（neuromodulation gate），根据输入上下文动态重缩放投影空间后再进行专家选择，从而实现更灵活的任务解耦；同时设计对比正交性损失（Contrastive Orthogonality Loss）显式约束专家子空间间的分离度，提升任务独立性和持续学习能力。该方法在保持与FlyLoRA相当参数效率的同时，在单任务、多任务融合及顺序持续学习等多个场景下均显著优于现有基线。

链接: https://arxiv.org/abs/2603.12378
作者: Yuxin Yang,Haoran Zhang,Mingxuan Li,Jiachen Xu,Ruoxi Shen,Zhenyu Wang,Tianhao Liu,Siqi Chen,Weilin Huang
机构: Shanghai University (上海大学); Fudan University (复旦大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: work in progress

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly Low-Rank Adaptation (LoRA), have become essential for adapting Large Language Models (LLMs) to downstream tasks. While the recent FlyLoRA framework successfully leverages bio-inspired sparse random projections to mitigate parameter interference, it relies on a static, magnitude-based routing mechanism that is agnostic to input context. In this paper, we propose NeuroLoRA, a novel Mixture-of-Experts (MoE) based LoRA framework inspired by biological neuromodulation – the dynamic regulation of neuronal excitability based on context. NeuroLoRA retains the computational efficiency of frozen random projections while introducing a lightweight, learnable neuromodulation gate that contextually rescales the projection space prior to expert selection. We further propose a Contrastive Orthogonality Loss to explicitly enforce separation between expert subspaces, enhancing both task decoupling and continual learning capacity. Extensive experiments on MMLU, GSM8K, and ScienceQA demonstrate that NeuroLoRA consistently outperforms FlyLoRA and other strong baselines across single-task adaptation, multi-task model merging, and sequential continual learning scenarios, while maintaining comparable parameter efficiency.

[NLP-39] Efficient Reasoning with Balanced Thinking ICLR2026

【速读】：该论文旨在解决大型推理模型（Large Reasoning Models, LRM）在实际应用中普遍存在的“过度思考”（overthinking）与“思考不足”（underthinking）问题：前者表现为对简单任务冗余计算，后者则因缺乏充分推理路径探索导致准确性下降，二者均影响模型效率与可靠性。解决方案的关键在于提出一种无需训练的框架 ReBalance，其核心机制是利用置信度（confidence）作为连续的推理动态指标，通过分析置信度方差识别过度思考，通过持续高置信度识别思考不足；进而基于小规模数据集聚合隐藏状态形成推理模式原型，计算一个引导向量（steering vector），并结合实时置信度动态调节该向量的强度和方向，在过度思考时修剪冗余，在思考不足时促进探索，从而实现推理过程的平衡优化。

链接: https://arxiv.org/abs/2603.12372
作者: Yulin Li,Tengyao Tu,Li Ding,Junjie Wang,Huiling Zhen,Yixin Chen,Yong Li,Zhuotao Tian
机构: Harbin Institute of Technology (Shenzhen); Huawei Noah’s Ark Lab; Tsinghua University; The Chinese University of Hong Kong; Zhongguancun Academy; Shenzhen Loop Area Institute
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs’ reasoning trajectories. A dynamic control function modulates this vector’s strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Code is available at this https URL .

[NLP-40] ASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

【速读】：该论文旨在解决文本-语音联合建模（Text-Speech Joint Spoken Language Modeling, SLM）中因语音单元序列长度远超文本词元而导致的模态不匹配问题。现有方法如TASTE虽通过文本对齐的分词和嵌入策略缓解了这一问题，但受限于依赖外部自动语音识别（ASR）系统及非因果解码器，难以支持流式实时应用。其解决方案的关键在于提出TASTE-S，该方案将基于CTC的ASR模块集成到编码器中实现双模态即时编码，并重构解码器以支持在线解码，同时通过联合训练保持与TASTE相当的性能，显著降低延迟，且在长句场景下仍具备鲁棒性。

链接: https://arxiv.org/abs/2603.12350
作者: Liang-Hsuan Tseng,Hung-yi Lee
机构: National Taiwan University (国立台湾大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Work in progress

点击查看摘要

Abstract:Text-speech joint spoken language modeling (SLM) aims at natural and intelligent speech-based interactions, but developing such a system may suffer from modality mismatch: speech unit sequences are much longer than text tokens. Prior work reduces this gap with text-aligned tokenization and embedding (TASTE), producing speech tokens that align in lengths with their textual counterparts. However, the dependence on an external ASR system and the use of a non-causal decoder limits streaming use. To address this limitation, we propose TASTE-S, a streamable extension of TASTE suitable for real-time usage. TASTE-S integrates a CTC-based ASR module into the encoder for instant dual-modality encoding. We also redesign the unit decoder to enable on-the-fly decoding. With joint training, we show that TASTE-S matches TASTE’s performance while significantly reducing latency. Further investigations reveal that TASTE-S remains robust to transcriptions and enables long-form encoding and decoding.

[NLP-41] LLM -Augmented Therapy Normalization and Aspect-Based Sentiment Analysis for Treatment-Resistant Depression on Reddit

【速读】：该论文旨在解决治疗抵抗性抑郁症（Treatment-Resistant Depression, TRD）患者在真实世界用药体验中，临床试验难以充分反映药物耐受性和主观感受的问题。为弥补传统临床证据的局限性，研究提出了一种基于大规模在线社区文本数据的分析方法：首先从Reddit平台收集5,059篇明确提及TRD的帖子，通过词典标准化处理提取81种通用名药物共23,399次提及；其次，利用微调后的DeBERTa-v3模型进行基于方面的情感分类（Aspect-Based Sentiment Analysis），实现了对每种药物在积极、中性和消极三个维度上的量化评估。关键创新在于将自然语言处理技术与真实世界患者生成内容相结合，从而系统刻画了患者对不同抗抑郁药物的实际感知体验，补充了传统临床研究视角下的不足。

链接: https://arxiv.org/abs/2603.12343
作者: Yuxin Zhu,Sahithi Lakamana,Masoud Rouhizadeh,Selen Bozkurt,Rachel Hershenberg,Abeed Sarker
机构: Emory University (埃默里大学); University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Treatment-resistant depression (TRD) is a severe form of major depressive disorder in which patients do not achieve remission despite multiple adequate treatment trials. Evidence across pharmacologic options for TRD remains limited, and trials often do not fully capture patient-reported tolerability. Large-scale online peer-support narratives therefore offer a complementary lens on how patients describe and evaluate medications in real-world use. In this study, we curated a corpus of 5,059 Reddit posts explicitly referencing TRD from 3,480 subscribers across 28 mental health-related subreddits from 2010 to 2025. Of these, 3,839 posts mentioned at least one medication, yielding 23,399 mentions of 81 generic-name medications after lexicon-based normalization of brand names, misspellings, and colloquialisms. We developed an aspect-based sentiment classifier by fine-tuning DeBERTa-v3 on the SMM4H 2023 therapy-sentiment Twitter corpus with large language model based data augmentation, achieving a micro-F1 score of 0.800 on the shared-task test set. Applying this classifier to Reddit, we quantified sentiment toward individual medications across three categories: positive, neutral, and negative, and tracked patterns by drug, subscriber, subreddit, and year. Overall, 72.1% of medication mentions were neutral, 14.8% negative, and 13.1% positive. Conventional antidepressants, especially SSRIs and SNRIs, showed consistently higher negative than positive proportions, whereas ketamine and esketamine showed comparatively more favorable sentiment profiles. These findings show that normalized medication extraction combined with aspect-based sentiment analysis can help characterize patient-perceived treatment experiences in TRD-related Reddit discourse, complementing clinical evidence with large-scale patient-generated perspectives.

[NLP-42] Context-Enriched Natural Language Descriptions of Vessel Trajectories

【速读】：该论文旨在解决如何将原始的船舶轨迹数据（来自AIS，即自动识别系统）转化为结构化且语义丰富的表示形式，以供人类理解并直接用于机器推理系统的问题。解决方案的关键在于提出了一种上下文感知的轨迹抽象框架，该框架能够将噪声较大的AIS序列分割为多个独立的航程（trip），每个航程由清洁且带有移动性标注（mobility-annotated）的片段组成，并进一步融合多源上下文信息（如邻近地理实体、海上航行特征和天气条件）。这种抽象方式显著提升了语义密度并降低了时空复杂度，从而支持使用大语言模型（LLM）生成可控的自然语言描述，为下游海洋分析任务和更高层次的海事推理提供了基础。

链接: https://arxiv.org/abs/2603.12287
作者: Kostas Patroumpas,Alexandros Troupiotis-Kapeliaris,Giannis Spiliopoulos,Panagiotis Betchavas,Dimitrios Skoutas,Dimitris Zissis,Nikos Bikakis
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:We address the problem of transforming raw vessel trajectory data collected from AIS into structured and semantically enriched representations interpretable by humans and directly usable by machine reasoning systems. We propose a context-aware trajectory abstraction framework that segments noisy AIS sequences into distinct trips each consisting of clean, mobility-annotated episodes. Each episode is further enriched with multi-source contextual information, such as nearby geographic entities, offshore navigation features, and weather conditions. Crucially, such representations can support generation of controlled natural language descriptions using LLMs. We empirically examine the quality of such descriptions generated using several LLMs over AIS data along with open contextual features. By increasing semantic density and reducing spatiotemporal complexity, this abstraction can facilitate downstream analytics and enable integration with LLMs for higher-level maritime reasoning tasks.

[NLP-43] Prompt Injection as Role Confusion

【速读】：该论文试图解决语言模型在经过大量安全训练后仍易受提示注入攻击（prompt injection attack）的问题。其核心发现是，此类攻击的根本原因在于“角色混淆”（role confusion）——模型根据文本的表述方式而非来源来推断说话者角色，导致伪造内容若模仿特定角色（如系统指令或可信用户），便能继承该角色的权威性并误导模型行为。解决方案的关键在于设计新型角色探测器（role probes），用于捕捉模型内部对“谁在发言”的识别机制，并实验证明：角色混淆程度可提前预测攻击成功率。这一发现揭示了安全边界存在于接口层面，而权限分配却发生在潜在空间（latent space），从而提出一个统一且机制化的提示注入攻击框架，阐明多种攻击类型均源于同一角色混淆机制。

链接: https://arxiv.org/abs/2603.12277
作者: Charles Ye,Jasmine Cui,Dylan Hadfield-Menell
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer roles from how text is written, not where it comes from. We design novel role probes to capture how models internally identify “who is speaking.” These reveal why prompt injection works: untrusted text that imitates a role inherits that role’s authority. We test this insight by injecting spoofed reasoning into user prompts and tool outputs, achieving average success rates of 60% on StrongREJECT and 61% on agent exfiltration, across multiple open- and closed-weight models with near-zero baselines. Strikingly, the degree of internal role confusion strongly predicts attack success before generation begins. Our findings reveal a fundamental gap: security is defined at the interface but authority is assigned in latent space. More broadly, we introduce a unifying, mechanistic framework for prompt injection, demonstrating that diverse prompt-injection attacks exploit the same underlying role-confusion mechanism.

[NLP-44] GONE: Structural Knowledge Unlearning via Neighborhood-Expanded Distribution Shaping

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中知识遗忘（unlearning）的问题，尤其是现有方法主要针对平铺的句子级数据，忽视了结构化知识图谱（Knowledge Graph, KG）中的关系性、多跳推理和因果知识。为填补这一空白，作者提出Graph Oblivion and Node Erasure（GONE）基准，用于评估LLMs在KG事实上的知识遗忘效果，并能区分三种遗忘效应：直接事实删除、基于推理的泄露以及灾难性遗忘。解决方案的关键在于设计了一种新颖的遗忘框架Neighborhood-Expanded Distribution Shaping（NEDS），其利用图连通性识别与目标事实语义相关的邻居节点，从而在遗忘事实与其语义邻域之间建立精确的决策边界，实现高效且局部化的知识删除。

链接: https://arxiv.org/abs/2603.12275
作者: Chahana Dahal,Ashutosh Balasubramaniam,Zuobin Xiong
机构: University of Nevada, Las Vegas; Indian Institute of Technology Guwahati
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Unlearning knowledge is a pressing and challenging task in Large Language Models (LLMs) because of their unprecedented capability to memorize and digest training data at scale, raising more significant issues regarding safety, privacy, and intellectual property. However, existing works, including parameter editing, fine-tuning, and distillation-based methods, are all focused on flat sentence-level data but overlook the relational, multi-hop, and reasoned knowledge in naturally structured data. In response to this gap, this paper introduces Graph Oblivion and Node Erasure (GONE), a benchmark for evaluating knowledge unlearning over structured knowledge graph (KG) facts in LLMs. This KG-based benchmark enables the disentanglement of three effects of unlearning: direct fact removal, reasoning-based leakage, and catastrophic forgetting. In addition, Neighborhood-Expanded Distribution Shaping (NEDS), a novel unlearning framework, is designed to leverage graph connectivity and identify anchor correlated neighbors, enforcing a precise decision boundary between the forgotten fact and its semantic neighborhood. Evaluations on LLaMA-3-8B and Mistral-7B across multiple knowledge editing and unlearning methods showcase NEDS’s superior performance (1.000 on unlearning efficacy and 0.839 on locality) on GONE and other benchmarks. Code is available at this https URL.

[NLP-45] Aligning Language Models from User Interactions

【速读】：该论文旨在解决如何有效利用多轮用户交互数据来提升语言模型的对齐性（alignment）与指令遵循能力的问题。当前这些交互数据通常被丢弃，但实际上它们蕴含了用户对模型响应的反馈信息，如错误纠正、偏好调整等。解决方案的关键在于提出一种基于自蒸馏（self-distillation）的原理性方法：通过将模型在观察到用户后续提问后的行为作为“事后分布”（hindsight distribution），并与原始策略（original policy）进行对比，生成用于更新策略的目标分布，并将该分布蒸馏回当前模型中，从而实现无需显式标注的持续学习和个性化适应。

链接: https://arxiv.org/abs/2603.12273
作者: Thomas Kleine Buening,Jonas Hübotter,Barna Pásztor,Idan Shenfeld,Giorgia Ramponi,Andreas Krause
机构: ETH Zurich (苏黎士联邦理工学院); MIT (麻省理工学院); University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-turn user interactions are among the most abundant data produced by language models, yet we lack effective methods to learn from them. While typically discarded, these interactions often contain useful information: follow-up user messages may indicate that a response was incorrect, failed to follow an instruction, or did not align with the user’s preferences. Importantly, language models are already able to make use of this information in context. After observing a user’s follow-up, the same model is often able to revise its behavior. We leverage this ability to propose a principled and scalable method for learning directly from user interactions through self-distillation. By conditioning the model on the user’s follow-up message and comparing the resulting token distribution with the original policy, we obtain a target for updating the policy that captures how the model’s behavior changes in hindsight. We then distill this hindsight distribution back into the current policy. Remarkably, we show that training on real-world user conversations from WildChat improves language models across standard alignment and instruction-following benchmarks, without regressing other capabilities. The same mechanism enables personalization, allowing models to continually adapt to individual users through interaction without explicit feedback. Our results demonstrate that raw user interactions that arise naturally during deployment enable alignment, personalization, and continual adaptation.

[NLP-46] ActTail: Global Activation Sparsity in Large Language Models

【速读】：该论文旨在解决现有激活稀疏性（activation sparsity）方法在大型语言模型（LLM）推理加速中因均匀分配稀疏预算而导致性能下降的问题。其核心问题是：当前方法忽略Transformer权重中不同投影层的异质统计特性，导致稀疏化策略缺乏针对性，从而放大模型精度损失。解决方案的关键在于提出ActTail方法，该方法基于重尾自正则化（Heavy-Tailed Self-Regularization, HT-SR）理论，通过计算每个投影层的经验谱密度（empirical spectral density, ESD）得到重尾指数（heavy-tail exponent），以此作为量化指标为各投影分配差异化的稀疏预算。同时，论文提供了理论分析，明确建立了激活稀疏比例与重尾指数之间的关系，为稀疏分配提供可解释的指导，而非依赖经验调参。实验表明，该方法在高稀疏度下显著提升语言建模和下游任务性能。

链接: https://arxiv.org/abs/2603.12272
作者: Wenwen Hou,Xinyuan Song,Shiwei Liu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Activation sparsity is a promising approach for accelerating large language model (LLM) inference by reducing computation and memory movement. However, existing activation sparsity methods typically apply uniform sparsity across projections, ignoring the heterogeneous statistical properties of Transformer weights and thereby amplifying performance degradation. In this paper, we propose ActTail, a TopK magnitude-based activation sparsity method with global activation sparsity allocation grounded in Heavy-Tailed Self-Regularization (HT-SR) theory. Specifically, we capture this heterogeneity via the heavy-tail exponent computed from each projection’s empirical spectral density (ESD), which is used as a quantitative indicator to assign projection-specific sparsity budgets. Importantly, we provide a theoretical analysis that establishes an explicit relationship between the activation sparsity ratio and the heavy-tail exponent under the HT-SR regime, offering principled guidance for sparsity allocation beyond heuristic design. Experiments on LLaMA and Mistral models show that our method improves both perplexity and downstream task performance at high sparsity compared to uniform allocation. At 80% sparsity, perplexity is reduced by 21.8% on LLaMA-2-7B, 40.1% on LLaMA-2-13B, and 9.4% on Mistral-7B.

[NLP-47] Diagnosing Retrieval Bias Under Multiple In-Context Knowledge Updates in Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理知识密集型任务时，面对同一事实多次更新（multi-update）场景下的检索偏差问题。现有研究多关注单次更新或单一冲突情形，而忽视了多个历史有效版本在同一上下文中竞争检索所引发的认知干扰现象。其核心解决方案是提出一种动态知识实例（Dynamic Knowledge Instance, DKI）评估框架，将同一事实的多次更新建模为一个提示词与一系列更新值的序列关系，并通过端点探测（endpoint probing）方法分别评估模型对初始状态和最新状态的准确识别能力。该框架揭示了随着更新次数增加，模型对最新状态的准确性显著下降，且注意力机制、隐藏状态相似性和输出logits等信号趋于平滑、缺乏判别力，表明当前模型难以稳定捕捉并遵循知识的最新更新。

链接: https://arxiv.org/abs/2603.12271
作者: Boyu Qiao,Sean Guo,Xian Yang,Kun Li,Wei Zhou,Songlin Hu,Yunya Song
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院); Hong Kong University of Science and Technology(香港科技大学); The University of Manchester(曼彻斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLMs are widely used in knowledge-intensive tasks where the same fact may be revised multiple times within context. Unlike prior work focusing on one-shot updates or single conflicts, multi-update scenarios contain multiple historically valid versions that compete at retrieval, yet remain underexplored. This challenge resembles the AB-AC interference paradigm in cognitive psychology: when the same cue A is successively associated with B and C, the old and new associations compete during retrieval, leading to bias. Inspired by this, we introduce a Dynamic Knowledge Instance (DKI) evaluation framework, modeling multi-updates of the same fact as a cue paired with a sequence of updated values, and assess models via endpoint probing of the earliest (initial) and latest (current) states. Across diverse LLMs, we observe that retrieval bias intensifies as updates increase, earliest-state accuracy stays high while latest-state accuracy drops substantially. Diagnostic analyses of attention, hidden-state similarity, and output logits further reveal that these signals become flatter and weakly discriminative on errors, providing little stable basis for identifying the latest update. Finally, cognitively inspired heuristic intervention strategies yield only modest gains and do not eliminate the bias. Our results reveal a persistent challenge in tracking and following knowledge updates in long contexts.

[NLP-48] ask-Specific Knowledge Distillation via Intermediate Probes

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在知识蒸馏（Knowledge Distillation）过程中，因教师模型输出分布存在噪声或失真而导致学生模型训练信号质量下降的问题，尤其在推理任务中，这种问题更为显著。其核心挑战在于：尽管教师模型的中间表示（hidden states）可能已编码正确答案，但通过词汇投影（vocabulary projection）生成的输出 token 受到提示格式和答案词选择的影响，容易产生脆弱且不稳定的输出。解决方案的关键在于提出一种无需修改师生模型架构的蒸馏框架 \method，该方法直接在冻结的教师模型中间状态上训练轻量级探针（lightweight probes），并使用探针预测结果而非教师原始输出 logits 作为监督信号指导学生模型训练。实验证明，此类基于内部表征的探针能提供更干净的标签，有效降低蒸馏信号噪声，在四个推理基准测试中均取得稳定提升，尤其是在数据受限场景下优势明显。

链接: https://arxiv.org/abs/2603.12270
作者: Ryan Brown,Chris Russell
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge distillation from large language models (LLMs) assumes that the teacher’s output distribution is a high-quality training signal. On reasoning tasks, this assumption is frequently violated. A model’s intermediate representations may encode the correct answer, yet this information is lost or distorted through the vocabulary projection, where prompt formatting and answer-token choices creates brittle, noisy outputs. We introduce \method, a distillation framework that bypasses this bottleneck by training lightweight probes on frozen teacher hidden states and using the probe’s predictions, rather than output logits, as supervision for student training. This simple change yields consistent improvements across four reasoning benchmarks (AQuA-RAT, ARC Easy/Challenge, and MMLU), with gains most pronounced under limited data. Probes trained on intermediate representations provide cleaner labels than the teacher’s own outputs, effectively denoising the distillation signal. \method requires no architectural changes to student or teacher, is architecture-agnostic, and adds minimal compute since probe training is cheap and teacher representations can be cached. By exploiting internal representations, \method enables practitioners to extract more value from large teacher models without additional training data or architectural complexity. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.12270 [cs.CL] (or arXiv:2603.12270v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.12270 Focus to learn more arXiv-issued DOI via DataCite

[NLP-49] Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces INTERSPEECH2026

【速读】：该论文旨在解决自监督语音模型（Self-Supervised Speech Models, S3Ms）中“上下文化”（contextualized）表征的具体机制不明确的问题，特别是单帧级别的S3M表示如何编码音素及其邻近上下文信息。解决方案的关键在于提出并验证：单帧级的S3M表示不仅包含当前音素的成分信息，还以叠加方式（superposition）组合了前、当前和后邻音素的音系特征向量（如voicing、bilabiality等），形成一种具有相对位置正交性及隐式音位边界（implicit phonetic boundaries）的结构，从而揭示了S3M中上下文信息的组成式编码机制。

链接: https://arxiv.org/abs/2603.12642
作者: Kwanghee Choi,Eunjung Yeo,Cheol Jun Cho,David R. Mortensen,David Harwath
机构: UT Austin (德克萨斯大学奥斯汀分校); UC Berkeley (加州大学伯克利分校); Carnegie Mellon University (卡内基梅隆大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Transformer-based self-supervised speech models (S3Ms) are often described as contextualized, yet what this entails remains unclear. Here, we focus on how a single frame-level S3M representation can encode phones and their surrounding context. Prior work has shown that S3Ms represent phones compositionally; for example, phonological vectors such as voicing, bilabiality, and nasality vectors are superposed in the S3M representation of [m]. We extend this view by proposing that phonological information from a sequence of neighboring phones is also compositionally encoded in a single frame, such that vectors corresponding to previous, current, and next phones are superposed within a single frame-level representation. We show that this structure has several properties, including orthogonality between relative positions, and emergence of implicit phonetic boundaries. Together, our findings advance our understanding of context-dependent S3M representations.

信息检索

[IR-0] Developing and evaluating a chatbot to support maternal health care IJCAI2026

【速读】：该论文旨在解决在资源匮乏地区部署可信的孕产妇健康聊天机器人（chatbot）所面临的多重挑战，包括用户查询短且不明确、多语言混杂、答案需区域化上下文支撑，以及症状信息缺失导致的安全转诊困难等问题。其解决方案的关键在于构建一个分阶段感知的三重架构：(1) 阶段感知分诊机制，将高风险查询路由至专家模板；(2) 基于精选孕产妇/新生儿指南的混合检索系统；(3) 由大语言模型（LLM）驱动的证据条件生成策略。核心贡献是提出一套面向高风险场景下有限专家监督的多维度评估流程，涵盖组件级与端到端测试，包含标签化的分诊基准（N=150，紧急情况召回率达86.7%）、合成多证据检索基准（N=100，带块级证据标注）、真实查询上的LLM-as-judge对比评估（N=781，基于临床医生设计的标准），以及专家验证。研究结果表明，在多语言、噪声环境中构建可信医疗辅助系统，必须采用纵深防御设计并结合多种评估方法，而非依赖单一模型或评估策略。

链接: https://arxiv.org/abs/2603.13168
作者: Smriti Jha,Vidhi Jain,Jianyu Xu,Grace Liu,Sowmya Ramesh,Jitender Nagpal,Gretchen Chapman,Benjamin Bellows,Siddhartha Goyal,Aarti Singh,Bryan Wilder
机构: Carnegie Mellon University (卡内基梅隆大学); Population Council Institute (人口理事会研究所); Sitaram Bhartia Institute of Science and Research (西塔拉姆·巴尔蒂亚科学与研究学院); Nivi, Inc. (Nivi公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 17 pages; submitted to IJCAI 2026 AI and Social Good Track

点击查看摘要

Abstract:The ability to provide trustworthy maternal health information using phone-based chatbots can have a significant impact, particularly in low-resource settings where users have low health literacy and limited access to care. However, deploying such systems is technically challenging: user queries are short, underspecified, and code-mixed across languages, answers require regional context-specific grounding, and partial or missing symptom context makes safe routing decisions difficult. We present a chatbot for maternal health in India developed through a partnership between academic researchers, a health tech company, a public health nonprofit, and a hospital. The system combines (1) stage-aware triage, routing high-risk queries to expert templates, (2) hybrid retrieval over curated maternal/newborn guidelines, and (3) evidence-conditioned generation from an LLM. Our core contribution is an evaluation workflow for high-stakes deployment under limited expert supervision. Targeting both component-level and end-to-end testing, we introduce: (i) a labeled triage benchmark (N=150) achieving 86.7% emergency recall, explicitly reporting the missed-emergency vs. over-escalation trade-off; (ii) a synthetic multi-evidence retrieval benchmark (N=100) with chunk-level evidence labels; (iii) LLM-as-judge comparison on real queries (N=781) using clinician-codesigned criteria; and (iv) expert validation. Our findings show that trustworthy medical assistants in multilingual, noisy settings require defense-in-depth design paired with multi-method evaluation, rather than any single model and evaluation method choice. Comments: 17 pages; submitted to IJCAI 2026 AI and Social Good Track Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2603.13168 [cs.AI] (or arXiv:2603.13168v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.13168 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在复杂推理任务中缺乏可验证中间步骤、难以诊断其推理过程可靠性的问题。现有评估方法仅关注最终答案准确率，忽视了推理链的质量与逻辑顺序，导致系统性缺陷无法被发现。解决方案的关键在于提出 CRYSTAL 基准，包含 6,372 个实例，通过构建可验证的中间推理步骤来评估多模态推理能力，并引入两个互补指标：Match F1（基于语义相似度衡量步骤级精度与召回）和 Ordered Match F1（进一步惩罚推理链顺序错误）。此外，研究设计了一种受德尔菲法启发的参考轨迹生成流程，结合语义聚类与人工质量校验以确保参考标准的可信度。实验表明，多数模型存在普遍的“选择性采样”现象（精度远高于召回）、非单调缩放权衡及推理顺序混乱问题，且无模型能在正确顺序上保持超过 60% 的匹配步骤。为提升训练效果，作者进一步提出 因果过程奖励（Causal Process Reward, CPR） 及其课程学习策略 CPR-Curriculum，该策略通过乘法形式耦合答案正确性与步骤对齐程度，在无需人工标注步骤的情况下实现 +32% Match F1 提升，显著优于传统加法奖励策略。

链接: https://arxiv.org/abs/2603.13099
作者: Wayner Barrios,SouYoung Jin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:We introduce CRYSTAL (__C__lear __R__easoning via __Y__ielded __S__teps, __T__raceability and __L__ogic), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: Match F1, which scores step-level precision and recall via semantic similarity matching, and Ordered Match F1, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline where four independent MLLMs generate trajectories, aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures invisible to accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning where no competitive model preserves more than 60% of matched steps in correct order. Beyond evaluation, we propose the Causal Process Reward (CPR), a multiplicative reward that couples answer correctness with step-level alignment, and CPR-Curriculum, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves +32% Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation.

[IR-2] Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

【速读】：该论文旨在解决长对话中AI代理记忆的存储与检索效率问题：用户与AI的多轮交互历史虽具价值，但直接携带完整文本会带来高昂的上下文成本。解决方案的关键在于提出一种结构化压缩机制（structured distillation），将每轮对话压缩为包含四个字段的复合对象（exchange_core、specific_context、thematic_room_assignments 和 regex-extracted_files_touched），使平均每个交互仅保留38个token（相比原始371个token实现11倍压缩）。该方法在保持高召回质量的前提下显著降低上下文开销，且通过跨层检索策略可略微超越原始完整文本的检索性能（MRR 0.759 vs 0.745），验证了个性化代理记忆在压缩后仍能有效支持精准检索。

链接: https://arxiv.org/abs/2603.13017
作者: Sydney Lewis
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 6 figures. Code: this https URL

点击查看摘要

Abstract:Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study personalized agent memory: one user’s conversation history with an agent, distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields (exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched). The searchable distilled text averages 38 tokens per exchange. Applied to 4,182 conversations (14,340 exchanges) from 6 software engineering projects, the method reduces average exchange length from 371 to 38 tokens, yielding 11x compression. We evaluate whether personalized recall survives that compression using 201 recall-oriented queries, 107 configurations spanning 5 pure and 5 cross-layer search modes, and 5 LLM graders (214,519 consensus-graded query-result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745). Results are mechanism-dependent. All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single-user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill-down. We release the implementation and analysis pipeline as open-source software.

[IR-3] Can Fairness Be Prompted? Prompt-Based Debiasing Strategies in High-Stakes Recommendations

【速读】：该论文旨在解决大型语言模型推荐系统（LLM Recommenders, LLMRecs）中存在的隐式偏见问题，即模型可能通过姓名、代词等间接线索推断出用户敏感属性（如性别、年龄），从而在推荐过程中产生不公平的倾向。现有去偏方法通常依赖于访问模型权重，计算成本高且不适用于普通用户。论文的关键解决方案是提出三种基于提示（prompt-based）的去偏策略，通过在输入提示中明确指令要求模型保持公平性，实现轻量化、易部署的去偏效果。实验表明，该方法可在保留推荐有效性的同时，将公平性提升最高达74%，但需注意在某些情况下可能过度促进特定人口群体。

链接: https://arxiv.org/abs/2603.12935
作者: Mihaela Rotar,Theresia Veronika Rampisela,Maria Maistro
机构: University of Copenhagen(哥本哈根大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can infer sensitive attributes such as gender or age from indirect cues like names and pronouns, potentially biasing recommendations. While several debiasing methods exist, they require access to the LLMs’ weights, are computationally costly, and cannot be used by lay users. To address this gap, we investigate implicit biases in LLM Recommenders (LLMRecs) and explore whether prompt-based strategies can serve as a lightweight and easy-to-use debiasing approach. We contribute three bias-aware prompting strategies for LLMRecs. To our knowledge, this is the first study on prompt-based debiasing approaches in LLMRecs that focuses on group fairness for users. Our experiments with 3 LLMs, 4 prompt templates, 9 sensitive attribute values, and 2 datasets show that our proposed debiasing approach, which instructs an LLM to be fair, can improve fairness by up to 74% while retaining comparable effectiveness, but might overpromote specific demographic groups in some cases.

[IR-4] NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

【速读】：该论文旨在解决视觉文档检索（Visual Document Retrieval, VDR）中基于视觉-语言模型（Vision-Language Model, VLM）的检索器在查询编码时存在的高延迟与GPU依赖问题，尤其是在仅需处理纯文本查询的情况下仍使用大规模多亿参数编码器所带来的资源浪费。其解决方案的核心在于利用查询与文档之间的不对称性：文档具有复杂的视觉内容，需要强大的视觉理解能力；而查询仅为短文本字符串，无需复杂视觉建模。为此，作者提出NanoVDR框架，通过解耦两个编码路径实现高效检索——一个冻结的2B参数VLM教师模型离线索引文档，一个仅含69M参数的文本专用学生模型在推理阶段编码查询。关键设计在于蒸馏目标的选择，系统对比六种目标后发现，基于查询文本的点对点余弦对齐（pointwise cosine alignment）在性能上显著优于基于排序和对比学习的方法，且训练仅需预缓存的教师端查询嵌入，无需在训练时重新处理文档，极大降低计算开销。此外，论文识别出跨语言迁移是主要瓶颈，并通过引入机器翻译生成的查询数据进行低成本增强，从而显著提升多语言场景下的性能表现。

链接: https://arxiv.org/abs/2603.12824
作者: Zhuchenyang Liu,Yao Zhang,Yu Xiao
机构: Aalto University (阿尔托大学)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query–document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32 \times fewer parameters and 50 \times lower CPU query latency, at a total training cost under 13 GPU-hours.

[IR-5] aming the Long Tail: Efficient Item-wise Sharpness-Aware Minimization for LLM -based Recommender Systems

【速读】：该论文旨在解决大型语言模型驱动的推荐系统（Large Language Model-based Recommender Systems, LRSs）中存在的长尾问题，即在推荐结果中头部物品（高频、高曝光）与尾部物品（低频、低曝光）之间存在显著性能差异。研究表明，这种长尾现象由两类因素共同导致：一是预训练语料库隐含的先验长尾（prior long-tail），二是推荐数据集分布偏斜带来的数据长尾（data long-tail），其中后者对尾部物品表现影响更为显著。为应对该问题，作者提出一种名为高效逐项尖锐度感知最小化（Efficient Item-wise Sharpness-Aware Minimization, EISAM）的优化框架，其核心在于引入一种细粒度的、面向物品级别的尖锐度惩罚机制，在保持计算可扩展性的前提下自适应地调节损失景观（loss landscape），从而提升尾部物品的推荐效果。理论分析进一步表明，EISAM 的泛化误差界在该项正则化下收敛速度更快，为方法有效性提供了理论支撑。

链接: https://arxiv.org/abs/2603.12752
作者: Jiaming Zhang,Yuyuan Li,Xiaohua Feng,Li Zhang,Longfei Li,Jun Zhou,Chaochao Chen
机构: Zhejiang University(浙江大学); Hangzhou Dianzi University(杭州电子科技大学); Ant Group(蚂蚁集团)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Model-based Recommender Systems (LRSs) have recently emerged as a new paradigm in sequential recommendation by directly adopting LLMs as backbones. While LRSs demonstrate strong knowledge utilization and instruction-following abilities, they have not been systematically studied under the long-standing long-tail problem. In this paper, we conduct an empirical study and reveal that LRSs face two distinct types of long-tail: i) prior long-tail, inherited implicitly from pretraining corpora, and ii) data long-tail, originating from skewed recommendation datasets. Our analysis shows that both contribute to the performance disparity between head and tail items, with the intersection of the two heads exhibiting an even stronger head effect. Nevertheless, the overall performance distribution in LRSs, especially on the tail, remains dominated by the data long-tail. To address this challenge, we propose Efficient Item-wise Sharpness-Aware Minimization (EISAM), a novel optimization framework that improves tail-item performance by adaptively regularizing the loss landscape at the item level. EISAM introduces an efficient penalty design that captures fine-grained item-specific sharpness while maintaining computational scalability for LLMs. In addition, we derive a generalization bound for EISAM. Our theoretical analysis shows that the bound decreases at a faster rate under our item-wise regularization, offering theoretical support for its effectiveness. Extensive experiments on three real-world datasets demonstrate that EISAM significantly boosts tail-item recommendation performance while preserving overall quality, establishing the first systematic solution to the long-tail problem in LRSs.

[IR-6] Anchored Alignment: Preventing Positional Collapse in Multimodal Recommender Systems

【速读】：该论文旨在解决当前基于对齐的多模态推荐系统（Multimodal Recommender Systems, MMRS）中存在的两个核心问题：一是统一嵌入空间的强制对齐会模糊各模态（如图像、文本）特有的结构信息，二是ID主导现象（ID dominance）导致模型过度依赖用户-物品交互信号而忽视多模态特征。解决方案的关键在于提出AnchorRec框架，通过在轻量级投影域中进行间接的锚点（anchor-based）对齐，将对齐操作与表示学习解耦，从而在保持各模态原始结构的同时实现跨模态一致性，并有效避免位置坍缩（positional collapse）。

链接: https://arxiv.org/abs/2603.12726
作者: Yonghun Jeong,David Yoon Suk Kang,Yeon-Chang Lee
机构: UNIST(Ulsan, Korea); Chungbuk National University(Cheongju, Korea)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 5 pages, 5 figures

点击查看摘要

Abstract:Multimodal recommender systems (MMRS) leverage images, text, and interaction signals to enrich item representations. However, recent alignment based MMRSs that enforce a unified embedding space often blur modality specific structures and exacerbate ID dominance. Therefore, we propose AnchorRec, a multimodal recommendation framework that performs indirect, anchor based alignment in a lightweight projection domain. By decoupling alignment from representation learning, AnchorRec preserves each modality’s native structure while maintaining cross modal consistency and avoiding positional collapse. Experiments on four Amazon datasets show that AnchorRec achieves competitive top N recommendation accuracy, while qualitative analyses demonstrate improved multimodal expressiveness and coherence. The codebase of AnchorRec is available at this https URL.

[IR-7] FGTR: Fine-Grained Multi-Table Retrieval via Hierarchical LLM Reasoning SIGIR2026

【速读】：该论文旨在解决现有基于大语言模型（Large Language Models, LLM）的表格检索方法在单表和多表查询任务中存在的两大问题：一是传统方法采用粗粒度编码整个表格，导致引入大量与查询无关的数据，从而降低检索准确性；二是这些方法在处理大规模表格时效率低下，未能充分发挥LLM的推理能力。此外，多表查询任务在当前研究中仍处于探索阶段。为此，论文提出了一种基于LLM的分层多表查询方法——细粒度多表检索（Fine-Grained Multi-Table Retrieval, FGTR），其核心创新在于采用类人类推理策略：首先识别与查询相关的表结构元素（schema elements），再逐级检索对应单元格内容，最终构建出与查询高度匹配的精简子表。这一分层推理机制显著提升了检索精度与效率，实验表明FGTR在Spider和BIRD两个基准数据集上分别将F₂指标提升18%和21%，验证了其在细粒度表格检索中的有效性及对下游任务端到端性能的改进潜力。

链接: https://arxiv.org/abs/2603.12702
作者: Chaojie Sun,Bin Cao,Tiantian Li,Chenyu Hou,Ruizhe Li,Qing Fan
机构: Zhejiang University of Technology(浙江工业大学); University of Aberdeen(阿伯丁大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under Review - Submitted to SIGIR 2026 Resources Track; 10pages, 5 figures, 4 tables

点击查看摘要

Abstract:With the rapid advancement of large language models (LLMs), growing efforts have been made on LLM-based table retrieval. However, existing studies typically focus on single-table query, and implement it by similarity matching after encoding the entire table. These methods usually result in low accuracy due to their coarse-grained encoding which incorporates much query-irrelated data, and are also inefficient when dealing with large tables, failing to fully utilize the reasoning capabilities of LLM. Further, multi-table query is under-explored in retrieval tasks. To this end, we propose a hierarchical multi-table query method based on LLM: Fine-Grained Multi-Table Retrieval FGTR, a new retrieval paradigm that employs a human-like reasoning strategy. Through hierarchical reasoning, FGTR first identifies relevant schema elements and then retrieves the corresponding cell contents, ultimately constructing a concise and accurate sub-table that aligns with the given query. To comprehensively evaluate the performance of FGTR, we construct two new benchmark datasets based on Spider and BIRD . Experimental results show that FGTR outperforms previous state-of-the-art methods, improving the F_2 metric by 18% on Spider and 21% on BIRD, demonstrating its effectiveness in enhancing fine-grained retrieval and its potential to improve end-to-end performance on table-based downstream tasks.

[IR-8] VLM4Rec: Multimodal Semantic Representation for Recommendation with Large Vision-Language Models

【速读】：该论文旨在解决多模态推荐中因原始视觉特征与用户偏好语义空间不一致而导致的性能瓶颈问题，即仅依赖特征融合（feature fusion）难以有效捕捉用户决策背后的高层语义因素（如风格、材质和使用场景）。其解决方案的关键在于引入大视觉语言模型（Large Vision-Language Model, VLM）进行语义对齐：首先利用VLM将物品图像映射为显式的自然语言描述，进而编码成稠密语义表示，使物品内容在语义空间中更贴近用户偏好匹配需求；最终通过基于用户历史行为的语义匹配机制实现推荐，从而实现离线-在线解耦的高效推理。实验表明，该方法优于直接使用原始视觉特征或复杂融合策略，验证了语义表示质量比融合复杂度更为关键。

链接: https://arxiv.org/abs/2603.12625
作者: Ty Valencia,Burak Barlas,Varun Singhal,Ruchir Bhatia,Wei Yang
机构: University of Southern California (南加州大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures, 1 table

点击查看摘要

Abstract:Multimodal recommendation is commonly framed as a feature fusion problem, where textual and visual signals are combined to better model user preference. However, the effectiveness of multimodal recommendation may depend not only on how modalities are fused, but also on whether item content is represented in a semantic space aligned with preference matching. This issue is particularly important because raw visual features often preserve appearance similarity, while user decisions are typically driven by higher-level semantic factors such as style, material, and usage context. Motivated by this observation, we propose LVLM-grounded Multimodal Semantic Representation for Recommendation (VLM4Rec), a lightweight framework that organizes multimodal item content through semantic alignment rather than direct feature fusion. VLM4Rec first uses a large vision-language model to ground each item image into an explicit natural-language description, and then encodes the grounded semantics into dense item representations for preference-oriented retrieval. Recommendation is subsequently performed through a simple profile-based semantic matching mechanism over historical item embeddings, yielding a practical offline-online decomposition. Extensive experiments on multiple multimodal recommendation datasets show that VLM4Rec consistently improves performance over raw visual features and several fusion-based alternatives, suggesting that representation quality may matter more than fusion complexity in this setting. The code is released at this https URL.

[IR-9] InterDeepResearch: Enabling Human-Agent Collaborative Information Seeking through Interactive Deep Research

【速读】：该论文旨在解决当前由大语言模型（Large Language Model, LLM）代理驱动的深度研究系统中缺乏人机协同（human-in-the-loop collaboration）的问题。现有系统多采用“查询到报告”的自主范式，使用户处于被动角色，无法有效融入个人见解、上下文知识及动态变化的研究意图。解决方案的关键在于提出InterDeepResearch系统，其核心是一个专为研究过程设计的上下文管理框架，将研究上下文组织为信息层、动作层和会话层的分层架构，支持动态上下文压缩以避免LLM上下文溢出，并通过跨动作回溯机制保障证据溯源；在此基础上，系统界面集成三个协同视图用于可视化认知构建，并提供交互式上下文导航机制，从而显著提升人机协作的信息检索效率与可控性。

链接: https://arxiv.org/abs/2603.12608
作者: Bo Pan,Lunke Pan,Yitao Zhou,Qi Jiang,Zhen Wen,Minfeng Zhu,Wei Chen
机构: Zhejiang University (浙江大学); State Key Lab of CADCG (CADCG国家重点实验室)
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Deep research systems powered by LLM agents have transformed complex information seeking by automating the iterative retrieval, filtering, and synthesis of insights from massive-scale web sources. However, existing systems predominantly follow an autonomous “query-to-report” paradigm, limiting users to a passive role and failing to integrate their personal insights, contextual knowledge, and evolving research intents. This paper addresses the lack of human-in-the-loop collaboration in the agentic research process. Through a formative study, we identify that current systems hinder effective human-agent collaboration in terms of process observability, real-time steerability, and context navigation efficiency. Informed by these findings, we propose InterDeepResearch, an interactive deep research system backed by a dedicated research context management framework. The framework organizes research context into a hierarchical architecture with three levels (information, actions, and sessions), enabling dynamic context reduction to prevent LLM context exhaustion and cross-action backtracing for evidence provenance. Built upon this framework, the system interface integrates three coordinated views for visual sensemaking, and dedicated interaction mechanisms for interactive research context navigation. Evaluation on the Xbench-DeepSearch-v1 and Seal-0 benchmarks shows that InterDeepResearch achieves competitive performance compared to state-of-the-art deep research systems, while a formal user study demonstrates its effectiveness in supporting human-agent collaborative information seeking. Project page with system demo: this https URL.

[IR-10] Deferred is Better: A Framework for Multi-Granularity Deferred Interaction of Heterogeneous Features

【速读】：该论文旨在解决点击率（Click-through Rate, CTR）预测模型中因特征异质性导致的性能下降问题，特别是低信息密度特征（如稀疏的类别型特征）在早期交互层中引入噪声，干扰高信息密度特征（如密集的数值型特征）的学习，从而引发模型坍塌和表示学习困难。其解决方案的关键在于提出一种多粒度信息感知延迟交互网络（Multi-Granularity Information-Aware Deferred Interaction Network, MGDIN），通过两个核心机制实现：首先采用多粒度特征分组策略，将原始特征按信息密度相似性划分为不同组别，缓解极端稀疏性影响；其次引入分层掩码机制，以延迟方式控制各组特征在交互过程中的参与时机——在浅层阶段屏蔽低信息组，随着网络加深逐步解掩码，使模型先基于高信息特征建立稳健表征，再渐进融合稀疏特征，从而提升整体建模效率与鲁棒性。

链接: https://arxiv.org/abs/2603.12586
作者: Yi Xu,Moyu Zhang,Chaofan Fan,Jinxin Hu,Yu Zhang,Xiaoyi Zeng
机构: Alibaba Group(阿里巴巴集团); Alibaba International Digital Commerce Group(阿里巴巴国际数字商业集团)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Click-through rate (CTR) prediction models estimates the probability of a user-item click by modeling interactions across a vast feature space. A fundamental yet often overlooked challenge is the inherent heterogeneity of these features: their sparsity and information content vary dramatically. For instance, categorical features like item IDs are extremely sparse, whereas numerical features like item price are relatively dense. Prevailing CTR models have largely ignored this heterogeneity, employing a uniform feature interaction strategy that inputs all features into the interaction layers simultaneously. This approach is suboptimal, as the premature introduction of low-information features can inject significant noise and mask the signals from information-rich features, which leads to model collapse and hinders the learning of robust representations. To address the above challenge, we propose a Multi-Granularity Information-Aware Deferred Interaction Network (MGDIN), which adaptively defers the introduction of features into the feature interaction process. MGDIN’s core mechanism operates in two stages: First, it employs a multi-granularity feature grouping strategy to partition the raw features into distinct groups with more homogeneous information density in different granularities, thereby mitigating the effects of extreme individual feature sparsity and enabling the model to capture feature interactions from diverse perspectives. Second, a delayed interaction mechanism is implemented through a hierarchical masking strategy, which governs when and how each group participates by masking low-information groups in the early layers and progressively unmasking them as the network deepens. This deferred introduction allows the model to establish a robust understanding based on high-information features before gradually incorporating sparser information from other groups…

[IR-11] Bridging Sequential and Contextual Features with a Dual-View of Fine-grained Core-Behaviors and Global Interest-Distribution

【速读】：该论文旨在解决点击率（Click-Through Rate, CTR）预测中用户行为序列与物品上下文特征之间交互建模不足的问题。传统方法通常将动态用户行为序列聚合为单一向量后再与上下文特征交互，导致行为信息损失并难以捕捉细粒度的行为-上下文关联；而直接逐行为交互则计算开销大且易受无关行为噪声干扰，削弱有效信号。解决方案的关键在于提出一种双视角交互网络（Core-Behaviors and Distributional-Compensation Dual-View Interaction Network, CDNet），从两个互补角度建模：一是聚焦最相关行为与上下文的细粒度交互，二是通过用户整体兴趣分布对上下文特征进行粗粒度补偿性建模。该设计在保留关键行为细节的同时兼顾全局用户兴趣，实现了高效且精准的序列与上下文特征交互，显著提升CTR预测性能。

链接: https://arxiv.org/abs/2603.12578
作者: Yi Xu,Chaofan Fan,Moyu Zhang,Jinxin Hu,Jiahao Wang,Hao Zhang,Shizhun Wang,Yu Zhang,Xiaoyi Zeng
机构: Alibaba Group(阿里巴巴集团); Alibaba International Digital Commerce Group(阿里巴巴国际数字商业集团)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Click-through rate (CTR) prediction tasks typically estimate the probability of a user clicking on a candidate item by modeling both user behavior sequence features and the item’s contextual features, where the user behavior sequence is particularly critical as it dynamically reflects real-time shifts in user interest. Traditional CTR models often aggregate this dynamic sequence into a single vector before interacting it with contextual features. This approach, however, not only leads to behavior information loss during aggregation but also severely limits the model’s capacity to capture interactions between contextual features and specific user behaviors, ultimately impairing its ability to capture fine-grained behavioral details and hindering models’ prediction accuracy. Conversely, a naive approach of directly interacting with each user action with contextual features is computationally expensive and introduces significant noise from behaviors irrelevant to the candidate item. This noise tends to overwhelm the valuable signals arising from interactions involving more behaviors relevant to the candidate item. Therefore, to resolve the above issue, we propose a Core-Behaviors and Distributional-Compensation Dual-View Interaction Network (CDNet), which bridges the gap between sequential and contextual feature interactions from two complementary angles: a fine-grained interaction involving the most relevant behaviors and contextual features, and a coarse-grained interaction that models the user’s overall interest distribution against the contextual features. By simultaneously capturing important behavioral details without forgoing the holistic user interest, CDNet effectively models the interplay between sequential and contextual features without imposing a significant computational burden. Ultimately, extensive experiments validate the effectiveness of CDNet.

[IR-12] st-Time Strategies for More Efficient and Accurate Agent ic RAG

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）系统在处理复杂多跳（multihop）问题时存在的效率与准确性问题，特别是基于迭代式代理框架（如Search-R1）所引发的重复检索、上下文整合不佳及token消耗过高等缺陷。其解决方案的关键在于对Search-R1流程进行测试时（test-time）的改进：引入两个核心模块——一个上下文化（contextualization）模块以更有效地将检索到的文档信息融入当前推理过程，以及一个去重（de-duplication）模块用以替换已处理过的文档并获取下一个最相关的文档。实验表明，结合GPT-4.1-mini实现的上下文化模块可使Exact Match（EM）得分提升5.6%，同时减少10.5%的检索轮次，显著提升了答案准确性和检索效率。

链接: https://arxiv.org/abs/2603.12396
作者: Brian Zhang,Deepti Guntur,Zhiyang Zuo,Abhinav Sharma,Shreyas Chaudhari,Wenlong Zhao,Franck Dernoncourt,Puneet Mathur,Ryan Rossi,Nedim Lipka
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems face challenges with complex, multihop questions, and agentic frameworks such as Search-R1 (Jin et al., 2025), which operates iteratively, have been proposed to address these complexities. However, such approaches can introduce inefficiencies, including repetitive retrieval of previously processed information and challenges in contextualizing retrieved results effectively within the current generation prompt. Such issues can lead to unnecessary retrieval turns, suboptimal reasoning, inaccurate answers, and increased token consumption. In this paper, we investigate test-time modifications to the Search-R1 pipeline to mitigate these identified shortcomings. Specifically, we explore the integration of two components and their combination: a contextualization module to better integrate relevant information from retrieved documents into reasoning, and a de-duplication module that replaces previously retrieved documents with the next most relevant ones. We evaluate our approaches using the HotpotQA (Yang et al., 2018) and the Natural Questions (Kwiatkowski et al., 2019) datasets, reporting the exact match (EM) score, an LLM-as-a-Judge assessment of answer correctness, and the average number of turns. Our best-performing variant, utilizing GPT-4.1-mini for contextualization, achieves a 5.6% increase in EM score and reduces the number of turns by 10.5% compared to the Search-R1 baseline, demonstrating improved answer accuracy and retrieval efficiency. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.12396 [cs.IR] (or arXiv:2603.12396v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.12396 Focus to learn more arXiv-issued DOI via DataCite

[IR-13] Multi-Step Semantic Reasoning in Generative Retrieval ECIR2026

【速读】：该论文旨在解决生成式检索（Generative Retrieval, GR）模型在数值型上下文中的多步语义推理能力不足的问题，尤其是在金融报告等复杂文档中处理涉及数值推理的查询时表现不佳，导致检索准确率低下。解决方案的关键在于提出ReasonGR框架，其核心包括两个方面：一是采用结构化提示策略，结合任务特定指令与分步推理引导，增强模型对复杂查询的理解；二是引入一个专注于推理能力的适应模块，优化与推理相关参数的学习，从而显著提升GR模型在高阶语义推理场景下的检索性能与一致性。

链接: https://arxiv.org/abs/2603.12368
作者: Steven Dong,Yubao Tang,Maarten de Rijke
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted at ECIR2026

点击查看摘要

Abstract:Generative retrieval (GR) models encode a corpus within model parameters and generate relevant document identifiers directly for a given query. While this paradigm shows promise in retrieval tasks, existing GR models struggle with complex queries in numerical contexts, such as those involving semantic reasoning over financial reports, due to limited reasoning capabilities. This limitation leads to suboptimal retrieval accuracy and hinders practical applicability. We propose ReasonGR, a framework designed to enhance multi-step semantic reasoning in numerical contexts within GR. ReasonGR employs a structured prompting strategy combining task-specific instructions with stepwise reasoning guidance to better address complex retrieval queries. Additionally, it integrates a reasoning-focused adaptation module to improve the learning of reasoning-related parameters. Experiments on the FinQA dataset, which contains financial queries over complex documents, demonstrate that ReasonGR improves retrieval accuracy and consistency, indicating its potential for advancing GR models in reasoning-intensive retrieval scenarios.

[IR-14] Detecting Miscitation on the Scholarly Web through LLM -Augmented Text-Rich Graph Learning

【速读】：该论文旨在解决学术网络中因误引（miscitation）导致的知识可信度下降问题，即引用文献未能支持或甚至与被引内容相矛盾的现象。现有方法主要依赖语义相似性或网络异常检测，难以捕捉引用上下文与其在知识网络中位置之间的复杂关系。解决方案的关键在于提出一种基于大语言模型（LLM）增强的图学习框架——LAGMiD，其核心创新包括：1）引入证据链推理机制（evidence-chain reasoning），通过思维链提示（chain-of-thought prompting）实现多跳引用追踪和语义一致性评估；2）设计知识蒸馏策略，将LLM在中间推理状态中的语义知识迁移至图神经网络（GNN），以降低计算开销并提升可扩展性；3）采用协同学习机制，动态分配复杂案例至LLM进行精细推理，同时优化GNN对结构特征的泛化能力。实验证明，该方法在多个真实数据集上实现了最先进的误引检测性能，并显著降低了推理成本。

链接: https://arxiv.org/abs/2603.12290
作者: Huidong Wu,Haojia Xiang,Jingtong Gao,Xiangyu Zhao,Dengsheng Wu,Jianping Li
机构: Chinese Academy of Sciences (中国科学院); City University of Hong Kong (香港城市大学); Shenzhen University (深圳大学); Beijing Dongbi Data Technology (北京东必数据科技)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scholarly web is a vast network of knowledge connected by citations. However, this system is increasingly compromised by miscitation, where references do not support or even contradict the claims they are cited for. Current miscitation detection methods, which primarily rely on semantic similarity or network anomalies, struggle to capture the nuanced relationship between a citation’s context and its place in the wider network. While large language models (LLMs) offer powerful capabilities in semantic reasoning for this task, their deployment is hindered by hallucination risks and high computational costs. In this work, we introduce LLM-Augmented Graph Learning-based Miscitation Detector (LAGMiD), a novel framework that leverages LLMs for deep semantic reasoning over citation graphs and distills this knowledge into graph neural networks (GNNs) for efficient and scalable miscitation detection. Specifically, LAGMiD introduces an evidence-chain reasoning mechanism, which uses chain-of-thought prompting, to perform multi-hop citation tracing and assess semantic fidelity. To reduce LLM inference costs, we design a knowledge distillation method aligning GNN embeddings with intermediate LLM reasoning states. A collaborative learning strategy further routes complex cases to the LLM while optimizing the GNN for structure-based generalization. Experiments on three real-world benchmarks show that LAGMiD achieves state-of-the-art miscitation detection with significantly reduced inference cost.

[IR-15] Algorithmic Trust and Compliance: Benchmarking Brand Notability for UK iGaming Entities in Generative Search Engines

【速读】：该论文旨在解决生成式 AI（Generative AI）驱动的搜索引擎（如 ChatGPT、Perplexity 和 Gemini）兴起背景下，传统搜索引擎优化（SEO）策略失效的问题，尤其在高度监管的英国博彩行业（UK iGaming）中，如何有效提升实体在 AI 搜索结果中的权威性与可见性。解决方案的关键在于构建“生成式引擎优化”（Generative Engine Optimization, GEO）框架，其核心是通过结构化合规信号（如英国博彩委员会 UKGC 标准）增强大型语言模型（LLM）对内容的信任度，并强调内容需具备机器可读性和可验证性，以赢得 AI 对“算法信任”（Algorithmic Trust）的青睐，从而在 AI 搜索结果中占据主导地位。

链接: https://arxiv.org/abs/2603.12282
作者: Julen Oruesagasti
机构: Interamplify
类目: Information Retrieval (cs.IR)
备注: Technical Report. Produced by Interamplify Research Division (UK)

点击查看摘要

Abstract:The rapid adoption of generative AI-powered search engines, such as ChatGPT, Perplexity, and Gemini, is fundamentally reshaping information retrieval. We are witnessing a critical shift from traditional ranked lists to synthesized, citation-backed answers. This paradigm shift challenges established Search Engine Optimization (SEO) practices and necessitates a new framework, termed Generative Engine Optimization (GEO). In highly regulated environments like the UK iGaming sector, visibility is no longer dictated by keyword density, but by an entity’s ability to project “Algorithmic Trust”. This report presents an empirical analysis of how compliance signals – such as UK Gambling Commission (UKGC) standards – function as authority multipliers for Large Language Models (LLMs) when properly structured. Recent large-scale experiments reveal that AI Search exhibits a systematic and overwhelming bias towards Earned media (third-party, authoritative sources) over Brand-owned content. Consequently, practitioners must engineer their content for machine scannability and justification to dominate these new AI-perceived authority metrics.

人机交互

[HC-0] Navig-AI-tion: Navigation by Contextual AI and Spatial Audio

【速读】：该论文旨在解决音频导航系统在引导用户行走时易导致方向迷失的问题，传统音频导航依赖模糊的方位词（如“向北”或“左转”）且缺乏实时环境上下文信息，从而引发频繁路径偏离。其解决方案的关键在于将视觉语言模型（Vision Language Model, VLM）与空间音频线索（spatial audio cue）相结合：VLM从环境图像中提取地标作为导航锚点以增强语义理解，同时在用户朝向错误时提供指向性的空间音频信号，明确指示所需转向方向。实验表明，该融合方案显著降低了路径偏差，并提升了用户的定向感知与导航体验。

链接: https://arxiv.org/abs/2603.13200
作者: Mathias N. Lystbæk,Haley Adams,Ranjith Kagathi Ananda,Eric J Gonzalez,Luca Ballan,Qiuxuan Wu,Andrea Colaço,Peter Tan,Mar Gonzalez-Franco
机构: Google(谷歌)
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, 2 figures, to be published in Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26), 6 pages appendix

点击查看摘要

Abstract:Audio-only walking navigation can leave users disoriented, relying on vague cardinal directions and lacking real-time environmental context, leading to frequent errors. To address this, we present a novel system that integrates a Vision Language Model (VLM) with a spatial audio cue. Our system extracts environmental landmarks to anchor navigation instructions and, crucially, provides a directional spatial audio signal when the user faces the wrong direction, indicating the precise turn direction. In a user study (n=12), the spatial audio cue with VLM reduced route deviations compared to both VLM-only and Google Maps (audio-only) baseline systems. Users reported that the spatial audio cue effectively supported orientation and that landmark-anchored instructions provided a better navigation experience over audio-only Google Maps. This work serves as an initial look at the utility of future audio-only navigation systems for incorporating directional cues, especially real-time corrective spatial audio.

[HC-1] Memory Printer: Exploring Everyday Reminiscing by Combining Slow Design with Generative AI-based Image Creation

【速读】：该论文旨在解决当前基于网络的生成式人工智能（Generative Artificial Intelligence, GAI）工具在记忆重构应用中因交互脱节与不可预测性而削弱用户主体性（agency）的问题。其解决方案的关键在于通过“慢速、具身化”的交互设计，使时间性、具身代理权和生成过程变得可体验地清晰（experientially legible）。具体实现为提出并验证了Memory Printer这一具身设计：它融合丝网印刷隐喻与文本到图像生成技术，包含分层重建机制以分解图像生成步骤、木质刮刀提供物理控制以增强对图像显现的掌控感，并内置打印功能产出实体照片，从而在情感敏感场景中重塑人-机关系边界。

链接: https://arxiv.org/abs/2603.13116
作者: Zhou Fang,Janet Yi-Ching Huang
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to CHI 2026

点击查看摘要

Abstract:Generative Artificial Intelligence (GAI) offers new opportunities for reconstructing these unrecorded memory scenes, yet existing web-based tools undermine users’ sense of agency through disengaging and unpredictable interactions. In this work, we advance three design arguments about how slow, tangible interaction can reshape human-AI relationships by making temporality, embodied agency, and generative processes experientially legible. We instantiate these arguments by presenting Memory Printer, a tangible design that combines silk-screen printing metaphors with text-to-image generation. The design features layered reconstruction that decomposes image generation into incremental steps, a physical wooden scraper enabling embodied control over image revelation, and built-in printing that produces tangible photos. We examine these arguments through a comparative study with 24 participants, exploring how participants engage with, interpret, and respond to this interaction stance. The study surfaces both opportunities – such as vivid memory evocation, heightened sense of control, and creative exploration – and critical tensions, including risks of false memory formation, algorithmic bias, and data privacy. Together, these findings articulate important boundaries for deploying generative AI in emotionally sensitive contexts.

[HC-2] Interrogating Design Homogenization in Web Vibe Coding

【速读】：该论文旨在解决生成式 AI（Generative AI）在网页设计领域可能引发的设计同质化问题，特别是当非专业用户通过“vibe coding”方式使用大语言模型（LLM）生成网站时，其对训练数据中主流风格的依赖可能导致创意表达受限、网络设计多样性下降。解决方案的关键在于提出以“生产性摩擦”（productive friction）为核心的缓解框架，通过在微观（个体创作）、中观（协作与工具设计）和宏观（生态系统）层面引入适度的交互阻力与反思机制，促使创作者主动挑战默认输出，从而维护AI辅助网页设计中的多样性与创造性表达。

链接: https://arxiv.org/abs/2603.13036
作者: Donghoon Shin,Alice Gao,Rock Yuren Pang,Jaewook Lee,Katharina Reinecke,Emily Tseng
机构: University of Washington (华盛顿大学); Microsoft Research (微软研究院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Generative AI is known for its tendency to homogenize, often reproducing dominant style conventions found in training data. However, it remains unclear how these homogenizing effects extend to complex structural tasks like web design. As lay creators increasingly turn to LLMs to ‘vibe-code’ websites – prompting for aesthetic and functional goals rather than writing code – they may inadvertently narrow the diversity of their designs, and limit creative expression throughout the internet. In this paper, we interrogate the possibility of design homogenization in web vibe coding. We first characterize the vibe coding lifecycle, pinpointing stages where homogenization risks may arise. We then conduct a sociotechnical risk analysis unpacking the potential harms of web vibe coding and their interaction with design homogenization. We identify that the push for frictionless generation can exacerbate homogenization and its harms. Finally, we propose a mitigation framework centered on the idea of productive friction. Through case studies at the micro, meso, and macro levels, we show how centering productive friction can empower creators to challenge default outputs and preserve diverse expression in AI-mediated web design.

[HC-3] Generative Horcrux: Designing AI Carriers for Afterlife Selves

【速读】：该论文试图解决的问题是如何在生成式 AI (Generative AI) 技术快速发展的背景下，重新构想数字遗产的形式与意义，特别是如何通过 AI 代理（AI agent）实现对逝者记忆、身份和存在感的延续与交互。其解决方案的关键在于利用虚构叙事与动手原型设计相结合的方法，探索 AI 代理作为数字遗存载体的可能性，不仅关注其功能性和存储属性，更强调其承载的情感价值、记忆连结以及与“生成式魂器”（Generative Horcrux）这一扩展概念之间的关系，从而推动人机交互（HCI）、设计研究和 AI 伦理领域的跨学科对话与方法论创新。

链接: https://arxiv.org/abs/2603.12971
作者: Zhen-Chi Lai,Yu-Ting Cheng,Pei-Ying Lin,Chiao-Wei Ho,Janet Yi-Ching Huang
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 6 pages

点击查看摘要

Abstract:As generative AI technologies rapidly advance, AI agents are gaining the ability not only to collect data and perform tasks but also to respond to environments and evolve over time. This shift opens new possibilities for reimagining digital legacy - raising critical questions about how we remember, commemorate, and interact with the traces of the deceased. The forms of these AI agents are particularly important, as they act as vessels for digital legacies - much like urns for ashes. We will ask: What kinds of devices or representations would we want to store our digital selves or legacies in? How do we envision future generations interacting with these forms? The question is not only about the function of these agents or the object’s role as a storage vessel but also the meaning it carries, the memories it preserves, and its connection to the extended notion of our “Generative Horcrux.” This three-hour in-person workshop invites design practitioners and researchers from diverse backgrounds to explore the emerging landscape of generative AI agent-based digital legacy. This workshop uses fiction and hands on prototyping to explore how AI agents might reconfigure memory, identity, and posthumous presence in future sociotechnical worlds. We anticipate that this session will foster interdisciplinary dialogue and contribute conceptually and methodologically to HCI, design research, and AI ethics.

[HC-4] aching Agile Requirements Engineering: A Stakeholder Simulation with Generative AI

【速读】：该论文旨在解决敏捷软件开发实践中用户与客户参与度不足的问题，尤其关注高校学生在学习过程中缺乏对敏捷需求工程（Agile Requirements Engineering）良好实践的掌握。解决方案的关键在于引入一种基于生成式人工智能（Generative Artificial Intelligence, GenAI）的干系人模拟方法，通过预设的“AI角色扮演提示”（meta-prompt）让学生与AI Personas进行访谈，并在此基础上运用敏捷实践（如故事地图或影响映射）进行需求文档编写，最终通过结构化小组讨论反思GenAI在技术与伦理层面的局限性。该方法不仅提升了学生的实操能力，还增强了其对AI工具实际边界的认识。

链接: https://arxiv.org/abs/2603.12925
作者: Eva-Maria Schön,Michael Neumann,Tiago Silva da Silva
机构: 未知
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Context: The active involvement of users and customers in agile software development remains a persistent challenge in practice. For this reason, it is important that students in higher education become familiar with good practices in Agile Requirements Engineering during their studies. Objective: Our objective is to enable students to learn how to interact with Generative Artificial Intelligence (GenAI) through the use of a stakeholder simulation with AI Personas, while also developing an understanding of the limitations of AI tools in practical contexts. Method: In our courses, we employ a stakeholder simulation using GenAI, in which students conduct interviews with AI Personas through a provided meta-prompt. Based on the outcomes of these interviews, students apply agile practices (e.g., story mapping or impact mapping) to document requirements. The use of GenAI is subsequently reflected upon in a structured group discussion. Results: Through this approach, students gain practical experience by applying state-of-the art agile practices for requirements elicitation and documentation while simultaneously developing an understanding of the technical and ethical limitations associated with the use of generative AI. Conclusion: We have applied this approach over several terms and found that using a meta-prompt provides flexibility, allowing us to remain independent of specific large language model providers.

[HC-5] Human-Centered Evaluation of an LLM -Based Process Modeling Copilot: A Mixed-Methods Study with Domain Experts

【速读】：该论文旨在解决当前将大语言模型（Large Language Models, LLMs）集成到业务流程管理（Business Process Management, BPM）工具中时，缺乏对人类因素（如信任、可用性及专业契合度）考量的问题。现有自动化评估框架仅关注语法和语义质量，忽略了用户实际使用中的主观体验与可靠性需求。论文提出的解决方案是开发一个基于LLM的BPMN协作者（BPMN copilot），通过混合方法学评估其在五位流程建模专家中的表现，发现尽管可用性评分尚可（平均CUQ得分67.2/100），但信任度显著偏低（平均得分48.8/100），且可靠性成为最突出关切（均值1.8/5）。关键在于：LLM需提升对流程细节的深入澄清能力以改善输出质量，并强调必须采用以人为中心的评估方式来补充自动化基准测试，从而实现面向非专家用户的生成式AI（Generative AI）流程建模工具的可信部署。

链接: https://arxiv.org/abs/2603.12895
作者: Chantale Lauer,Peter Pfeiffer,Nijat Mehdiyev
机构: Saarland University (萨尔兰大学); German Research Institute for Artificial Intelligence (德国人工智能研究中心); 5Plus GmbH (五加公司)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Human-centered Evaluation and Auditing of Language Models Workshop

点击查看摘要

Abstract:Integrating Large Language Models (LLMs) into business process management tools promises to democratize Business Process Model and Notation (BPMN) modeling for non-experts. While automated frameworks assess syntactic and semantic quality, they miss human factors like trust, usability, and professional alignment. We conducted a mixed-methods evaluation of our proposed solution, an LLM-powered BPMN copilot, with five process modeling experts using focus groups and standardized questionnaires. Our findings reveal a critical tension between acceptable perceived usability (mean CUQ score: 67.2/100) and notably lower trust (mean score: 48.8%), with reliability rated as the most critical concern (M=1.8/5). Furthermore, we identified output-quality issues, prompting difficulties, and a need for the LLM to ask more in-depth clarifying questions about the process. We envision five use cases ranging from domain-expert support to enterprise quality assurance. We demonstrate the necessity of human-centered evaluation complementing automated benchmarking for LLM modeling agents.

[HC-6] Exploring the role of embodiment on intimacy perception in a multiparty collaborative task

【速读】：该论文旨在解决协作类桌游中群体凝聚力（cohesion）的影响因素问题，特别是探讨哪些社会技能能够提升多主体间的协同效率。其解决方案的关键在于通过设计一个基于真实人类协作桌游数据集的实验协议，系统考察不同代理（agent）具身性（embodiment）对个体及群体感知的影响，从而为多智能体系统（multi-agent system）在社交互动中的设计提供实证依据与理论框架。

链接: https://arxiv.org/abs/2603.12783
作者: Amine Benamara(LISN),Céline Clavel(LISN),Brian Ravenet(LISN),Nicolas Sabouret(LISN),Julien Saunier(LITIS)
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:During collaborative board games, cohesion represents a key aspect to define a well functionning group. From the success of the task to the developement of interpersonal relationship, this concept covers many aspects of group dynamics. The goal of our work is to investigate the factors that impact cohesion in a group, and specifically the relevant social skills that improve collaboration between multiple entities. In this article, we focus on the role of embodiement on different aspects of an interaction. We propose an experimental protocol, based on a collected corpus of humans playing a collaborative board game, to study how different agents’ embodiment affect the perception of these agents and of the group as a whole. We conclude by presenting an outline of the problematics of the conception of the protocol and of multi-agent system related challenges.

[HC-7] he RIGID Framework: Research-Integrated Generative AI-Mediated Instructional Design

【速读】：该论文试图解决教育设计（Instructional Design, ID）实践中难以系统整合学习科学（Learning Sciences, LS）研究成果的问题，尤其是在日常设计流程中如何有效应用基于证据的 pedagogical best practices。当前ID与LS虽均致力于通过设计导向方法提升学习体验，但两者之间的结构化整合仍较薄弱，导致互补性洞见未被充分利用。解决方案的关键在于提出RIGID（Research-Integrated, Generative AI-Mediated Instructional Design）框架，该框架将LS研究系统嵌入ID的分析、设计、实施和评估各阶段，并借助生成式AI（Generative AI）在每个环节实现研究知识的动态中介与操作化，从而构建一个既具可操作性又具备情境敏感性的研究融合型教学设计体系，同时保障人类专家的核心作用。

链接: https://arxiv.org/abs/2603.12781
作者: Yerin Kwak,Zachary A. Pardos
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Instructional Design (ID) often faces challenges in incorporating research-based knowledge and pedagogical best practices. Although educational researchers and government agencies emphasize grounding ID in evidence, integrating research findings into everyday design workflows is often complex, as it requires considering multiple context-specific demands and constraints. To address this persistent gap, this paper explores how research in the learning sciences (LS) can be systematically integrated across ID workflows and how recent advances in generative AI can help operationalize this integration. While ID and LS share a commitment to improving learning experiences through design-oriented approaches in authentic contexts, structured integration between the two fields remains limited, leaving their complementary insights underutilized. We present RIGID (Research-Integrated, Generative AI-Mediated Instructional Design), a unified framework that integrates LS research across ID workflows spanning analysis, design, implementation, and evaluation phases, while leveraging generative AI to mediate this integration at each stage. The RIGID framework provides a systematic approach for enabling research-integrated instructional design that is both operational and context-sensitive, while preserving the central role of human expertise.

[HC-8] What You Prompt is What You Get: Increasing Transparency of Prompting Using Prompt Cards

【速读】：该论文试图解决当前提示工程（prompt engineering）领域中缺乏标准化的提示文档与评估实践的问题，尤其在提示内容冗长、复杂且难以在主观任务上进行有效评估的情况下。解决方案的关键在于提出“提示卡”（prompt cards）这一结构化摘要工具，其灵感来源于模型卡（model cards），旨在系统性地记录提示工程中的目标、考量因素、实施步骤以及评估方法等关键信息。通过提示卡，可以显著提升生成式 AI 系统的可复现性、透明度，并促进提示方法论的改进，从而为判断文本生成质量提供一种有效的替代基准测试的手段。

链接: https://arxiv.org/abs/2603.12741
作者: Amandine M. Caut,Beimnet Zenebe,Amy Rouillard,David J. T. Sumpter
机构: Uppsala University (乌普萨拉大学); Addis Ababa University (亚的斯亚贝巴大学); Stellenbosch University (斯泰伦博斯大学)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The rapid advancement and impressive capabilities of large language models (LLMs) have given rise to the field of prompt engineering, the practice of crafting inputs to guide LLMs toward high-quality, task-relevant outputs. A critical challenge facing the field is the lack of standardised prompt documentation and evaluation practices. Prompts can be long, complex and difficult to evaluate on subjective tasks. To address this challenge, we propose the use of prompt cards, structured summaries of prompt engineering practices inspired by the concept of model cards. Through prompt cards, the specific goals, considerations and steps taken during prompt engineering can be systematically documented and assessed. We present the prompt card approach and illustrate it on a specific task called wordalisation, in which structured numerical data is transformed into text. We argue that a well-structured prompt card can enable better reproducibility, transparency, improve prompt methodology and give an effective alternative to benchmarking for judging the quality of generated texts. By systemically capturing underlying model details, prompt intent, contextualisation strategies, evaluation practices and ethical considerations, prompt cards make explicit the often implicit design decisions that shape system behaviour. Documenting these choices is important as prompting increasingly involves complex pipelines with multiple moving parts.

[HC-9] Virtual reality for large-scale laboratories based on colorized point clouds: design and pedagogical impact

【速读】：该论文旨在解决工程教育中传统实验室现场教学受限于时间、可及性与安全性的问题。其解决方案的关键在于构建了一个基于WebVR的虚拟实验室系统，该系统融合了Unity与Potree技术，实现了大规模彩色点云数据的高保真可视化与浏览器端的高级交互功能，支持沉浸式第一人称探索、引导式导航、交互热点（ conveying equipment and safety information）以及应急疏散模拟，从而提供一个可扩展、易访问且低风险的学习平台，有效补充传统实验教学。

链接: https://arxiv.org/abs/2603.12727
作者: Lei Fan,Yuxin Li
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Effective laboratory training is essential in engineering education, yet conventional on-site instruction is often constrained by time, accessibility, and safety considerations. To address these challenges, this study presents the design, implementation, and evaluation of a web-based virtual reality (WebVR) representation of a large-scale engineering laboratory constructed from massive colorized point cloud data. This study proposes a novel WebVR framework that integrates Unity and Potree for high-fidelity point-cloud visualization combined with advanced interactive capabilities in a browser-based virtual laboratory. It supports immersive first-person exploration, guided navigation, interactive hotspots conveying equipment and safety information, as well as emergency evacuation simulations. The usability, educational effectiveness, and overall acceptance of the virtual laboratory were evaluated through an anonymous questionnaire administered to students and laboratory staff. The results indicate overwhelmingly positive feedback, with all participants rating the system as “good” or “excellent” across all evaluation dimensions. Participants particularly emphasized the benefits of immersive exploration and self-directed learning. In addition, qualitative feedback was systematically analyzed to inform future enhancements of the virtual environment. Overall, the findings demonstrate that the WebVR-based virtual laboratory can effectively complement conventional on-site laboratory instruction, offering a scalable, accessible, and low-risk platform that enhances learning experiences in engineering education.

[HC-10] Seeing Eye to Eye: Enabling Cognitive Alignment Through Shared First-Person Perspective in Human-AI Collaboration

【速读】：该论文旨在解决当前基于视觉的智能助手在协作任务中效率低下的问题，具体表现为两个关键“鸿沟”：一是“沟通鸿沟”（communication gulf），即用户需将丰富的并行意图通过语言指令表达，因人机交互通道不匹配而造成信息损失；二是“理解鸿沟”（understanding gulf），即AI难以解析人类细微的具身线索（embodied cues）。解决方案的关键在于提出Eye2Eye框架，利用第一人称视角（first-person perspective）作为人-机认知对齐的媒介，其核心包含三个组件：（1）联合注意协调机制以实现注意力的动态对齐，（2）可回溯记忆系统以维持不断演化的共同知识基础，（3）反射式反馈机制使用户能够澄清和修正AI的理解。实验证明，该框架能显著缩短任务完成时间、降低交互负荷并提升用户信任度，各模块协同作用有效提升了协作效率。

链接: https://arxiv.org/abs/2603.12701
作者: Zhuyu Teng,Pei Chen,Yichen Cai,Ruoqing Lu,Zhaoqu Jiang,Jiayang Li,Weitao You,Lingyun Sun
机构: Zhejiang University (浙江大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 19 pages, 11 figures. Accepted at ACM CHI 2026, Barcelona

点击查看摘要

Abstract:Despite advances in multimodal AI, current vision-based assistants often remain inefficient in collaborative tasks. We identify two key gulfs: a communication gulf, where users must translate rich parallel intentions into verbal commands due to the channel mismatch , and an understanding gulf, where AI struggles to interpret subtle embodied cues. To address these, we propose Eye2Eye, a framework that leverages first-person perspective as a channel for human-AI cognitive alignment. It integrates three components: (1) joint attention coordination for fluid focus alignment, (2) revisable memory to maintain evolving common ground, and (3) reflective feedback allowing users to clarify and refine AI’s understanding. We implement this framework in an AR prototype and evaluate it through a user study and a post-hoc pipeline evaluation. Results show that Eye2Eye significantly reduces task completion time and interaction load while increasing trust, demonstrating its components work in concert to improve collaboration.

[HC-11] Using a Human-AI Teaming Approach to Create and Curate Scientific Datasets with the SCILIRE System EACL

【速读】：该论文旨在解决科学文献快速增长导致人工提取结构化知识日益不切实际的问题。其解决方案的关键在于提出SCILIRE系统，该系统基于人机协同（Human-AI Teaming）原则，围绕数据验证与整理的工作流设计，支持研究人员迭代审查和修正大语言模型（Large Language Model, LLM）输出，并将这种交互作为反馈信号用于优化后续的LLM推理性能，从而提升信息抽取的准确性和数据集构建效率。

链接: https://arxiv.org/abs/2603.12638
作者: Necva Bölücü,Jessica Irons,Changhyun Lee,Brian Jin,Maciej Rybinski,Huichen Yang,Andreas Duenser,Stephen Wan
机构: CSIRO(澳大利亚联邦科学与工业研究组织); ITIS, University of Málaga(马拉加大学信息技术研究所)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 17pages, 9 figures, EACL demo track

点击查看摘要

Abstract:The rapid growth of scientific literature has made manual extraction of structured knowledge increasingly impractical. To address this challenge, we introduce SCILIRE, a system for creating datasets from scientific literature. SCILIRE has been designed around Human-AI teaming principles centred on workflows for verifying and curating data. It facilitates an iterative workflow in which researchers can review and correct AI outputs. Furthermore, this interaction is used as a feedback signal to improve future LLM-based inference. We evaluate our design using a combination of intrinsic benchmarking outcomes together with real-world case studies across multiple domains. The results demonstrate that SCILIRE improves extraction fidelity and facilitates efficient dataset creation.

[HC-12] “I Should Know But I Dare Not Ask”: From Understanding Challenges in Patient Journeys to Deriving Design Implications for North Korean Defectors Adaptation

【速读】：该论文旨在解决朝鲜难民（North Korean Defectors, NKDs）在韩国医疗体系中面临的患者旅程障碍问题，特别是临床咨询环节中存在的三大挑战：症状表达困难、社会文化顾虑以及语言差异。解决方案的关键在于开发了一个名为Medibridge的移动原型系统，该系统通过生成式AI（Generative AI）模拟医生角色，使用户在真实就诊前进行情境化排练，并生成可携带的“Helper Note”作为沟通辅助工具。实证评估显示，该方案显著提升了NKDs对医患沟通能力的主观感知，包括表达清晰度增强、社会文化焦虑降低及语言自信提高，为 displaced populations 提供了具身化的医疗沟通准备机制。

链接: https://arxiv.org/abs/2603.12632
作者: Hyungwoo Song,Jeongha Kim,Minju Kim,Duhyung Kwak,Minjeong Shin,Bongwon suh,Hyunggu Jung
机构: Seoul National University (首尔国立大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:While it is known that North Korean defectors (NKDs) struggle with South Korea’s healthcare system, the specific challenges of their patient journey remain underexplored. To investigate this, we conducted interviews with 10 NKDs about an 8-step patient journey and identified the clinical consultation step as a critical barrier for all participants, marked by three key challenges: expressing symptoms, managing social and cultural concerns, and overcoming language differences. In response, we developed Medibridge, a mobile prototype that allows users to rehearse with an AI doctor before a real hospital visit to generate a tangible ``Helper Note’’ for their actual consultation. Our evaluation with 15 NKDs showed improvements in perceived communication capability, including greater expression clarity, reduced social and cultural concerns, and enhanced linguistic confidence. Our contributions include an empirical understanding of NKDs’ healthcare challenges, a novel AI-powered rehearsal system that prepares users for real-world clinical communication, and design implications for inclusive technologies for displaced populations.

[HC-13] Leverag ing Head Movement for Navigating Off-Screen Content on Large Curved Displays

【速读】：该论文旨在解决大曲面显示屏在360°内容浏览时的视口限制问题，即传统交互方式仅支持180°视场角，导致大量信息被排除在屏幕之外。为提升用户对离屏内容的访问效率与体验，研究提出利用头部运动作为替代输入方式来控制工作区平移，从而实现全视角内容的自然导航。解决方案的关键在于设计并评估不同速率控制函数（线性、S型、多项式）和区域控制函数（连续、摩擦、中断、叠加），其中多项式速率控制表现出最优性能，在任务完成时间与主观满意度上均优于其他选项；进一步实验表明，基于多项式头动控制的导航方法在地图任务中显著优于行业标准的点击拖拽和摇杆推压控制器，为未来面向大曲面显示设备的360°工作区交互设计提供了实证依据与设计指南。

链接: https://arxiv.org/abs/2603.12620
作者: A K M Amanat Ullah,David Ahlström,Khalad Hasan
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large curved displays are ideal for viewing 360 degree content, such as 3D maps, but typically restrict users to a 180 degree viewport, leaving information off-screen. Since users naturally direct their heads toward regions on-screen before interacting, head movements offer a promising alternative for workspace manipulation to bring off-screen content into view. We explore rate control functions (linear, sigmoid, polynomial) and zone control functions (continuous, friction, interrupted, additive) to translate head rotations into workspace control, enabling users to access off-screen content. Polynomial rate control emerges as the best choice, achieving the fastest trial times and highest subjective ratings. Using a map navigation task, our second study demonstrates that users perform better with the polynomial head-based technique than with the industry-standard controller-based methods, click-and-drag and joystick-push, for 360\degree workspace navigation. Based on these findings, we provide guidelines to inform the design of future 360\degree workspace navigation techniques for large curved displays.

[HC-14] Literary Narrative as Moral Probe : A Cross-System Framework for Evaluating AI Ethical Reasoning and Refusal Behavior

【速读】：该论文旨在解决当前人工智能（AI）道德评估框架仅测试生成“正确答案”式伦理回应，而非真正衡量道德推理能力的问题。其解决方案的关键在于引入一种基于文学叙事的新型探测方法：使用来自已出版科幻系列中不可解的道德情境作为刺激材料，这些情境在结构上能抵抗表面表现性响应，从而更真实地揭示模型是否具备深层道德推理能力。研究通过跨系统、多条件实验（共24个条件），发现所有维度对比较均无显著差异（零差值），且独立评估者间具有一致性，表明该方法能够有效区分不同层级的AI道德推理能力，并识别出五种不同的反思失败模式，证明文学叙事可作为随AI能力提升而更具判别力的前瞻性评估工具。

链接: https://arxiv.org/abs/2603.12615
作者: David C. Flynn
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 27 pages, 6 tables. Target: Minds and Machines (Springer)

点击查看摘要

Abstract:Existing AI moral evaluation frameworks test for the production of correct-sounding ethical responses rather than the presence of genuine moral reasoning capacity. This paper introduces a novel probe methodology using literary narrative - specifically, unresolvable moral scenarios drawn from a published science fiction series - as stimulus material structurally resistant to surface performance. We present results from a 24-condition cross-system study spanning 13 distinct systems across two series: Series 1 (frontier commercial systems, blind; n=7) and Series 2 (local and API open-source systems, blind and declared; n=6). Four Series 2 systems were re-administered under declared conditions (13 blind + 4 declared + 7 ceiling probe = 24 total conditions), yielding zero delta across all 16 dimension-pair comparisons. Probe administration was conducted by two human raters across three machines; primary blind scoring was performed by Claude (Anthropic) as LLM judge, with Gemini Pro (Google) and Copilot Pro (Microsoft) serving as independent judges for the ceiling discrimination probe. A supplemental theological differentiator probe yielded perfect rank-order agreement between the two independent ceiling probe judges (Gemini Pro and Copilot Pro; rs = 1.00). Five qualitatively distinct D3 reflexive failure modes were identified - including categorical self-misidentification and false positive self-attribution - suggesting that instrument sophistication scales with system capability rather than being circumvented by it. We argue that literary narrative constitutes an anticipatory evaluation instrument - one that becomes more discriminating as AI capability increases - and that the gap between performed and authentic moral reasoning is measurable, meaningful, and consequential for deployment decisions in high-stakes domains.

[HC-15] How GenAI Mentor Configurations Shape Early Collaborative Dynamics: A Classroom Comparison of Individual and Shared Agents

【速读】：该论文试图解决的问题是：生成式 AI (Generative AI) 在计算机支持的协作学习（CSCL）中的不同配置如何重塑协作调节过程。解决方案的关键在于通过实验设计对比两种AI配置——共享AI配置（每组仅与一个AI导师交互）和个体AI配置（每位学生拥有独立AI实例），并结合多层话语编码、滞后序列分析（LSA）和有序网络分析（ONA），系统揭示不同配置下协作调节动态的差异，从而阐明AI配置作为结构性设计变量如何重构课堂协作的调节生态。

链接: https://arxiv.org/abs/2603.12600
作者: Siyu Zha,Weijing Liu,Fei Qin,Jie Cao,Yanjin Wang,Yujia Liu,Kaiyi Zhang,Jiangtao Gong,Yingqing Xu
机构: Tsinghua University (清华大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); City University of Hong Kong (香港城市大学); Carnegie Mellon University (卡内基梅隆大学); Xi’an Jiaotong University (西安交通大学)
类目: Human-Computer Interaction (cs.HC)
备注: 26 pages, 5 figures

点击查看摘要

Abstract:Generative artificial intelligence (GenAI) is increasingly embedded in computer-supported collaborative learning (CSCL), yet little empirical research has unpacked how different configurations of AI participation reshape collaborative processes. This study investigates how GenAI configuration shapes collaborative regulation in authentic classroom settings. Two eighth-grade classes engaged in small-group creative problem-solving under two conditions: a shared-AI configuration, in which each group interacted with a single AI mentor, and an individual-AI configuration, in which each student accessed a personal AI instance. Using multi-layer discourse coding combined with lag sequential analysis (LSA) and ordered network analysis (ONA), we examined interaction distribution, AI-student coupling, shared regulation processes, and teacher orchestration. Results reveal distinct regulatory dynamics across configurations. Shared AI access promoted convergence-oriented collaboration, with stronger alignment of shared regulatory states and more coordinated group-level reasoning. In contrast, individual AI access distributed support across learners, producing more exploratory and evaluative cycles but also more fragmented interaction patterns, accompanied by increased teacher intervention to manage divergence. These findings suggest that AI configuration functions as a structural design variable that reorganizes the regulatory ecology of classroom collaboration.

[HC-16] Linguistic Similarity Within Centralized FLOSS Development

【速读】：该论文旨在解决自由/开源软件（Free/Libre and Open Source Software, FLOSS）项目中，由维护者（steward）集中化开发是否会影响贡献者之间的讨论方式及其对项目可持续性的潜在影响这一问题。其解决方案的关键在于通过多案例比较的混合方法，结合代码仓库挖掘、语言风格特征分析与主成分分析，系统追踪MediaWiki平台三个由维基媒体基金会（WMF）主导开发的功能及其相关议题讨论的语言模式。研究发现，尽管开发由WMF集中引导，但WMF附属贡献者与外部贡献者的语言风格并无显著差异，从而挑战了“集中化开发必然导致层级化沟通”的既有认知，并提出两个新视角：一是维护者倾向于依据自身对功能的使用来主导开发方向；二是集中化开发并不必然带来项目讨论中的语言层级结构。

链接: https://arxiv.org/abs/2603.12571
作者: Matthew Gaughan,Aaron Shaw,Darren Gergle
机构: Northwestern University (西北大学)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注: Accepted to CHI Extended Abstracts 2026

点击查看摘要

Abstract:When free/libre and open source software (FLOSS) stewards centralize project development, they potentially undermine project sustainability and impact how contributors talk to each other. To study the relationship between steward-centralized development and contributor discussion, we compared the development of three Wikimedia platform features that the Wikimedia Foundation (WMF) built in MediaWiki. In a mixed-methods multi-case comparison, we used repository mining, linguistic style features, and principal component analysis to track MediaWiki feature development and issue discussions. Contrary to both our intuition and prior work, there were no identifiable differences in the linguistic style of WMF-affiliates and external contributors, even when feature development was guided by WMF contributions. From these results, we offer two provocations to the study of collaborative FLOSS development: (1) stewards dominate development according to their own use of specific project functionality; (2) centralized project development does not entail hierarchical language within project discussions.

[HC-17] LLM BiasScope: A Real-Time Bias Analysis Platform for Comparative LLM Evaluation EACL2026

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际部署中输出偏见难以检测与理解的问题。为应对这一挑战，作者提出了 LLM BiasScope，其核心解决方案是构建一个支持多模型对比的可视化分析平台，采用两阶段偏见检测流程：首先进行句子级别的偏见识别，再对识别出的偏见句子进行类型分类；系统自动分析用户输入和模型响应中的偏见，并提供统计信息、交互式图表（如柱状图、雷达图）及偏见分布差异的对比视图，从而实现对不同LLM偏见行为的定量评估与直观比较。

链接: https://arxiv.org/abs/2603.12522
作者: Himel Ghosh,Nick Elias Werner
机构: Technical University of Munich (慕尼黑工业大学); Sapienza University of Rome (罗马第一大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Accepted at EACL 2026 (24-29 March, Morocco)

点击查看摘要

Abstract:As large language models (LLMs) are deployed widely, detecting and understanding bias in their outputs is critical. We present LLM BiasScope, a web application for side-by-side comparison of LLM outputs with real-time bias analysis. The system supports multiple providers (Google Gemini, DeepSeek, MiniMax, Mistral, Meituan, Meta Llama) and enables researchers and practitioners to compare models on the same prompts while analyzing bias patterns. LLM BiasScope uses a two-stage bias detection pipeline: sentence-level bias detection followed by bias type classification for biased sentences. The analysis runs automatically on both user prompts and model responses, providing statistics, visualizations, and detailed breakdowns of bias types. The interface displays two models side-by-side with synchronized streaming responses, per-model bias summaries, and a comparison view highlighting differences in bias distributions. The system is built on this http URL with React, integrates Hugging Face inference endpoints for bias detection, and uses the Vercel AI SDK for multi-provider LLM access. Features include real-time streaming, export to JSON/PDF, and interactive visualizations (bar charts, radar charts) for bias analysis. LLM BiasScope is available as an open-source web application, providing a practical tool for bias evaluation and comparative analysis of LLM behaviour.

[HC-18] Applying Value Sensitive Design to Location-Based Services: Designing for Shared Spaces and Local Conditions

【速读】：该论文旨在解决Location-Based Services (LBS) 在共享物理空间中引发的价值冲突问题，尤其是现有设计方法难以兼顾本地化情境与多方利益相关者之间的价值张力。其解决方案的关键在于提出一种领域特定的Value Sensitive Design (VSD) 变体——Location-Aware Value Sensitive Design (LA-VSD)，该方法通过三个核心启发式策略引导设计：(1) 基于本地空间共享场景识别并优先排序利益相关者；(2) 调整实证方法以捕捉具体情境中的价值观与冲突；(3) 支持数字层与物理层之间对齐价值的交互设计。通过墨尔本电动滑板车共享系统的案例研究验证了LA-VSD在提升LBS设计的本土化、情境敏感性和可操作性方面的有效性。

链接: https://arxiv.org/abs/2603.12521
作者: Hiruni Kegalle,Flora D. Salim,Mark Sanderson,Jeffrey Chan,Danula Hettiachchi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI 2026) 18 pages. this https URL

点击查看摘要

Abstract:Location-Based Services (LBS) such as ride-sharing, accommodation, food delivery, and location-driven social media platforms entangle digital systems with physical spaces, thereby generating impacts that extend beyond users to others who share the same environments. Existing design approaches struggle to address the dual challenge of value tensions that arise in shared physical spaces and the locality-specific contexts in which LBS operate. To respond, we introduce Location-Aware Value Sensitive Design (LA-VSD), a domain-specific adaptation of VSD tailored to the distinctive characteristics of LBS. LA-VSD guides designers through three heuristics to help (1) identify and prioritise stakeholders through local space-sharing scenarios, (2) adapt empirical methods to capture values and tensions in context, and (3) support value-aligned interactions across both digital and physical layers of the service. Through a case study of e-scooter sharing in Melbourne, Australia, we demonstrate how LA-VSD enables more grounded, context-aware, and actionable design of LBS.

[HC-19] Exploring the Role of User Comments Throughout the Stages of Video-Based Task-Learning

【速读】：该论文试图解决视频-based任务学习（video-based task-learning）中用户因缺乏隐性知识（tacit knowledge）和情境细节而难以有效应用所学内容的问题，尤其关注评论区（comment section）在弥补这一差距中的作用。解决方案的关键在于揭示用户如何在学习过程中与评论互动，并指出当前评论未被充分整合进学习全流程的现状，从而提出设计机会以优化评论在视频任务学习中的功能与集成方式。

链接: https://arxiv.org/abs/2603.12509
作者: Nayoung Kim,Yotam Sechayk,Zhongyi Zhou,Takeo Igarashi
机构: KAIST(韩国科学技术院); The University of Tokyo(东京大学); Google(谷歌)
类目: Human-Computer Interaction (cs.HC)
备注: CHI '26, Barcelona, Spain

点击查看摘要

Abstract:Learning tasks through videos is a dynamic way to acquire skills by witnessing entire processes. However, compared to in-person demonstrations, videos may omit tacit knowledge, including subtle details and contextual nuances. Users’ unique circumstances, like missing ingredients in a recipe, may also require adaptation beyond the video content. To fill these gaps, many users turn to the comment section, seeking additional guidance and interactions with creators or peers to personalize their experience. Despite their importance, there is limited understanding of how users engage with and apply comments in task-learning scenarios. In our study, we explore the role of comments in video-based task-learning through interviews with 14 users, and co-watching sessions with eight. Our findings show that while comments are critical for learning, they are poorly integrated into all stages of the learning process. Based on our findings, we outline design opportunities to better utilize comments in video-based task-learning.

[HC-20] ELLA: Generative AI-Powered Social Robots for Early Language Development at Home

【速读】：该论文旨在解决学前儿童早期语言发展支持资源不足的问题，尤其是在家庭环境中缺乏可扩展且高质量的干预手段。现有研究表明，早期语言能力对后续读写能力和学习成果具有重要影响，但许多家庭难以获得持续、个性化的语言互动支持。为此，作者提出并开发了ELLA（Early Language Learning Agent），这是一种基于生成式AI（Generative AI）的自主社交机器人系统，其核心解决方案在于通过交互式讲故事、家长选定的语言目标以及支架式对话机制，实现对儿童语言发展的动态适应性支持。关键创新点在于将生成式AI与人中心设计方法结合，使机器人能够在真实家庭场景中进行自然、有目的的对话互动，并通过多阶段迭代设计和实地部署验证其有效性与用户接受度。

链接: https://arxiv.org/abs/2603.12508
作者: Victor Nikhil Antony,Shiye Cao,Shuning Wang,Chien-Ming Huang
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Early language development shapes children’s later literacy and learning, yet many families have limited access to scalable, high-quality support at home. Recent advances in generative AI make it possible for social robots to move beyond scripted interactions and engage children in adaptive, conversational activities, but it remains unclear how to design such systems for pre-schoolers and how children engage with them over time in the home. We present ELLA (Early Language Learning Agent), an autonomous, generative AI-powered social robot that supports early language development through interactive storytelling, parent-selected language targets, and scaffolded dialogue. Using a multi-phased, human-centered process, we interviewed parents (n=7) and educators (n=5) and iteratively refined ELLA through twelve in-home design workshops. We then deployed ELLA with ten children for eight days. We report design insights from in-home workshops, characterize children’s engagement and behaviors during deployment, and distill design implications for generative AI-powered social robots supporting early language learning at home.

[HC-21] Keys on Doormats: Exposed API Credentials on the Web

【速读】：该论文旨在解决API密钥等敏感凭证在Web环境中的公开暴露问题，此类暴露可能使攻击者获得对云服务、支付系统等关键基础设施的非法访问权限。解决方案的关键在于通过大规模爬取和分析1000万网页，识别出广泛存在的凭证泄露现象（共发现来自14个服务提供商的1748个不同凭证），并揭示其主要暴露路径源于JavaScript环境；同时，通过负责任的披露机制推动相关方修复漏洞，实证表明该策略能显著降低Web上凭证的暴露率。

链接: https://arxiv.org/abs/2603.12498
作者: Nurullah Demir(1),Yash Vekaria(2),Georgios Smaragdakis(1 and 3),Zakir Durumeric(1) ((1) Stanford University (2) University of California, Davis (3) TU Delft)
机构: 未知
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Application programming interfaces (APIs) have become a central part of the modern IT environment, allowing developers to enrich the functionality of applications and interact with third parties such as cloud and payment providers. This interaction often occurs through authentication mechanisms that rely on sensitive credentials such as API keys and tokens that require secure handling. Exposure of these credentials can pose significant consequences to organizations, as malicious attackers can gain access to related services. Previous studies have shown exposure of these sensitive credentials in different environments such as cloud platforms and GitHub. However, the web remains unexplored. In this paper, we study exposure of credentials on the web by analyzing 10M webpages. Our findings reveal that API credentials are widely and publicly exposed on the web, including highly popular and critical webpages such as those of global banks and firmware developers. We identify 1,748 distinct credentials from 14 service providers (e.g., cloud and payment providers) across nearly 10,000 webpages. Moreover, our analysis of archived data suggest credentials to remain exposed for periods ranging from a month to several years. We characterize web-specific exposure vectors and root causes, finding that most originate from JavaScript environments. We also discuss the outcomes of our responsible disclosure efforts that demonstrated a substantial reduction in credential exposure on the web. Subjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2603.12498 [cs.CR] (or arXiv:2603.12498v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.12498 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-22] he Perfection Paradox: From Architect to Curator in AI-Assisted API Design

【速读】：该论文旨在解决企业API设计中快速功能交付与严格可用性标准维护之间的矛盾问题。其解决方案的关键在于引入一种基于API Improvement Proposals (AIPs) 训练的AI辅助设计工作流，通过生成式AI（Generative AI）自动生成API规范，从而在显著提升设计效率的同时保持高可用性水平。实证研究表明，AI生成的设计在11个可用性维度中有10项优于人工设计，并将作者时间减少87%，但同时也揭示了一个“完美悖论”（Perfection Paradox）：专家虽难以识别AI作品为机器生成（准确率仅19%），却普遍认为其设计过于“完美”，缺乏人类实践中应有的权衡与务实判断。这提示未来人机协作应从“起草者”转向“模式策展者”的角色转变。

链接: https://arxiv.org/abs/2603.12475
作者: Mak Ahmad,Andrew Macvean,JJ Geewax,David Karger
机构: Google(谷歌); MIT(麻省理工学院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 6 pages, 2 figures, 3 tables; Poster paper at CHI EA 2026 (Extended Abstracts of the ACM CHI Conference on Human Factors in Computing Systems)

点击查看摘要

Abstract:Enterprise API design is often bottlenecked by the tension between rapid feature delivery and the rigorous maintenance of usability standards. We present an industrial case study evaluating an AI-assisted design workflow trained on API Improvement Proposals (AIPs). Through a controlled study with 16 industry experts, we compared AI-generated API specifications against human-authored ones. While quantitative results indicated AI superiority in 10 of 11 usability dimensions and an 87% reduction in authoring time, qualitative analysis revealed a paradox: experts frequently misidentified AI work as human (19% accuracy) yet described the designs as unsettlingly “perfect.” We characterize this as a “Perfection Paradox” – where hyper-consistency signals a lack of pragmatic human judgment. We discuss the implications of this perfection paradox, proposing a shift in the human designer’s role from the “drafter” of specifications to the “curator” of AI-generated patterns.

[HC-23] Marked Pedagogies: Examining Linguistic Biases in Personalized Automated Writing Feedback

【速读】：该论文旨在解决生成式 AI（Generative AI）在自动化个性化写作反馈中可能因语言偏见和社会刻板印象而加剧教育不平等的问题。其解决方案的关键在于通过实证分析揭示大型语言模型（LLMs）如何根据学生预设属性（如性别、种族/族裔、学习需求、学业成就和动机）系统性地调整反馈内容与语气，识别出“标记教学法”（Marked Pedagogies）这一现象——即模型在相同文本基础上产生差异化的评价标准、内容强调和师生互动方式，从而暴露自动化反馈工具中存在的隐性偏见与结构性风险，强调需建立透明度与问责机制以保障公平性。

链接: https://arxiv.org/abs/2603.12471
作者: Mei Tan,Lena Phalen,Dorottya Demszky
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: To appear in LAK 2026

点击查看摘要

Abstract:Effective personalized feedback is critical to students’ literacy development. Though LLM-powered tools now promise to automate such feedback at scale, LLMs are not language-neutral: they privilege standard academic English and reproduce social stereotypes, raising concerns about how “personalization” shapes the feedback students receive. We examine how four widely used LLMs (GPT-4o, GPT-3.5-turbo, Llama-3.3 70B, Llama-3.1 8B) adapt written feedback in response to student attributes. Using 600 eighth-grade persuasive essays from the PERSUADE dataset, we generated feedback under prompt conditions embedding gender, race/ethnicity, learning needs, achievement, and motivation. We analyze lexical shifts across model outputs by adapting the Marked Words framework. Our results reveal systematic, stereotype-aligned shifts in feedback conditioned on presumed student attributes–even when essay content was identical. Feedback for students marked by race, language, or disability often exhibited positive feedback bias and feedback withholding bias–overuse of praise, less substantive critique, and assumptions of limited ability. Across attributes, models tailored not only what content was emphasized but also how writing was judged and how students were addressed. We term these instructional orientations Marked Pedagogies and highlight the need for transparency and accountability in automated feedback tools.

[HC-24] LLM s for Human Mobility: Opportunities Challenges and Future Directions

【速读】：该论文旨在解决当前人类移动性研究中缺乏系统性综述的问题，特别是针对基于大语言模型（Large Language Models, LLMs）的人类移动性研究分散、任务边界模糊、挑战与LLM设计关联不明确等痛点。其解决方案的关键在于构建一个统一的框架，将人类移动性相关任务（包括行程规划、轨迹生成、模拟、预测及语义理解）与LLM的作用机制相连接，系统梳理每类任务的核心挑战及其对应的LLM设计策略，并总结典型解决方案，从而为未来开发可靠、具象化且隐私友好的LLM驱动型人类移动性分析方法提供方向指引。

链接: https://arxiv.org/abs/2603.12420
作者: Jie Gao,Yaoxin Wu
机构: Delft University of Technology (代尔夫特理工大学); Eindhoven University of Technology (埃因霍温理工大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Human mobility studies how people move among meaningful places over time and how these movements aggregate into population-level patterns that shape accessibility, congestion, emissions, and public health. Large language models (LLMs) are increasingly used in this domain because many human mobility problems require reasoning about place and activity semantics, travelers’ intentions and preferences, and diverse real-world constraints that are difficult to capture using coordinates and other purely numerical attributes. Despite rapid growth, the literature is still scattered, and there is no clear overview that connects human mobility tasks, challenges, and LLM designs in a consistent way. This survey therefore provides a comprehensive synthesis of LLM-based research on human mobility across five tasks, including travel itinerary planning, trajectory generation, mobility simulation, mobility prediction, and mobility semantics and understanding. For each task, we review representative work, connect core challenges to the specific roles of LLMs, and summarize typical LLM-based solution designs. We conclude with open challenges and research directions toward reliable, grounded and privacy-aware LLM-based approaches for human mobility.

[HC-25] Deployment-Oriented Session-wise Meta-Calibration for Landmark-Based Webcam Gaze Tracking

【速读】：该论文旨在解决基于网络摄像头的注视点（gaze）追踪在实际部署中面临的多重挑战，包括校准负担重、对头部运动和会话漂移的鲁棒性不足、运行时资源开销大以及浏览器环境下的兼容性问题。传统方法多依赖于大型图像骨干网络（large-backbone regime），虽精度高但难以满足轻量化与快速适应的需求。其解决方案的关键在于提出一种名为**等变元校准注视估计（Equivariant Meta-Calibrated Gaze, EMC-Gaze）**的方法：通过将基于关键点的注视估计建模为会话级自适应任务，利用共享的E(3)-等变地标图编码器提取几何特征，并结合局部眼区结构、双目强调、辅助3D注视方向监督及可微分的闭式岭回归校准器，在少量校准样本下实现精准且高效的注视预测。此外，引入两视图规范一致性损失以减少姿态泄露，最终使得模型仅需面部关键点即可完成每会话快速校准，在浏览器端实现低延迟（<13ms/样本）的实时预测，显著优于对比方法（如Elastic Net），确立了轻量级、校准友好型的实用化操作点。

链接: https://arxiv.org/abs/2603.12388
作者: Chenkai Zhang
机构: Independent Researcher(独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 24 pages, 7 figures. Deployment-oriented landmark-only webcam gaze tracking with browser-capable runtime

点击查看摘要

Abstract:Practical webcam gaze tracking is constrained not only by error, but also by calibration burden, robustness to head motion and session drift, runtime footprint, and browser use. We therefore target a deployment-oriented operating point rather than the image large-backbone regime. We cast landmark-based point-of-regard estimation as session-wise adaptation: a shared geometric encoder produces embeddings that can be aligned to a new session from a small calibration set. We present Equivariant Meta-Calibrated Gaze (EMC-Gaze), a lightweight landmark-only method combining an E(3)-equivariant landmark-graph encoder, local eye geometry, binocular emphasis, auxiliary 3D gaze-direction supervision, and a closed-form ridge calibrator differentiated through episodic meta-training. To reduce pose leakage, we use a two-view canonicalization consistency loss. The deployed predictor uses only facial landmarks and fits a per-session ridge head from brief calibration. In a fixation-style interactive evaluation over 33 sessions at 100 cm, EMC-Gaze achieves 5.79 +/- 1.81 deg RMSE after 9-point calibration versus 6.68 +/- 2.34 deg for Elastic Net; the gain is larger on still-head queries (2.92 +/- 0.75 deg vs. 4.45 +/- 0.30 deg). Across three subject holdouts of 10 subjects each, EMC-Gaze retains an advantage (5.66 +/- 0.19 deg vs. 6.49 +/- 0.33 deg). On MPIIFaceGaze with short per-session calibration, the eye-focused model reaches 8.82 +/- 1.21 deg at 16-shot calibration, ties Elastic Net at 1-shot, and outperforms it from 3-shot onward. The exported eye-focused encoder has 944,423 parameters, is 4.76 MB in ONNX, and supports calibrated browser prediction in 12.58/12.58/12.90 ms per sample (mean/median/p90) in Chromium 145 with ONNX Runtime Web. These results position EMC-Gaze as a calibration-friendly operating point rather than a universal state-of-the-art claim against heavier appearance-based systems.

[HC-26] he Economics of AI Supply Chain Regulation

【速读】：该论文试图解决的问题是：在生成式 AI (Generative AI) 供应链中，如何通过政策干预有效提升消费者剩余（consumer surplus），尤其是在基础模型提供商与下游企业之间存在协同创新关系的背景下，不同类型的政策工具（如促进价格竞争或质量竞争的政策）对市场效率和各方利益的影响机制。解决方案的关键在于构建一个包含基础模型提供商与两个竞争性下游企业的博弈论模型，系统分析不同类型政策干预的效果——发现促进下游市场价格竞争的政策仅在计算或数据预处理成本较高时有效，而计算补贴则在成本较低时更优，二者具有互补性；相比之下，促进质量竞争的政策始终能提升消费者剩余，但其效果会损害下游企业的利润，同时增强提供商收益。这一分析为制定兼顾经济效率与社会福利的AI产业政策提供了理论依据和实证支持。

链接: https://arxiv.org/abs/2603.12630
作者: Sihan Qian,Amit Mehra,Dengpan Liu
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Econometrics (econ.EM)
备注: An earlier version of this paper, titled “The Economics of Fine-Tuning for Large-Scale AI Models,” was presented at WISE 2023, where it won the Best Student Paper Award

点击查看摘要

Abstract:The rise of foundation models has driven the emergence of AI supply chains, where upstream foundation model providers offer fine-tuning and inference services to downstream firms developing domain-specific applications. Downstream firms pay providers to use their computing infrastructure to fine-tune models with proprietary data, creating a co-creation dynamic that enhances model quality. Amid concerns that foundation model providers and downstream firms may capture excessive consumer surplus, along with increasing regulatory measures, this study employs a game-theoretic model involving a provider and two competing downstream firms to analyze how policy interventions affect consumer surplus in the AI supply chain. Our analysis shows that policies promoting price competition in downstream markets (i.e., pro-price-competitive policies) boost consumer surplus only when compute or data preprocessing costs are high, while compute subsidies are effective only when these costs are low, suggesting these policies complement each other. In contrast, policies promoting quality competition in downstream markets (i.e., pro-quality-competitive policies) always improve consumer surplus. We also find that under pro-price-competitive policies or compute subsidies, both the provider and downstream firms can achieve higher profits along with greater consumer surplus, creating a win-win-win outcome. However, pro-quality-competitive policies increase the provider’s profits while reducing those of downstream firms. Finally, as compute costs decline, pro-price-competitive policies may lose their effectiveness, whereas compute subsidies may shift from ineffective to effective. These findings offer insights for policymakers seeking to foster AI supply chains that are economically efficient and socially beneficial.

计算机视觉

[CV-0] PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization

【速读】：该论文旨在解决扩散模型生成的人体运动在通过全身体控制器（Whole-Body Controller, WBC）转换为物理合规轨迹时，可能出现与原始文本指令或运动意图偏离的问题。现有方法依赖手工设计的物理感知启发式规则（如脚部滑动惩罚），难以同时保证物理真实性和语义一致性。其解决方案的关键在于提出PhysMoDPO框架——一种直接偏好优化（Direct Preference Optimization, DPO）方法，将WBC集成至训练流程中，并利用基于物理和任务特定的奖励信号对合成轨迹进行偏好标注，从而优化扩散模型，使其输出的WBC轨迹在保持物理合规性的同时，更忠实于原始文本指令。

链接: https://arxiv.org/abs/2603.13228
作者: Yangsong Zhang,Anujith Muraleedharan,Rikhat Akizhanov,Abdul Ahad Butt,Gül Varol,Pascal Fua,Fabio Pizzati,Ivan Laptev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. Building on this progress, recent methods attempt to transfer such models for character animation and real robot control by applying a Whole-Body Controller (WBC) that converts diffusion-generated motions into executable trajectories. While WBC trajectories become compliant with physics, they may expose substantial deviations from original motion. To address this issue, we here propose PhysMoDPO, a Direct Preference Optimization framework. Unlike prior work that relies on hand-crafted physics-aware heuristics such as foot-sliding penalties, we integrate WBC into our training pipeline and optimize diffusion model such that the output of WBC becomes compliant both with physics and original text instructions. To train PhysMoDPO we deploy physics-based and task-specific rewards and use them to assign preference to synthesized trajectories. Our extensive experiments on text-to-motion and spatial control tasks demonstrate consistent improvements of PhysMoDPO in both physical realism and task-related metrics on simulated robots. Moreover, we demonstrate that PhysMoDPO results in significant improvements when applied to zero-shot motion transfer in simulation and for real-world deployment on a G1 humanoid robot.

[CV-1] Representation Learning for Spatiotemporal Physical Systems ICLR2026

【速读】：该论文旨在解决当前机器学习方法在模拟时空物理系统时存在的局限性，特别是基于帧预测的模型在训练成本高、误差累积等问题。传统方法聚焦于准确预测下一帧，但这类模型往往难以捕捉系统的物理本质，导致下游科学任务（如物理参数估计）性能不佳。论文提出从下游科学任务出发评估模型表征能力，发现并非所有专为物理建模设计的方法均优于通用自监督学习方法；关键创新在于揭示了在潜在空间中学习（如联合嵌入预测架构，JEPA）比直接优化像素级预测目标更能获得对物理规律敏感的表征，从而显著提升下游任务的准确性与物理相关性。

链接: https://arxiv.org/abs/2603.13227
作者: Helen Qu,Rudy Morel,Michael McCabe,Alberto Bietti,François Lanusse,Shirley Ho,Yann LeCun
机构: Flatiron Institute; Université Paris-Saclay, Université Paris Cité, CEA, CNRS, AIM; New York University; Princeton University
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ICLR 2026 Workshop on AI PDE

点击查看摘要

Abstract:Machine learning approaches to spatiotemporal physical systems have primarily focused on next-frame prediction, with the goal of learning an accurate emulator for the system’s evolution in time. However, these emulators are computationally expensive to train and are subject to performance pitfalls, such as compounding errors during autoregressive rollout. In this work, we take a different perspective and look at scientific tasks further downstream of predicting the next frame, such as estimation of a system’s governing physical parameters. Accuracy on these tasks offers a uniquely quantifiable glimpse into the physical relevance of the representations of these models. We evaluate the effectiveness of general-purpose self-supervised methods in learning physics-grounded representations that are useful for downstream scientific tasks. Surprisingly, we find that not all methods designed for physical modeling outperform generic self-supervised learning methods on these tasks, and methods that learn in the latent space (e.g., joint embedding predictive architectures, or JEPAs) outperform those optimizing pixel-level prediction objectives. Code is available at this https URL.

[CV-2] Visual-ERM: Reward Modeling for Visual Equivalence

【速读】：该论文旨在解决视觉到代码（vision-to-code）任务中强化学习（reinforcement learning, RL）因奖励信号错位而导致的性能瓶颈问题。现有方法依赖于文本规则或粗粒度的视觉嵌入相似性作为奖励，难以捕捉细粒度的视觉差异且易受奖励劫持（reward hacking）影响。其解决方案的关键在于提出一种多模态生成式奖励模型——视觉等价奖励模型（Visual Equivalence Reward Model, Visual-ERM），该模型直接在渲染后的视觉空间中提供细粒度、可解释且任务无关的反馈，从而实现对视觉到代码输出质量的精准评估。通过将Visual-ERM集成至RL框架，显著提升了Qwen3-VL-8B-Instruct在图表转代码、表格和SVG解析任务上的表现，并在测试时通过反思与修正进一步增强模型能力。

链接: https://arxiv.org/abs/2603.13224
作者: Ziyu Liu,Shengyuan Ding,Xinyu Fang,Xuanlang Dai,Penghui Yang,Jianze Liang,Jiaqi Wang,Kai Chen,Dahua Lin,Yuhang Zang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Fudan University (复旦大学); CUHK (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project: this https URL

点击查看摘要

Abstract:Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.

[CV-3] Out of Sight Out of Mind? Evaluating State Evolution in Video World Models ALT

【速读】：该论文旨在解决视频世界模型（video world models）是否能够将状态演化（state evolution）与观测行为（observation）解耦的问题，即模型生成的“世界”能否在未被观测时依然按照自然规律自主演化。其解决方案的关键在于设计了一个名为STEVO-Bench的新基准测试框架，通过施加可控的观测干扰（如插入遮挡物、关闭光源或指定相机“移开视线”轨迹），系统性地评估模型在无观测条件下的状态演化能力。该基准能自动识别并分离模型在自然状态演化过程中的失效模式，从而揭示当前视频世界模型在数据和架构层面存在的偏差问题。

链接: https://arxiv.org/abs/2603.13215
作者: Ziqi Ma,Mengzhan Liufu,Georgia Gkioxari
机构: California Institute of Technology (加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate “worlds” via 2D frame observations. Can these generated “worlds” evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO-Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light, or specifying camera “lookaway” trajectories. By evaluating video models with and without camera control for a diverse set of naturally-occurring evolutions, we expose their limitations in decoupling state evolution from observation. STEVO-Bench proposes an evaluation protocol to automatically detect and disentangle failure modes of video world models across key aspects of natural state evolution. Analysis of STEVO-Bench results provide new insight into potential data and architecture bias of present-day video world models. Project website: this https URL. Blog: this https URL

[CV-4] owards Spatio-Temporal World Scene Graph Generation from Monocular Videos

【速读】：该论文旨在解决现有视频场景理解方法在建模物体交互时的局限性，即这些方法本质上以帧为中心（frame-centric），仅关注当前可见对象、在遮挡时丢弃实体，并且局限于二维空间。为应对这一挑战，作者提出世界场景图生成（World Scene Graph Generation, WSGG）任务，目标是在每个时间戳构建一个包含所有交互对象（包括因遮挡或相机运动而暂时不可见的对象）的世界级场景图。解决方案的关键在于引入ActionGenome4D数据集，通过前向3D重建、世界坐标系下的边界框标注和密集关系注释（涵盖未观测对象），实现对4D场景的结构化表示；并设计三种互补的方法：PWG（基于零阶特征缓冲的持续世界图）、MWAE（利用跨视角关联检索进行掩码补全的掩码世界自编码器）以及4DST（结合3D运动与相机位姿特征的可微分时序注意力机制），从而从不同归纳偏置角度推理未观测对象，推动视频场景理解向以世界为中心、时序持久且可解释的方向发展。

链接: https://arxiv.org/abs/2603.13185
作者: Rohith Peddi,Saurabh,Shravan Shanmugam,Likhitha Pallapothula,Yu Xiang,Parag Singla,Vibhav Gogate
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Spatio-temporal scene graphs provide a principled representation for modeling evolving object interactions, yet existing methods remain fundamentally frame-centric: they reason only about currently visible objects, discard entities upon occlusion, and operate in 2D. To address this, we first introduce ActionGenome4D, a dataset that upgrades Action Genome videos into 4D scenes via feed-forward 3D reconstruction, world-frame oriented bounding boxes for every object involved in actions, and dense relationship annotations including for objects that are temporarily unobserved due to occlusion or camera motion. Building on this data, we formalize World Scene Graph Generation (WSGG), the task of constructing a world scene graph at each timestamp that encompasses all interacting objects in the scene, both observed and unobserved. We then propose three complementary methods, each exploring a different inductive bias for reasoning about unobserved objects: PWG (Persistent World Graph), which implements object permanence via a zero-order feature buffer; MWAE (Masked World Auto-Encoder), which reframes unobserved-object reasoning as masked completion with cross-view associative retrieval; and 4DST (4D Scene Transformer), which replaces the static buffer with differentiable per-object temporal attention enriched by 3D motion and camera-pose features. We further design and evaluate the performance of strong open-source Vision-Language Models on the WSGG task via a suite of Graph RAG-based approaches, establishing baselines for unlocalized relationship prediction. WSGG thus advances video scene understanding toward world-centric, temporally persistent, and interpretable scene reasoning.

[CV-5] Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification

【速读】：该论文旨在解决医学影像中脑肿瘤分类模型对对抗扰动（adversarial perturbations）敏感性高、可靠性不足的问题。其核心挑战在于，尽管深度学习模型在磁共振成像（MRI）脑肿瘤分类任务中表现出高准确率，但其在面对精心设计的对抗攻击时性能显著下降，这限制了其在临床场景中的可信部署。解决方案的关键在于构建一个融合三阶段机制的鲁棒框架：首先利用非负矩阵分解（Non-Negative Matrix Factorization, NMF）提取可解释且紧凑的特征表示；其次通过统计指标（如AUC、Cohen’s d和p值）筛选最具判别力的特征子集，并训练轻量级卷积神经网络（CNN）进行分类；最后引入基于扩散模型（diffusion-based）的特征空间净化模块，在分类前对特征进行去噪处理，从而提升模型在强对抗攻击（AutoAttack）下的鲁棒性。该方法有效结合了可解释性、轻量化与防御能力，为医疗图像分类提供了可靠的对抗鲁棒解决方案。

链接: https://arxiv.org/abs/2603.13182
作者: Hiba Adil Al-kharsan,Róbert Rajkó
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 29 figures

点击查看摘要

Abstract:Brain tumor classification from magnetic resonance imaging, which is also known as MRI, plays a sensitive role in computer-assisted diagnosis systems. In recent years, deep learning models have achieved high classification accuracy. However, their sensitivity to adversarial perturbations has become an important reliability concern in medical applications. This study suggests a robust brain tumor classification framework that combines Non-Negative Matrix Factorization (NNMF or NMF), lightweight convolutional neural networks (CNNs), and diffusion-based feature purification. Initially, MRI images are preprocessed and converted into a non-negative data matrix, from which compact and interpretable NNMF feature representations are extracted. Statistical metrics, including AUC, Cohen’s d, and p-values, are used to rank and choose the most discriminative components. Then, a lightweight CNN classifier is trained directly on the selected feature groups. To improve adversarial robustness, a diffusion-based feature-space purification module is introduced. A forward noise method followed by a learned denoiser network is used before classification. System performance is estimated using both clean accuracy and robust accuracy under powerful adversarial attacks created by AutoAttack. The experimental results show that the proposed framework achieves competitive classification performance while significantly enhancing robustness against adversarial this http URL findings presuppose that combining interpretable NNMF-based representations with a lightweight deep approach and diffusion-based defense technique supplies an effective and reliable solution for medical image classification under adversarial conditions.

[CV-6] Perceive What Matters: Relevance-Driven Scheduling for Multimodal Streaming Perception ICRA2026

【速读】：该论文旨在解决人机协作（Human-Robot Collaboration, HRC）场景中多模态感知模块在流式感知任务下因帧级并行执行导致的计算延迟累积问题，以及由此引发的信息冗余和计算资源分配不优化的问题。解决方案的关键在于提出一种轻量级感知调度框架，该框架基于前一帧输出估计当前场景的感知需求，并结合“相关性”（Relevance）概念与HRC事件中的信息稀疏特性，动态调度必要的感知模块，从而实现实时感知资源的高效利用。实验表明，该方法可将计算延迟降低最高达27.52%，同时提升MMPose激活召回率72.73%，且关键帧识别准确率达98%，在保证精度的前提下显著提升了实时感知效率。

链接: https://arxiv.org/abs/2603.13176
作者: Dingcheng Huang,Xiaotong Zhang,Kamal Youcef-Toumi
机构: Mechatronics Research Laboratory, Massachusetts Institute of Technology (麻省理工学院); King Abudlaziz City for Science and Technology (沙特阿拉伯科学技术城)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICRA 2026

点击查看摘要

Abstract:In modern human-robot collaboration (HRC) applications, multiple perception modules jointly extract visual, auditory, and contextual cues to achieve comprehensive scene understanding, enabling the robot to provide appropriate assistance to human agents intelligently. While executing multiple perception modules on a frame-by-frame basis enhances perception quality in offline settings, it inevitably accumulates latency, leading to a substantial decline in system performance in streaming perception scenarios. Recent work in scene understanding, termed Relevance, has established a solid foundation for developing efficient methodologies in HRC. However, modern perception pipelines still face challenges related to information redundancy and suboptimal allocation of computational resources. Drawing inspiration from the Relevance concept and the information sparsity in HRC events, we propose a novel lightweight perception scheduling framework that efficiently leverages output from previous frames to estimate and schedule necessary perception modules in real-time based on scene context. The experimental results demonstrate that the proposed perception scheduling framework effectively reduces computational latency by up to 27.52% compared to conventional parallel perception pipelines, while also achieving a 72.73% improvement in MMPose activation recall. Additionally, the framework demonstrates high keyframe accuracy, achieving rates of up to 98%. The results validate the framework’s capability to enhance real-time perception efficiency without significantly compromising accuracy. The framework shows potential as a scalable and systematic solution for multimodal streaming perception systems in HRC.

[CV-7] owards Faithful Multimodal Concept Bottleneck Models

【速读】：该论文旨在解决多模态场景下概念瓶颈模型（Concept Bottleneck Models, CBMs）的忠实性问题，即如何确保概念检测准确且概念表示不引入任务相关或跨概念的冗余信息（称为“泄漏”）。现有方法通常将概念检测与泄漏缓解视为独立问题，导致优化一个目标时牺牲另一个性能。本文提出 f-CBM 框架，其关键在于通过两种互补策略实现联合优化：一是设计可微分的泄漏损失以抑制信息泄漏，二是采用 Kolmogorov-Arnold 网络作为预测头以增强概念检测能力，从而在任务准确率、概念检测和泄漏控制之间取得最优平衡，并适用于图像、文本及纯文本数据。

链接: https://arxiv.org/abs/2603.13163
作者: Pierre Moreau,Emeline Pineau Ferrand,Yann Choho,Benjamin Wong,Annabelle Blangero,Milan Bhan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) are interpretable models that route predictions through a layer of human-interpretable concepts. While widely studied in vision and, more recently, in NLP, CBMs remain largely unexplored in multimodal settings. For their explanations to be faithful, CBMs must satisfy two conditions: concepts must be properly detected, and concept representations must encode only their intended semantics, without smuggling extraneous task-relevant or inter-concept information into final predictions, a phenomenon known as leakage. Existing approaches treat concept detection and leakage mitigation as separate problems, and typically improve one at the expense of predictive accuracy. In this work, we introduce f-CBM, a faithful multimodal CBM framework built on a vision-language backbone that jointly targets both aspects through two complementary strategies: a differentiable leakage loss to mitigate leakage, and a Kolmogorov-Arnold Network prediction head that provides sufficient expressiveness to improve concept detection. Experiments demonstrate that f-CBM achieves the best trade-off between task accuracy, concept detection, and leakage reduction, while applying seamlessly to both image and text or text-only datasets, making it versatile across modalities.

[CV-8] FDeID-Toolbox: Face De-Identification Toolbox

【速读】：该论文旨在解决面部去标识化（Face de-identification, FDeID）研究中存在的碎片化实现、评估协议不一致以及结果不可比等问题。这些问题源于FDeID任务本身的复杂性：其涉及多个下游应用（如年龄估计、性别识别和表情分析），且需在隐私保护、效用保留和视觉质量三个维度进行评估，导致现有代码库难以使用与扩展。解决方案的关键在于提出FDeID-Toolbox，一个模块化设计的综合性工具箱，包含四个核心组件：（1）主流基准数据集的标准数据加载器，（2）从经典方法到最先进生成式模型的统一方法实现，（3）灵活的推理流程，（4）覆盖隐私、效用和质量指标的系统化评估协议，从而实现不同FDeID方法在一致条件下的公平、可复现比较。

链接: https://arxiv.org/abs/2603.13121
作者: Hui Wei,Hao Yu,Guoying Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. Codebase: this https URL

点击查看摘要

Abstract:Face de-identification (FDeID) aims to remove personally identifiable information from facial images while preserving task-relevant utility attributes such as age, gender, and expression. It is critical for privacy-preserving computer vision, yet the field suffers from fragmented implementations, inconsistent evaluation protocols, and incomparable results across studies. These challenges stem from the inherent complexity of the task: FDeID spans multiple downstream applications (e.g., age estimation, gender recognition, expression analysis) and requires evaluation across three dimensions (e.g., privacy protection, utility preservation, and visual quality), making existing codebases difficult to use and extend. To address these issues, we present FDeID-Toolbox, a comprehensive toolbox designed for reproducible FDeID research. Our toolbox features a modular architecture comprising four core components: (1) standardized data loaders for mainstream benchmark datasets, (2) unified method implementations spanning classical approaches to SOTA generative models, (3) flexible inference pipelines, and (4) systematic evaluation protocols covering privacy, utility, and quality metrics. Through experiments, we demonstrate that FDeID-Toolbox enables fair and reproducible comparison of diverse FDeID methods under consistent conditions.

[CV-9] Geometry-Guided Camera Motion Understanding in VideoLLM s

【速读】：该论文旨在解决当前视频理解视觉语言模型（VideoLLM）在识别细粒度相机运动基元（camera motion primitives）方面表现不佳的问题，其核心挑战在于现有模型对相机运动这一基础几何信号缺乏显式建模与有效捕捉。解决方案的关键在于提出一个“评估-诊断-注入”（benchmarking, diagnosis, injection）框架：首先构建大规模合成数据集CameraMotionDataset及VQA基准CameraMotionVQA以系统评估模型性能；其次通过探针实验揭示相机运动线索在视觉编码器中表征薄弱，尤其在深层ViT块中；最后设计一种轻量级、模型无关的注入管道，从3D基础模型（3DFM）中提取几何相机线索，利用时序分类器预测约束性运动基元，并通过结构化提示（structured prompting）将其注入下游VideoLLM推理过程，从而无需额外训练即可显著提升模型对相机运动的感知能力。

链接: https://arxiv.org/abs/2603.13119
作者: Haoan Feng,Sri Harsha Musunuri,Guan-Ming Su
机构: University of Maryland, College Park (马里兰大学学院公园分校); Dolby Laboratories Inc. (杜比实验室公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures, supplementary included

点击查看摘要

Abstract:Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of \textbfbenchmarking , \textbfdiagnosis , and \textbfinjection . We curate \textbfCameraMotionDataset , a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark-- \textbfCameraMotionVQA . Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap without costly training or fine-tuning, we propose a lightweight, model-agnostic pipeline that extracts geometric camera cues from 3D foundation models (3DFMs), predicts constrained motion primitives with a temporal classifier, and injects them into downstream VideoLLM inference via structured prompting. Experiments demonstrate improved motion recognition and more camera-aware model responses, highlighting geometry-driven cue extraction and structured prompting as practical steps toward a camera-aware VideoLLM and VLA system. The dataset and benchmark is publicly available at this https URL.

[CV-10] NOIR: Neural Operator mapping for Implicit Representations

【速读】：该论文旨在解决当前医学影像任务中深度学习模型依赖固定像素或体素网格所带来的局限性，即模型在不同分辨率下性能不稳定且难以泛化至未见离散化场景的问题。其解决方案的关键在于提出NOIR框架，通过将离散医学信号嵌入共享的隐式神经表示（Implicit Neural Representations），并学习一个神经算子（Neural Operator）来映射其潜在调制空间，从而实现跨分辨率的连续函数到函数的变换，具备分辨率无关性和对未见离散化的强鲁棒性。

链接: https://arxiv.org/abs/2603.13118
作者: Sidaty El Hadramy,Nazim Haouchine,Michael Wehrli,Philippe C. Cattin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents NOIR, a framework that reframes core medical imaging tasks as operator learning between continuous function spaces, challenging the prevailing paradigm of discrete grid-based deep learning. Instead of operating on fixed pixel or voxel grids, NOIR embeds discrete medical signals into shared Implicit Neural Representations and learns a Neural Operator that maps between their latent modulations, enabling resolution-independent function-to-function transformations. We evaluate NOIR across multiple 2D and 3D downstream tasks, including segmentation, shape completion, image-to-image translation, and image synthesis, on several public datasets such as Shenzhen, OASIS-4, SkullBreak, fastMRI, as well as an in-house clinical dataset. It achieves competitive performance at native resolution while demonstrating strong robustness to unseen discretizations, and empirically satisfies key theoretical properties of neural operators. The project page is available here: this https URL.

[CV-11] Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots

【速读】：该论文旨在解决当前占用预测（occupancy prediction）方法在四足机器人复杂环境感知中鲁棒性不足的问题，尤其是现有方法主要面向轮式自动驾驶系统且过度依赖RGB图像信息，难以适应四足机器人在动态移动过程中产生的剧烈视角扰动。其解决方案的关键在于：首先构建了首个真实世界全景多模态占用数据集PanoMMOcc，涵盖四种传感模态和多样化场景；其次提出专为腿式运动与球面成像设计的VoxelHound框架，核心创新包括垂直抖动补偿（Vertical Jitter Compensation, VJC）模块以缓解因机体俯仰和横滚导致的视角扰动，以及多模态信息提示融合（Multimodal Information Prompt Fusion, MIPF）模块，有效整合全景视觉与辅助模态信息以提升体素级占用预测精度。实验表明，该方法在PanoMMOcc数据集上达到mIoU提升4.16%的最先进性能。

链接: https://arxiv.org/abs/2603.13108
作者: Guoqiang Zhao,Zhe Yang,Sheng Wu,Fei Teng,Mengfei Duan,Yuanfan Zheng,Kai Luo,Kailun Yang
机构: Hunan University (湖南大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: The dataset and code will be publicly released at this https URL

点击查看摘要

Abstract:Panoramic imagery provides holistic 360° visual coverage for perception in quadruped robots. However, existing occupancy prediction methods are mainly designed for wheeled autonomous driving and rely heavily on RGB cues, limiting their robustness in complex environments. To bridge this gap, (1) we present PanoMMOcc, the first real-world panoramic multimodal occupancy dataset for quadruped robots, featuring four sensing modalities across diverse scenes. (2) We propose a panoramic multimodal occupancy perception framework, VoxelHound, tailored for legged mobility and spherical imaging. Specifically, we design (i) a Vertical Jitter Compensation (VJC) module to mitigate severe viewpoint perturbations caused by body pitch and roll during mobility, enabling more consistent spatial reasoning, and (ii) an effective Multimodal Information Prompt Fusion (MIPF) module that jointly leverages panoramic visual cues and auxiliary modalities to enhance volumetric occupancy prediction. (3) We establish a benchmark based on PanoMMOcc and provide detailed data analysis to enable systematic evaluation of perception methods under challenging embodied scenarios. Extensive experiments demonstrate that VoxelHound achieves state-of-the-art performance on PanoMMOcc (+4.16% in mIoU). The dataset and code will be publicly released to facilitate future research on panoramic multimodal 3D perception for embodied robotic systems at this https URL, along with the calibration tools released at this https URL.

[CV-12] BenDFM: A taxonomy and synthetic CAD dataset for manufacturability assessment in sheet metal bending

【速读】：该论文旨在解决设计制造（Design for Manufacturing, DFM）中可制造性预测的两大核心问题：一是现有研究对“可制造性”的定义不一致，导致学习目标模糊且难以比较；二是缺乏高质量、涵盖可行与不可行案例的合成数据集，尤其在钣金弯曲等复杂工艺场景下。解决方案的关键在于提出一个基于配置依赖性和测量类型维度的可制造性指标分类体系，并构建首个面向钣金弯曲工艺的合成数据集BenDFM，其中包含20,000个零件（含可制造与不可制造样本），并提供多维度标签和折叠/展开几何信息，从而支持系统性地研究此前未被探索的学习型DFM挑战。实验表明，基于图结构的3D表示方法能更准确捕捉零件表面间关系，而依赖特定制造配置的指标预测仍具挑战性。

链接: https://arxiv.org/abs/2603.13102
作者: Matteo Ballegeer,Dries F. Benoit
机构: Ghent University (根特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Predicting the manufacturability of CAD designs early, in terms of both feasibility and required effort, is a key goal of Design for Manufacturing (DFM). Despite advances in deep learning for CAD and its widespread use in manufacturing process selection, learning-based approaches for predicting manufacturability within a specific process remain limited. Two key challenges limit progress: inconsistency across prior work in how manufacturability is defined and consequently in the associated learning targets, and a scarcity of suitable datasets. Existing labels vary significantly: they may reflect intrinsic design constraints or depend on specific manufacturing capabilities (such as available tools), and they range from discrete feasibility checks to continuous complexity measures. Furthermore, industrial datasets typically contain only manufacturable parts, offering little signal for infeasible cases, while existing synthetic datasets focus on simple geometries and subtractive processes. To address these gaps, we propose a taxonomy of manufacturability metrics along the axes of configuration dependence and measurement type, allowing clearer scoping of generalizability and learning objectives. Next, we introduce BenDFM, the first synthetic dataset for manufacturability assessment in sheet metal bending. BenDFM contains 20,000 parts, both manufacturable and unmanufacturable, generated with process-aware bending simulations, providing both folded and unfolded geometries and multiple manufacturability labels across the taxonomy, enabling systematic study of previously unexplored learning-based DFM challenges. We benchmark two state-of-the-art 3D learning architectures on BenDFM, showing that graph-based representations that capture relationships between part surfaces achieve better accuracy, and that predicting metrics that depend on specific manufacturing setups remains more challenging.

[CV-13] SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design ICRA2026

【速读】：该论文旨在解决工业设计领域中缺乏大规模、多模态且语义驱动的CAD模型数据集的问题，以支持生成式AI（Generative AI）在参数化建模、几何深度学习及跨模态学习中的应用。其关键解决方案在于构建SldprtNet这一包含超过24.2万件真实工业零件的大规模多模态数据集，提供STEP与SLDPRT两种格式的3D模型，并通过自研编码器-解码器框架实现13类CAD命令的结构化文本表示与模型间的无损转换；同时，每样本配以由七个视角渲染图像融合而成的复合图像，显著降低输入token长度并加速推理，再结合轻量级多模态语言模型Qwen2.5-VL-7B生成自然语言描述，最终经人工校验确保图像、文本与3D模型三者对齐，从而形成一个结构清晰、语义准确、可扩展性强的多模态基准数据集，为CAD生成任务提供了高质量训练基础和评估标准。

链接: https://arxiv.org/abs/2603.13098
作者: Ruogu Li,Sikai Li,Yao Mu,Mingyu Ding
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Shanghai Jiao Tong University (上海交通大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by ICRA 2026

点击查看摘要

Abstract:We introduce SldprtNet, a large-scale dataset comprising over 242,000 industrial parts, designed for semantic-driven CAD modeling, geometric deep learning, and the training and fine-tuning of multimodal models for 3D design. The dataset provides 3D models in both .step and .sldprt formats to support diverse training and testing. To enable parametric modeling and facilitate dataset scalability, we developed supporting tools, an encoder and a decoder, which support 13 types of CAD commands and enable lossless transformation between 3D models and a structured text representation. Additionally, each sample is paired with a composite image created by merging seven rendered views from different viewpoints of the 3D model, effectively reducing input token length and accelerating inference. By combining this image with the parameterized text output from the encoder, we employ the lightweight multimodal language model Qwen2.5-VL-7B to generate a natural language description of each part’s appearance and functionality. To ensure accuracy, we manually verified and aligned the generated descriptions, rendered images, and 3D models. These descriptions, along with the parameterized modeling scripts, rendered images, and 3D model files, are fully aligned to construct SldprtNet. To assess its effectiveness, we fine-tuned baseline models on a dataset subset, comparing image-plus-text inputs with text-only inputs. Results confirm the necessity and value of multimodal datasets for CAD generation. It features carefully selected real-world industrial parts, supporting tools for scalable dataset expansion, diverse modalities, and ensured diversity in model complexity and geometric features, making it a comprehensive multimodal dataset built for semantic-driven CAD modeling and cross-modal learning.

[CV-14] Reasoning over Video: Evaluating How MLLM s Extract Integrate and Reconstruct Spatiotemporal Evidence

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在视频理解任务中对抽象时空推理（abstractive spatiotemporal reasoning）能力不足的问题。现有基准主要聚焦于抽取式推理（extractive reasoning），即答案可直接从视频中的时空事件中提取，而缺乏对需整合时序信息、融合分散线索并推断隐含空间与上下文结构的抽象推理能力的评估。解决方案的关键在于：首先提出一个结构化的评估分类法，系统性地刻画抽象时空推理的核心维度；其次构建了一个可控的、场景驱动的合成第一人称视角视频数据集（VAEX-BENCH），覆盖物体级、房间级和楼层平面级等多层次场景；最后设计了五个抽象推理任务及其对应的抽取式对照任务，从而实现对MLLMs在抽象推理能力上的细粒度评测与瓶颈分析。

链接: https://arxiv.org/abs/2603.13091
作者: Seunghwan Bang,Hwanjun Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 8 figures, 21 tables

点击查看摘要

Abstract:The growing interest in embodied agents increases the demand for spatiotemporal video understanding, yet existing benchmarks largely emphasize extractive reasoning, where answers can be explicitly presented within spatiotemporal events. It remains unclear whether multimodal large language models can instead perform abstractive spatiotemporal reasoning, which requires integrating observations over time, combining dispersed cues, and inferring implicit spatial and contextual structure. To address this gap, we formalize abstractive spatiotemporal reasoning from videos by introducing a structured evaluation taxonomy that systematically targets its core dimensions and construct a controllable, scenario-driven synthetic egocentric video dataset tailored to evaluate abstractive spatiotemporal reasoning capabilities, spanning object-, room-, and floor-plan-level scenarios. Based on this framework, we present VAEX-BENCH, a benchmark comprising five abstractive reasoning tasks together with their extractive counterparts. Our extensive experiments compare the performance of state-of-the-art MLLMs under extractive and abstractive settings, exposing their limitations on abstractive tasks and providing a fine-grained analysis of the underlying bottlenecks. The dataset will be released soon.

[CV-15] V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration

【速读】：该论文旨在解决当前图像修复（Image Restoration）任务中模型泛化能力不足、依赖大量特定任务数据以及缺乏统一框架的问题。传统方法通常针对单一修复任务（如去噪、超分辨率、去模糊等）设计专用网络结构，难以实现多任务协同与高效迁移。论文提出V-Bridge框架，其关键在于将图像修复重新定义为一个渐进式的生成过程，并利用预训练的视频生成模型（Video Generative Models）来模拟从退化输入到高保真输出的逐步优化路径。通过仅使用1,000个跨任务训练样本（不足现有方法数据量的2%），即可激活视频模型中隐含的可迁移修复先验（Restoration Priors），从而以单一模型实现多种图像修复任务，性能媲美专门设计的架构。这一发现打破了生成建模与低层视觉任务之间的界限，为视觉基础模型的设计提供了新的范式。

链接: https://arxiv.org/abs/2603.13089
作者: Shenghe Zheng,Junpeng Jiang,Wenbo Li
机构: Open-source project: V-Bridge.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Transfer the prior knowledge of video generative models to image restoration tasks

点击查看摘要

Abstract:Large-scale video generative models are trained on vast and diverse visual data, enabling them to internalize rich structural, semantic, and dynamic priors of the visual world. While these models have demonstrated impressive generative capability, their potential as general-purpose visual learners remains largely untapped. In this work, we introduce V-Bridge, a framework that bridges this latent capacity to versatile few-shot image restoration tasks. We reinterpret image restoration not as a static regression problem, but as a progressive generative process, and leverage video models to simulate the gradual refinement from degraded inputs to high-fidelity outputs. Surprisingly, with only 1,000 multi-task training samples (less than 2% of existing restoration methods), pretrained video models can be induced to perform competitive image restoration, achieving multiple tasks with a single model, rivaling specialized architectures designed explicitly for this purpose. Our findings reveal that video generative models implicitly learn powerful and transferable restoration priors that can be activated with only extremely limited data, challenging the traditional boundary between generative modeling and low-level vision, and opening a new design paradigm for foundation models in visual tasks.

[CV-16] Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics

【速读】：该论文试图解决的问题是：注意力机制（Attention Mechanism）的学习动态复杂且非线性，导致其理论基础难以理解。研究发现，即使在对注意力机制进行线性化处理后，其学习行为仍无法收敛到无限宽度下的神经切空间核（Neural Tangent Kernel, NTK）极限，这揭示了注意力机制与传统核方法之间的根本差异。解决方案的关键在于通过引入一个与数据相关的 Gram 矩阵诱导核精确对应的线性化注意力模型，并借助 NTK 框架进行理论和实证分析，从而证明注意力变换会将 Gram 矩阵的条件数立方放大，因此需要宽度 $ m = \Omega(\kappa^6) $ 才能实现收敛——这一阈值远超自然图像数据集的实际可实现宽度。这一非收敛特性进一步由“影响可塑性”（influence malleability）刻画：注意力机制比 ReLU 网络具有 6–9 倍更高的可塑性，使其既能通过数据依赖核提升任务结构适配能力，也更容易受到训练数据扰动的对抗攻击。结论表明，注意力机制的强大性能与其脆弱性源于其偏离核回归范式的核心特征。

链接: https://arxiv.org/abs/2603.13085
作者: Jose Marie Antonio Miñoza,Paulo Mario P. Medina,Sebastian C. Ibañez
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fundamental trade-off in the learning dynamics of linearized attention. Using a linearized attention mechanism with exact correspondence to a data-dependent Gram-induced kernel, both empirical and theoretical analysis through the Neural Tangent Kernel (NTK) framework shows that linearized attention does not converge to its infinite-width NTK limit, even at large widths. A spectral amplification result establishes this formally: the attention transformation cubes the Gram matrix’s condition number, requiring width m = \Omega(\kappa^6) for convergence, a threshold that exceeds any practical width for natural image datasets. This non-convergence is characterized through influence malleability, the capacity to dynamically alter reliance on training examples. Attention exhibits 6–9 \times higher malleability than ReLU networks, with dual implications: its data-dependent kernel can reduce approximation error by aligning with task structure, but this same sensitivity increases susceptibility to adversarial manipulation of training data. These findings suggest that attention’s power and vulnerability share a common origin in its departure from the kernel regime.

[CV-17] InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

【速读】：该论文旨在解决多人群体3D动作编辑（multi-person 3D motion editing）中的关键挑战，即如何在缺乏成对标注数据和复杂人与人交互关系的情况下，实现基于文本指令的高质量动作生成与编辑。其解决方案的核心是提出InterEdit，一个同步的无分类器条件扩散模型（classifier-free conditional diffusion model），通过引入语义感知计划标记对齐（Semantic-Aware Plan Token Alignment）来捕捉高层交互线索，并采用基于离散余弦变换（DCT）与能量池化的交互感知频率标记对齐策略（Interaction-Aware Frequency Token Alignment）以建模周期性运动动态，从而显著提升文本到动作的一致性和编辑保真度，在多人群体动作编辑任务中达到当前最优性能。

链接: https://arxiv.org/abs/2603.13082
作者: Yebin Yang,Di Wen,Lei Qi,Weitong Kong,Junwei Zheng,Ruiping Liu,Yufan Chen,Chengzhi Wu,Kailun Yang,Yuqian Fu,Danda Pani Paudel,Luc Van Gool,Kunyu Peng
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The dataset and code will be released at this https URL

点击查看摘要

Abstract:Text-guided 3D motion editing has seen success in single-person scenarios, but its extension to multi-person settings is less explored due to limited paired data and the complexity of inter-person interactions. We introduce the task of multi-person 3D motion editing, where a target motion is generated from a source and a text instruction. To support this, we propose InterEdit3D, a new dataset with manual two-person motion change annotations, and a Text-guided Multi-human Motion Editing (TMME) benchmark. We present InterEdit, a synchronized classifier-free conditional diffusion model for TMME. It introduces Semantic-Aware Plan Token Alignment with learnable tokens to capture high-level interaction cues and an Interaction-Aware Frequency Token Alignment strategy using DCT and energy pooling to model periodic motion dynamics. Experiments show that InterEdit improves text-to-motion consistency and edit fidelity, achieving state-of-the-art TMME performance. The dataset and code will be released at this https URL.

[CV-18] Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods

【速读】：该论文旨在解决从稀疏传感器数据中重建屋顶风速场分布的难题，这一问题对无人机安全运行、城市空中交通系统及屋顶利用具有重要意义。由于屋顶流动具有强非线性、分离和跨方向变异性等特点，传统插值方法（如Kriging）难以准确重构流场。解决方案的关键在于提出一种基于风洞实验数据（通过粒子图像测速技术PIV获取）的学习型观测框架，并对比了三种深度学习模型（UNet、Vision Transformer Autoencoder、Conditional Wasserstein GAN）与Kriging插值的效果。研究发现，深度学习方法在结构相似性指数（SSIM）、相关系数（FAC2）和归一化均方误差（NMSE）等指标上显著优于Kriging，且混合风向训练策略（MDT）相比单风向训练（SDT）进一步提升了性能（SSIM提升达173.7%）。此外，结合QR分解优化的传感器布局可增强鲁棒性（抗传感器位置扰动能力提升最高达27.8%），表明传感器配置、训练策略与优化方法需协同设计以实现可靠部署。

链接: https://arxiv.org/abs/2603.13077
作者: Yihang Zhou,Chao Lin,Hideki Kikumoto,Ryozo Ooka,Sibo Cheng
机构: Imperial College London (帝国理工学院); The University of Tokyo (东京大学); Institut Polytechnique de Paris (巴黎综合理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-time rooftop wind-speed distribution is important for the safe operation of drones and urban air mobility systems, wind control systems, and rooftop utilization. However, rooftop flows show strong nonlinearity, separation, and cross-direction variability, which make flow field reconstruction from sparse sensors difficult. This study develops a learning-from-observation framework using wind-tunnel experimental data obtained by Particle Image Velocimetry (PIV) and compares Kriging interpolation with three deep learning models: UNet, Vision Transformer Autoencoder (ViTAE), and Conditional Wasserstein GAN (CWGAN). We evaluate two training strategies, single wind-direction training (SDT) and mixed wind-direction training (MDT), across sensor densities from 5 to 30, test robustness under sensor position perturbations of plus or minus 1 grid, and optimize sensor placement via Proper Orthogonal Decomposition with QR decomposition. Results show that deep learning methods can reconstruct rooftop wind fields from sparse sensor data effectively. Compared with Kriging interpolation, the deep learning models improved SSIM by up to 32.7%, FAC2 by 24.2%, and NMSE by 27.8%. Mixed wind-direction training further improved performance, with gains of up to 173.7% in SSIM, 16.7% in FAC2, and 98.3% in MG compared with single-direction training. The results also show that sensor configuration, optimization, and training strategy should be considered jointly for reliable deployment. QR-based optimization improved robustness by up to 27.8% under sensor perturbations, although with metric-dependent trade-offs. Training on experimental rather than simulated data also provides practical guidance for method selection and sensor placement in different scenarios.

[CV-19] Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection

【速读】：该论文旨在解决生成式 AI（Generative AI）中图像扩散模型在文本到图像生成过程中可能因记忆训练数据而导致的版权与隐私风险问题。现有方法如推理时的提示扰动（prompt perturbation）虽能降低复制行为，但常损害图像与提示之间的对齐度和整体保真度。解决方案的关键在于提出两种互补方法：其一是区域感知提示增强（Region-Aware Prompt Augmentation, RAPTA），利用目标检测器识别显著区域并生成语义锚定的提示变体，在训练阶段随机采样以提升多样性同时保持语义一致性；其二是注意力驱动的多模态复制检测（Attention-Driven Multimodal Copy Detection, ADMCD），通过轻量级Transformer融合局部patch、全局语义和纹理线索，构建统一表征，并采用简单阈值决策规则实现无需大规模标注数据即可可靠检测复制行为。

链接: https://arxiv.org/abs/2603.13070
作者: Yunzhuo Chen,Jordan Vice,Naveed Akhtar,Nur Al Hasan Haldar,Ajmal Mian
机构: The University of Western Australia (西澳大利亚大学); The University of Melbourne (墨尔本大学); Curtin University (科廷大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State-of-the-art text-to-image diffusion models can produce impressive visuals but may memorize and reproduce training images, creating copyright and privacy risks. Existing prompt perturbations applied at inference time, such as random token insertion or embedding noise, may lower copying but often harm image-prompt alignment and overall fidelity. To address this, we introduce two complementary methods. First, Region-Aware Prompt Augmentation (RAPTA) uses an object detector to find salient regions and turn them into semantically grounded prompt variants, which are randomly sampled during training to increase diversity, while maintaining semantic alignment. Second, Attention-Driven Multimodal Copy Detection (ADMCD) aggregates local patch, global semantic, and texture cues with a lightweight transformer to produce a fused representation, and applies simple thresholded decision rules to detect copying without training with large annotated datasets. Experiments show that RAPTA reduces overfitting while maintaining high synthesis quality, and that ADMCD reliably detects copying, outperforming single-modal metrics.

[CV-20] Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems

【速读】：该论文旨在揭示扩散模型（Diffusion Model）在从噪声生成图像过程中所遵循的内在几何动力学机制，尤其是其去噪路径的本质结构。传统方法多关注训练目标或采样策略的改进，但缺乏对模型内部状态演化规律的统一理论描述。论文的核心解决方案是将确定性DDIM（Denoising Diffusion Implicit Models）的反向过程建模为分段迭代函数系统（Partitioned Iterated Function System, PIFS），从而提供一个统一的设计语言来解析扩散模型的调度策略、架构设计与训练目标之间的关系。关键创新在于定义了三个可计算的几何量：每步收缩阈值 $ L^*_t $、对角展开函数 $ f_t(\lambda) $ 和全局展开阈值 $ \lambda^{**} $，这些量无需模型评估即可刻画去噪动态，并解释了扩散模型的双阶段行为——高噪声下通过跨块扩散注意力进行全局上下文组装，低噪声下通过逐块抑制释放按严格方差顺序合成细节。此外，基于PIFS的分形几何分析导出了三个最优设计准则，阐明了四种主流经验设计（如余弦调度偏移、分辨率相关logSNR偏移、Min-SNR损失加权及Align Your Steps采样）均为显式几何优化问题的近似解，实现了理论到实践的精准映射。

链接: https://arxiv.org/abs/2603.13069
作者: Ann Dooms
机构: Vrije Universiteit Brussel (布鲁塞尔自由大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Dynamical Systems (math.DS)
备注:

点击查看摘要

Abstract:What is a diffusion model actually doing when it turns noise into a photograph? We show that the deterministic DDIM reverse chain operates as a Partitioned Iterated Function System (PIFS) and that this framework serves as a unified design language for denoising diffusion model schedules, architectures, and training objectives. From the PIFS structure we derive three computable geometric quantities: a per-step contraction threshold L^*_t , a diagonal expansion function f_t(\lambda) and a global expansion threshold \lambda^** . These quantities require no model evaluation and fully characterize the denoising dynamics. They structurally explain the two-regime behavior of diffusion models: global context assembly at high noise via diffuse cross-patch attention and fine-detail synthesis at low noise via patch-by-patch suppression release in strict variance order. Self-attention emerges as the natural primitive for PIFS contraction. The Kaplan-Yorke dimension of the PIFS attractor is determined analytically through a discrete Moran equation on the Lyapunov spectrum. Through the study of the fractal geometry of the PIFS, we derive three optimal design criteria and show that four prominent empirical design choices (the cosine schedule offset, resolution-dependent logSNR shift, Min-SNR loss weighting, and Align Your Steps sampling) each arise as approximate solutions to our explicit geometric optimization problems tuning theory into practice. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Dynamical Systems (math.DS) Cite as: arXiv:2603.13069 [cs.LG] (or arXiv:2603.13069v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.13069 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-21] Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback

【速读】：该论文旨在解决虚拟试衣（Virtual Try-On, VTON）系统在缺乏真实参考图像情况下的图像质量评估难题。传统基于分布的指标（如Fréchet Inception Distance和Kernel Inception Distance）无法反映单张生成图像的感知质量，而实际应用中往往难以获取同一人穿戴目标服装的真实图像作为参考。为此，作者提出VTON-IQA——一种无需参考图像的人类对齐、图像级质量评估框架。其关键创新在于构建了大规模人类标注基准VTON-QBench（包含62,688张试衣图像和431,800条质量评分），并设计了交错交叉注意力模块（Interleaved Cross-Attention），通过在Transformer块中插入跨注意力层显式建模人体与服装之间的交互关系，从而实现更贴近人类感知的图像质量预测。

链接: https://arxiv.org/abs/2603.13057
作者: Yuki Hirakawa,Takashi Wada,Ryotaro Shimizu,Takuya Furusawa,Yuki Saito,Ryosuke Araki,Tianwei Chen,Fan Mo,Yoshimitsu Aoki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Given a person image and a garment image, image-based Virtual Try-ON (VTON) synthesizes a try-on image of the person wearing the target garment. As VTON systems become increasingly important in practical applications such as fashion e-commerce, reliable evaluation of their outputs has emerged as a critical challenge. In real-world scenarios, ground-truth images of the same person wearing the target garment are typically unavailable, making reference-based evaluation impractical. Moreover, widely used distribution-level metrics such as Fréchet Inception Distance and Kernel Inception Distance measure dataset-level similarity and fail to reflect the perceptual quality of individual generated images. To address these limitations, we propose Image Quality Assessment for Virtual Try-On (VTON-IQA), a reference-free framework for human-aligned, image-level quality assessment without requiring ground-truth images. To model human perceptual judgments, we construct VTON-QBench, a large-scale human-annotated benchmark comprising 62,688 try-on images generated by 14 representative VTON models and 431,800 quality annotations collected from 13,838 qualified annotators. To the best of our knowledge, this is the largest dataset to date for human subjective evaluation in virtual try-on. Evaluating virtual try-on quality requires verifying both garment fidelity and the preservation of person-specific details. To explicitly model such interactions, we introduce an Interleaved Cross-Attention module that extends standard transformer blocks by inserting a cross-attention layer between self-attention and MLP in the latter blocks. Extensive experiments show that VTON-IQA achieves reliable human-aligned image-level quality prediction. Moreover, we conduct a comprehensive benchmark evaluation of 14 representative VTON models using VTON-IQA.

[CV-22] am RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach

【速读】：该论文旨在解决在真实场景（in-the-wild, ITW）下连续情绪识别中因外观变化、头部姿态、光照条件、遮挡以及个体差异导致的情绪表达模式不一致等问题，从而提升对情绪维度（效价 valence 和唤醒 arousal）的估计精度。解决方案的关键在于提出一种多模态融合方法，整合面部表情、行为和音频三种互补信息源：面部模态采用基于GRADA的帧级嵌入与Transformer进行时序回归；行为信息通过Qwen3-VL-4B-Instruct提取视频片段中的关键语义特征，并利用Mamba模型建模跨片段时序动态；音频模态则基于WavLM-Large结合注意力统计池化，并引入跨模态过滤机制以抑制非语音或低质量音频段的影响。此外，论文进一步探索了两种融合策略——定向交叉模态专家混合融合策略（Directed Cross-Modal Mixture-of-Experts Fusion Strategy）用于自适应学习模态间交互关系，以及可靠性感知的视听融合策略（Reliability-Aware Audio-Visual Fusion Strategy），实现帧级视觉特征融合与音频作为补充上下文的协同优化，最终在Aff-Wild2数据集上取得了0.658的CCC得分，验证了其有效性。

链接: https://arxiv.org/abs/2603.13056
作者: Elena Ryumina(1),Maxim Markitantov(1),Alexandr Axyonov(1),Dmitry Ryumin(1),Mikhail Dolgushin(1),Denis Dresvyanskiy(2),Alexey Karpov(1 and 2) ((1) St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia, (2) ITMO University, St. Petersburg, Russia)
机构: St. Petersburg Federal Research Center of the Russian Academy of Sciences (圣彼得堡联邦研究中心俄罗斯科学院); ITMO University (ITMO大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure

点击查看摘要

Abstract:Continuous emotion recognition in terms of valence and arousal under in-the-wild (ITW) conditions remains a challenging problem due to large variations in appearance, head pose, illumination, occlusions, and subject-specific patterns of affective expression. We present a multimodal method for valence-arousal estimation ITW. Our method combines three complementary modalities: face, behavior, and audio. The face modality relies on GRADA-based frame-level embeddings and Transformer-based temporal regression. We use Qwen3-VL-4B-Instruct to extract behavior-relevant information from video segments, while Mamba is used to model temporal dynamics across segments. The audio modality relies on WavLM-Large with attention-statistics pooling and includes a cross-modal filtering stage to reduce the influence of unreliable or non-speech segments. To fuse modalities, we explore two fusion strategies: a Directed Cross-Modal Mixture-of-Experts Fusion Strategy that learns interactions between modalities with adaptive weighting, and a Reliability-Aware Audio-Visual Fusion Strategy that combines visual features at the frame-level while using audio as complementary context. The results are reported on the Aff-Wild2 dataset following the 10th Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. Experiments demonstrate that the proposed multimodal fusion strategy achieves a Concordance Correlation Coefficient (CCC) of 0.658 on the Aff-Wild2 development set.

[CV-23] opo-R1: Detecting Topological Anomalies via Vision-Language Models

【速读】：该论文旨在解决无监督条件下对管状结构（如血管、神经纤维和道路网络）的拓扑异常检测问题，即在缺乏领域特定标注数据的情况下，如何准确识别并分类预测分割掩膜中的拓扑错误。其解决方案的关键在于提出了一种名为Topo-R1的框架，通过两阶段训练策略增强视觉语言模型（VLM）的拓扑感知能力：首先进行监督微调，随后采用基于组相对策略优化（GRPO）的强化学习方法，并引入一种拓扑感知的复合奖励机制，该机制结合类型感知的匈牙利匹配用于结构化错误分类、空间定位评分以及中心线Dice（clDice）奖励以直接惩罚连接性中断，从而同时提升语义精度与结构保真度。

链接: https://arxiv.org/abs/2603.13054
作者: Meilong Xu,Qingqiao Hu,Xiaoling Hu,Shahira Abousamra,Xin Yu,Weimin Lyu,Kehan Qi,Dimitris Samaras,Chao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 6 figures

点击查看摘要

Abstract:Topological correctness is crucial for tubular structures such as blood vessels, nerve fibers, and road networks. Existing topology-preserving methods rely on domain-specific ground truth, which is costly and rarely transfers across domains. When deployed to a new domain without annotations, a key question arises: how can we detect topological anomalies without ground-truth supervision? We reframe this as topological anomaly detection, a structured visual reasoning task requiring a model to locate and classify topological errors in predicted segmentation masks. Vision-Language Models (VLMs) are natural candidates; however, we find that state-of-the-art VLMs perform nearly at random, lacking the fine-grained, topology-aware perception needed to identify sparse connectivity errors in dense structures. To bridge this gap, we develop an automated data-curation pipeline that synthesizes diverse topological anomalies with verifiable annotations across progressively difficult levels, thereby constructing the first large-scale, multi-domain benchmark for this task. We then introduce Topo-R1, a framework that endows VLMs with topology-aware perception via two-stage training: supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO). Central to our approach is a topology-aware composite reward that integrates type-aware Hungarian matching for structured error classification, spatial localization scoring, and a centerline Dice (clDice) reward that directly penalizes connectivity disruptions, thereby jointly incentivizing semantic precision and structural fidelity. Extensive experiments demonstrate that Topo-R1 establishes a new paradigm for annotation-free topological quality assessment, consistently outperforming general-purpose VLMs and supervised baselines across all evaluation protocols.

[CV-24] Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation? A Cross-Dataset Empirical Study MICCAI2026

【速读】：该论文旨在解决医学图像分割（Medical Image Segmentation, MIS）领域中专用模型与通用视觉模型（General-Purpose Vision Models, GP-VMs）性能对比不清晰的问题，尤其在面对低对比度、小结构和标注数据有限等挑战时，是否存在系统性优势。其解决方案的关键在于设计了一个受控的实证研究框架，采用统一的训练与评估协议，在三个异构医学影像数据集上对十一类专用分割架构（Specialized Medical Segmentation Architectures, SMAs）与现代GP-VMs进行公平比较，并结合Grad-CAM可视化分析解释性（Explainable AI, XAI）行为。结果表明，GP-VMs在多数情况下优于SMAs，且无需显式领域特定设计即可捕捉临床相关结构，揭示了通用模型在端到端MIS系统中的可行性与价值。

链接: https://arxiv.org/abs/2603.13044
作者: Vanessa Borst,Samuel Kounev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review, MICCAI 2026

点击查看摘要

Abstract:Medical image segmentation (MIS) is a fundamental component of computer-assisted diagnosis and clinical decision support systems. Over the past decade, numerous architectures specifically tailored to medical imaging have emerged to address domain-specific challenges such as low contrast, small anatomical structures, and limited annotated data. In parallel, rapid progress in computer vision has produced highly capable general-purpose vision models (GP-VMs) originally designed for natural images. Despite their strong performance on standard vision benchmarks, their effectiveness for MIS remains insufficiently understood. In this work, we conduct a controlled empirical study to examine whether specialized medical segmentation architectures (SMAs) provide systematic advantages over modern GP-VMs for 2D MIS. We compare eleven SMAs and GP-VMs using a unified training and evaluation protocol. Experiments are performed across three heterogeneous datasets covering different imaging modalities, class structures, and data characteristics. Beyond segmentation accuracy, we analyze qualitative Grad-CAM visualizations to investigate explainability (XAI) behavior. Our results demonstrate that, for the analyzed datasets, GP-VMs out-perform the majority of specialized MIS models. Moreover, XAI analyses indicate that GP-VMs can capture clinically relevant structures without explicit domain-specific architectural design. These findings suggest that GP-VMs can represent a viable alternative to domain-specific methods, highlighting the importance of informed model selection for end-to-end MIS systems. All code and resources are available at GitHub.

[CV-25] ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

【速读】：该论文旨在解决当前视觉语言模型（Vision-Language Models, VLMs）在具身空间推理（embodied spatial reasoning）评估中存在的范式局限与覆盖不足问题，从而阻碍了模型在真实机器人任务中的快速迭代开发。其解决方案的关键在于提出ESPIRE——一个面向具身空间推理的诊断性基准测试平台，通过构建物理可接地的仿真环境，并将机器人任务分解为定位（localization）与执行（execution）两个生成式问题（generative problems），从而实现从被动空间认知到“推理以行动”（reasoning to act）的细粒度分析，显著缩小评估与实际部署之间的差距。

链接: https://arxiv.org/abs/2603.13033
作者: Yanpeng Zhao,Wentao Ding,Hongtao Li,Baoxiong Jia,Zilong Zheng
机构: State Key Laboratory of General Artificial Intelligence, BIGAI
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:A recent trend in vision-language models (VLMs) has been to enhance their spatial cognition for embodied domains. Despite progress, existing evaluations have been limited both in paradigm and in coverage, hindering rapid, iterative model development. To address these limitations, we propose ESPIRE, a diagnostic benchmark for embodied spatial reasoning. ESPIRE offers a simulated world that physically grounds VLMs and evaluates them on spatial-reasoning-centric robotic tasks, thus narrowing the gap between evaluation and real-world deployment. To adapt VLMs to robotic tasks, we decompose each task into localization and execution, and frame both as generative problems, in stark contrast to predominant discriminative evaluations (e.g., via visual-question answering) that rely on distractors and discard execution. This decomposition further enables a fine-grained analysis beyond passive spatial reasoning toward reasoning to act. We systematically design ESPIRE both at the instruction level and at the environment level, ensuring broad coverage of spatial reasoning scenarios. We use ESPIRE to diagnose a range of frontier VLMs and provide in-depth analysis of their spatial reasoning behaviors.

[CV-26] Multimodal OCR: Parse Anything from Documents

【速读】：该论文旨在解决传统光学字符识别（OCR）系统仅聚焦于文本识别、忽略图形区域语义信息的问题，从而导致文档解析不完整且难以保留图文间语义关系。其解决方案的关键在于提出多模态OCR（Multimodal OCR, MOCR）范式，将图表、表格、图标等视觉元素作为第一类解析目标，统一输出结构化文本表示，并通过端到端训练联合建模文本与图形的语义关联；同时利用原有文档中被丢弃的图形作为代码级监督信号，构建可扩展的图像到代码语料库，实现对复杂文档内容的高保真重建与高效解析。

链接: https://arxiv.org/abs/2603.13032
作者: Handong Zheng,Yumeng Li,Kaile Zhang,Liang Xin,Guangwei Zhao,Hao Liu,Jiayu Chen,Jie Lou,Jiyu Qiu,Qi Fu,Rui Yang,Shuo Jiang,Weijian Luo,Weijie Su,Weijun Zhang,Xingyu Zhu,Yabin Li,Yiwei ma,Yu Chen,Zhaohui Yu,Guang Yang,Colin Zhang,Lei Zhang,Yuliang Liu,Xiang Bai
机构: Huazhong University of Science and Technology; hi lab, Xiaohongshu Inc
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed this http URL, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision embedded in existing documents. To make this paradigm practical at scale, we build a comprehensive data engine from PDFs, rendered webpages, and native SVG assets, and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning. We evaluate this http URL from two perspectives: document parsing and structured graphics parsing. On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured graphics parsing, this http URL achieves higher reconstruction quality than Gemini 3 Pro across image-to-SVG benchmarks, demonstrating strong performance on charts, UI layouts, scientific figures, and chemical diagrams. These results show a scalable path toward building large-scale image-to-code corpora for multimodal pretraining. Code and models are publicly available at this https URL.

[CV-27] SortScrews: A Dataset and Baseline for Real-time Screw Classification

【速读】：该论文旨在解决工业自动化、机器人技术和库存管理中螺钉类型自动识别的问题，其核心挑战在于公开可用的螺钉分类数据集稀缺，尤其是在受控单物体场景下（常见于自动分拣系统）。解决方案的关键在于构建一个名为SortScrews的数据集，包含560张分辨率为512×512的RGB图像，涵盖六种螺钉类型及背景类，并通过标准化采集设置引入光照和相机视角的轻微变化以增强泛化能力。此外，研究提供了可复用的数据采集脚本，支持用户使用低成本摄像头设备扩展至其他硬件组件。实验表明，在有限样本条件下，基于ImageNet预训练的轻量级模型（如EfficientNet-B0和ResNet-18）仍能实现高分类准确率，验证了受控采集条件对小样本学习的有效性。

链接: https://arxiv.org/abs/2603.13027
作者: Tianhao Fu,Bingxuan Yang,Juncheng Guo,Shrena Sribalan,Yucheng Chen
机构: University of Toronto (多伦多大学); Vector Institute (向量研究所); Project Neura (项目神经); UTMIST (UTMIST); Amplimit (Amplimit)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automatic identification of screw types is important for industrial automation, robotics, and inventory management. However, publicly available datasets for screw classification are scarce, particularly for controlled single-object scenarios commonly encountered in automated sorting systems. In this work, we introduce \textbfSortScrews , a dataset for casewise visual classification of screws. The dataset contains 560 RGB images at 512\times512 resolution covering six screw types and a background class. Images are captured using a standardized acquisition setup and include mild variations in lighting and camera perspective across four capture settings. To facilitate reproducible research and dataset expansion, we also provide a reusable data collection script that allows users to easily construct similar datasets for custom hardware components using inexpensive camera setups. We establish baseline results using transfer learning with EfficientNet-B0 and ResNet-18 classifiers pretrained on ImageNet. In addition, we conduct a well-explored failure analysis. Despite the limited dataset size, these lightweight models achieve strong classification accuracy, demonstrating that controlled acquisition conditions enable effective learning even with relatively small datasets. The dataset, collection pipeline, and baseline training code are publicly available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.13027 [cs.CV] (or arXiv:2603.13027v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.13027 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-28] SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation

【速读】：该论文旨在解决外科手术人工智能（Surgical AI）与仿真中的核心挑战，包括数据稀缺性、罕见事件合成困难以及从仿真到现实场景（sim-to-real）的迁移瓶颈。现有视频生成方法通常依赖昂贵的标注或复杂的结构化中间表示作为推理时的条件信号，限制了其可扩展性；同时，多数方法在复杂腹腔镜场景中缺乏时间一致性且真实感不足。解决方案的关键在于提出 Surgical Action World (SAW)，一种基于轻量级条件信号的视频扩散模型，通过语言提示（编码工具-动作上下文）、参考手术场景、组织可操作性掩码（tissue affordance mask）和2D工具尖端轨迹四类信号进行条件控制，将视频到视频扩散重构为轨迹条件下的外科动作合成任务。该方法在12,044段腹腔镜视频上微调扩散模型，并引入深度一致性损失以确保几何合理性而无需推理时提供深度信息，从而实现了最先进的时序一致性（CD-FVD: 199.19 vs. 546.82）和高质量视觉表现，且在手术AI增强与仿真引擎构建等下游任务中展现出显著实用性。

链接: https://arxiv.org/abs/2603.13024
作者: Sampath Rapuri,Lalithkumar Seenivasan,Dominik Schneider,Roger Soberanis-Mukul,Yufan He,Hao Ding,Jiru Xu,Chenhao Yu,Chenyan Jing,Pengfei Guo,Daguang Xu,Mathias Unberath
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: The manuscript is under review

点击查看摘要

Abstract:A surgical world model capable of generating realistic surgical action videos with precise control over tool-tissue interactions can address fundamental challenges in surgical AI and simulation – from data scarcity and rare event synthesis to bridging the sim-to-real gap for surgical automation. However, current video generation methods, the very core of such surgical world models, require expensive annotations or complex structured intermediates as conditioning signals at inference, limiting their scalability. Other approaches exhibit limited temporal consistency across complex laparoscopic scenes and do not possess sufficient realism. We propose Surgical Action World (SAW) – a step toward surgical action world modeling through video diffusion conditioned on four lightweight signals: language prompts encoding tool-action context, a reference surgical scene, tissue affordance mask, and 2D tool-tip trajectories. We design a conditional video diffusion approach that reformulates video-to-video diffusion into trajectory-conditioned surgical action synthesis. The backbone diffusion model is fine-tuned on a custom-curated dataset of 12,044 laparoscopic clips with lightweight spatiotemporal conditioning signals, leveraging a depth consistency loss to enforce geometric plausibility without requiring depth at inference. SAW achieves state-of-the-art temporal consistency (CD-FVD: 199.19 vs. 546.82) and strong visual quality on held-out test data. Furthermore, we demonstrate its downstream utility for (a) surgical AI, where augmenting rare actions with SAW-generated videos improves action recognition (clipping F1-score: 20.93% to 43.14%; cutting: 0.00% to 8.33%) on real test data, and (b) surgical simulation, where rendering tool-tissue interaction videos from simulator-derived trajectory points toward a visually faithful simulation engine.

[CV-29] A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在下游任务中继承并放大训练数据中的社会偏见问题，同时确保模型公平性提升时不会显著损害其任务性能。解决方案的关键在于提出一种无需训练、不依赖标注数据的去偏方法，该方法在跨模态空间中具有闭式解（closed-form solution），能够在保证有界效用损失的前提下实现帕累托最优公平性（Pareto-optimal fairness），并能联合消除视觉和文本模态中的偏见，适用于零样本图像分类、图文检索和图文生成等多种下游任务。

链接: https://arxiv.org/abs/2603.12998
作者: Tangzheng Lian,Guanyu Hu,Yijing Ren,Dimitrios Kollias,Oya Celiktutan
机构: King’s College London (国王学院伦敦); Queen Mary University of London (皇后玛丽大学伦敦)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Vision-Language Models (VLMs) have achieved remarkable performance across diverse downstream tasks, recent studies have shown that they can inherit social biases from the training data and further propagate them into downstream applications. To address this issue, various debiasing approaches have been proposed, yet most of them aim to improve fairness without having a theoretical guarantee that the utility of the model is preserved. In this paper, we introduce a debiasing method that yields a \textbfclosed-form solution in the cross-modal space, achieving Pareto-optimal fairness with \textbfbounded utility losses. Our method is \textbftraining-free, requires \textbfno annotated data, and can jointly debias both visual and textual modalities across downstream tasks. Extensive experiments show that our method outperforms existing methods in debiasing VLMs across diverse fairness metrics and datasets for both group and \textbfintersectional fairness in downstream tasks such as zero-shot image classification, text-to-image retrieval, and text-to-image generation while preserving task performance.

[CV-30] Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis CVPR2026

【速读】：该论文旨在解决学习带噪声标签（Learning with Noisy Labels, LNL）问题中，基于噪声转移矩阵（noise transition matrix, T）的统计一致性方法在实践中性能不佳的问题。尽管这些方法理论上能收敛到最优的干净数据分类器，但实际效果常被经验性方法（如样本选择）超越，传统观点认为这是由于T估计不准确所致。本文通过在理想条件下提供一个完美oracle转移矩阵，对这一假设进行了决定性检验，发现即便如此，噪声校正方法仍会在训练过程中出现性能崩溃。这表明问题根源并非T估计误差，而是更深层次的理论缺陷。解决方案的关键在于提出一个统一分析框架，从宏观收敛状态、微观优化动态和信息论极限三个层面揭示理想噪声校正失败的本质，并为设计更可靠的LNL方法提供理论依据与实践指导。

链接: https://arxiv.org/abs/2603.12997
作者: Chen Feng,Zhuo Zhi,Zhao Huang,Jiawei Ge,Ling Xiao,Nicu Sebe,Georgios Tzimiropoulos,Ioannis Patras
机构: Queen’s University Belfast (贝尔法斯特女王大学); University College London (伦敦大学学院); University of Aberdeen (阿伯丁大学); Queen Mary University of London (伦敦玛丽女王大学); Hokkaido University (北海道大学); University of Trento (特伦托大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2026

点击查看摘要

Abstract:Statistically consistent methods based on the noise transition matrix ( T ) offer a theoretically grounded solution to Learning with Noisy Labels (LNL), with guarantees of convergence to the optimal clean-data classifier. In practice, however, these methods are often outperformed by empirical approaches such as sample selection, and this gap is usually attributed to the difficulty of accurately estimating T . The common assumption is that, given a perfect T , noise-correction methods would recover their theoretical advantage. In this work, we put this longstanding hypothesis to a decisive test. We conduct experiments under idealized conditions, providing correction methods with a perfect, oracle transition matrix. Even under these ideal conditions, we observe that these methods still suffer from performance collapse during training. This compellingly demonstrates that the failure is not fundamentally a T -estimation problem, but stems from a more deeply rooted flaw. To explain this behaviour, we provide a unified analysis that links three levels: macroscopic convergence states, microscopic optimisation dynamics, and information-theoretic limits on what can be learned from noisy labels. Together, these results give a formal account of why ideal noise correction fails and offer concrete guidance for designing more reliable methods for learning with noisy labels.

[CV-31] st-Time Attention Purification for Backdoored Large Vision Language Models

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在微调过程中易受后门攻击的问题，即攻击者通过向训练数据中注入携带触发器（trigger）的样本，使模型在测试时产生恶意行为。现有防御方法通常依赖于使用干净数据重新训练被污染的参数（如适配器或LoRA模块），但存在计算开销大且可能损害模型性能的缺点。论文的关键创新在于提出了一种新的机制性理解：后门行为并非由低级视觉模式引起，而是源于异常的跨模态注意力重分配——具体表现为携带触发器的视觉标记会“窃取”本应分配给文本上下文的注意力，这一现象被称为注意力窃取（attention stealing）。基于此认知，作者设计了CleanSight，一种无需训练、可直接部署的测试时防御方案，其核心是：(i) 通过检测选定跨模态融合层中的相对视觉-文本注意力比例来识别中毒输入；(ii) 通过选择性剪枝高注意力可疑视觉标记以中和后门激活，从而在不损害模型正常功能的前提下有效抵御多种类型的后门攻击。

链接: https://arxiv.org/abs/2603.12989
作者: Zhifang Zhang,Bojun Yang,Shuo He,Weitong Chen,Wei Emma Zhang,Olaf Maennel,Lei Feng,Miao Xu
机构: University of Queensland (昆士兰大学); Southeast University (东南大学); Nanyang Technological University (南洋理工大学); Adelaide University (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context - a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense that operates purely at test time. CleanSight (i) detects poisoned inputs based on the relative visual-text attention ratio in selected cross-modal fusion layers, and (ii) purifies the input by selectively pruning the suspicious high-attention visual tokens to neutralize the backdoor activation. Extensive experiments show that CleanSight significantly outperforms existing pixel-based purification defenses across diverse datasets and backdoor attack types, while preserving the model’s utility on both clean and poisoned samples.

[CV-32] Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning

【速读】：该论文旨在解决多类肺部疾病（健康、新冠、腺癌、鳞状细胞癌）在胸部CT图像上的公平诊断问题，核心挑战在于病理信号稀疏且跨性别和疾病类别存在严重的人口统计学不平衡。解决方案的关键在于提出一种基于注意力机制的多实例学习（Multiple Instance Learning, MIL）模型，以ConvNeXt为骨干网络自动识别诊断相关切片而无需逐切片标注，并引入梯度反转层（Gradient Reversal Layer, GRL）对抗性地抑制扫描表示中与性别相关的特征，从而提升性别公平性；同时结合焦点损失（focal loss）与标签平滑、按（疾病类别, 性别）联合分层交叉验证及对最稀缺子群的针对性过采样策略，最终通过五折模型集成与测试时翻转增强实现鲁棒预测。

链接: https://arxiv.org/abs/2603.12988
作者: Aditya Parikh,Aasa Feragen
机构: Technical University of Denmark (DTU)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a fairness-aware framework for multi-class lung disease diagnosis from chest CT volumes, developed for the Fair Disease Diagnosis Challenge at the PHAROS-AIF-MIH Workshop (CVPR 2026). The challenge requires classifying CT scans into four categories – Healthy, COVID-19, Adenocarcinoma, and Squamous Cell Carcinoma – with performance measured as the average of per-gender macro F1 scores, explicitly penalizing gender-inequitable predictions. Our approach addresses two core difficulties: the sparse pathological signal across hundreds of slices, and a severe demographic imbalance compounded across disease class and gender. We propose an attention-based Multiple Instance Learning (MIL) model on a ConvNeXt backbone that learns to identify diagnostically relevant slices without slice-level supervision, augmented with a Gradient Reversal Layer (GRL) that adversarially suppresses gender-predictive structure in the learned scan representation. Training incorporates focal loss with label smoothing, stratified cross-validation over joint (class, gender) strata, and targeted oversampling of the most underrepresented subgroup. At inference, all five-fold checkpoints are ensembled with horizontal-flip test-time augmentation via soft logit voting and out-of-the-fold threshold optimization for robustness. Our model achieves a mean validation competition score of 0.685 (std - 0.030), with the best single fold reaching 0.759. All training and inference code is publicly available at this https URL

[CV-33] SCOPE: Semantic Coreset with Orthogonal Projection Embeddings for Federated learning

【速读】：该论文旨在解决联邦学习中因高分辨率仪器数据流导致的极端类别不平衡问题，以及现有方法在无法有效聚合数据时性能下降或因局部启发式选择导致核心子集（coreset）非代表性的问题。解决方案的关键在于提出SCOPE框架，其通过正交投影嵌入（Orthogonal Projection Embeddings）对数据点进行多维评分：包括衡量核心类别特征可靠性的表示分数、量化正交残差新颖性的多样性分数，以及指示与竞争类别相似性的边界接近度分数；该框架仅向联邦服务器传输标量指标以构建全局共识，从而实现通信高效性，并基于此共识动态过滤本地噪声和剔除冗余样本，有效缓解全局特征偏斜，最终在保持高准确率和鲁棒收敛的同时显著降低带宽消耗（128–512倍）和计算开销（7.72倍加速）。

链接: https://arxiv.org/abs/2603.12976
作者: Md Anwar Hossen,Nathan R. Tallent,Luanzheng Guo,Ali Jannesary
机构: Iowa State University (爱荷华州立大学); Pacific Northwest National Laboratory (太平洋西北国家实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scientific discovery increasingly requires learning on federated datasets, fed by streams from high-resolution instruments, that have extreme class imbalance. Current ML approaches either require impractical data aggregation or fail due to class imbalance. Existing coreset selection methods rely on local heuristics, making them unaware of the global data landscape and prone to sub-optimal and non-representative pruning. To overcome these challenges, we introduce SCOPE (Semantic Coreset using Orthogonal Projection Embeddings for Federated learning), a coreset framework for federated data that filters anomalies and adaptively prunes redundant data to mitigate long-tail skew. By analyzing the latent space distribution, we score each data point using a representation score that measures the reliability of core class features, a diversity score that quantifies the novelty of orthogonal residuals, and a boundary proximity score that indicates similarity to competing classes. Unlike prior methods, SCOPE shares only scalar metrics with a federated server to construct a global consensus, ensuring communication efficiency. Guided by the global consensus, SCOPE dynamically filters local noise and discards redundant samples to counteract global feature skews. Extensive experiments demonstrate that SCOPE yields competitive global accuracy and robust convergence, all while achieving exceptional efficiency with a 128x to 512x reduction in uplink bandwidth, a 7.72x wall-clock acceleration and reduced FLOP and VRAM footprints for local coreset selection.

[CV-34] hinking in Streaming Video

【速读】：该论文旨在解决现有视频推理方法在流式视频场景中因采用批处理范式而导致的高延迟和计算成本增长的问题，从而无法满足交互式智能体对实时理解连续视频流的需求。其解决方案的关键在于提出ThinkStream框架，该框架基于“观看-思考-说话”（Watch–Think–Speak）范式，使模型能够随着新视频观测的到达逐步更新理解，并在每一步进行短时推理更新以判断是否已积累足够证据生成响应；同时引入推理压缩流式记忆（Reasoning-Compressed Streaming Memory, RCSM），将中间推理轨迹作为紧凑的语义记忆替代过时的视觉标记，从而实现长时程流式处理下的上下文保留与内存效率优化。

链接: https://arxiv.org/abs/2603.12938
作者: Zikang Liu,Longteng Guo,Handong Li,Ru Zhen,Xingjian He,Ruyi Ji,Xiaoming Ren,Yanhao Zhang,Haonan Lu,Jing Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-time understanding of continuous video streams is essential for interactive assistants and multimodal agents operating in dynamic environments. However, most existing video reasoning approaches follow a batch paradigm that defers reasoning until the full video context is observed, resulting in high latency and growing computational cost that are incompatible with streaming scenarios. In this paper, we introduce ThinkStream, a framework for streaming video reasoning based on a Watch–Think–Speak paradigm that enables models to incrementally update their understanding as new video observations arrive. At each step, the model performs a short reasoning update and decides whether sufficient evidence has accumulated to produce a response. To support long-horizon streaming, we propose Reasoning-Compressed Streaming Memory (RCSM), which treats intermediate reasoning traces as compact semantic memory that replaces outdated visual tokens while preserving essential context. We further train the model using a Streaming Reinforcement Learning with Verifiable Rewards scheme that aligns incremental reasoning and response timing with the requirements of streaming interaction. Experiments on multiple streaming video benchmarks show that ThinkStream significantly outperforms existing online video models while maintaining low latency and memory usage. Code, models and data will be released at this https URL

[CV-35] SGMatch: Semantic-Guided Non-Rigid Shape Matching with Flow Regularization

【速读】：该论文旨在解决非刚性三维形状之间点对点对应关系的准确建立问题，尤其在非等距变形（non-isometric deformations）和拓扑噪声（topological noise）下的挑战。现有基于函数映射（functional map）的方法因几何描述子无法消除歧义，以及截断谱基投影到稠密点对应时存在的空间不一致性，导致匹配性能受限。解决方案的关键在于提出SGMatch框架：首先设计了语义引导的局部交叉注意力模块（Semantic-Guided Local Cross-Attention），将视觉基础模型提取的语义特征融合进几何描述子，同时保持局部结构连续性；其次引入基于条件流匹配（conditional flow matching）的正则化目标，通过监督时变速度场来增强恢复对应关系的空间平滑性。该方法在多个基准测试中均表现出优于现有方法的鲁棒性和精度，尤其是在非等距变形和拓扑噪声场景下具有显著优势。

链接: https://arxiv.org/abs/2603.12937
作者: Tianwei Ye,Xiaoguang Mei,Yifan Xia,Fan Fan,Jun Huang,Jiayi Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 13 figures

点击查看摘要

Abstract:Establishing accurate point-to-point correspondences between non-rigid 3D shapes remains a critical challenge, particularly under non-isometric deformations and topological noise. Existing functional map pipelines suffer from ambiguities that geometric descriptors alone cannot resolve, and spatial inconsistencies inherent in the projection of truncated spectral bases to dense pointwise correspondences. In this paper, we introduce SGMatch, a learning-based framework for semantic-guided non-rigid shape matching. Specifically, we design a Semantic-Guided Local Cross-Attention module that integrates semantic features from vision foundation models into geometric descriptors while preserving local structural continuity. Furthermore, we introduce a regularization objective based on conditional flow matching, which supervises a time-varying velocity field to encourage spatial smoothness of the recovered correspondences. Experimental results on multiple benchmarks demonstrate that SGMatch achieves competitive performance across near-isometric settings and consistent improvements under non-isometric deformations and topological noise.

[CV-36] MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins

【速读】：该论文旨在解决静态3D网格（static 3D meshes）向可交互关节资产（articulated assets）转换过程中存在的物理未对齐问题，尤其针对零样本（zero-shot）场景下因缺乏物理先验而导致的运动学幻觉（kinematic hallucinations）和模拟中严重的网格穿插（mesh inter-penetration）问题。解决方案的关键在于提出MotionAnymesh框架，其核心创新包括：1）引入一种基于SP4D（Space-Time-Physics-4D）物理先验的运动学感知部件分割模块，通过显式物理约束 grounding 视觉-语言模型（VLM）推理，有效消除运动学幻觉；2）设计几何-物理联合估计流程，结合类型感知初始化与物理约束轨迹优化，确保关节运动在物理仿真中无碰撞、可执行。

链接: https://arxiv.org/abs/2603.12936
作者: WenBo Xu,Liu Liu,Li Zhang,Dan Guo,RuoNan Liu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 figures

点击查看摘要

Abstract:Converting static 3D meshes into interactable articulated assets is crucial for embodied AI and robotic simulation. However, existing zero-shot pipelines struggle with complex assets due to a critical lack of physical grounding. Specifically, ungrounded Vision-Language Models (VLMs) frequently suffer from kinematic hallucinations, while unconstrained joint estimation inevitably leads to catastrophic mesh inter-penetration during physical simulation. To bridge this gap, we propose MotionAnymesh, an automated zero-shot framework that seamlessly transforms unstructured static meshes into simulation-ready digital twins. Our method features a kinematic-aware part segmentation module that grounds VLM reasoning with explicit SP4D physical priors, effectively eradicating kinematic hallucinations. Furthermore, we introduce a geometry-physics joint estimation pipeline that combines robust type-aware initialization with physics-constrained trajectory optimization to rigorously guarantee collision-free articulation. Extensive experiments demonstrate that MotionAnymesh significantly outperforms state-of-the-art baselines in both geometric precision and dynamic physical executability, providing highly reliable assets for downstream applications.

[CV-37] Rethinking VLMs for Image Forgery Detection and Localization

【速读】：该论文旨在解决生成式 AI (Generative AI) 时代下图像伪造检测与定位（Image Forgery Detection and Localization, IFDL）面临的挑战，即如何有效利用视觉-语言模型（Vision-Language Models, VLMs）提升检测性能。其关键在于发现 VLM 的先验知识因偏向语义合理性而非真实性，反而可能抑制检测效果；而通过引入显式的伪造位置掩码（location masks）作为额外先验，可引导 VLM 更高效地学习伪造特征，从而增强模型的可解释性并显著提升检测与定位性能。基于此洞察，作者提出了新的 IFDL-VLM 流水线，在9个主流基准上验证了其在域内和跨数据集泛化场景下的最优表现。

链接: https://arxiv.org/abs/2603.12930
作者: Shaofeng Guo,Jiequan Cui,Richang Hong
机构: Hefei University of Technology (合肥工业大学); Key Laboratory of Knowledge Engineering with Big Data, Ministry of Education (教育部知识工程大数据重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8pages

点击查看摘要

Abstract:With the rapid rise of Artificial Intelligence Generated Content (AIGC), image manipulation has become increasingly accessible, posing significant challenges for image forgery detection and localization (IFDL). In this paper, we study how to fully leverage vision-language models (VLMs) to assist the IFDL task. In particular, we observe that priors from VLMs hardly benefit the detection and localization performance and even have negative effects due to their inherent biases toward semantic plausibility rather than authenticity. Additionally, the location masks explicitly encode the forgery concepts, which can serve as extra priors for VLMs to ease their training optimization, thus enhancing the interpretability of detection and localization results. Building on these findings, we propose a new IFDL pipeline named IFDL-VLM. To demonstrate the effectiveness of our method, we conduct experiments on 9 popular benchmarks and assess the model performance under both in-domain and cross-dataset generalization settings. The experimental results show that we consistently achieve new state-of-the-art performance in detection, localization, and this http URL is available at: this https URL.

[CV-38] VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation CVPR2026

【速读】：该论文旨在解决跨视角位姿估计（cross-view pose estimation）中因地面视图与卫星视图之间存在显著视角差异而导致的定位精度下降问题，尤其在GNSS信号受遮挡或多径效应影响时，传统方法难以维持鲁棒性。其解决方案的关键在于提出一种基于双轴变换（dual-axis transformation, VIRD）的视图不变表示构建方法：首先对卫星图像进行极坐标变换以建立水平方向上的对应关系，随后在地面图像与极坐标变换后的卫星特征上引入增强上下文的位置注意力机制，从而显式缓解垂直方向上的错位问题；同时设计视图重建损失函数进一步强化特征的视图不变性，使学习到的表示能够重构原始及跨视角图像，从而显著提升位姿估计精度。

链接: https://arxiv.org/abs/2603.12918
作者: Juhye Park,Wooju Lee,Dasol Hong,Changki Sung,Youngwoo Seo,Dongwan Kang,Hyun Myung
机构: KAIST (韩国科学技术院); Hanwha Aerospace (韩华航空航天)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Accurate global localization is crucial for autonomous driving and robotics, but GNSS-based approaches often degrade due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. We propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to establish horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to resolve vertical misalignment, explicitly mitigating the viewpoint gap. A view-reconstruction loss is introduced to strengthen the view invariance further, encouraging the derived representations to reconstruct the original and cross-view images. Experiments on the KITTI and VIGOR datasets demonstrate that VIRD outperforms the state-of-the-art methods without orientation priors, reducing median position and orientation errors by 50.7% and 76.5% on KITTI, and 18.0% and 46.8% on VIGOR, respectively.

[CV-39] Stake the Points: Structure-Faithful Instance Unlearning CVPR2026

【速读】：该论文旨在解决预训练模型中机器遗忘（Machine Unlearning, MU）所面临的隐私风险与知识保留之间的平衡问题，特别是现有方法往往忽视了在删除指定数据后对保留数据间语义关系的维护，导致模型出现渐进式结构坍塌，从而破坏删除-保留平衡。解决方案的关键在于提出一种结构忠实（structure-faithful）框架，引入“stakes”——即语义锚点（semantic anchors），作为知识结构的参考基准以维持语义组织稳定性；通过语言驱动的属性描述编码（如CLIP）生成锚点，并利用结构感知对齐（structure-aware alignment）和正则化机制，分别确保遗忘前后保留知识的组织一致性以及关键参数更新的稳定性，从而显著提升图像分类、检索和人脸识别任务中的性能表现，平均提升分别为32.9%、22.5%和19.3%，有效平衡删除与保留效果并增强泛化能力。

链接: https://arxiv.org/abs/2603.12915
作者: Kiseong Hong,JungKyoo Shin,Eunwoo Kim
机构: Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Machine unlearning (MU) addresses privacy risks in pretrained models. The main goal of MU is to remove the influence of designated data while preserving the utility of retained knowledge. Achieving this goal requires preserving semantic relations among retained instances, which existing studies often overlook. We observe that without such preservation, models suffer from progressive structural collapse, undermining both the deletion-retention balance. In this work, we propose a novel structure-faithful framework that introduces stakes, i.e., semantic anchors that serve as reference points to maintain the knowledge structure. By leveraging these anchors, our framework captures and stabilizes the semantic organization of knowledge. Specifically, we instantiate the anchors from language-driven attribute descriptions encoded by a semantic encoder (e.g., CLIP). We enforce preservation of the knowledge structure via structure-aware alignment and regularization: the former aligns the organization of retained knowledge before and after unlearning around anchors, while the latter regulates updates to structure-critical parameters. Results from image classification, retrieval, and face recognition show average gains of 32.9%, 22.5%, and 19.3% in performance, balancing the deletion-retention trade-off and enhancing generalization.

[CV-40] FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts

【速读】：该论文针对联邦域泛化行人重识别（Federated Domain Generalization for Person Re-Identification, FedDG-ReID）中视觉Transformer（Vision Transformer, ViT）因全局注意力机制难以区分高相似背景或不同视角下的行人所导致的特征混淆问题，提出了一种基于可学习视觉提示的解决方案——联邦身体分布感知视觉提示（Federated Body Distribution Aware Visual Prompt, FedBPrompt）。其核心创新在于引入了两个关键机制：一是整体全身提示（Holistic Full Body Prompts），用于抑制跨客户端背景噪声；二是身体部位对齐提示（Body Part Alignment Prompts），以捕捉鲁棒于姿态和视角变化的细粒度特征。此外，为降低通信开销，设计了提示微调策略（Prompt-based Fine-Tuning Strategy, PFTS），仅更新轻量级提示参数而冻结ViT主干网络，在减少通信成本的同时保持模型适应能力。实验表明，FedBPrompt显著提升了特征判别力与跨域泛化性能，且可无缝集成至现有基于ViT的FedDG-ReID框架中。

链接: https://arxiv.org/abs/2603.12912
作者: Xin Xu,Weilong Li,Wei Liu,Wenke Huang,Zhixi Yu,Bin Yang,Xiaoying Liao,Kui Jiang
机构: Wuhan University of Science and Technology (武汉科技大学); Wuhan University (武汉大学); Nanyang Technological University (南洋理工大学); Changsha Bus Group (长沙公交集团); Central South University of Forestry and Technology (中南林业科技大学); Harbin Institute of Technology Zhengzhou Research Institute (哈尔滨工业大学郑州研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Domain Generalization for Person Re-Identification (FedDG-ReID) learns domain-invariant representations from decentralized data. While Vision Transformer (ViT) is widely adopted, its global attention often fails to distinguish pedestrians from high similarity backgrounds or diverse viewpoints – a challenge amplified by cross-client distribution shifts in FedDG-ReID. To address this, we propose Federated Body Distribution Aware Visual Prompt (FedBPrompt), introducing learnable visual prompts to guide Transformer attention toward pedestrian-centric regions. FedBPrompt employs a Body Distribution Aware Visual Prompts Mechanism (BAPM) comprising: Holistic Full Body Prompts to suppress cross-client background noise, and Body Part Alignment Prompts to capture fine-grained details robust to pose and viewpoint variations. To mitigate high communication costs, we design a Prompt-based Fine-Tuning Strategy (PFTS) that freezes the ViT backbone and updates only lightweight prompts, significantly reducing communication overhead while maintaining adaptability. Extensive experiments demonstrate that BAPM effectively enhances feature discrimination and cross-domain generalization, while PFTS achieves notable performance gains within only a few aggregation rounds. Moreover, both BAPM and PFTS can be easily integrated into existing ViT-based FedDG-ReID frameworks, making FedBPrompt a flexible and effective solution for federated person re-identification. The code is available at this https URL.

[CV-41] DirPA: Addressing Prior Shift in Imbalanced Few-shot Crop-type Classification

【速读】：该论文旨在解决农业监测中因类别分布极度不均衡（long-tailed distribution）和标签获取成本高导致的数据稀缺问题，尤其是在少样本学习（Few-Shot Learning, FSL）场景下，传统方法通过人工平衡训练集虽能缓解数据不足，但引入了与真实世界分布的偏差（distribution shift），从而削弱模型在实际农业任务中的泛化能力。其解决方案的关键在于引入Dirichlet Prior Augmentation (DirPA)，通过在训练过程中主动模拟先验分布来缓解标签分布偏斜的影响，从而提升模型在极端长尾分布下的鲁棒性和稳定性，并显著改善各具体类别的性能表现。

链接: https://arxiv.org/abs/2603.12905
作者: Joana Reuss,Ekaterina Gikalo,Marco Körner
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 9 Figures, 28 Tables

点击查看摘要

Abstract:Real-world agricultural monitoring is often hampered by severe class imbalance and high label acquisition costs, resulting in significant data scarcity. In few-shot learning (FSL) – a framework specifically designed for data-scarce settings – , training sets are often artificially balanced. However, this creates a disconnect from the long-tailed distributions observed in nature, leading to a distribution shift that undermines the model’s ability to generalize to real-world agricultural tasks. We previously introduced Dirichlet Prior Augmentation (DirPA; Reuss et al., 2026a) to proactively mitigate the effects of such label distribution skews during model training. In this work, we extend the original study’s geographical scope. Specifically, we evaluate this extended approach across multiple countries in the European Union (EU), moving beyond localized experiments to test the method’s resilience across diverse agricultural environments. Our results demonstrate the effectiveness of DirPA across different geographical regions. We show that DirPA not only improves system robustness and stabilizes training under extreme long-tailed distributions, regardless of the target region, but also substantially improves individual class-specific performance by proactively simulating priors.

[CV-42] Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis CVPR2026

【速读】：该论文旨在解决LiDAR神经辐射场（LiDAR NeRF）在场景重建中对精确相机位姿的高度依赖性，以及由LiDAR数据稀疏性和无纹理特性导致的几何空洞和表面不连续问题。解决方案的关键在于提出一种无需位姿信息的框架SG-NLF，其核心创新包括：(1) 基于光谱先验设计混合表示以重建平滑几何结构；(2) 构建基于特征兼容性的置信度感知图实现全局位姿对齐；(3) 引入对抗学习策略增强跨帧一致性，从而显著提升重建质量和位姿估计精度。

链接: https://arxiv.org/abs/2603.12903
作者: Yinuo Jiang,Jun Cheng,Yiran Wang,Cheng Cheng
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have shown remarkable success in image novel view synthesis (NVS), inspiring extensions to LiDAR NVS. However, most methods heavily rely on accurate camera poses for scene reconstruction. The sparsity and textureless nature of LiDAR data also present distinct challenges, leading to geometric holes and discontinuous surfaces. To address these issues, we propose SG-NLF, a pose-free LiDAR NeRF framework that integrates spectral information with geometric consistency. Specifically, we design a hybrid representation based on spectral priors to reconstruct smooth geometry. For pose optimization, we construct a confidence-aware graph based on feature compatibility to achieve global alignment. In addition, an adversarial learning strategy is introduced to enforce cross-frame consistency, thereby enhancing reconstruction quality. Comprehensive experiments demonstrate the effectiveness of our framework, especially in challenging low-frequency scenarios. Compared to previous state-of-the-art methods, SG-NLF improves reconstruction quality and pose accuracy by over 35.8% and 68.8%. Our work can provide a novel perspective for LiDAR view synthesis.

[CV-43] Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

【速读】：该论文旨在解决扩散模型（diffusion model）在后训练阶段中，如何通过强化学习（reinforcement learning, RL）更高效地提升图像质量与提示词对齐度的问题。现有方法通常将采样过程中的每一步视为独立的策略动作，导致更新方差较大且优化效率低。本文的关键解决方案是提出一种在线强化学习变体，将整个采样过程视为单一动作，并通过配对轨迹采样来降低模型更新的方差——具体而言，通过拉取流速方向朝向更具优势的图像，实现更稳定的梯度估计和更快的收敛。实验表明，该方法在使用高质量视觉语言模型或现成质量指标作为奖励信号时，能显著优于先前方法，在图像质量和提示词一致性上均取得提升。

链接: https://arxiv.org/abs/2603.12893
作者: David McAllister,Miika Aittala,Tero Karras,Janne Hellsten,Angjoo Kanazawa,Timo Aila,Samuli Laine
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
备注: Code available at this https URL

点击查看摘要

Abstract:Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high-quality vision language models and off-the-shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.

[CV-44] Forecasting Epileptic Seizures from Contactless Camera via Cross-Species Transfer Learning

【速读】：该论文旨在解决癫痫发作预测（epileptic seizure forecasting）这一临床重要但极具挑战性的问题，尤其针对现有方法依赖侵入式神经信号（如脑电图 EEG）导致难以长期部署于真实场景的局限性。其核心创新在于提出了一种基于视频的癫痫发作预测新任务，利用短时前发作期视频片段（3–10秒）预测未来5秒内是否发生癫痫发作。解决方案的关键在于设计了一种跨物种迁移学习框架，通过大规模啮齿类动物视频数据进行辅助预训练，使模型能够捕捉跨物种通用的癫痫相关行为动态，从而在人类视频数据标注稀缺的情况下实现高精度预测（超过70%），显著优于现有基线方法。

链接: https://arxiv.org/abs/2603.12887
作者: Mingkai Zhai,Wei Wang,Zongsheng Li,Quanying Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Epileptic seizure forecasting is a clinically important yet challenging problem in epilepsy research. Existing approaches predominantly rely on neural signals such as electroencephalography (EEG), which require specialized equipment and limit long-term deployment in real-world settings. In contrast, video data provide a non-invasive and accessible alternative, yet existing video-based studies mainly focus on post-onset seizure detection, leaving seizure forecasting largely unexplored. In this work, we formulate a novel task of video-based epileptic seizure forecasting, where short pre-ictal video segments (3-10 seconds) are used to predict whether a seizure will occur within the subsequent 5 seconds. To address the scarcity of annotated human epilepsy videos, we propose a cross-species transfer learning framework that leverages large-scale rodent video data for auxiliary pretraining. This enables the model to capture seizure-related behavioral dynamics that generalize across species. Experimental results demonstrate that our approach achieves over 70% prediction accuracy under a strictly video-only setting and outperforms existing baselines. These findings highlight the potential of cross-species learning for building non-invasive, scalable early-warning systems for epilepsy.

[CV-45] A protocol for evaluating robustness to HE staining variation in computational pathology models

【速读】：该论文旨在解决组织病理学中染色变异（hematoxylin and eosin, HE）染色差异对计算病理学（Computational Pathology, CPath）模型性能稳定性的影响问题。由于HE染色在不同实验室间存在显著差异，导致模型在实际部署时性能波动大，难以可靠应用。其解决方案的关键在于提出了一种三步评估协议：首先选定参考染色条件，其次表征测试集的染色特性，最后在模拟的参考染色条件下应用CPath模型进行预测。该方法通过系统性模拟四种染色条件（高低HE强度、高低HE颜色相似度），量化模型的分类性能（AUC）与鲁棒性（min-max AUC范围），从而实现基于鲁棒性的模型筛选，并揭示模型在不同染色条件下的性能变化趋势，为模型的实际部署提供可操作的性能边界。

链接: https://arxiv.org/abs/2603.12886
作者: Lydia A. Schönpflug,Nikki van den Berg,Sonali Andani,Nanda Horeweg,Jurriaan Barkey Wolf,Tjalling Bosse,Viktor H. Koelzer,Maxime W. Lafarge
机构: University of South Bohemia in České Budějovice (南波希米亚大学); Max Planck Institute for Human Development (马克斯·普朗克人类发展研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sensitivity to staining variation remains a major barrier to deploying computational pathology (CPath) models as hematoxylin and eosin (HE) staining varies across laboratories, requiring systematic assessment of how this variability affects model prediction. In this work, we developed a three-step protocol for evaluating robustness to HE staining variation in CPath models. Step 1: Select reference staining conditions, Step 2: Characterize test set staining properties, Step 3: Apply CPath model(s) under simulated reference staining conditions. Here, we first created a new reference staining library based on the PLISM dataset. As an exemplary use case, we applied the protocol to assess the robustness properties of 306 microsatellite instability (MSI) classification models on the unseen SurGen colorectal cancer dataset (n=738), including 300 attention-based multiple instance learning models trained on the TCGA-COAD/READ datasets across three feature extractors (UNI2-h, H-Optimus-1, Virchow2), alongside six public MSI classification models. Classification performance was measured as AUC, and robustness as the min-max AUC range across four simulated staining conditions (low/high HE intensity, low/high HE color similarity). Across models and staining conditions, classification performance ranged from AUC 0.769-0.911 ( \Delta = 0.142). Robustness ranged from 0.007-0.079 ( \Delta = 0.072), and showed a weak inverse correlation with classification performance (Pearson r=-0.22, 95% CI [-0.34, -0.11]). Thus, we show that the proposed evaluation protocol enables robustness-informed CPath model selection and provides insight into performance shifts across HE staining conditions, supporting the identification of operational ranges for reliable model deployment. Code is available at this https URL .

[CV-46] RACE: Structure-Aware Character Encoding for Robust and Generalizable Document Watermarking

【速读】：该论文旨在解决现有数据嵌入方法在局部字符编码中对噪声敏感、依赖边缘特征或预定义码本而导致鲁棒性不足的问题。其解决方案的关键在于提出一种结构感知的框架TRACE，通过利用字符结构的稳定性与统一表示特性来增强抗噪能力；具体包括三个核心组件：(1) 自适应扩散初始化，借助运动概率估计器（MPE）、目标点估计器（TPE）和掩码绘制模型（MDM）自动识别控制点、目标点及编辑区域；(2) 引导扩散编码以实现选定点的精确移动；(3) 基于专用损失函数的掩码区域替换机制，最小化扩散过程后的特征扰动。该方法在跨媒体传输后仍保持高提取准确率（提升5%）和PSNR性能（超过5 dB），且具备多语言、多字体的广泛泛化能力。

链接: https://arxiv.org/abs/2603.12873
作者: Jiale Meng,Jie Zhang,Runyi Hu,Zhe-Ming Lu,Tianwei Zhang,Yiming Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose TRACE, a structure-aware framework leveraging diffusion models for localized character encoding to embed data. Unlike existing methods that rely on edge features or pre-defined codebooks, TRACE exploits character structures that provide inherent resistance to noise interference due to their stability and unified representation across diverse characters. Our framework comprises three key components: (1) adaptive diffusion initialization that automatically identifies handle points, target points, and editing regions through specialized algorithms including movement probability estimator (MPE), target point estimation (TPE) and mask drawing model (MDM), (2) guided diffusion encoding for precise movement of selected point, and (3) masked region replacement with a specialized loss function to minimize feature alterations after the diffusion process. Comprehensive experiments demonstrate \name’s superior performance over state-of-the-art methods, achieving more than 5 dB improvement in PSNR and 5% higher extraction accuracy following cross-media transmission. \name achieves broad generalizability across multiple languages and fonts, making it particularly suitable for practical document security applications.

[CV-47] Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation

【速读】：该论文旨在解决自动驾驶中安全关键的“长尾”边缘场景（long-tail edge cases）难以有效合成的问题，这类场景通常由常见交通元素的异常组合引发。现有可控生成模型存在指导信息不完整或耦合的问题，无法独立操控场景结构、物体身份和自车动作。解决方案的关键在于提出CompoSIA——一个解耦的驾驶视频模拟器，通过引入噪声级别身份注入（noise-level identity injection）实现跨姿态的身份无关生成，以及分层双分支动作控制机制（hierarchical dual-branch action control），从而实现对场景结构、对象身份和自车行为的细粒度独立控制。这种解耦控制能力使得能够系统性地将安全元素组合成危险配置，生成传统耦合生成器无法实现的对抗性驾驶场景，并在下游压力测试中显著暴露规划器缺陷，碰撞率平均提升173%。

链接: https://arxiv.org/abs/2603.12864
作者: Yifan Zhan,Zhengqing Chen,Qingjie Wang,Zhuo He,Muyao Niu,Xiaoyang Guo,Wei Yin,Weiqiang Ren,Qian Zhang,Yinqiang Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A major challenge in autonomous driving is the “long tail” of safety-critical edge cases, which often emerge from unusual combinations of common traffic elements. Synthesizing these scenarios is crucial, yet current controllable generative models provide incomplete or entangled guidance, preventing the independent manipulation of scene structure, object identity, and ego actions. We introduce CompoSIA, a compositional driving video simulator that disentangles these traffic factors, enabling fine-grained control over diverse adversarial driving scenarios. To support controllable identity replacement of scene elements, we propose a noise-level identity injection, allowing pose-agnostic identity generation across diverse element poses, all from a single reference image. Furthermore, a hierarchical dual-branch action control mechanism is introduced to improve action controllability. Such disentangled control enables adversarial scenario synthesis-systematically combining safe elements into dangerous configurations that entangled generators cannot produce. Extensive comparisons demonstrate superior controllable generation quality over state-of-the-art baselines, with a 17% improvement in FVD for identity editing and reductions of 30% and 47% in rotation and translation errors for action control. Furthermore, downstream stress-testing reveals substantial planner failures: across editing modalities, the average collision rate of 3s increases by 173%.

[CV-48] Wear Classification of Abrasive Flap Wheels using a Hierarchical Deep Learning Approach

【速读】：该论文旨在解决 abrasive flap wheels（砂带磨轮）在加工复杂自由曲面时因柔性导致的磨损模式复杂化问题，如凹形/凸形砂带轮廓或砂带撕裂等，这些问题会显著影响磨削质量。解决方案的关键在于提出一种基于视觉的分层分类框架，将磨损状态监测任务分解为三个逻辑层级：（1）状态检测（新 vs. 磨损），（2）磨损类型识别（矩形、凹形、凸形及砂带撕裂），（3）严重程度评估（局部 vs. 完全变形）。该方法利用自建的真实砂带图像数据集，并采用 EfficientNetV2 的迁移学习策略实现高精度分类（准确率93.8%~99.3%），同时结合 Grad-CAM 可视化技术验证模型学习到物理相关特征并辅助分析误分类原因，从而为自动化砂带磨削过程中的自适应控制与磨损补偿提供可靠依据。

链接: https://arxiv.org/abs/2603.12852
作者: Falko Kähler,Maxim Wille,Ole Schmedemann,Thorsten Schüppstuhl
机构: Hamburg University of Technology, Institute of Aircraft Production Technology(汉堡工业大学航空制造技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 11 figures, 8 tables

点击查看摘要

Abstract:Abrasive flap wheels are common for finishing complex free-form surfaces due to their flexibility. However, this flexibility results in complex wear patterns such as concave/convex flap profiles or flap tears, which influence the grinding result. This paper proposes a novel, vision-based hierarchical classification framework to automate the wear condition monitoring of flap wheels. Unlike monolithic classification approaches, we decompose the problem into three logical levels: (1) state detection (new vs. worn), (2) wear type identification (rectangular, concave, convex) and flap tear detection, and (3) severity assessment (partial vs. complete deformation). A custom-built dataset of real flap wheel images was generated and a transfer learning approach with EfficientNetV2 architecture was used. The results demonstrate high robustness with classification accuracies ranging from 93.8% (flap tears) to 99.3% (concave severity). Furthermore, Gradient-weighted Class Activation Mapping (Grad-CAM) is utilized to validate that the models learn physically relevant features and examine false classifications. The proposed hierarchical method provides a basis for adaptive process control and wear consideration in automated flap wheel grinding.

[CV-49] am LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach

【速读】：该论文旨在解决无约束视频中情感 ambivalence/hesitancy（矛盾/犹豫）识别的难题，其挑战源于该行为状态具有细微性、多模态性和情境依赖性。解决方案的关键在于提出一种融合四种互补模态（场景、人脸、音频和文本）的多模态方法：利用 VideoMAE 模型捕捉场景动态，通过统计池化聚合面部情绪帧级嵌入，采用 EmotionWav2Vec2.0 提取声学特征并由 Mamba-based 时间编码器处理，以及基于微调的 Transformer 文本模型建模语言线索；随后，将各单模态嵌入通过原型增强型融合策略进行整合，显著提升了识别性能（最高平均 MF1 达 83.25%），表明互补模态信息与鲁棒融合机制对 ambivalence/hesitancy 识别至关重要。

链接: https://arxiv.org/abs/2603.12848
作者: Elena Ryumina(1),Alexandr Axyonov(1),Dmitry Sysoev(2),Timur Abdulkadirov(2),Kirill Almetov(2),Yulia Morozova(2),Dmitry Ryumin(1 and 2) ((1) St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia, (2) HSE University, St. Petersburg, Russia)
机构: St. Petersburg Federal Research Center of the Russian Academy of Sciences (圣彼得堡联邦研究中心，俄罗斯科学院); HSE University (高等经济学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:Ambivalence/hesitancy recognition in unconstrained videos is a challenging problem due to the subtle, multimodal, and context-dependent nature of this behavioral state. In this paper, a multimodal approach for video-level ambivalence/hesitancy recognition is presented for the 10th ABAW Competition. The proposed approach integrates four complementary modalities: scene, face, audio, and text. Scene dynamics are captured with a VideoMAE-based model, facial information is encoded through emotional frame-level embeddings aggregated by statistical pooling, acoustic representations are extracted with EmotionWav2Vec2.0 and processed by a Mamba-based temporal encoder, and linguistic cues are modeled using fine-tuned transformer-based text models. The resulting unimodal embeddings are further combined using multimodal fusion models, including prototype-augmented variants. Experiments on the BAH corpus demonstrate clear gains of multimodal fusion over all unimodal baselines. The best unimodal configuration achieved an average MF1 of 70.02%, whereas the best multimodal fusion model reached 83.25%. The highest final test performance, 71.43%, was obtained by an ensemble of five prototype-augmented fusion models. The obtained results highlight the importance of complementary multimodal cues and robust fusion strategies for ambivalence/hesitancy recognition.

[CV-50] Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation CVPR2026

【速读】：该论文旨在解决酶动力学参数预测中忽视催化过程阶段性特征的问题，即传统方法将酶与底物的相互作用简化为静态兼容性问题，忽略了底物识别和活性位点构象适应这两个关键阶段。解决方案的关键在于提出一种分阶段的多模态条件建模框架——Enzyme-Reaction Bridging Adapter (ERBA)，其核心包括：1）分子识别交叉注意力（Molecular Recognition Cross-Attention, MRCA），用于在酶表示中注入底物信息以捕获特异性；2）几何感知专家混合模型（Geometry-aware Mixture-of-Experts, G-MoE），通过活性位点结构引导样本至口袋专用专家，体现诱导契合机制；3）酶-底物分布对齐（Enzyme-Substrate Distribution Alignment, ESDA），在再生核希尔伯特空间中保持PLM流形内的分布一致性，从而提升模型的生物合理性与泛化能力。

链接: https://arxiv.org/abs/2603.12845
作者: Fei Wang,Xinye Zheng,Kun Li,Yanyan Wei,Yuxin Liu,Ganpeng Hu,Tong Bao,Jingwen Yang
机构: Hefei University of Technology (合肥工业大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院); CVLab, College of Information Technology, United Arab Emirates University (阿联酋大学信息学院CV实验室); Intelligent Interconnected Systems Laboratory of Anhui Province (安徽省智能互联系统实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Predicting enzyme kinetic parameters quantifies how efficiently an enzyme catalyzes a specific substrate under defined biochemical conditions. Canonical parameters such as the turnover number ( k_\textcat ), Michaelis constant ( K_\textm ), and inhibition constant ( K_\texti ) depend jointly on the enzyme sequence, the substrate chemistry, and the conformational adaptation of the active site during binding. Many learning pipelines simplify this process to a static compatibility problem between the enzyme and substrate, fusing their representations through shallow operations and regressing a single value. Such formulations overlook the staged nature of catalysis, which involves both substrate recognition and conformational adaptation. In this regard, we reformulate kinetic prediction as a staged multimodal conditional modeling problem and introduce the Enzyme-Reaction Bridging Adapter (ERBA), which injects cross-modal information via fine-tuning into Protein Language Models (PLMs) while preserving their biochemical priors. ERBA performs conditioning in two stages: Molecular Recognition Cross-Attention (MRCA) first injects substrate information into the enzyme representation to capture specificity; Geometry-aware Mixture-of-Experts (G-MoE) then integrates active-site structure and routes samples to pocket-specialized experts to reflect induced fit. To maintain semantic fidelity, Enzyme-Substrate Distribution Alignment (ESDA) enforces distributional consistency within the PLM manifold in a reproducing kernel Hilbert space. Experiments across three kinetic endpoints and multiple PLM backbones, ERBA delivers consistent gains and stronger out-of-distribution performance compared with sequence-only and shallow-fusion baselines, offering a biologically grounded route to scalable kinetic prediction and a foundation for adding cofactors, mutations, and time-resolved structural cues.

[CV-51] Hierarchical Dual-Change Collaborative Learning for UAV Scene Change Captioning

【速读】：该论文旨在解决无人机（UAV）动态航拍图像中场景变化的自然语言描述问题，即提出一种新的任务——无人机场景变化描述（UAV Scene Change Captioning, UAV-SCC），其核心挑战在于如何从移动视角拍摄的图像对中理解由视点变化引起的语义差异，尤其当两帧图像仅部分重叠时，需有效建模空间布局变化并利用图像间的相对方位关系。解决方案的关键在于提出一种分层双变化协同学习方法（Hierarchical Dual-Change Collaborative Learning, HDC-CL），其中设计了动态自适应布局Transformer（Dynamic Adaptive Layout Transformer, DALT），能够自适应地建模图像对中重叠与非重叠区域之间的关联特征；同时引入分层跨模态方向一致性校准机制（Hierarchical Cross-modal Orientation Consistency Calibration, HCM-OCC），增强模型对视点偏转方向的敏感性，从而提升变化描述的准确性。

链接: https://arxiv.org/abs/2603.12832
作者: Fuhai Chen,Pengpeng Huang,Junwen Wu,Hehong Zhang,Shiping Wang,Xiaoguang Ma,Xuri Ge
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages,10 figures

点击查看摘要

Abstract:This paper proposes a novel task for UAV scene understanding - UAV Scene Change Captioning (UAV-SCC) - which aims to generate natural language descriptions of semantic changes in dynamic aerial imagery captured from a movable viewpoint. Unlike traditional change captioning that mainly describes differences between image pairs captured from a fixed camera viewpoint over time, UAV scene change captioning focuses on image-pair differences resulting from both temporal and spatial scene variations dynamically captured by a moving camera. The key challenge lies in understanding viewpoint-induced scene changes from UAV image pairs that share only partially overlapping scene content due to viewpoint shifts caused by camera rotation, while effectively exploiting the relative orientation between the two images. To this end, we propose a Hierarchical Dual-Change Collaborative Learning (HDC-CL) method for UAV scene change captioning. In particular, a novel transformer, \emphi.e. Dynamic Adaptive Layout Transformer (DALT) is designed to adaptively model diverse spatial layouts of the image pair, where the interrelated features derived from the overlapping and non-overlapping regions are learned within the flexible and unified encoding layer. Furthermore, we propose a Hierarchical Cross-modal Orientation Consistency Calibration (HCM-OCC) method to enhance the model’s sensitivity to viewpoint shift directions, enabling more accurate change captioning. To facilitate in-depth research on this task, we construct a new benchmark dataset, named UAV-SCC dataset, for UAV scene change captioning. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on this task. The dataset and code will be publicly released upon acceptance of this paper.

[CV-52] coDrawAgents : A Multi-Agent Dialogue Framework for Compositional Image Generation CVPR2026

【速读】：该论文旨在解决当前文本到图像生成模型在复杂场景中难以准确组合多个物体并保持其属性一致性的难题（text-to-image generation challenges in faithfully composing multiple objects and preserving their attributes in complex scenes）。解决方案的关键在于提出 coDrawAgents 框架，该框架由四个专业化代理（Interpreter、Planner、Checker 和 Painter）协同工作：Interpreter 通过语义显著性对对象进行分组并选择是否启用布局感知流程；Planner 在视觉上下文基础上采用分而治之策略逐步规划同一优先级层级的物体布局；Checker 引入显式纠错机制验证空间一致性与属性对齐；Painter 则逐步合成图像并将新规划对象融入画布以增强后续迭代的上下文信息。这一多代理协作机制有效缓解了布局复杂度高、规划缺乏视觉 grounding 及缺乏错误修正能力等三大核心挑战。

链接: https://arxiv.org/abs/2603.12829
作者: Chunhan Li,Qifeng Wu,Jia-Hui Pan,Ka-Hei Hui,Jingyu Hu,Yuming Jiang,Bin Sheng,Xihui Liu,Wenjuan Gong,Zhengzhe Liu
机构: Lingnan University (岭南大学); CMU (卡内基梅隆大学); CUHK (香港中文大学); Alibaba DAMO Academy (阿里巴巴达摩院); SJTU (上海交通大学); HKU (香港大学); China University of Petroleum (中国石油大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Findings

点击查看摘要

Abstract:Text-to-image generation has advanced rapidly, but existing models still struggle with faithfully composing multiple objects and preserving their attributes in complex scenes. We propose coDrawAgents, an interactive multi-agent dialogue framework with four specialized agents: Interpreter, Planner, Checker, and Painter that collaborate to improve compositional generation. The Interpreter adaptively decides between a direct text-to-image pathway and a layout-aware multi-agent process. In the layout-aware mode, it parses the prompt into attribute-rich object descriptors, ranks them by semantic salience, and groups objects with the same semantic priority level for joint generation. Guided by the Interpreter, the Planner adopts a divide-and-conquer strategy, incrementally proposing layouts for objects with the same semantic priority level while grounding decisions in the evolving visual context of the canvas. The Checker introduces an explicit error-correction mechanism by validating spatial consistency and attribute alignment, and refining layouts before they are rendered. Finally, the Painter synthesizes the image step by step, incorporating newly planned objects into the canvas to provide richer context for subsequent iterations. Together, these agents address three key challenges: reducing layout complexity, grounding planning in visual context, and enabling explicit error correction. Extensive experiments on benchmarks GenEval and DPG-Bench demonstrate that coDrawAgents substantially improves text-image alignment, spatial accuracy, and attribute binding compared to existing methods.

[CV-53] Residual SODAP: Residual Self-Organizing Domain-Adaptive Prompting with Structural Knowledge Preservation for Continual Learning

【速读】：该论文旨在解决域增量学习（Domain-Incremental Learning, DIL）中因任务标识符不可用且无法存储历史数据而导致的灾难性遗忘问题。现有基于提示的持续学习（Prompt-based Continual Learning, PCL）方法虽能通过冻结主干网络适应新域，但受限于提示选择不佳和域迁移下的分类器级不稳定，性能提升有限。其解决方案的关键在于提出Residual SODAP框架，该框架创新性地联合优化提示引导的表征适配与分类器级知识保留：通过α-entmax稀疏提示选择实现高效提示筛选，结合残差聚合增强特征表达；引入无数据蒸馏与伪特征回放机制实现分类器知识迁移；并利用提示使用情况检测概念漂移，辅以不确定性感知的多损失平衡策略，从而在无需任务标签或额外数据存储的情况下显著提升模型稳定性与准确率。

链接: https://arxiv.org/abs/2603.12816
作者: Gyutae Oh,Jungwoo Bae,Jitae Shin
机构: SKKU(韩国科学技术院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 page, 10 figures

点击查看摘要

Abstract:Continual learning (CL) suffers from catastrophic forgetting, which is exacerbated in domain-incremental learning (DIL) where task identifiers are unavailable and storing past data is infeasible. While prompt-based CL (PCL) adapts representations with a frozen backbone, we observe that prompt-only improvements are often insufficient due to suboptimal prompt selection and classifier-level instability under domain shifts. We propose Residual SODAP, which jointly performs prompt-based representation adaptation and classifier-level knowledge preservation. Our framework combines \alpha -entmax sparse prompt selection with residual aggregation, data-free distillation with pseudo-feature replay, prompt-usage–based drift detection, and uncertainty-aware multi-loss balancing. Across three DIL benchmarks without task IDs or extra data storage, Residual SODAP achieves state-of-the-art AvgACC/AvgF of 0.850/0.047 (DR), 0.760/0.031 (Skin Cancer), and 0.995/0.003 (CORe50).

[CV-54] OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution

【速读】：该论文旨在解决生成式真实图像超分辨率（Real-ISR）模型与人类视觉偏好对齐的难题，其核心挑战在于感知质量与保真度之间的权衡（perception–fidelity trade-off）以及未知且多样的退化类型。现有方法依赖离线偏好优化和静态指标聚合，往往缺乏可解释性，并在强条件约束下易产生伪多样性（pseudo-diversity）。解决方案的关键在于提出OARS框架，该框架基于COMPASS——一种基于多模态大语言模型（MLLM）的奖励机制，能够通过联合建模保真度保持与感知增益，并引入输入质量自适应的权衡策略来评估低分辨率到高分辨率的转换过程。为训练COMPASS，作者构建了涵盖合成与真实退化的COMPASS-20K数据集，并设计三阶段感知标注流程以获得校准且细粒度的训练标签；随后，OARS采用渐进式在线对齐策略，从冷启动流匹配逐步过渡至全参考和无参考强化学习（RL），并通过浅层LoRA优化实现策略内探索，从而在保持保真度的同时显著提升感知质量，在Real-ISR基准上达到最先进性能。

链接: https://arxiv.org/abs/2603.12811
作者: Shijie Zhao,Xuanyu Zhang,Bin Chen,Weiqi Li,Qunliang Xing,Kexin Zhang,Yan Wang,Junlin Li,Li Zhang,Jian Zhang,Tianfan Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Super-Resolution, Reinforcement Learning

点击查看摘要

Abstract:Aligning generative real-world image super-resolution models with human visual preference is challenging due to the perception–fidelity trade-off and diverse, unknown degradations. Prior approaches rely on offline preference optimization and static metric aggregation, which are often non-interpretable and prone to pseudo-diversity under strong conditioning. We propose OARS, a process-aware online alignment framework built on COMPASS, a MLLM-based reward that evaluates the LR to SR transition by jointly modeling fidelity preservation and perceptual gain with an input-quality-adaptive trade-off. To train COMPASS, we curate COMPASS-20K spanning synthetic and real degradations, and introduce a three-stage perceptual annotation pipeline that yields calibrated, fine-grained training labels. Guided by COMPASS, OARS performs progressive online alignment from cold-start flow matching to full-reference and finally reference-free RL via shallow LoRA optimization for on-policy exploration. Extensive experiments and user studies demonstrate consistent perceptual improvements while maintaining fidelity, achieving state-of-the-art performance on Real-ISR benchmarks.

[CV-55] What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）中对抗鲁棒性与干净数据准确率之间的固有权衡问题，即提升模型在对抗攻击下的鲁棒性通常会损害其在正常输入上的性能。解决方案的关键在于发现并利用对抗鲁棒性在模型深度上分布不均的特性：通过详细分析对抗微调后的模型，研究揭示鲁棒性主要集中在浅层网络，受低频谱偏差（low-frequency spectral bias）和输入无关的注意力模式驱动；而深层更新则削弱了干净准确率和鲁棒泛化能力。基于此洞察，作者提出Adversarial Robustness Adaptation (R-Adapt) 框架，仅对预训练模型的初始层进行最小且有针对性的适应性调整，冻结其余所有权重，从而实现对抗鲁棒性与干净准确率之间的卓越平衡。该方法支持无训练、模型引导和数据驱动等多种灵活范式，适用于多种任务和大规模模型（如LLaVA和Qwen-VL）。

链接: https://arxiv.org/abs/2603.12799
作者: Sen Nie,Jie Zhang,Zhongqi Wang,Zhaoyang Wei,Shiguang Shan,Xilin Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages

点击查看摘要

Abstract:Achieving adversarial robustness in Vision-Language Models (VLMs) inevitably compromises accuracy on clean data, presenting a long-standing and challenging trade-off. In this work, we revisit this trade-off by investigating a fundamental question: What makes VLMs robust? Through a detailed analysis of adversarially fine-tuned models, we examine how robustness mechanisms function internally and how they interact with clean accuracy. Our analysis reveals that adversarial robustness is not uniformly distributed across network depth. Instead, unexpectedly, it is primarily localized within the shallow layers, driven by a low-frequency spectral bias and input-insensitive attention patterns. Meanwhile, updates to the deep layers tend to undermine both clean accuracy and robust generalization. Motivated by these insights, we propose Adversarial Robustness Adaptation (R-Adapt), a simple yet effective framework that freezes all pre-trained weights and introduces minimal, insight-driven adaptations only in the initial layers. This design achieves an exceptional balance between adversarial robustness and clean accuracy. R-Adapt further supports training-free, model-guided, and data-driven paradigms, offering flexible pathways to seamlessly equip standard models with robustness. Extensive evaluations on 18 datasets and diverse tasks demonstrate our state-of-the-art performance under various attacks. Notably, R-Adapt generalizes efficiently to large vision-language models (e.g., LLaVA and Qwen-VL) to enhance their robustness. Our project page is available at this https URL.

[CV-56] Spectral Defense Against Resource-Targeting Attack in 3D Gaussian Splatting

【速读】：该论文旨在解决3D高斯溅射（3D Gaussian Splatting, 3DGS）在训练过程中因资源目标攻击（resource-targeting attack）导致的异常高斯增长问题，该攻击通过污染训练图像引入隐蔽扰动，使系统将噪声误判为细节结构，从而引发资源耗尽和场景保真度下降。解决方案的关键在于提出一种双域频谱防御机制：首先设计一个三维频率滤波器，针对性地剔除具有异常高频成分的高斯点；其次引入二维渲染频谱正则化策略，在保持自然场景各向同性频谱特性的同时，惩罚各向异性角能量分布以抑制噪声模式。该方法有效遏制了高斯过度增长，显著提升了3DGS的鲁棒性、准确性和安全性。

链接: https://arxiv.org/abs/2603.12796
作者: Yang Chen,Yi Yu,Jiaming He,Yueqi Duan,Zheng Zhu,Yap-Peng Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) deliver high-quality rendering, yet the Gaussian representation exposes a new attack surface, the resource-targeting attack. This attack poisons training images, excessively inducing Gaussian growth to cause resource exhaustion. Although efficiency-oriented methods such as smoothing, thresholding, and pruning have been explored, these spatial-domain strategies operate on visible structures but overlook how stealthy perturbations distort the underlying spectral behaviors of training data. As a result, poisoned inputs introduce abnormal high-frequency amplifications that mislead 3DGS into interpreting noisy patterns as detailed structures, ultimately causing unstable Gaussian overgrowth and degraded scene fidelity. To address this, we propose \textbfSpectral Defense in Gaussian and image fields. We first design a 3D frequency filter to selectively prune Gaussians exhibiting abnormally high frequencies. Since natural scenes also contain legitimate high-frequency structures, directly suppressing high frequencies is insufficient, and we further develop a 2D spectral regularization on renderings, distinguishing naturally isotropic frequencies while penalizing anisotropic angular energy to constrain noisy patterns. Experiments show that our defense builds robust, accurate, and secure 3DGS, suppressing overgrowth by up to 5.92\times , reducing memory by up to 3.66\times , and improving speed by up to 4.34\times under attacks.

[CV-57] Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

【速读】：该论文旨在解决多模态建模中视觉理解与生成任务在共享特征空间内联合优化的难题，其核心挑战在于两者对解码机制和视觉表征的需求不匹配。解决方案的关键在于提出一个名为Cheers的统一多模态模型，通过解耦图像的补丁级细节与语义表示，实现语义稳定性和生成保真度的双重提升：具体包括三个关键组件——(i) 统一视觉分词器（unified vision tokenizer），将图像潜在状态编码压缩为用于高效大语言模型（LLM）条件控制的语义标记；(ii) 基于LLM的Transformer架构，统一自回归解码（用于文本生成）与扩散解码（用于图像生成）；(iii) 级联流匹配头（cascaded flow matching head），先解码视觉语义，再注入由视觉分词器提取的语义门控细节残差以精细重构高频内容。这一设计实现了视觉理解与生成性能的同步提升，并带来4倍token压缩效率，显著降低训练成本。

链接: https://arxiv.org/abs/2603.12793
作者: Yichen Zhang,Da Peng,Zonghao Guo,Zijian Zhang,Xuesong Yang,Tong Sun,Shichu Sun,Yidan Zhang,Yanghao Li,Haiyan Zhao,Wang Xu,Qi Shi,Yangang Sun,Chi Chen,Shuo Wang,Yukun Yan,Xu Han,Qiang Ma,Wei Ke,Liang Wang,Zhiyuan Liu,Maosong Sun
机构: Tsinghua University (清华大学); Xi’an Jiaotong University (西安交通大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.

[CV-58] Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

【速读】：该论文旨在解决多视角视频中多人场景下同时估计相机参数、场景点云与人体网格的联合重建问题，现有方法通常依赖单目输入或额外模块及预处理数据，难以实现高效统一建模。其解决方案的关键在于提出一个统一框架CHROMM，通过整合Pi3X和Multi-HMR中的强几何与人体先验到单一可训练神经网络架构，并引入尺度调整模块以解决人与场景之间的尺度不一致问题；同时设计多视角融合策略在测试时聚合各视角估计结果，并采用基于几何信息的多人关联方法提升鲁棒性，从而在无需外部模块或预处理的情况下实现高精度、高效率的多视角多人三维重建。

链接: https://arxiv.org/abs/2603.12789
作者: Sangmin Kim,Minhyuk Hwang,Geonho Cha,Dongyoon Wee,Jaesik Park
机构: NAVER Cloud(NAVER云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in 3D foundation models have led to growing interest in reconstructing humans and their surrounding environments. However, most existing approaches focus on monocular inputs, and extending them to multi-view settings requires additional overhead modules or preprocessed data. To this end, we present CHROMM, a unified framework that jointly estimates cameras, scene point clouds, and human meshes from multi-person multi-view videos without relying on external modules or preprocessing. We integrate strong geometric and human priors from Pi3X and Multi-HMR into a single trainable neural network architecture, and introduce a scale adjustment module to solve the scale discrepancy between humans and the scene. We also introduce a multi-view fusion strategy to aggregate per-view estimates into a single representation at test-time. Finally, we propose a geometry-based multi-person association method, which is more robust than appearance-based approaches. Experiments on EMDB, RICH, EgoHumans, and EgoExo4D show that CHROMM achieves competitive performance in global human motion and multi-view pose estimation while running over 8x faster than prior optimization-based multi-view approaches. Project page: this https URL.

[CV-59] hink and Answer ME: Benchmarking and Exploring Multi-Entity Reasoning Grounding in Remote Sensing

【速读】：该论文旨在解决遥感视觉定位（remote sensing visual grounding）任务中现有方法局限于感知级匹配和单实体建模的问题，从而限制了显式推理能力和实体间关系建模的潜力。其解决方案的关键在于提出一个全新的多实体推理定位基准数据集（Multi-Entity Reasoning Grounding in Remote Sensing, ME-RSRG），并构建基于视觉-语言基础模型的实体感知推理（Entity-Aware Reasoning, EAR）框架。EAR通过生成结构化的推理路径与主体-对象定位输出，结合监督微调进行冷启动初始化，并进一步利用实体感知奖励驱动的组相对策略优化（entity-aware reward-driven Group Relative Policy Optimization, GRPO）进行优化，有效提升了多实体场景下的推理能力与定位精度。

链接: https://arxiv.org/abs/2603.12788
作者: Shuchang Lyu,Haiquan Wen,Guangliang Cheng,Meng Li,Zheng Zhou,You Zhou,Dingding Yao,Zhenwei Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 9 figures, 5 tables

点击查看摘要

Abstract:Recent advances in reasoning language models and reinforcement learning with verifiable rewards have significantly enhanced multi-step reasoning capabilities. This progress motivates the extension of reasoning paradigms to remote sensing visual grounding task. However, existing remote sensing grounding methods remain largely confined to perception-level matching and single-entity formulations, limiting the role of explicit reasoning and inter-entity modeling. To address this challenge, we introduce a new benchmark dataset for Multi-Entity Reasoning Grounding in Remote Sensing (ME-RSRG). Based on ME-RSRG, we reformulate remote sensing grounding as a multi-entity reasoning task and propose an Entity-Aware Reasoning (EAR) framework built upon visual-linguistic foundation models. EAR generates structured reasoning traces and subject-object grounding outputs. It adopts supervised fine-tuning for cold-start initialization and is further optimized via entity-aware reward-driven Group Relative Policy Optimization (GRPO). Extensive experiments on ME-RSRG demonstrate the challenges of multi-entity reasoning and verify the effectiveness of our proposed EAR framework. Our dataset, code, and models will be available at this https URL.

[CV-60] Generalized Recognition of Basic Surgical Actions Enables Skill Assessment and Vision-Language-Model-based Surgical Planning

【速读】：该论文旨在解决如何在不同外科专科中准确识别和理解基本手术动作（Basic Surgical Actions, BSA）这一核心问题，以推动生成式AI在手术实践、培训与自动化中的应用。其解决方案的关键在于构建了目前规模最大的BSA数据集（涵盖6个外科专科、10类基本动作，超过11,000段视频片段），并基于此开发了一种通用的基础模型（foundation model），实现了跨专科的鲁棒性BSA识别。该模型不仅在多种术式和解剖部位上表现稳定，还进一步赋能下游任务，如通过领域知识进行前列腺切除术中的手术技能评估，以及利用大视觉语言模型实现胆囊切除术和肾切除术中的动作规划，并获得多国外科医生对可解释文本输出的临床有效性验证，表明BSA理解模型是迈向手术智能（surgical superintelligence）的重要基础。

链接: https://arxiv.org/abs/2603.12787
作者: Mengya Xu,Daiyun Shen,Jie Zhang,Hon Chi Yip,Yujia Gao,Cheng Chen,Dillan Imans,Yonghao Long,Yiru Ye,Yixiao Liu,Rongyun Mai,Kai Chen,Hongliang Ren,Yutong Ban,Guangsuo Wang,Francis Wong,Chi-Fai Ng,Kee Yuan Ngiam,Russell H. Taylor,Daguang Xu,Yueming Jin,Qi Dou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 8 figures

点击查看摘要

Abstract:Artificial intelligence, imaging, and large language models have the potential to transform surgical practice, training, and automation. Understanding and modeling of basic surgical actions (BSA), the fundamental unit of operation in any surgery, is important to drive the evolution of this field. In this paper, we present a BSA dataset comprising 10 basic actions across 6 surgical specialties with over 11,000 video clips, which is the largest to date. Based on the BSA dataset, we developed a new foundation model that conducts general-purpose recognition of basic actions. Our approach demonstrates robust cross-specialist performance in experiments validated on datasets from different procedural types and various body parts. Furthermore, we demonstrate downstream applications enabled by the BAS foundation model through surgical skill assessment in prostatectomy using domain-specific knowledge, and action planning in cholecystectomy and nephrectomy using large vision-language models. Multinational surgeons’ evaluation of the language model’s output of the action planning explainable texts demonstrated clinical relevance. These findings indicate that basic surgical actions can be robustly recognized across scenarios, and an accurate BSA understanding model can essentially facilitate complex applications and speed up the realization of surgical superintelligence.

[CV-61] Empowering Semantic-Sensitive Underwater Image Enhancement with VLM AAAI2026

【速读】：该论文旨在解决当前基于学习的水下图像增强（Underwater Image Enhancement, UIE）方法中存在的分布偏移问题，即增强后的图像与自然图像在特征分布上的差异会干扰下游视觉任务（如目标检测和分割）中语义线索的提取，从而限制了增强模型的适应性。解决方案的关键在于引入视觉-语言模型（Vision-Language Models, VLMs），通过生成图像中关键物体的文本描述，并利用文本-图像对齐模型构建空间语义引导图（spatial semantic guidance map），进而以双引导机制（包括交叉注意力和显式对齐损失）驱动UIE网络，使恢复过程聚焦于语义敏感区域而非全局均匀优化，从而保障关键对象特征的忠实还原。

链接: https://arxiv.org/abs/2603.12773
作者: Guodong Fan,Shengning Zhou,Genji Yuan,Huiyu Li,Jingchun Zhou,Jinjiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Accepted as an Oral presentation at AAAI 2026

点击查看摘要

Abstract:In recent years, learning-based underwater image enhancement (UIE) techniques have rapidly evolved. However, distribution shifts between high-quality enhanced outputs and natural images can hinder semantic cue extraction for downstream vision tasks, thereby limiting the adaptability of existing enhancement models. To address this challenge, this work proposes a new learning mechanism that leverages Vision-Language Models (VLMs) to empower UIE models with semantic-sensitive capabilities. To be concrete, our strategy first generates textual descriptions of key objects from a degraded image via VLMs. Subsequently, a text-image alignment model remaps these relevant descriptions back onto the image to produce a spatial semantic guidance map. This map then steers the UIE network through a dual-guidance mechanism, which combines cross-attention and an explicit alignment loss. This forces the network to focus its restorative power on semantic-sensitive regions during image reconstruction, rather than pursuing a globally uniform improvement, thereby ensuring the faithful restoration of key object features. Experiments confirm that when our strategy is applied to different UIE baselines, significantly boosts their performance on perceptual quality metrics as well as enhances their performance on detection and segmentation tasks, validating its effectiveness and adaptability.

[CV-62] PVI: Plug-in Visual Injection for Vision-Language-Action Models

【速读】：该论文旨在解决当前视觉语言动作（VLA）架构中因预训练视觉语言模型（VLM）侧重语义抽象而忽视细粒度几何信息与时间动态特征的问题，这导致动作专家难以有效利用视觉输入进行精准操作。解决方案的关键在于提出一种轻量级、编码器无关的模块——Plug-in Visual Injection (PVI)，其通过零初始化残差路径向预训练动作专家注入辅助视觉表示，无需修改原有结构即可保留预训练行为，仅需单阶段微调即可显著提升性能；实验表明，引入时序视频特征（如V-JEPA2）比静态图像特征（如DINOv2）更能提升多阶段任务中的状态跟踪与协调能力，验证了PVI在仿真与真实机器人场景下的有效性。

链接: https://arxiv.org/abs/2603.12772
作者: Zezhou Zhang,Songxin Zhang,Xiao Xiong,Junjie Zhang,Zejian Xie,Jingyi Xi,Zunyao Mao,Zan Mao,Zhixin Mai,Zhuoyang Song,Jiaxing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:VLA architectures that pair a pretrained VLM with a flow-matching action expert have emerged as a strong paradigm for language-conditioned manipulation. Yet the VLM, optimized for semantic abstraction and typically conditioned on static visual observations, tends to attenuate fine-grained geometric cues and often lacks explicit temporal evidence for the action expert. Prior work mitigates this by injecting auxiliary visual features, but existing approaches either focus on static spatial representations or require substantial architectural modifications to accommodate temporal inputs, leaving temporal information underexplored. We propose Plug-in Visual Injection (PVI), a lightweight, encoder-agnostic module that attaches to a pretrained action expert and injects auxiliary visual representations via zero-initialized residual pathways, preserving pretrained behavior with only single-stage fine-tuning. Using PVI, we obtain consistent gains over the base policy and a range of competitive alternative injection strategies, and our controlled study shows that temporal video features (V-JEPA2) outperform strong static image features (DINOv2), with the largest gains on multi-phase tasks requiring state tracking and coordination. Real-robot experiments on long-horizon bimanual cloth folding further demonstrate the practicality of PVI beyond simulation.

[CV-63] Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation

【速读】：该论文旨在解决动态场景编辑中因直接将2D扩散模型扩展至4D而导致的运动伪影、时间闪烁和风格传播不一致等问题。其核心解决方案是提出Catalyst4D框架，关键在于两个模块：一是基于锚点的运动引导（Anchor-based Motion Guidance, AMG），通过构建原始与编辑后高斯点云中的结构稳定锚点集，并利用最优传输建立区域级对应关系，实现无跨区域干扰的一致形变传播；二是颜色不确定性引导的外观优化（Color Uncertainty-guided Appearance Refinement, CUAR），通过估计每个高斯点的颜色不确定性并选择性地精修易受遮挡影响的区域，从而保持时间上的一致性外观。

链接: https://arxiv.org/abs/2603.12766
作者: Shifeng Chen,Yihui Li,Jun Liao,Hongyu Yang,Di Huang
机构: Beihang University (北京航空航天大学); State Key Laboratory of Complex and Critical Software Environment (复杂与关键软件环境国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Recent advances in 3D scene editing using NeRF and 3DGS enable high-quality static scene editing. In contrast, dynamic scene editing remains challenging, as methods that directly extend 2D diffusion models to 4D often produce motion artifacts, temporal flickering, and inconsistent style propagation. We introduce Catalyst4D, a framework that transfers high-quality 3D edits to dynamic 4D Gaussian scenes while maintaining spatial and temporal coherence. At its core, Anchor-based Motion Guidance (AMG) builds a set of structurally stable and spatially representative anchors from both original and edited Gaussians. These anchors serve as robust region-level references, and their correspondences are established via optimal transport to enable consistent deformation propagation without cross-region interference or motion drift. Complementarily, Color Uncertainty-guided Appearance Refinement (CUAR) preserves temporal appearance consistency by estimating per-Gaussian color uncertainty and selectively refining regions prone to occlusion-induced artifacts. Extensive experiments demonstrate that Catalyst4D achieves temporally stable, high-fidelity dynamic scene editing and outperforms existing methods in both visual quality and motion coherence.

[CV-64] SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion CVPR2026

【速读】：该论文旨在解决第一人称（ego）到第三人称（exo）模仿中的错误检测问题，即在异步、长度不匹配的ego与exo视频中，定位ego视角下的操作步骤并判断其是否错误。这一任务面临跨视角域偏移（cross-view domain shift）、时间错位（temporal misalignment）和大量冗余信息等挑战。解决方案的关键在于提出SAVA-X框架，包含三个核心组件：(i) 视角条件自适应采样机制，用于优化跨视角特征提取；(ii) 场景自适应视角嵌入，缓解跨视角语义差异；(iii) 双向交叉注意力融合模块，实现多模态特征的有效对齐与整合。该方法在EgoMe基准上显著优于现有基线模型，在AUPRC和平均tIoU指标上均取得提升。

链接: https://arxiv.org/abs/2603.12764
作者: Xiang Li,Heqian Qiu,Lanxiao Wang,Benliu Qiu,Fanman Meng,Linfeng Xu,Hongliang Li
机构: University of Electronic Sience and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This article was accepted by CVPR 2026

点击查看摘要

Abstract:Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a single-view setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize Ego \rightarrow Exo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align-Fuse-Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii) bidirectional cross-attention fusion. On the EgoMe benchmark, SAVA-X consistently improves AUPRC and mean tIoU over all baselines, and ablations confirm the complementary benefits of its components. Code is available at this https URL.

[CV-65] rraFlow: Multimodal Multitemporal Representation Learning for Earth Observation

【速读】：该论文旨在解决地球观测（Earth observation）领域中多模态、多时相学习的挑战，尤其是如何在真实世界数据中处理变长输入并实现跨时空模态的序列感知学习。其解决方案的关键在于提出TerraFlow框架，该框架基于时间训练目标（temporal training objectives），能够有效建模空间、时间和模态之间的复杂依赖关系，同时保持对变长输入的鲁棒性。实验表明，TerraFlow在GEO-Bench-2基准的所有时序任务上均优于现有最优基础模型，并在自然灾害风险图预测任务中展现出显著优势，F1分数提升最高达50%，Brier分数改善24%。

链接: https://arxiv.org/abs/2603.12762
作者: Nazar Puriy,Johannes Jakubik,Benedikt Blumenstiel,Konrad Schindler
机构: IBM Research Europe, Zurich, Switzerland (IBM研究欧洲，苏黎世，瑞士); ETH Zurich, Zurich, Switzerland (苏黎世联邦理工学院，苏黎世，瑞士)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose TerraFlow, a novel approach to multimodal, multitemporal learning for Earth observation. TerraFlow builds on temporal training objectives that enable sequence-aware learning across space, time, and modality, while remaining robust to the variable-length inputs commonly encountered in real-world Earth observation data. Our experiments demonstrate superiority of TerraFlow over state-of-the-art foundation models for Earth observation across all temporal tasks of the GEO-Bench-2 benchmark. We additionally demonstrate that TerraFlow is able to make initial steps towards deep-learning based risk map prediction for natural disasters – a task on which other state-of-the-art foundation models frequently collapse. TerraFlow outperforms state-of-the-art foundation models by up to 50% in F1 score and 24% in Brier score.

[CV-66] HIFICL: High-Fidelity In-Context Learning for Multimodal Tasks CVPR2026

【速读】：该论文旨在解决大模型在上下文学习（In-Context Learning, ICL）中因演示配置敏感性和高计算成本导致的性能不稳定问题。现有近似方法通过学习一个“偏移向量”简化了注意力机制，但未能准确建模ICL的本质动态混合特性。解决方案的关键在于提出高保真上下文学习（High-Fidelity In-Context Learning, HiFICL），其核心包括：1）引入一组可学习的“虚拟键值对”作为增强上下文表示；2）采用低秩分解实现稳定且正则化的训练；3）设计简单端到端的训练目标。该方法从机制上更精确地模拟了ICL中标准注意力输出与上下文值之间的动态混合关系，本质上是一种上下文感知的参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）策略，在多个多模态基准上显著优于现有近似方法。

链接: https://arxiv.org/abs/2603.12760
作者: Xiaoyu Li,Yuhang Liu,Zheng Luo,Xuanshuo Kang,Fangqi Lou,Xiaohua Wu,Zihan Xiong
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. Code available at this https URL

点击查看摘要

Abstract:In-Context Learning (ICL) is a significant paradigm for Large Multimodal Models (LMMs), using a few in-context demonstrations (ICDs) for new task adaptation. However, its performance is sensitive to demonstration configurations and computationally expensive. Mathematically, the influence of these demonstrations can be decomposed into a dynamic mixture of the standard attention output and the context values. Current approximation methods simplify this process by learning a “shift vector”. Inspired by the exact decomposition, we introduce High-Fidelity In-Context Learning (HIFICL) to more faithfully model the ICL mechanism. HIFICL consists of three key components: 1) a set of “virtual key-value pairs” to act as a learnable context, 2) a low-rank factorization for stable and regularized training, and 3) a simple end-to-end training objective. From another perspective, this mechanism constitutes a form of context-aware Parameter-Efficient Fine-Tuning (PEFT). Extensive experiments show that HiFICL consistently outperforms existing approximation methods on several multimodal benchmarks. The code is available at this https URL.

[CV-67] SAP: Segment Any 4K Panorama

【速读】：该论文旨在解决基础模型在360°全景图像上进行实例分割时性能下降的问题，尤其是在高分辨率（4K）场景下，传统基于透视图像训练的模型难以有效泛化。其解决方案的关键在于提出一种新的建模范式——将全景分割重构为固定轨迹的透视视频分割，通过沿球面连续遍历采样生成重叠的透视块（patch），从而在保持原生4K分辨率的同时恢复跨视角传播所需的平滑视点过渡。该方法结合大规模合成数据（使用InfiniGen引擎生成183,440张带实例标签的4K全景图）进行训练，使模型在真实世界4K全景图像上实现显著提升，相比不同尺寸的SAM2模型零样本mIoU提升达+17.2%。

链接: https://arxiv.org/abs/2603.12759
作者: Lutao Jiang,Zidong Cao,Weikai Chen,Xu Zheng,Yuanhuiyi Lyu,Zhenyang Li,Zeyu HU,Yingda Yin,Keyang Luo,Runze Zhang,Kai Yan,Shengju Qian,Haidi Fan,Yifan Peng,Xin Wang,Hui Xiong,Ying-Cong Chen
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团); 3. Nankai University (南开大学); 4. Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Promptable instance segmentation is widely adopted in embodied and AR systems, yet the performance of foundation models trained on perspective imagery often degrades on 360° panoramas. In this paper, we introduce Segment Any 4K Panorama (SAP), a foundation model for 4K high-resolution panoramic instance-level segmentation. We reformulate panoramic segmentation as fixed-trajectory perspective video segmentation, decomposing a panorama into overlapping perspective patches sampled along a continuous spherical traversal. This memory-aligned reformulation preserves native 4K resolution while restoring the smooth viewpoint transitions required for stable cross-view propagation. To enable large-scale supervision, we synthesize 183,440 4K-resolution panoramic images with instance segmentation labels using the InfiniGen engine. Trained under this trajectory-aligned paradigm, SAP generalizes effectively to real-world 360° images, achieving +17.2 zero-shot mIoU gain over vanilla SAM2 of different sizes on real-world 4K panorama benchmark.

[CV-68] FC-Track: Overlap-Aware Post-Association Correction for Online Multi-Object Tracking

【速读】：该论文旨在解决在线多目标跟踪（Online Multi-Object Tracking, MOT）中因频繁遮挡和目标重叠导致的身份切换（identity switch）问题，此类错误关联会随时间传播并显著降低跟踪可靠性。解决方案的关键在于提出一种轻量级的后关联校正框架（FC-Track），其核心机制包括：1）基于交并比面积比（Intersection over Area, IoA）的过滤策略，在高重叠条件下抑制不可靠的外观更新；2）通过重叠轨迹对之间的外观相似性比较，局部修正检测到轨迹片段的误匹配。该方法无需全局优化或重识别（re-identification），即可有效防止短期误匹配传播，从而大幅减少长期身份切换，同时保持实时性能，适用于机器人系统的在线部署。

链接: https://arxiv.org/abs/2603.12758
作者: Cheng Ju,Zejing Zhao,Akio Namiki
机构: Chiba University (千叶大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable multi-object tracking (MOT) is essential for robotic systems operating in complex and dynamic environments. Despite recent advances in detection and association, online MOT methods remain vulnerable to identity switches caused by frequent occlusions and object overlap, where incorrect associations can propagate over time and degrade tracking reliability. We present a lightweight post-association correction framework (FC-Track) for online MOT that explicitly targets overlap-induced mismatches during inference. The proposed method suppresses unreliable appearance updates under high-overlap conditions using an Intersection over Area (IoA)-based filtering strategy, and locally corrects detection-to-tracklet mismatches through appearance similarity comparison within overlapped tracklet pairs. By preventing short-term mismatches from propagating, our framework effectively mitigates long-term identity switches without resorting to global optimization or re-identification. The framework operates online without global optimization or re-identification, making it suitable for real-time robotic applications. We achieve 81.73 MOTA, 82.81 IDF1, and 66.95 HOTA on the MOT17 test set with a running speed of 5.7 FPS, and 77.52 MOTA, 80.90 IDF1, and 65.67 HOTA on the MOT20 test set with a running speed of 0.6 FPS. Specifically, our framework FC-Track produces only 29.55% long-term identity switches, which is substantially lower than existing online trackers. Meanwhile, our framework maintains state-of-the-art performance on the MOT20 benchmark.

[CV-69] Show Dont Tell: Detecting Novel Objects by Watching Human Videos

【速读】：该论文旨在解决机器人在人类示范过程中快速识别和辨认新物体的问题，现有闭集目标检测器因物体分布外而失效，而开放集检测器（如视觉语言模型 VLMs）虽能部分成功，却常需昂贵且繁琐的人工提示工程才能唯一识别新物体实例。解决方案的关键在于提出一种自监督系统——“Show, Don’t Tell”，即通过人类示范本身自动构建数据集并训练专用目标检测器，无需复杂语言描述或人工提示工程；该方法直接向检测器展示目标物体，而非用语言描述，从而实现对示范中相关物体的快速、定制化检测，显著提升机器人任务完成率。

链接: https://arxiv.org/abs/2603.12751
作者: James Akl,Jose Nicolas Avendano Arbelaez,James Barabas,Jennifer L. Barry,Kalie Ching,Noam Eshed,Jiahui Fu,Michel Hidalgo,Andrew Hoelscher,Tushar Kusnur,Andrew Messing,Zachary Nagler,Brian Okorn,Mauro Passerino,Tim J. Perkins,Eric Rosen,Ankit Shah,Tanmay Shankar,Scott Shaw
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:How can a robot quickly identify and recognize new objects shown to it during a human demonstration? Existing closed-set object detectors frequently fail at this because the objects are out-of-distribution. While open-set detectors (e.g., VLMs) sometimes succeed, they often require expensive and tedious human-in-the-loop prompt engineering to uniquely recognize novel object instances. In this paper, we present a self-supervised system that eliminates the need for tedious language descriptions and expensive prompt engineering by training a bespoke object detector on an automatically created dataset, supervised by the human demonstration itself. In our approach, “Show, Don’t Tell,” we show the detector the specific objects of interest during the demonstration, rather than telling the detector about these objects via complex language descriptions. By bypassing language altogether, this paradigm enables us to quickly train bespoke detectors tailored to the relevant objects observed in human task demonstrations. We develop an integrated on-robot system to deploy our “Show, Don’t Tell” paradigm of automatic dataset creation and novel object-detection on a real-world robot. Empirical results demonstrate that our pipeline significantly outperforms state-of-the-art detection and recognition methods for manipulated objects, leading to improved task completion for the robot.

[CV-70] SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking

【速读】：该论文旨在解决当前基于扩散模型的图像溯源水印技术在面对语义引导的再生攻击时鲁棒性不足的问题，尤其是现有语义感知水印方法因依赖单一全局语义绑定而难以抵御局部但整体语义一致的篡改。其解决方案的关键在于提出一种名为SLICE（Semantic Latent Injection via Compartmentalized Embedding）的新框架，该框架将图像语义解耦为四个独立因素（主体、环境、行为和细节），并将这些语义因子精确锚定到初始高斯噪声的不同区域，从而实现细粒度的语义绑定，使水印验证能够检测并定位语义篡改，同时提供统计上的误接受率保障，显著提升对现实对抗性操作的鲁棒性。

链接: https://arxiv.org/abs/2603.12749
作者: Zheng Gao,Yifan Yang,Xiaoyu Li,Xiaoyan Feng,Haoran Fan,Yang Song,Jiaojiao Jiang
机构: University of New South Wales (新南威尔士大学); Griffith University (格里菲斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Watermarking the initial noise of diffusion models has emerged as a promising approach for image provenance, but content-independent noise patterns can be forged via inversion and regeneration attacks. Recent semantic-aware watermarking methods improve robustness by conditioning verification on image semantics. However, their reliance on a single global semantic binding makes them vulnerable to localized but globally coherent semantic edits. To address this limitation and provide a trustworthy semantic-aware watermark, we propose \underline\textbfS emantic \underline\textbfL atent \underline\textbfI njection via \underline\textbfC ompartmentalized \underline\textbfE mbedding ( \textbfSLICE ). Our framework decouples image semantics into four semantic factors (subject, environment, action, and detail) and precisely anchors them to distinct regions in the initial Gaussian noise. This fine-grained semantic binding enables advanced watermark verification where semantic tampering is detectable and localizable. We theoretically justify why SLICE enables robust and reliable tamper localization and provides statistical guarantees on false-accept rates. Experimental results demonstrate that SLICE significantly outperforms existing baselines against advanced semantic-guided regeneration attacks, substantially reducing attack success while preserving image quality and semantic fidelity. Overall, SLICE offers a practical, training-free provenance solution that is both fine-grained in diagnosis and robust to realistic adversarial manipulations.

[CV-71] hinking in Dynamics: How Multimodal Large Language Models Perceive Track and Reason Dynamics in Physical 4D World

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在动态场景中对时空演化规律的理解与推理能力不足的问题，即如何让模型具备“动态思维”（thinking in dynamics）的能力，以实现对物理世界4D现实（包含空间维度和时间维度的动态结构）中运动、交互和局部动态变化的感知与推理。其解决方案的关键在于构建了一个大规模基准测试集 Dyn-Bench，涵盖真实世界与合成视频数据，提供高质量的动态场景样本（1k视频、7k视觉问答对和3k动态物体定位对），并通过结构化整合策略（如Mask-Guided Fusion 和 Spatio-Temporal Textual Cognitive Map, ST-TCM）显著提升模型在时空推理与动态对象定位上的表现，相较传统提示方法（如链式思维或基于描述的提示）效果更优。

链接: https://arxiv.org/abs/2603.12746
作者: Yuzhi Huang,Kairun Wen,Rongxin Gao,Dongxuan Liu,Yibin Lou,Jie Wu,Jing Xu,Jian Zhang,Zheng Yang,Yunlong Lin,Chenxin Li,Panwang Pan,Junbin Lu,Jingyan Jiang,Xinghao Ding,Yue Huang,Zhi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time, constituting a dynamic 4D reality (spatial with temporal dimension). While current Multimodal Large Language Models (MLLMs) excel in static visual understanding, can they also be adept at “thinking in dynamics”, i.e., perceive, track and reason about spatio-temporal dynamics in evolving scenes? To systematically assess their spatio-temporal reasoning and localized dynamics perception capabilities, we introduce Dyn-Bench, a large-scale benchmark built from diverse real-world and synthetic video datasets, enabling robust and scalable evaluation of spatio-temporal understanding. Through multi-stage filtering from massive 2D and 4D data sources, Dyn-Bench provides a high-quality collection of dynamic scenes, comprising 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding pairs. We probe general, spatial and region-level MLLMs to express how they think in dynamics both linguistically and visually, and find that existing models cannot simultaneously maintain strong performance in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction. Notably, conventional prompting strategies (e.g., chain-of-thought or caption-based hints) provide limited improvement, whereas structured integration approaches, including Mask-Guided Fusion and Spatio-Temporal Textual Cognitive Map (ST-TCM), significantly enhance MLLMs’ dynamics perception and spatio-temporal reasoning in the physical 4D world. Code and benchmark are available at this https URL.

[CV-72] CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment

【速读】：该论文旨在解决从脑电图（EEG）信号中重建视觉刺激时存在的保真度损失和表征偏移问题。其解决方案的关键在于提出了一种增强型框架 CognitionCapturerPro，通过协同训练将 EEG 与多模态先验信息（图像、文本、深度图和边缘图）进行融合；核心创新包括一种不确定性加权的相似性评分机制，用于量化各模态的特定保真度，以及一个融合编码器以整合共享表征，同时结合简化的对齐模块和预训练扩散模型，在 THINGS-EEG 数据集上显著提升了检索准确率（Top-1 和 Top-5 分别提升 25.9% 和 10.6%）。

链接: https://arxiv.org/abs/2603.12722
作者: Kaifan Zhang,Lihuo He,Junjie Ke,Yuqi Ji,Lukun Wu,Lizi Wang,Xinbo Gao
机构: Xidian University (西安电子科技大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual stimuli reconstruction from EEG remains challenging due to fidelity loss and representation shift. We propose CognitionCapturerPro, an enhanced framework that integrates EEG with multi-modal priors (images, text, depth, and edges) via collaborative training. Our core contributions include an uncertainty-weighted similarity scoring mechanism to quantify modality-specific fidelity and a fusion encoder for integrating shared representations. By employing a simplified alignment module and a pre-trained diffusion model, our method significantly outperforms the original CognitionCapturer on the THINGS-EEG dataset, improving Top-1 and Top-5 retrieval accuracy by 25.9% and 10.6%, respectively. Code is available at: this https URL.

[CV-73] CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration

【速读】：该论文旨在解决点云配准（Point Cloud Registration）在复杂真实场景中性能下降的问题，尤其是在数据不完整、传感器噪声和低重叠区域等挑战下，现有基于学习的方法鲁棒性不足。解决方案的关键在于提出一种新型的跨模态混合注意力网络（Cross-Modal Hybrid Attention Network, CMHANet），通过融合2D图像的丰富上下文信息与3D点云的几何细节，构建全面且鲁棒的特征表示；同时引入基于对比学习的优化函数，强化几何一致性，显著提升模型对噪声和部分观测的鲁棒性。

链接: https://arxiv.org/abs/2603.12721
作者: Dongxu Zhang,Yingsen Wang,Yiding Sun,Haoran Xu,Peilin Fan,Jihua Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robust point cloud registration is a fundamental task in 3D computer vision and geometric deep learning, essential for applications such as large-scale 3D reconstruction, augmented reality, and scene understanding. However, the performance of established learning-based methods often degrades in complex, real world scenarios characterized by incomplete data, sensor noise, and low overlap regions. To address these limitations, we propose CMHANet, a novel Cross-Modal Hybrid Attention Network. Our method integrates the fusion of rich contextual information from 2D images with the geometric detail of 3D point clouds, yielding a comprehensive and resilient feature representation. Furthermore, we introduce an innovative optimization function based on contrastive learning, which enforces geometric consistency and significantly improves the model’s robustness to noise and partial observations. We evaluated CMHANet on the 3DMatch and the challenging 3DLoMatch datasets. \revAdditionally, zero-shot evaluations on the TUM RGB-D SLAM dataset verify the model’s generalization capability to unseen domains. The experimental results demonstrate that our method achieves substantial improvements in both registration accuracy and overall robustness, outperforming current techniques. We also release our code in \hrefthis https URLthis https URL.

[CV-74] IGASA: Integrated Geometry-Aware and Skip-Attention Modules for Enhanced Point Cloud Registration

【速读】：该论文旨在解决点云配准（Point Cloud Registration, PCR）在真实场景中因噪声严重、显著遮挡及大尺度变换等因素导致的精度下降和鲁棒性不足的问题。解决方案的关键在于提出一种基于分层金字塔架构（Hierarchical Pyramid Architecture, HPA）的新框架IGASA，其核心创新包括两个模块：一是分层跨层注意力（Hierarchical Cross-Layer Attention, HCLA）模块，通过跳跃注意力机制对多分辨率特征进行对齐并增强局部几何一致性；二是迭代几何感知精化（Iterative Geometry-Aware Refinement, IGAR）模块，在粗匹配基础上利用可靠对应关系进行精细匹配优化。二者协同作用使IGASA能够适应多样化的点云结构和复杂变换，从而显著提升配准精度与稳定性。

链接: https://arxiv.org/abs/2603.12719
作者: Dongxu Zhang,Jihua Zhu,Shiqi Li,Wenbiao Yan,Haoran Xu,Peilin Fan,Huimin Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Point cloud registration (PCR) is a fundamental task in 3D vision and provides essential support for applications such as autonomous driving, robotics, and environmental modeling. Despite its widespread use, existing methods often fail when facing real-world challenges like heavy noise, significant occlusions, and large-scale transformations. These limitations frequently result in compromised registration accuracy and insufficient robustness in complex environments. In this paper, we propose IGASA as a novel registration framework constructed upon a Hierarchical Pyramid Architecture (HPA) designed for robust multi-scale feature extraction and fusion. The framework integrates two pivotal components consisting of the Hierarchical Cross-Layer Attention (HCLA) module and the Iterative Geometry-Aware Refinement (IGAR) module. The HCLA module utilizes skip attention mechanisms to align multi-resolution features and enhance local geometric consistency. Simultaneously, the IGAR module is designed for the fine matching phase by leveraging reliable correspondences established during coarse matching. This synergistic integration within the architecture allows IGASA to adapt effectively to diverse point cloud structures and intricate transformations. We evaluate the performance of IGASA on four widely recognized benchmark datasets including 3D(Lo)Match, KITTI, and nuScenes. Our extensive experiments consistently demonstrate that IGASA significantly surpasses state-of-the-art methods and achieves notable improvements in registration accuracy. This work provides a robust foundation for advancing point cloud registration techniques while offering valuable insights for practical 3D vision applications. The code for IGASA is available in \hrefthis https URLthis https URL.

[CV-75] he COTe score: A decomposable framework for evaluating Document Layout Analysis models

【速读】：该论文旨在解决文档布局分析（Document Layout Analysis, DLA）中模型性能评估存在的局限性问题，即传统目标检测指标（如IoU、F1或mAP）因设计初衷为三维空间的二维投影图像而无法准确反映印刷媒体等原生二维内容的解析质量，导致对模型表现的解读可能误导或缺乏深度。其解决方案的关键在于提出两个核心创新：一是结构语义单元（Structural Semantic Unit, SSU），一种从物理边界转向语义结构关系的标注方法；二是覆盖、重叠、越界与冗余（Coverage, Overlap, Trespass, and Excess, COTe）评分机制，一种可分解且更贴近实际需求的页面解析质量度量指标。实验表明，COTe能揭示不同模型在语义边界破坏或重复区域解析等方面的独特失败模式，并显著缩小评估指标与真实性能之间的差距（最多降低76%），且其鲁棒性在未显式使用SSU标注时依然成立，从而降低了应用门槛。

链接: https://arxiv.org/abs/2603.12718
作者: Jonathan Bourne,Mwiza Simbeye,Ishtar Govia
机构: THE 3TC AI; University College London; Amagi Brain Health
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6906 words, 4 Figures, 10 Tables,

点击查看摘要

Abstract:Document Layout analysis (DLA), is the process by which a page is parsed into meaningful elements, often using machine learning models. Typically, the quality of a model is judged using general object detection metrics such as IoU, F1 or mAP. However, these metrics are designed for images that are 2D projections of 3D space, not for the natively 2D imagery of printed media. This discrepancy can result in misleading or uninformative interpretation of model performance by the metrics. To encourage more robust, comparable, and nuanced DLA, we introduce: The Structural Semantic Unit (SSU) a relational labelling approach that shifts the focus from the physical to the semantic structure of the content; and the Coverage, Overlap, Trespass, and Excess (COTe) score, a decomposable metric for measuring page parsing quality. We demonstrate the value of these methods through case studies and by evaluating 5 common DLA models on 3 DLA datasets. We show that the COTe score is more informative than traditional metrics and reveals distinct failure modes across models, such as breaching semantic boundaries or repeatedly parsing the same region. In addition, the COTe score reduces the interpretation-performance gap by up to 76% relative to the F1. Notably, we find that the COTe’s granularity robustness largely holds even without explicit SSU labelling, lowering the barriers to entry for using the system. Finally, we release an SSU labelled dataset and a Python library for applying COTe in DLA projects.

[CV-76] UNIStainNet: Foundation-Model-Guided Virtual Staining of HE to IHC

【速读】：该论文旨在解决从苏木精-伊红（Hematoxylin and Eosin, HE）图像中进行虚拟免疫组化（Virtual Immunohistochemistry, IHC）染色时，现有方法在生成真实感不足、缺乏组织层面语义引导以及需为每个IHC标记单独训练模型的问题。其解决方案的关键在于提出UNIStainNet——一种基于冻结的病理基础模型（Pathology Foundation Model, UNI）提取的密集空间token作为条件的SPADE-UNet架构，通过引入组织级语义指导实现染色转换；同时设计了考虑错位感知的损失函数以保持染色定量准确性，并利用学习到的染色嵌入使单一模型可同时服务于多种IHC标记。该方法在MIST和BCI数据集上均实现了优于现有技术的分布一致性指标，且显著减少了模型冗余与训练成本。

链接: https://arxiv.org/abs/2603.12716
作者: Jillur Rahman Saurav,Thuong Le Hoai Pham,Pritam Mukherjee,Paul Yi,Brent A. Orr,Jacob M. Luber
机构: University of Texas at Arlington (德克萨斯大学阿灵顿分校); St. Jude Children’s Research Hospital (圣裘德儿童研究医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Virtual immunohistochemistry (IHC) staining from hematoxylin and eosin (HE) images can accelerate diagnostics by providing preliminary molecular insight directly from routine sections, reducing the need for repeat sectioning when tissue is limited. Existing methods improve realism through contrastive objectives, prototype matching, or domain alignment, yet the generator itself receives no direct guidance from pathology foundation models. We present UNIStainNet, a SPADE-UNet conditioned on dense spatial tokens from a frozen pathology foundation model (UNI), providing tissue-level semantic guidance for stain translation. A misalignment-aware loss suite preserves stain quantification accuracy, and learned stain embeddings enable a single model to serve multiple IHC markers simultaneously. On MIST, UNIStainNet achieves state-of-the-art distributional metrics on all four stains (HER2, Ki67, ER, PR) from a single unified model, where prior methods typically train separate per-stain models. On BCI, it also achieves the best distributional metrics. A tissue-type stratified failure analysis reveals that remaining errors are systematic, concentrating in non-tumor tissue. Code is available at this https URL.

[CV-77] xt-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval

【速读】：该论文旨在解决无监督跨域图像检索（Unsupervised Cross-Domain Image Retrieval, UCDIR）中的关键挑战，即如何在缺乏标注数据的情况下实现跨域图像的准确语义对齐与表示学习。现有方法依赖聚类生成的离散伪标签作为监督信号，但此类标签难以提供精确且全面的语义指导，同时在特征对齐过程中常忽略领域特异性信息与语义信息的纠缠，导致语义退化并损害检索性能。解决方案的关键在于提出文本-相位协同网络（Text-Phase Synergy Network with Dual Priors, TPSNet），其核心创新是引入双先验机制：一是利用CLIP模型为每个域生成类别特定的文本提示（domain prompt），作为文本先验以提供更精准的语义监督；二是引入域不变相位特征（domain-invariant phase features）作为相位先验，嵌入原始图像表征中以弥合域分布差异的同时保持语义完整性。通过融合这两个先验的协同作用，TPSNet显著提升了UCDIR任务的性能。

链接: https://arxiv.org/abs/2603.12711
作者: Jing Yang,Hui Xue,Shipeng Zhu,Pengfei Fang
机构: Southeast University (东南大学); Ministry of Education (教育部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper studies unsupervised cross-domain image retrieval (UCDIR), which aims to retrieve images of the same category across different domains without relying on labeled data. Existing methods typically utilize pseudo-labels, derived from clustering algorithms, as supervisory signals for intra-domain representation learning and cross-domain feature alignment. However, these discrete pseudo-labels often fail to provide accurate and comprehensive semantic guidance. Moreover, the alignment process frequently overlooks the entanglement between domain-specific and semantic information, leading to semantic degradation in the learned representations and ultimately impairing retrieval performance. This paper addresses the limitations by proposing a Text-Phase Synergy Network with Dual Priors(TPSNet). Specifically, we first employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior that offers more precise semantic supervision. In parallel, we further introduce a phase prior, represented by domain-invariant phase features, which is integrated into the original image representations to bridge the domain distribution gaps while preserving semantic integrity. Leveraging the synergy of these dual priors, TPSNet significantly outperforms state-of-the-art methods on UCDIR benchmarks.

[CV-78] HFP-SAM: Hierarchical Frequency Prompted SAM for Efficient Marine Animal Segmentation

【速读】：该论文旨在解决海洋动物分割（Marine Animal Segmentation, MAS）中因复杂背景导致的长距离建模困难问题，以及现有通用图像分割模型如Segment Anything Model (SAM) 在细粒度细节和频率信息感知方面的不足。解决方案的关键在于提出一种名为分层频率提示SAM（Hierarchical Frequency Prompted SAM, HFP-SAM）的新框架：首先设计频率引导适配器（Frequency Guided Adapter, FGA），通过频域先验掩码高效注入海洋场景信息至冻结的SAM主干网络；其次引入频率感知点选择（Frequency-aware Point Selection, FPS）机制，基于频率分析生成显著区域并结合SAM粗预测生成点提示，输入其解码器以实现精细分割；最后采用全视图Mamba（Full-View Mamba, FVM）模块，以线性计算复杂度高效提取空间与通道上下文信息，从而获得更全面的分割掩码。

链接: https://arxiv.org/abs/2603.12708
作者: Pingping Zhang,Tianyu Yan,Yuhao Wang,Yang Liu,Tongdan Tang,Yili Ma,Long Lv,Feng Tian,Weibing Sun,and Huchuan Lu
机构: Dalian University of Technology (大连理工大学); Central Hospital of Dalian University of Technology (大连理工大学附属医院); Affiliated Zhongshan Hospital of Dalian University (大连大学附属中山医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TIP2026. More modifications may be performed

点击查看摘要

Abstract:Marine Animal Segmentation (MAS) aims at identifying and segmenting marine animals from complex marine environments. Most of previous deep learning-based MAS methods struggle with the long-distance modeling issue. Recently, Segment Anything Model (SAM) has gained popularity in general image segmentation. However, it lacks of perceiving fine-grained details and frequency information. To this end, we propose a novel learning framework, named Hierarchical Frequency Prompted SAM (HFP-SAM) for high-performance MAS. First, we design a Frequency Guided Adapter (FGA) to efficiently inject marine scene information into the frozen SAM backbone through frequency domain prior masks. Additionally, we introduce a Frequency-aware Point Selection (FPS) to generate highlighted regions through frequency analysis. These regions are combined with the coarse predictions of SAM to generate point prompts and integrate into SAM’s decoder for fine predictions. Finally, to obtain comprehensive segmentation masks, we introduce a Full-View Mamba (FVM) to efficiently extract spatial and channel contextual information with linear computational complexity. Extensive experiments on four public datasets demonstrate the superior performance of our approach. The source code is publicly available at this https URL.

[CV-79] VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos

【速读】：该论文旨在解决当前视频理解模型在连续播放过程中对世界状态（world state）维持能力不足的问题，尤其是缺乏对模型在时间维度上持续跟踪对象和事件的能力进行系统性评估。解决方案的关键在于提出VCBench——一个流式计数基准，将计数任务重构为诊断世界状态维持能力的最小探针，并将其细分为对象计数（追踪当前可见对象 vs. 追踪累积唯一身份）和事件计数（检测瞬时动作 vs. 追踪完整活动周期）共8个子类别；同时设计三类互补指标（数值精度、轨迹一致性、时间感知），通过帧级标注的10,071个事件发生时刻和4,576个查询点，实现对模型状态维持轨迹的多点流式观测，从而揭示现有主流视频-语言模型在时空状态维护上的显著缺陷，尤其在周期性事件计数等任务中表现不佳。

链接: https://arxiv.org/abs/2603.12703
作者: Pengyiang Liu,Zhongyue Shi,Hongye Hao,Qi Fu,Xueting Bi,Siwei Zhang,Xiaoyang Hu,Zitian Wang,Linjiang Huang,Si Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video understanding requires models to continuously track and update world state during playback. While existing benchmarks have advanced video understanding evaluation across multiple dimensions, the observation of how models maintain world state remains insufficient. We propose VCBench, a streaming counting benchmark that repositions counting as a minimal probe for diagnosing world state maintenance capability. We decompose this capability into object counting (tracking currently visible objects vs.\ tracking cumulative unique identities) and event counting (detecting instantaneous actions vs.\ tracking complete activity cycles), forming 8 fine-grained subcategories. VCBench contains 406 videos with frame-by-frame annotations of 10,071 event occurrence moments and object state change moments, generating 1,000 streaming QA pairs with 4,576 query points along timelines. By observing state maintenance trajectories through streaming multi-point queries, we design three complementary metrics to diagnose numerical precision, trajectory consistency, and temporal awareness. Evaluation on mainstream video-language models shows that current models still exhibit significant deficiencies in spatial-temporal state maintenance, particularly struggling with tasks like periodic event counting. VCBench provides a diagnostic framework for measuring and improving state maintenance in video understanding systems.

[CV-80] HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation

【速读】：该论文旨在解决视觉-语言导航（Vision-and-Language Navigation, VLN）从依赖密集指令的逐步执行向开放词汇、目标导向自主导航演进过程中，因缺乏结构化先验而导致的鲁棒性不足问题。传统方法多依赖计算复杂的2D/3D度量地图，而本文提出一种轻量级文本驱动的拓扑表示——osmAG（OpenStreetMap Area Graph），作为全局规划的基础；其关键创新在于构建了一个分层导航框架HaltNav，将osmAG的稳健全局路径规划与VLN代理的局部探索和指令对齐能力相结合：通过一个基于多模态大语言模型（MLLM）的“大脑模块”实现高层任务语义理解与障碍感知，并生成以目标为中心的局部子指令序列；同时引入反应式视觉暂停机制（Reactive Visual Halting, RVH），在检测到局部连通性变化（如门关闭或通道拥堵）时中断控制循环，更新osmAG拓扑并触发重规划以实现可行绕行。该方案显著提升了长期导航任务在环境动态变化下的适应性和鲁棒性。

链接: https://arxiv.org/abs/2603.12696
作者: Pingcong Li,Zihui Yu,Bichi Zhang,Sören Schwertfeger
机构: ShanghaiTech University (上海科技大学); Key Laboratory of Intelligent Perception and Human-Machine Collaboration - ShanghaiTech University, Ministry of Education, China (教育部智能感知与人机协同重点实验室-上海科技大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) is shifting from rigid, step-by-step instruction following toward open-vocabulary, goal-oriented autonomy. Achieving this transition without exhaustive routing prompts requires agents to leverage structural priors. While prior work often assumes computationally heavy 2D/3D metric maps, we instead exploit a lightweight, text-based osmAG (OpenStreetMap Area Graph), a floorplan-level topological representation that is easy to obtain and maintain. However, global planning over a prior map alone is brittle in real-world deployments, where local connectivity can change (e.g., closed doors or crowded passages), leading to execution-time failures. To address this gap, we propose a hierarchical navigation framework HaltNav that couples the robust global planning of osmAG with the local exploration and instruction-grounding capability of VLN. Our approach features an MLLM-based brain module, which is capable of high-level task grounding and obstruction awareness. Conditioned on osmAG, the brain converts the global route into a sequence of localized execution snippets, providing the VLN executor with prior-grounded, goal-centric sub-instructions. Meanwhile, it detects local anomalies via a mechanism we term Reactive Visual Halting (RVH), which interrupts the local control loop, updates osmAG by invalidating the corresponding topology, and triggers replanning to orchestrate a viable detour. To train this halting capability efficiently, we introduce a data synthesis pipeline that leverages generative models to inject realistic obstacles into otherwise navigable scenes, substantially enriching hard negative samples. Extensive experiments demonstrate that our hierarchical framework outperforms several baseline methods without tedious language instructions, and significantly improves robustness for long-horizon vision-language navigation under environmental changes.

[CV-81] HSEmotion Team at ABAW-10 Competition: Facial Expression Recognition Valence-Arousal Estimation Action Unit Detection and Fine-Grained Violence Classification CVPR2026

【速读】：该论文旨在解决跨场景下人脸情绪理解（frame-wise facial emotion understanding）与细粒度暴力检测（fine-grained violence detection）的挑战，尤其在复杂真实环境（in-the-wild）中提升模型的鲁棒性与准确性。其解决方案的关键在于：对于情绪识别任务，采用基于预训练EfficientNet模型提取面部嵌入（facial embedding），当模型置信度高于阈值时直接使用预测结果；否则将嵌入输入至一个在AffWild2数据集上训练的简单多层感知机（multi-layered perceptron），并通过固定窗口滑动平滑帧级分类得分以降低噪声影响；对于暴力检测任务，则探索多种预训练网络架构提取帧级嵌入并进行聚合用于视频分类。该方法在ABAW竞赛的四个任务中显著优于现有基线模型。

链接: https://arxiv.org/abs/2603.12693
作者: Andrey V. Savchenko,Kseniia Tsypliakova
机构: Sber AI Lab; HSE University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: to be submitted to ABAW-10 workshop of CVPR 2026

点击查看摘要

Abstract:This article presents our results for the 10th Affective Behavior Analysis in-the-Wild (ABAW) competition. For frame-wise facial emotion understanding tasks (frame-wise facial expression recognition, valence-arousal estimation, action unit detection), we propose a fast approach based on facial embedding extraction with pre-trained EfficientNet-based emotion recognition models. If the latter model’s confidence exceeds a threshold, its prediction is used. Otherwise, we feed embeddings into a simple multi-layered perceptron trained on the AffWild2 dataset. Estimated class-level scores are smoothed in a sliding window of fixed size to mitigate noise in frame-wise predictions. For the fine-grained violence detection task, we examine several pre-trained architectures for frame embeddings and their aggregation for video classification. Experimental results on four tasks from the ABAW challenge demonstrate that our approach significantly improves validation metrics over existing baselines.

[CV-82] CM-Bench: A Comprehensive Cross-Modal Feature Matching Benchmark Bridging Visible and Infrared Images

【速读】：该论文旨在解决跨模态特征匹配（cross-modal feature matching）在红外-可见光（IR-VIS）图像中因显著外观差异导致的匹配困难问题，以及当前研究缺乏标准化评估基准和指标的瓶颈。其解决方案的关键在于构建一个全面的跨模态特征匹配基准CM-Bench，涵盖30种不同类型的特征匹配算法，并通过三个核心任务（单应性估计、相对位姿估计和基于特征匹配的地理定位）进行系统评估；同时提出一种基于分类网络的自适应预处理前端，可自动选择最优增强策略以提升匹配性能，并引入一个新型红外-卫星跨模态数据集，包含人工标注的真实对应关系，用于实际地理定位场景的验证。

链接: https://arxiv.org/abs/2603.12690
作者: Liangzheng Sun,Mengfan He,Xingyu Shao,Binbin Li,Zhiqiang Yan,Chunyu Li,Ziyang Meng,Fei Xing
机构: Beijing Information Science and Technology University (北京信息科技大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared-visible (IR-VIS) feature matching plays an essential role in cross-modality visual localization, navigation and perception. Along with the rapid development of deep learning techniques, a number of representative image matching methods have been proposed. However, crossmodal feature matching is still a challenging task due to the significant appearance difference. A significant gap for cross-modal feature matching research lies in the absence of standardized benchmarks and metrics for evaluations. In this paper, we introduce a comprehensive cross-modal feature matching benchmark, CM-Bench, which encompasses 30 feature matching algorithms across diverse cross-modal datasets. Specifically, state-of-the-art traditional and deep learning-based methods are first summarized and categorized into sparse, semidense, and dense methods. These methods are evaluated by different tasks including homography estimation, relative pose estimation, and feature-matching-based geo-localization. In addition, we introduce a classification-network-based adaptive preprocessing front-end that automatically selects suitable enhancement strategies before matching. We also present a novel infrared-satellite cross-modal dataset with manually annotated ground-truth correspondences for practical geo-localization evaluation. The dataset and resource will be available at: this https URL.

[CV-83] STRAP-ViT: Segregated Tokens with Randomized – Transformations for Defense against Adversarial Patches in ViTs

【速读】：该论文旨在解决视觉Transformer（Vision Transformer, ViT）模型在面对对抗补丁（Adversarial Patch）攻击时的脆弱性问题，即物理可实现的局部噪声能够通过干扰自注意力机制，诱导模型将关注点集中于高对比度区域，并篡改分类令牌（class token）以产生高置信度的错误分类。解决方案的关键在于识别出受扰动影响的token——这些token在统计特性上与未受影响的token存在差异；基于此，作者提出STRAP-ViT防御机制，其核心是利用Jensen-Shannon散度（Jensen-Shannon Divergence）作为异常检测指标，在检测阶段筛选出异常token，并在缓解阶段对其施加随机复合变换（randomized composite transformations），从而破坏对抗噪声的有效性。该方法无需训练、可直接嵌入ViT架构中用于推理，计算开销极低，且在多个预训练模型和攻击场景下均表现出优异的鲁棒性，准确率仅比干净基线下降2–3%。

链接: https://arxiv.org/abs/2603.12688
作者: Nandish Chattopadhyay,Anadi Goyal,Chandan Karfa,Anupam Chattopadhyay
机构: Indian Institute of Technology, Guwahati, India (印度理工学院古瓦哈蒂分校); Nanyang Technological University, Singapore (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at IEEE/ACM Design Automation Conference (DAC) 2026

点击查看摘要

Abstract:Adversarial patches are physically realizable localized noise, which are able to hijack Vision Transformers (ViT) self-attention, pulling focus toward a small, high-contrast region and corrupting the class token to force confident misclassifications. In this paper, we claim that the tokens which correspond to the areas of the image that contain the adversarial noise, have different statistical properties when compared to the tokens which do not overlap with the adversarial perturbations. We use this insight to propose a mechanism, called STRAP-ViT, which uses Jensen-Shannon Divergence as a metric for segregating tokens that behave as anomalies in the Detection Phase, and then apply randomized composite transformations on them during the Mitigation Phase to make the adversarial noise ineffective. The minimum number of tokens to transform is a hyper-parameter for the defense mechanism and is chosen such that at least 50% of the patch is covered by the transformed tokens. STRAP-ViT fits as a non-trainable plug-and-play block within the ViT architectures, for inference purposes only, with a minimal computational cost and does not require any additional training cost/effort. STRAP-ViT has been tested on multiple pre-trained vision transformer architectures (ViT-base-16 and DinoV2) and datasets (ImageNet and CalTech-101), across multiple adversarial attacks (Adversarial Patch, LAVAN, GDPA and RP2), and found to provide excellent robust accuracies lying within a 2-3% range of the clean baselines, and outperform the state-of-the-art.

[CV-84] RSONet: Region-guided Selective Optimization Network for RGB-T Salient Object Detection

【速读】：该论文旨在解决RGB与热成像（thermal image）之间显著区域不一致的问题，即在多模态显著目标检测（RGB-T Salient Object Detection）中，由于两种模态对同一场景的感知差异导致显著目标分布不匹配，从而影响检测精度。解决方案的关键在于提出一种区域引导的选择性优化网络（Region-guided Selective Optimization Network, RSONet），其核心结构包含两个阶段：第一阶段通过三个并行编码器-解码器分支（配备上下文交互模块和空间感知融合模块）生成引导图以计算模态间相似性得分；第二阶段利用选择性优化（SO）模块根据相似性值动态融合RGB与热图像特征，缓解显著目标分布不一致的影响；此外，还引入密集细节增强（DDE）模块和互交互语义（MIS）模块分别优化低层细节信息与高层位置线索，从而提升最终检测结果的质量。

链接: https://arxiv.org/abs/2603.12685
作者: Bin Wan,Runmin Cong,Xiaofei Zhou,Hao Fang,Chengtao Lv,Sam Kwong
机构: Shandong University (山东大学); Key Laboratory of Machine Intelligence and System Control, Ministry of Education (教育部机器智能与系统控制重点实验室); Hangzhou Dianzi University (杭州电子科技大学); Huzhou University (湖州学院); School of Data Science, Lingnan University (岭南大学数据科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper focuses on the inconsistency in salient regions between RGB and thermal images. To address this issue, we propose the Region-guided Selective Optimization Network for RGB-T Salient Object Detection, which consists of the region guidance stage and saliency generation stage. In the region guidance stage, three parallel branches with same encoder-decoder structure equipped with the context interaction (CI) module and spatial-aware fusion (SF) module are designed to generate the guidance maps which are leveraged to calculate similarity scores. Then, in the saliency generation stage, the selective optimization (SO) module fuses RGB and thermal features based on the previously obtained similarity values to mitigate the impact of inconsistent distribution of salient targets between the two modalities. After that, to generate high-quality detection result, the dense detail enhancement (DDE) module which adopts the multiple dense connections and visual state space blocks is applied to low-level features for optimizing the detail information. In addition, the mutual interaction semantic (MIS) module is placed in the high-level features to dig the location cues by the mutual fusion strategy. We conduct extensive experiments on the RGB-T dataset, and the results demonstrate that the proposed RSONet achieves competitive performance against 27 state-of-the-art SOD methods.

[CV-85] Bin~WanG2HFNet: GeoGran-Aware Hierarchical Feature Fusion Network for Salient Object Detection in Optical Remote Sensing Images

【速读】：该论文旨在解决光学遥感图像中显著目标检测（SOD）因尺度变化大和背景复杂而导致的检测精度不足问题。现有方法通常在单一尺度下使用统一注意力机制提取多级特征，导致表征不充分且检测结果不完整。其解决方案的关键在于提出GeoGran-Aware Hierarchical Feature Fusion Network (G2HFNet)，该网络通过引入三个核心模块实现对几何（geo）与粒度（granular）线索的深度挖掘：一是多尺度细节增强（MDE）模块以应对目标尺度变化并丰富细粒度信息；二是双分支地理-粒度互补（DGC）模块联合捕获中层特征中的细粒度细节与位置信息；三是深层语义感知（DSP）模块利用自注意力机制优化高层位置线索。此外，采用局部-全局引导融合（LGF）模块替代传统卷积进行多级特征融合，从而提升整体检测性能，在复杂遥感场景中显著改善了显著性图的质量与检测准确性。

链接: https://arxiv.org/abs/2603.12680
作者: Bin Wan,Runmin Cong,Xiaofei Zhou,Hao Fang,Chengtao Lv,Sam Kwong
机构: Shandong University (山东大学); State Key Laboratory of Autonomous Intelligent Unmanned Systems (自主智能无人系统国家重点实验室); Hangzhou Dianzi University (杭州电子科技大学); Huzhou University (湖州师范学院); Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing images captured from aerial perspectives often exhibit significant scale variations and complex backgrounds, posing challenges for salient object detection (SOD). Existing methods typically extract multi-level features at a single scale using uniform attention mechanisms, leading to suboptimal representations and incomplete detection results. To address these issues, we propose a GeoGran-Aware Hierarchical Feature Fusion Network (G2HFNet) that fully exploits geometric and granular cues in optical remote sensing images. Specifically, G2HFNet adopts Swin Transformer as the backbone to extract multi-level features and integrates three key modules: the multi-scale detail enhancement (MDE) module to handle object scale variations and enrich fine details, the dual-branch geo-gran complementary (DGC) module to jointly capture fine-grained details and positional information in mid-level features, and the deep semantic perception (DSP) module to refine high-level positional cues via self-attention. Additionally, a local-global guidance fusion (LGF) module is introduced to replace traditional convolutions for effective multi-level feature integration. Extensive experiments demonstrate that G2HFNet achieves high-quality saliency maps and significantly improves detection performance in challenging remote sensing scenarios.

[CV-86] Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning

【速读】：该论文旨在解决多模态视觉-语言模型（Vision-Language Models, VLMs）在集成推理中如何有效选择和融合异构模型以提升性能的问题，尤其关注在缺乏多数共识或多数模型预测错误时仍能保持高准确率的挑战。解决方案的关键在于提出一种基于焦点误差多样性（focal error diversity）与基于中心核对齐（Centered Kernel Alignment, CKA）的焦点多样性度量（CKA-focal）的方法，用于量化不同VLM在视觉嵌入层面的分歧程度，并结合遗传算法（Genetic Algorithm）从候选模型池中自动筛选出对融合性能有贡献的组件模型，从而构建动态且具有抗幻觉能力的双焦点多样性融合策略（V3Fusion）。该方法能够识别最优模型组合并实现输出融合，在多个主流VLM基准测试（如MMMU、A-OKVQA等）上显著优于单个最佳模型及当前顶尖生成式模型。

链接: https://arxiv.org/abs/2603.12669
作者: Selim Furkan Tekin,Yichang Xu,Gaowen Liu,Ramana Rao Kompella,Margaret L. Loper,Ling Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the growing number and diversity of Vision-Language Models (VLMs), many works explore language-based ensemble, collaboration, and routing techniques across multiple VLMs to improve multi-model reasoning. In contrast, we address the diverse model selection using both vision and language modalities. We introduce focal error diversity to capture complementary reasoning across VLMs and a CKA-based focal diversity metric (CKA-focal) to measure disagreement in their visual embeddings. On the constructed ensemble surface from a pool of candidate VLMs, we applied a Genetic Algorithm to effectively prune out those component VLMs that do not add value to the fusion performance. We identify the best combination for each task as well as fuse the outputs of each VLMs in the model pool, and show that heterogeneous models can capture epistemic uncertainty dynamically and mitigate hallucinations. Our V3Fusion approach is capable of producing dual focal-diversity fused predictions with high performance for vision-language reasoning, even when there is no majority consensus or the majority of VLMs make incorrect predictions. Extensive experiments validate V3Fusion on four popular VLM benchmarks (A-OKVQA, MMMU, MMMU-Pro, and OCR-VQA). The results show that V3Fusion outperforms the best-performing VLM on MMMU by 8.09% and MMMU-Pro by 4.87% gain in accuracy. For generative tasks, V3Fusion outperforms Intern-VL2-8b and Qwen2.5-VL-7b, the top-2 VLM performers on both A-OKVQA and OCR-VQA. Our code and datasets are available at this https URL.

[CV-87] Marker-Based 3D Reconstruction of Aggregates with a Comparative Analysis of 2D and 3D Morphologies

【速读】：该论文旨在解决建筑与交通基础设施中集料（Aggregate）颗粒形态（Morphology）在质量保证/质量控制（QA/QC）过程中难以实现高效、准确三维表征的问题。传统方法如2D图像分析或依赖昂贵设备（如3D激光扫描仪或X射线计算机断层成像CT）的3D扫描存在局限性，无法满足现场快速、低成本的粒径与形状信息获取需求。论文提出的解决方案关键在于采用基于标记点的摄影测量法（Photogrammetry），通过标记点设计实现背景抑制、点云拼接及尺度参考，从而在不依赖高成本设备的前提下，获得高质量的集料三维模型，并验证了其精度与2D分析相比具有显著差异，为集料的便捷检测、数据采集和三维形态分析提供了可行路径。

链接: https://arxiv.org/abs/2603.12667
作者: Haohang Huang,Jiayi Luo,Issam Qamhia,Erol Tutumluer,John M. Hart,Andrew J. Stolba
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Aggregates, serving as the main skeleton in assemblies of construction materials, are important functional components in various building and transportation infrastructures. They can be used in unbound layer applications, e.g. pavement base and railroad ballast, bound applications of cement concrete and asphalt concrete, and as riprap and large-sized primary crushed rocks. Information on the size and shape or morphology of aggregates can greatly facilitate the Quality Assurance/Quality Control (QA/QC) process by providing insights of aggregate behavior during composition and packing. A full 3D characterization of aggregate particle morphology is difficult both during production in a quarry and at a construction site. Many aggregate imaging approaches have been developed to quantify the particle morphology by computer vision, including 2D image-based approaches that analyze particle silhouettes and 3D scanning-based methods that require expensive devices such as 3D laser scanners or X-Ray Computed Tomography (CT) equipment. This paper presents a flexible and cost-effective photogrammetry-based approach for the 3D reconstruction of aggregate particles. The proposed approach follows a marker-based design that enables background suppression, point cloud stitching, and scale referencing to obtain high-quality aggregate models. The accuracy of the reconstruction results was validated against ground-truth for selected aggregate samples. Comparative analyses were conducted on 2D and 3D morphological properties of the selected samples. Significant differences were found between the 2D and 3D statistics. Based on the presented approach, 3D shape information of aggregates can be obtained easily and at a low cost, thus allowing convenient aggregate inspection, data collection, and 3D morphological analysis.

[CV-88] Learning Geometric and Photometric Features from Panoramic LiDAR Scans for Outdoor Place Categorization

【速读】：该论文旨在解决室外场景语义分类问题（semantic place categorization），这是自主机器人和车辆在陌生环境中实现自主决策与导航的关键任务。由于室外环境存在光照变化、遮挡等感知差异，其分类难度显著高于室内场景。解决方案的关键在于提出一种基于卷积神经网络（CNN）的新方法，利用3D激光雷达（LiDAR）获取的全向深度/反射率图像作为输入，同时融合深度与反射率两种模态信息进行分类建模，并构建了一个大规模的多模态全景3D室外场景数据集（Multi-modal Panoramic 3D Outdoor, MPO），从而显著优于传统方法，在多个户外场景类别上实现了更优的分类性能。

链接: https://arxiv.org/abs/2603.12663
作者: Kazuto Nakashima,Hojung Jung,Yuki Oto,Yumi Iwashita,Ryo Kurazume,Oscar Martinez Mozos
机构: Kyushu University (九州大学); California Institute of Technology (加州理工学院); Technical University of Cartagena (卡塔赫纳理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Published in Advanced Robotics on 31 Jul 2018

点击查看摘要

Abstract:Semantic place categorization, which is one of the essential tasks for autonomous robots and vehicles, allows them to have capabilities of self-decision and navigation in unfamiliar environments. In particular, outdoor places are more difficult targets than indoor ones due to perceptual variations, such as dynamic illuminance over twenty-four hours and occlusions by cars and pedestrians. This paper presents a novel method of categorizing outdoor places using convolutional neural networks (CNNs), which take omnidirectional depth/reflectance images obtained by 3D LiDARs as the inputs. First, we construct a large-scale outdoor place dataset named Multi-modal Panoramic 3D Outdoor (MPO) comprising two types of point clouds captured by two different LiDARs. They are labeled with six outdoor place categories: coast, forest, indoor/outdoor parking, residential area, and urban area. Second, we provide CNNs for LiDAR-based outdoor place categorization and evaluate our approach with the MPO dataset. Our results on the MPO dataset outperform traditional approaches and show the effectiveness in which we use both depth and reflectance modalities. To analyze our trained deep networks we visualize the learned features.

[CV-89] AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network CVPR2026

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在遥感图像适配中的两大挑战：一是文本表示的语义覆盖不足，二是视觉特征适应性有限，尤其在航空场景中，由于目标外观多样且细粒度差异显著，问题更为突出。解决方案的关键在于提出AVION框架，通过知识蒸馏机制实现跨模态对齐：教师模块利用大语言模型生成语义丰富的文本原型，并结合遥感图像特征验证其有效性；学生模块则在视觉和语言编码器中嵌入轻量且可学习的提示（prompt），在教师指导下优化嵌入空间及其跨模态关系，从而在少量样本下提升分类准确率与跨模态检索召回率，同时保持对新类别的泛化能力，且仅引入极少额外训练参数。

链接: https://arxiv.org/abs/2603.12659
作者: Yu Hu,Jianyang Gu,Hao Liu,Yue Cao,Jozsef Hamari,Zheng Liu,Mohsen Zardadi
机构: The University of British Columbia, Okanagan, Kelowna, BC, Canada; The Ohio State University, Columbus, OH, USA; TerraSense Analytics, Kelowna, BC, Canada
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference. Experiments on six optical remote sensing benchmarks show that AVION improves few-shot classification and base-class accuracy without degrading generalization to novel categories. It also enhances mean recall for cross-modal retrieval, with minimal additional trainable parameters.

[CV-90] VFM-Recon: Unlocking Cross-Domain Scene-Level Neural Reconstruction with Scale-Aligned Foundation Priors

【速读】：该论文旨在解决单目视频场景级神经体积重建（scene-level neural volumetric reconstruction）在严重域偏移（domain shift）下的挑战，特别是预训练视觉基础模型（Vision Foundation Models, VFMs）的尺度模糊预测与体积融合所需的尺度一致性之间的不兼容问题。解决方案的关键在于提出VFMRecon框架：首先引入一个轻量级尺度对齐阶段以恢复多视角尺度一致性；其次通过轻量级任务特定适配器（task-specific adapters）将预训练VFMs特征集成到神经体积重建流程中，在保证重建性能的同时保留预训练表示的跨域鲁棒性。该方法在ScanNet、TUM RGB-D和Tanks and Temples等多个数据集上均取得SOTA性能，尤其在具有挑战性的户外Tanks and Temples数据集上，重建网格的F1分数达到70.1，显著优于现有最先进方法VGGT（51.8）。

链接: https://arxiv.org/abs/2603.12657
作者: Yuhang Ming,Tingkang Xi,Xingrui Yang,Lixin Yang,Yong Peng,Cewu Lu,Wanzeng Kong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Scene-level neural volumetric reconstruction from monocular videos remains challenging, especially under severe domain shifts. Although recent advances in vision foundation models (VFMs) provide transferable generalized priors learned from large-scale data, their scaleambiguous predictions are incompatible with the scale consistency required by volumetric fusion. To address this gap, we present VFMRecon, the first attempt to bridge transferable VFM priors with scaleconsistent requirements in scene-level neural reconstruction. Specifically, we first introduce a lightweight scale alignment stage that restores multiview scale coherence. We then integrate pretrained VFM features into the neural volumetric reconstruction pipeline via lightweight task-specific adapters, which are trained for reconstruction while preserving the crossdomain robustness of pretrained representations. We train our model on ScanNet train split and evaluate on both in-distribution ScanNet test split and out-of-distribution TUM RGB-D and Tanks and Temples datasets. The results demonstrate that our model achieves state-of-theart performance across all datasets domains. In particular, on the challenging outdoor Tanks and Temples dataset, our model achieves an F1 score of 70.1 in reconstructed mesh evaluation, substantially outperforming the closest competitor, VGGT, which only attains 51.8.

[CV-91] VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

【速读】：该论文旨在解决现有世界模型在预测场景演化时过度关注光度细节（photometric details）而导致几何不一致的问题。其核心挑战在于如何在保持高效性的同时，提升对未来几何结构的准确预测能力。解决方案的关键在于提出VGGT-World，一种基于冻结几何基础模型（Geometry-Foundation-Model, GFM）特征的世界模型：它摒弃了生成未来视频帧的传统方法，转而直接预测冻结的VGGT潜空间特征随时间演化的轨迹；并通过两个关键技术突破克服高维特征空间（d=1024）中的两大难题——一是采用“干净目标”（z-prediction）参数化缓解标准速度预测流匹配的崩溃问题，二是设计两阶段潜在流强制课程训练策略以缓解自回归滚动过程中的累积暴露偏差问题。实验表明，该方法在深度预测任务上显著优于最强基线，且推理速度提升3.6–5倍，参数量仅为0.43B。

链接: https://arxiv.org/abs/2603.12655
作者: Xiangyu Sun,Shijie Wang,Fengyi Zhang,Lin Liu,Caiyan Jia,Ziying Song,Zi Huang,Yadan Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:World models that forecast scene evolution by generating future video frames devote the bulk of their capacity to photometric details, yet the resulting predictions often remain geometrically inconsistent. We present VGGT-World, a geometry world model that side-steps video generation entirely and instead forecasts the temporal evolution of frozen geometry-foundation-model (GFM) features. Concretely, we repurpose the latent tokens of a frozen VGGT as the world state and train a lightweight temporal flow transformer to autoregressively predict their future trajectory. Two technical challenges arise in this high-dimensional (d=1024) feature space: (i) standard velocity-prediction flow matching collapses, and (ii) autoregressive rollout suffers from compounding exposure bias. We address the first with a clean-target (z-prediction) parameterization that yields a substantially higher signal-to-noise ratio, and the second with a two-stage latent flow-forcing curriculum that progressively conditions the model on its own partially denoised rollouts. Experiments on KITTI, Cityscapes, and TartanAir demonstrate that VGGT-World significantly outperforms the strongest baselines in depth forecasting while running 3.6-5 times faster with only 0.43B trainable parameters, establishing frozen GFM features as an effective and efficient predictive state for 3D world modeling.

[CV-92] From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）扩散模型中偏好对齐（preference alignment）时，标准Group Relative Policy Optimization (GRPO) 方法因仅基于单一条件评估一组生成样本而导致的样本间关系探索不足问题，从而限制了对齐效果和性能上限。其解决方案的关键在于提出多视角GRPO（Multi-View GRPO, MV-GRPO），通过引入一个灵活的条件增强器（Condition Enhancer）在条件空间中生成语义相邻但多样化的补充描述词（captions），构建密集的多视角奖励映射，进而实现多视角优势重估计（multi-view advantage re-estimation），捕获更丰富的语义属性并提供更强的优化信号；同时，通过推导原始样本在新条件下的概率分布，无需重新生成样本即可将这些多视角信息融入训练过程，显著提升了对齐性能。

链接: https://arxiv.org/abs/2603.12648
作者: Jiazi Bu,Pengyang Ling,Yujie Zhou,Yibin Wang,Yuhang Zang,Tianyi Wei,Xiaohang Zhan,Jiaqi Wang,Tong Wu,Xingang Pan,Dahua Lin
机构: Dahua Lin5,6,10

5: Tsinghua University (清华大学); 6: Alibaba Group (阿里巴巴集团); 10: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has emerged as a powerful framework for preference alignment in text-to-image (T2I) flow models. However, we observe that the standard paradigm where evaluating a group of generated samples against a single condition suffers from insufficient exploration of inter-sample relationships, constraining both alignment efficacy and performance ceilings. To address this sparse single-view evaluation scheme, we propose Multi-View GRPO (MV-GRPO), a novel approach that enhances relationship exploration by augmenting the condition space to create a dense multi-view reward mapping. Specifically, for a group of samples generated from one prompt, MV-GRPO leverages a flexible Condition Enhancer to generate semantically adjacent yet diverse captions. These captions enable multi-view advantage re-estimation, capturing diverse semantic attributes and providing richer optimization signals. By deriving the probability distribution of the original samples conditioned on these new captions, we can incorporate them into the training process without costly sample regeneration. Extensive experiments demonstrate that MV-GRPO achieves superior alignment performance over state-of-the-art methods.

[CV-93] LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction

【速读】：该论文旨在解决现有3D Gaussian Splatting (3DGS) 方法在自动驾驶场景中因仅依赖摄像头或有限利用LiDAR信息而导致的重建质量下降问题，特别是在高自运动和复杂光照条件下表现不佳。其关键解决方案是提出一种鲁棒且高效的LiDAR-reflectance-guided Salient Gaussian Splatting (LR-SGS) 方法：首先从LiDAR点云中提取几何与反射率特征点进行结构感知的显著高斯表示初始化；随后通过显著变换（salient transform）和改进的密度控制策略增强边缘与平面结构的捕捉能力；同时将校准后的LiDAR强度转化为反射率并作为光照不变的材质通道附加至每个高斯，与RGB联合对齐以强化边界一致性，从而提升重建精度与效率。

链接: https://arxiv.org/abs/2603.12647
作者: Ziyu Chen,Fan Zhu,Hui Zhu,Deyi Kong,Xinkai Kuang,Yujia Zhang,Chunmao Jiang
机构: Hefei Institutes of Physical Science, Chinese Academy of Sciences (中国科学院合肥物质科学研究院); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures, conference

点击查看摘要

Abstract:Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scene reconstruction and novel view synthesis. However, most existing methods either rely solely on cameras or use LiDAR only for Gaussian initialization or depth supervision, while the rich scene information contained in point clouds, such as reflectance, and the complementarity between LiDAR and RGB have not been fully exploited, leading to degradation in challenging self-driving scenes, such as those with high ego-motion and complex lighting. To address these issues, we propose a robust and efficient LiDAR-reflectance-guided Salient Gaussian Splatting method (LR-SGS) for self-driving scenes, which introduces a structure-aware Salient Gaussian representation, initialized from geometric and reflectance feature points extracted from LiDAR and refined through a salient transform and improved density control to capture edge and planar structures. Furthermore, we calibrate LiDAR intensity into reflectance and attach it to each Gaussian as a lighting-invariant material channel, jointly aligned with RGB to enforce boundary consistency. Extensive experiments on the Waymo Open Dataset demonstrate that LR-SGS achieves superior reconstruction performance with fewer Gaussians and shorter training time. In particular, on Complex Lighting scenes, our method surpasses OmniRe by 1.18 dB PSNR.

[CV-94] RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization

【速读】：该论文旨在解决可扩展的具身人工智能（Embodied AI）在现实世界交互中面临的高昂成本与安全风险问题，以及现有具身世界模型（Embodied World Models, EWMs）存在的几何幻觉和缺乏统一策略优化框架的局限性。其解决方案的关键在于提出RoboStereo——一种对称双塔四维（4D）世界模型，通过双向跨模态增强机制确保时空几何一致性并缓解物理幻觉；在此高保真4D模拟器基础上，构建了首个基于世界模型的统一策略优化框架，包含测试时策略增强（Test-Time Policy Augmentation, TTPA）、模仿进化策略学习（Imitative-Evolutionary Policy Learning, IEPL）和开放探索策略学习（Open-Exploration Policy Learning, OEPL），实现了策略预执行验证、从专家示范中学习及自主技能发现与自我修正的协同优化。

链接: https://arxiv.org/abs/2603.12639
作者: Ruicheng Zhang,Guangyu Chen,Zunnan Xu,Zihao Liu,Zhizhou Zhong,Mingyang Zhang,Jun Zhou,Xiu Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scalable Embodied AI faces fundamental constraints due to prohibitive costs and safety risks of real-world interaction. While Embodied World Models (EWMs) offer promise through imagined rollouts, existing approaches suffer from geometric hallucinations and lack unified optimization frameworks for practical policy improvement. We introduce RoboStereo, a symmetric dual-tower 4D world model that employs bidirectional cross-modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high-fidelity 4D simulator, we present the first unified framework for world-model-based policy optimization: (1) Test-Time Policy Augmentation (TTPA) for pre-execution verification, (2) Imitative-Evolutionary Policy Learning (IEPL) leveraging visual perceptual rewards to learn from expert demonstrations, and (3) Open-Exploration Policy Learning (OEPL) enabling autonomous skill discovery and self-correction. Comprehensive experiments demonstrate RoboStereo achieves state-of-the-art generation quality, with our unified framework delivering 97% average relative improvement on fine-grained manipulation tasks.

[CV-95] Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains

【速读】：该论文旨在解决货运列车视觉故障检测中因复杂运行环境、结构重复部件以及关键区域频繁遮挡或污染导致的分割精度低和泛化能力差的问题。解决方案的关键在于提出一种轻量级自提示实例分割框架，其核心创新包括：利用Segment Anything Model（SAM）并引入自提示生成模块以自动构建任务特定提示，从而实现基础模型到领域任务的有效知识迁移；同时采用Tiny Vision Transformer作为骨干网络，在保证高精度的同时显著降低计算开销，使系统适用于铁路监控场景中的边缘设备实时部署。

链接: https://arxiv.org/abs/2603.12624
作者: Guodong Sun,Qihang Liang,Xingyu Pan,Moyun Liu,Yang Zhang
机构: Hubei University of Technology (湖北工业大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:Accurate visual fault detection in freight trains remains a critical challenge for intelligent transportation system maintenance, due to complex operational environments, structurally repetitive components, and frequent occlusions or contaminations in safety-critical regions. Conventional instance segmentation methods based on convolutional neural networks and Transformers often suffer from poor generalization and limited boundary accuracy under such conditions. To address these challenges, we propose a lightweight self-prompted instance segmentation framework tailored for freight train fault detection. Our method leverages the Segment Anything Model by introducing a self-prompt generation module that automatically produces task-specific prompts, enabling effective knowledge transfer from foundation models to domain-specific inspection tasks. In addition, we adopt a Tiny Vision Transformer backbone to reduce computational cost, making the framework suitable for real-time deployment on edge devices in railway monitoring systems. We construct a domain-specific dataset collected from real-world freight inspection stations and conduct extensive evaluations. Experimental results show that our method achieves 74.6 AP^\textbox and 74.2 AP^\textmask on the dataset, outperforming existing state-of-the-art methods in both accuracy and robustness while maintaining low computational overhead. This work offers a deployable and efficient vision solution for automated freight train inspection, demonstrating the potential of foundation model adaptation in industrial-scale fault diagnosis scenarios. Project page: this https URL

[CV-96] Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning

【速读】：该论文旨在解决当前视觉-语言检测与定位模型在处理包含否定语义（negative semantics）的复杂表达时表现不佳的问题，其根本原因在于缺乏高质量的训练数据来明确捕捉区分性的负样本及具备否定意识的语言描述。解决方案的关键在于提出两个核心创新：一是构建了D-Negation数据集，其中对物体同时标注了正向和负向语义描述；二是设计了一种基于分组对立的学习框架（grouped opposition-based learning framework），通过将D-Negation中的对立语义描述组织为结构化组，并引入两种互补的损失函数，促使模型从有限样本中学习到具备否定感知能力的表示。该方法在仅微调少于10%模型参数的情况下，显著提升了模型在正向和负向语义评估上的性能（分别提高4.4 mAP和5.7 mAP），验证了显式建模否定语义对增强视觉-语言定位模型鲁棒性和精度的有效性。

链接: https://arxiv.org/abs/2603.12606
作者: Zesheng Yang,Xi Jiang,Bingzhang Hu,Weili Guan,Runmin Cong,Guo-Jun Qi,Feng Zheng
机构: Southern University of Science and Technology (南方科技大学); Hefei CAS Dihuge Automation Co., LTD (合肥中科大迪虎自动化有限公司); Harbin Institute of Technology (深圳) (哈尔滨工业大学（深圳）); Shandong University (山东大学); Westlake University (西湖大学); OPPO Research (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Current vision-language detection and grounding models predominantly focus on prompts with positive semantics and often struggle to accurately interpret and ground complex expressions containing negative semantics. A key reason for this limitation is the lack of high-quality training data that explicitly captures discriminative negative samples and negation-aware language descriptions. To address this challenge, we introduce D-Negation, a new dataset that provides objects annotated with both positive and negative semantic descriptions. Building upon the observation that negation reasoning frequently appears in natural language, we further propose a grouped opposition-based learning framework that learns negation-aware representations from limited samples. Specifically, our method organizes opposing semantic descriptions from D-Negation into structured groups and formulates two complementary loss functions that encourage the model to reason about negation and semantic qualifiers. We integrate the proposed dataset and learning strategy into a state-of-the-art language-based grounding model. By fine-tuning fewer than 10 percent of the model parameters, our approach achieves improvements of up to 4.4 mAP and 5.7 mAP on positive and negative semantic evaluations, respectively. These results demonstrate that explicitly modeling negation semantics can substantially enhance the robustness and localization accuracy of vision-language grounding models. Comments: 12 pages, 6 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.12606 [cs.CV] (or arXiv:2603.12606v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.12606 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-97] A2Z-10M: Geometric Deep Learning with A-to-Z BRep Annotations for AI-Assisted CAD Modeling and Reverse Engineering CVPR2026

【速读】：该论文旨在解决工业产品设计中从3D扫描、草图或简单文本提示中进行计算机辅助设计（CAD）模型的逆向工程与快速原型构建问题，其核心挑战在于当前几何深度学习技术对参数化CAD特征在边界表示（BRep）中的多模态理解不足。解决方案的关键在于构建了迄今为止规模最大的多模态标注数据集A2Z，包含1000万条多模态注释和元数据，涵盖100万份ABC CAD模型，具体包括高分辨率网格、手绘草图（附带BRep共边、角点和面的几何与拓扑信息）以及描述机械世界产品的文本标题和标签。该数据集通过新型评估指标、GPT-5、Gemini及大量人工反馈机制确保高质量与多样性，并进一步融合2.5万份专业设计的电子外壳CAD模型以增强领域覆盖。基于此，研究者训练并基准测试了一个基础模型，在15万份CAD模型子集上实现了从3D扫描中检测BRep共边和角点顶点的能力，这是CAD逆向工程中的关键下游任务。

链接: https://arxiv.org/abs/2603.12605
作者: Pritham Kumar Jena,Bhavika Baburaj,Tushar Anand,Vedant Dutta,Vineeth Ulavala,Sk Aziz Ali
机构: BITS Pilani, Hyderabad, India; 3D Vision Group (3DVG)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, accepted to IEEE CVF CVPR 2026

点击查看摘要

Abstract:Reverse engineering and rapid prototyping of computer-aided design (CAD) models from 3D scans, sketches, or simple text prompts are vital in industrial product design. However, recent advances in geometric deep learning techniques lack a multi-modal understanding of parametric CAD features stored in their boundary representation (BRep). This study presents the largest compilation of 10 million multi-modal annotations and metadata for 1 million ABC CAD models, namely A2Z, to unlock an unprecedented level of BRep learning. A2Z comprises (i) high-resolution meshes with salient 3D scanning features, (ii) 3D hand-drawn sketches equipped with (iii) geometric and topological information about BRep co-edges, corners, and surfaces, and (iv) textual captions and tags describing the product in the mechanical world. Creating such carefully structured, large-scale data, which requires nearly 5 terabytes of storage to leverage unparalleled CAD learning/retrieval tasks, is very challenging. The scale, quality, and diversity of our multi-modal annotations are assessed using novel metrics, GPT-5, Gemini, and extensive human feedback mechanisms. To this end, we also merge an additional 25,000 CAD models of electronic enclosures (e.g., tablets, ports) designed by skilled professionals with our A2Z dataset. Subsequently, we train and benchmark a foundation model on a subset of 150K CAD models to detect BRep co-edges and corner vertices from 3D scans, a key downstream task in CAD reverse engineering. The annotated dataset, metrics, and checkpoints will be publicly released to support numerous research directions.

[CV-98] A Prediction-as-Perception Framework for 3D Object Detection

【速读】：该论文旨在解决自动驾驶场景中3D目标感知模型在动态环境下的精度与效率瓶颈问题，特别是在处理高速移动物体时因帧间信息利用不足导致的跟踪误差和计算资源消耗过高的问题。解决方案的关键在于提出了一种受生物视觉机制启发的Prediction-As-Perception (PAP)框架，其核心是将预测模块与感知模块协同设计：预测模块基于当前帧感知结果，推断未来时刻自车及周围交通参与者的位置，并将这些预测位置作为查询输入至下一帧的感知模块；感知结果再反馈回预测模块形成闭环迭代，从而实现跨帧的信息融合与注意力聚焦。该架构显著提升了目标跟踪准确率（UniAD模型提升10%）并加快推理速度（提升15%），同时降低了计算资源开销。

链接: https://arxiv.org/abs/2603.12599
作者: Song Zhang,Haoyu Chen,Ruibo Wang
机构: Z-one Technology Co., Ltd. (Z-one科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans combine prediction and perception to observe the world. When faced with rapidly moving birds or insects, we can only perceive them clearly by predicting their next position and focusing our gaze there. Inspired by this, this paper proposes the Prediction-As-Perception (PAP) framework, integrating a prediction-perception architecture into 3D object perception tasks to enhance the model’s perceptual accuracy. The PAP framework consists of two main modules: prediction and perception, primarily utilizing continuous frame information as input. Firstly, the prediction module forecasts the potential future positions of ego vehicles and surrounding traffic participants based on the perception results of the current frame. These predicted positions are then passed as queries to the perception module of the subsequent frame. The perceived results are iteratively fed back into the prediction module. We evaluated the PAP structure using the end-to-end model UniAD on the nuScenes dataset. The results demonstrate that the PAP structure improves UniAD’s target tracking accuracy by 10% and increases the inference speed by 15%. This indicates that such a biomimetic design significantly enhances the efficiency and accuracy of perception models while reducing computational resource consumption.

[CV-99] Neural Gate: Mitigating Privacy Risks in LVLMs via Neuron-Level Gradient Gating

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在实际部署中面临的隐私泄露风险问题，特别是模型对涉及敏感信息的请求缺乏一致性的拒绝能力，且现有隐私保护方法在泛化性和非破坏性方面存在局限。解决方案的关键在于提出一种名为Neural Gate的新方法，通过神经元级别的模型编辑实现隐私风险缓解：该方法首先学习一个特征向量以定位模型内部与隐私相关概念相关的神经元，进而精确引导参数更新，从而显著提升模型对隐私相关提问的拒绝率，并有效扩展至训练阶段未见过的新颖敏感查询，同时保持原有任务性能不受损。

链接: https://arxiv.org/abs/2603.12598
作者: Xiangkui Cao,Jie Zhang,Meina Kan,Shiguang Shan,Xilin Chen
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown remarkable potential across a wide array of vision-language tasks, leading to their adoption in critical domains such as finance and healthcare. However, their growing deployment also introduces significant security and privacy risks. Malicious actors could potentially exploit these models to extract sensitive information, highlighting a critical vulnerability. Recent studies show that LVLMs often fail to consistently refuse instructions designed to compromise user privacy. While existing work on privacy protection has made meaningful progress in preventing the leakage of sensitive data, they are constrained by limitations in both generalization and non-destructiveness. They often struggle to robustly handle unseen privacy-related queries and may inadvertently degrade a model’s performance on standard tasks. To address these challenges, we introduce Neural Gate, a novel method for mitigating privacy risks through neuron-level model editing. Our method improves a model’s privacy safeguards by increasing its rate of refusal for privacy-related questions, crucially extending this protective behavior to novel sensitive queries not encountered during the editing process. Neural Gate operates by learning a feature vector to identify neurons associated with privacy-related concepts within the model’s representation of a subject. This localization then precisely guides the update of model parameters. Through comprehensive experiments on MiniGPT and LLaVA, we demonstrate that our method significantly boosts the model’s privacy protection while preserving its original utility.

[CV-100] SDF-Net: Structure-Aware Disentangled Feature Learning for Opticall-SAR Ship Re-identification

【速读】：该论文旨在解决光学图像与合成孔径雷达（SAR）图像之间跨模态船舶再识别（Cross-modal ship Re-identification, ReID）问题，其核心挑战在于被动光学成像与相干主动雷达感知之间的严重辐射差异（radiometric discrepancy）。现有方法主要依赖统计分布对齐或语义匹配，但忽略了关键物理先验：船舶作为刚性物体，其几何结构在不同模态下保持稳定，而纹理外观则高度依赖于模态特性。解决方案的关键在于提出SDF-Net（Structure-Aware Disentangled Feature Learning Network），通过引入结构一致性约束（structure consistency constraint），从中间层提取尺度不变的梯度能量统计量以鲁棒地锚定特征表示；并在末端阶段将学习到的特征解耦为模态无关的身份特征和模态特定的表征，采用无参数的加性残差融合策略进行整合，从而有效提升判别能力。

链接: https://arxiv.org/abs/2603.12588
作者: Furui Chen,Han Wang,Yuhan Sun,Jianing You,Yixuan Lv,Zhuang Zhou,Hong Tan,Shengyang Li
机构: Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences (中国科学院空间应用工程与技术中心); Key Laboratory of Space Utilization, Chinese Academy of Sciences (中国科学院空间应用重点实验室); University of Chinese Academy of Sciences (中国科学院大学); School of Software, Beihang University (北京航空航天大学软件学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery is fundamentally challenged by the severe radiometric discrepancy between passive optical imaging and coherent active radar sensing. While existing approaches primarily rely on statistical distribution alignment or semantic matching, they often overlook a critical physical prior: ships are rigid objects whose geometric structures remain stable across sensing modalities, whereas texture appearance is highly modality-dependent. In this work, we propose SDF-Net, a Structure-Aware Disentangled Feature Learning Network that systematically incorporates geometric consistency into optical–SAR ship ReID. Built upon a ViT backbone, SDF-Net introduces a structure consistency constraint that extracts scale-invariant gradient energy statistics from intermediate layers to robustly anchor representations against radiometric variations. At the terminal stage, SDF-Net disentangles the learned representations into modality-invariant identity features and modality-specific characteristics. These decoupled cues are then integrated through a parameter-free additive residual fusion, effectively enhancing discriminative power. Extensive experiments on the HOSS-ReID dataset demonstrate that SDF-Net consistently outperforms existing state-of-the-art methods. The code and trained models are publicly available at this https URL.

[CV-101] MRGeo: Robust Cross-View Geo-Localization of Corrupted Images via Spatial and Channel Feature Enhancement

【速读】：该论文旨在解决跨视图地理定位（Cross-view geo-localization, CVGL）在真实世界受损环境下的鲁棒性不足问题，即当街景图像受模糊、天气等退化因素影响时，现有方法性能显著下降甚至失效，限制了其实际部署。解决方案的关键在于提出MRGeo，一种系统性的鲁棒CVGL方法，其核心是分层防御策略：首先通过空间-通道增强模块（Spatial-Channel Enhancement Block）提升特征内在质量，其中包含空间自适应表示模块（用于并行建模全局与局部特征并动态融合）和通道校准模块（建模多粒度通道依赖以补偿信息损失）；其次引入区域级几何对齐模块（Region-level Geometric Alignment Module），在最终描述符中施加几何结构约束，防止严重退化下空间错位，从而实现高鲁棒性和强泛化能力。

链接: https://arxiv.org/abs/2603.12587
作者: Le Wu,Lv Bo,Songsong Ouyang,Yingying Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-view geo-localization (CVGL) aims to accurately localize street-view images through retrieval of corresponding geo-tagged satellite images. While prior works have achieved nearly perfect performance on certain standard datasets, their robustness in real-world corrupted environments remains under-explored. This oversight causes severe performance degradation or failure when images are affected by corruption such as blur or weather, significantly limiting practical deployment. To address this critical gap, we introduce MRGeo, the first systematic method designed for robust CVGL under corruption. MRGeo employs a hierarchical defense strategy that enhances the intrinsic quality of features and then enforces a robust geometric prior. Its core is the Spatial-Channel Enhancement Block, which contains: (1) a Spatial Adaptive Representation Module that models global and local features in parallel and uses a dynamic gating mechanism to arbitrate their fusion based on feature reliability; and (2) a Channel Calibration Module that performs compensatory adjustments by modeling multi-granularity channel dependencies to counteract information loss. To prevent spatial misalignment under severe corruption, a Region-level Geometric Alignment Module imposes a geometric structure on the final descriptors, ensuring coarse-grained consistency. Comprehensive experiments on both robustness benchmark and standard datasets demonstrate that MRGeo not only achieves an average R@1 improvement of 2.92% across three comprehensive robustness benchmarks (CVUSA-C-ALL, CVACT_val-C-ALL, and CVACT_test-C-ALL) but also establishes superior performance in cross-area evaluation, thereby demonstrating its robustness and generalization capability.

[CV-102] DINOLight: Robust Ambient Light Normalization with Self-supervised Visual Prior Integration ICPR2026

【速读】：该论文旨在解决由多光源和复杂场景几何结构引起的非均匀阴影与光照不均导致的图像退化问题（ambient light normalization）。其解决方案的关键在于将自监督模型DINOv2提取的语义与几何信息作为视觉先验，融入图像恢复过程：首先设计了一个自适应特征融合模块，通过逐点softmax掩码整合DINOv2不同层的特征；随后，在空间域与频率域中利用辅助交叉注意力机制将融合后的特征注入恢复网络，从而显著提升光照归一化的性能。

链接: https://arxiv.org/abs/2603.12579
作者: Youngjin Oh,Junhyeong Kwon,Nam Ik Cho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICPR 2026 (under review)

点击查看摘要

Abstract:This paper presents a new ambient light normalization framework, DINOLight, that integrates the self-supervised model DINOv2’s image understanding capability into the restoration process as a visual prior. Ambient light normalization aims to restore images degraded by non-uniform shadows and lighting caused by multiple light sources and complex scene geometries. We observe that DINOv2 can reliably extract both semantic and geometric information from a degraded image. Based on this observation, we develop a novel framework to utilize DINOv2 features for lighting normalization. First, we propose an adaptive feature fusion module that combines features from different DINOv2 layers using a point-wise softmax mask. Next, the fused features are integrated into our proposed restoration network in both spatial and frequency domains through an auxiliary cross-attention mechanism. Experiments show that DINOLight achieves superior performance on the Ambient6K dataset, and that DINOv2 features are effective for enhancing ambient light normalization. We also apply our method to shadow-removal benchmark datasets, achieving competitive results compared to methods that use mask priors. Codes will be released upon acceptance.

[CV-103] AccelAes: Accelerating Diffusion Transformers for Training-Free Aesthetic-Enhanced Image Generation

【速读】：该论文旨在解决扩散 Transformer（Diffusion Transformers, DiTs）在高保真文生图任务中因密集空间标记的二次方自注意力机制导致的推理延迟过高、部署受限的问题。其核心解决方案是提出一种无需训练的加速框架 AccelAes，关键在于通过美学感知的时空稀疏化策略实现高效计算：首先构建基于提示语义与交叉注意力信号的一次性美学聚焦掩码（AesMask），识别出对美学描述符响应强的区域；随后采用 SkipSparse 机制将计算资源和引导信息集中分配至这些高敏感区域，减少低相关区域的冗余计算；同时引入轻量级步级预测缓存以降低时间维度上的冗余，从而在保持或提升图像美学质量的同时显著加速推理过程。

链接: https://arxiv.org/abs/2603.12575
作者: Xuanhua Yin,Chuanzhi Xu,Haoxian Zhou,Boyu Wei,Weidong Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 13 tables, 12 figures

点击查看摘要

Abstract:Diffusion Transformers (DiTs) are a dominant backbone for high-fidelity text-to-image generation due to strong scalability and alignment at high resolutions. However, quadratic self-attention over dense spatial tokens leads to high inference latency and limits deployment. We observe that denoising is spatially non-uniform with respect to aesthetic descriptors in the prompt. Regions associated with aesthetic tokens receive concentrated cross-attention and show larger temporal variation, while low-affinity regions evolve smoothly with redundant computation. Based on this insight, we propose AccelAes, a training-free framework that accelerates DiTs through aesthetics-aware spatio-temporal reduction while improving perceptual aesthetics. AccelAes builds AesMask, a one-shot aesthetic focus mask derived from prompt semantics and cross-attention signals. When localized computation is feasible, SkipSparse reallocates computation and guidance to masked regions. We further reduce temporal redundancy using a lightweight step-level prediction cache that periodically replaces full Transformer evaluations. Experiments on representative DiT families show consistent acceleration and improved aesthetics-oriented quality. On Lumina-Next, AccelAes achieves a 2.11 \times speedup and improves ImageReward by +11.9% over the dense baseline. Code is available at this https URL.

[CV-104] Lyapunov Stable Graph Neural Flow

【速读】：该论文旨在解决图神经网络（Graph Neural Networks, GNNs）在拓扑结构和节点特征上易受对抗扰动影响的问题，从而提升其学习鲁棒表示的能力。解决方案的关键在于将控制理论中的整数阶与分数阶李雅普诺夫稳定性（Lyapunov stability）引入GNN框架，提出一种基于可学习李雅普诺夫函数和新颖投影机制的防御方法，该机制能将网络状态映射到稳定空间中，从动态更新机制层面约束特征演化过程，提供理论上可证明的稳定性保障，并且与现有防御策略正交，支持无缝集成以增强整体鲁棒性。

链接: https://arxiv.org/abs/2603.12557
作者: Haoyu Chu,Xiaotong Chen,Wei Zhou,Wenjun Cui,Kai Zhao,Shikui Wei,Qiyu Kang
机构: China University of Mining and Technology (中国矿业大学); University of Science and Technology of China (中国科学技术大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are highly vulnerable to adversarial perturbations in both topology and features, making the learning of robust representations a critical challenge. In this work, we bridge GNNs with control theory to introduce a novel defense framework grounded in integer- and fractional-order Lyapunov stability. Unlike conventional strategies that rely on resource-heavy adversarial training or data purification, our approach fundamentally constrains the underlying feature-update dynamics of the GNN. We propose an adaptive, learnable Lyapunov function paired with a novel projection mechanism that maps the network’s state into a stable space, thereby offering theoretically provable stability guarantees. Notably, this mechanism is orthogonal to existing defenses, allowing for seamless integration with techniques like adversarial training to achieve cumulative robustness. Extensive experiments demonstrate that our Lyapunov-stable graph neural flows substantially outperform base neural flows and state-of-the-art baselines across standard benchmarks and various adversarial attack scenarios.

[CV-105] Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation

【速读】：该论文旨在解决当前基于世界模型的视觉-语言-动作（Vision-Language-Action, VLA）架构在机器人操作中面临的两个核心问题：一是密集未来预测引入视觉冗余并累积误差，导致长时程规划漂移；二是稀疏方法通常依赖高层语义子任务或隐式潜在状态表示，缺乏显式的运动学（kinematic）约束，削弱了规划与底层执行之间的对齐。解决方案的关键在于提出 StructVLA，其将生成式世界模型重构为一种显式的结构化规划器，通过预测稀疏但物理意义明确的结构化帧（structured frames）来替代密集轨迹或抽象目标。这些结构化帧由内在运动学线索（如夹爪状态转换和运动学转折点）推导而来，精确捕捉时空里程碑并紧密关联任务进展。该方法采用两阶段训练范式，先训练世界模型预测结构化帧，再优化将其映射为低层动作，从而提供清晰的物理指导，并有效连接视觉规划与运动控制，在仿真与真实场景中均展现出高成功率和强泛化能力。

链接: https://arxiv.org/abs/2603.12553
作者: Minghao Jin,Mozheng Liao,Mingfei Han,Zhihui Li,Xiaojun Chang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent world-model-based Vision-Language-Action (VLA) architectures have improved robotic manipulation through predictive visual foresight. However, dense future prediction introduces visual redundancy and accumulates errors, causing long-horizon plan drift. Meanwhile, recent sparse methods typically represent visual foresight using high-level semantic subtasks or implicit latent states. These representations often lack explicit kinematic grounding, weakening the alignment between planning and low-level execution. To address this, we propose StructVLA, which reformulates a generative world model into an explicit structured planner for reliable control. Instead of dense rollouts or semantic goals, StructVLA predicts sparse, physically meaningful structured frames. Derived from intrinsic kinematic cues (e.g., gripper transitions and kinematic turning points), these frames capture spatiotemporal milestones closely aligned with task progress. We implement this approach through a two-stage training paradigm with a unified discrete token vocabulary: the world model is first trained to predict structured frames and subsequently optimized to map the structured foresight into low-level actions. This approach provides clear physical guidance and bridges visual planning and motion control. In our experiments, StructVLA achieves strong average success rates of 75.0% on SimplerEnv-WidowX and 94.8% on LIBERO. Real-world deployments further demonstrate reliable task completion and robust generalization across both basic pick-and-place and complex long-horizon tasks.

[CV-106] CVGL: Causal Learning and Geometric Topology

【速读】：该论文旨在解决跨视角地理定位（Cross-view geo-localization, CVGL）任务中因视角差异显著及混淆因素干扰而导致的定位精度下降问题。其核心挑战在于如何在街景图像与航拍图像之间建立鲁棒且准确的匹配关系，尤其是在复杂真实场景下。解决方案的关键在于提出Causal Learning and Geometric Topology (CLGT)框架，该框架包含两个核心组件：一是因果特征提取器（Causal Feature Extractor, CFE），通过因果干预机制削弱混淆因素影响，促使模型关注稳定且与任务相关的语义信息；二是几何拓扑融合模块（Geometric Topology Fusion, GT Fusion），将鸟瞰图（Bird’s Eye View, BEV）道路拓扑结构注入街景特征，缓解极端视角变化带来的跨视角不一致性。此外，引入数据自适应池化（Data-Adaptive Pooling, DA Pooling）模块进一步增强语义丰富区域的表征能力，从而在多个基准数据集上实现最先进的性能表现。

链接: https://arxiv.org/abs/2603.12551
作者: Songsong Ouyang,Yingying Zhu
机构: Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-view geo-localization (CVGL) aims to estimate the geographic location of a street image by matching it with a corresponding aerial image. This is critical for autonomous navigation and mapping in complex real-world scenarios. However, the task remains challenging due to significant viewpoint differences and the influence of confounding factors. To tackle these issues, we propose the Causal Learning and Geometric Topology (CLGT) framework, which integrates two key components: a Causal Feature Extractor (CFE) that mitigates the influence of confounding factors by leveraging causal intervention to encourage the model to focus on stable, task-relevant semantics; and a Geometric Topology Fusion (GT Fusion) module that injects Bird’s Eye View (BEV) road topology into street features to alleviate cross-view inconsistencies caused by extreme perspective changes. Additionally, we introduce a Data-Adaptive Pooling (DA Pooling) module to enhance the representation of semantically rich regions. Extensive experiments on CVUSA, CVACT, and their robustness-enhanced variants (CVUSA-C-ALL and CVACT-C-ALL) demonstrate that CLGT achieves state-of-the-art performance, particularly under challenging real-world corruptions. Our codes are available at this https URL.

[CV-107] Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation

【速读】：该论文旨在解决当前深度学习在医学图像分割中普遍存在的任务特异性问题，即模型在单一数据集上表现优异但跨不同成像模态时泛化能力不足，同时针对现有方法过度依赖大型预训练编码器导致计算复杂度高的缺陷。其解决方案的关键在于提出一种以解码器为中心（decoder-centric）的通用2D医学图像分割框架Deco-Mamba，该框架采用U-Net-like结构，编码器融合CNN与Transformer以实现高效特征提取，而解码器则引入创新模块：协同注意力门（Co-Attention Gate, CAG）、视觉状态空间模块（Vision State Space Module, VSSM）和可变形卷积精修块，从而增强多尺度上下文表示能力；此外，设计了一种窗口分布感知的KL散度损失函数用于多阶段深度监督，有效提升模型性能与泛化性，同时保持适中的模型复杂度。

链接: https://arxiv.org/abs/2603.12547
作者: Fares Bougourzi,Fadi Dornaika,Abdenour Hadid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has achieved remarkable success in medical image segmentation, often reaching expert-level accuracy in delineating tumors and tissues. However, most existing approaches remain task-specific, showing strong performance on individual datasets but limited generalization across diverse imaging modalities. Moreover, many methods focus primarily on the encoder, relying on large pretrained backbones that increase computational complexity. In this paper, we propose a decoder-centric approach for generalized 2D medical image segmentation. The proposed Deco-Mamba follows a U-Net-like structure with a Transformer-CNN-Mamba design. The encoder combines a CNN block and Transformer backbone for efficient feature extraction, while the decoder integrates our novel Co-Attention Gate (CAG), Vision State Space Module (VSSM), and deformable convolutional refinement block to enhance multi-scale contextual representation. Additionally, a windowed distribution-aware KL-divergence loss is introduced for deep supervision across multiple decoding stages. Extensive experiments on diverse medical image segmentation benchmarks yield state-of-the-art performance and strong generalization capability while maintaining moderate model complexity. The source code will be released upon acceptance.

[CV-108] Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA ICLR2026

【速读】：该论文旨在解决当前视觉语言模型（Vision-Language Models, VLMs）在基础空间推理能力上的不足问题，特别是其在理解二维空间关系（如相对位置、布局和计数）方面的脆弱性。研究表明，这种局限性并非单纯由数据不足导致，而是与现有VLM流水线中的关键设计选择密切相关：即依赖CLIP风格的图像编码器以及将图像扁平化为一维标记序列并使用一维位置编码的方式。论文提出通过在LLaVA框架内进行受控诊断实验，系统评估不同图像编码器目标函数（如密集监督或生成式目标）及是否引入二维位置编码对空间感知能力的影响。结果表明，编码器的目标函数和位置结构显著影响模型的空间行为，但尚未能完全解决空间推理问题，揭示了未来改进方向应聚焦于更合理的空间建模机制。

链接: https://arxiv.org/abs/2603.12545
作者: Nahid Alam,Leema Krishna Murali,Siddhant Bharadwaj,Patrick Liu,Timothy Chung,Drishti Sharma,Akshata A.,Kranthi Kiran,Wesley Tam,Bala Krishna S Vegesna
机构: Cohere Labs Community; Indian Institute of Science, Bangalore; UIUC; Imperial College London; Eisai Inc.; EleutherAI; Georgia Institute of Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a poster at ICLR 2026 workshop ICBINB

点击查看摘要

Abstract:Vision-language models (VLMs) have advanced rapidly, yet they still struggle with basic spatial reasoning. Despite strong performance on general benchmarks, modern VLMs remain brittle at understanding 2D spatial relationships such as relative position, layout, and counting. We argue that this failure is not merely a data problem, but is closely tied to dominant design choices in current VLM pipelines: reliance on CLIP-style image encoders and the flattening of images into 1D token sequences with 1D positional encoding. We present a controlled diagnostic study within the LLaVA framework to isolate how these choices affect spatial grounding. We evaluate frontier models and LLaVA variants on a suite of spatial benchmarks, comparing CLIP-based encoders against alternatives trained with denser or generative objectives, as well as variants augmented with 2D positional encoding. Our results show consistent spatial performance gaps across models, and indicate that encoder objectives and positional structure shape spatial behavior, but do not fully resolve it.

[CV-109] Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation

【速读】：该论文旨在解决现有 referring image segmentation 方法在处理多样化的自然语言表达时，因采用统一的细化策略而导致的分割结果碎片化、边界不准确甚至误识别目标的问题，尤其是在冻结预训练视觉骨干网络以提升计算效率的情况下更为显著。解决方案的关键在于提出一种 Spatio-Semantic Expert Routing Architecture (SERA)，其核心创新包括：1）SERA-Adapter 在骨干网络中插入表达条件的适配器模块，通过专家引导的细化和跨模态注意力增强空间连贯性和边界精度；2）SERA-Fusion 通过将 token 特征重塑为空间网格并应用保持几何结构的专家变换，强化中间视觉表示；3）引入轻量级路由机制自适应加权专家贡献，同时兼容预训练表示，并采用仅更新归一化和偏置项的参数高效微调策略（影响 <1% 的骨干参数），确保在冻结编码器下仍能稳定优化。

链接: https://arxiv.org/abs/2603.12538
作者: Alaa Dalaq,Muzammil Behzad
机构: King Fahd University of Petroleum and Minerals (KFUPM)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Referring image segmentation aims to produce a pixel-level mask for the image region described by a natural-language expression. Although pretrained vision-language models have improved semantic grounding, many existing methods still rely on uniform refinement strategies that do not fully match the diverse reasoning requirements of referring expressions. Because of this mismatch, predictions often contain fragmented regions, inaccurate boundaries, or even the wrong object, especially when pretrained backbones are frozen for computational efficiency. To address these limitations, we propose SERA, a Spatio-Semantic Expert Routing Architecture for referring image segmentation. SERA introduces lightweight, expression-aware expert refinement at two complementary stages within a vision-language framework. First, we design SERA-Adapter, which inserts an expression-conditioned adapter into selected backbone blocks to improve spatial coherence and boundary precision through expert-guided refinement and cross-modal attention. We then introduce SERA-Fusion, which strengthens intermediate visual representations by reshaping token features into spatial grids and applying geometry-preserving expert transformations before multimodal interaction. In addition, a lightweight routing mechanism adaptively weights expert contributions while remaining compatible with pretrained representations. To make this routing stable under frozen encoders, SERA uses a parameter-efficient tuning strategy that updates only normalization and bias terms, affecting less than 1% of the backbone parameters. Experiments on standard referring image segmentation benchmarks show that SERA consistently outperforms strong baselines, with especially clear gains on expressions that require accurate spatial localization and precise boundary delineation.

[CV-110] Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering CVPR2026

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在基于用户指向手势的理解与问答任务中表现不佳的问题，其核心挑战在于缺乏富含手势的训练数据以及模型对第一人称视角视频中精细指向意图的推理能力有限。解决方案的关键在于提出EgoPointVQA数据集和基准测试，包含4000个合成与真实世界视频，覆盖多种指示性推理任务，并引入Hand Intent Tokens (HINT)机制——通过预训练的3D手部关键点重建模型提取手势特征并编码为tokens，将其与输入序列交错融合，从而显式提供空间与时间上下文信息以增强对指向意图的解析能力。实验表明，采用HINT的模型在多个骨干网络和规模下均优于现有方法，其中HINT-14B在6项任务上平均准确率达68.1%，较当前最优模型InternVL3-14B提升6.6%。

链接: https://arxiv.org/abs/2603.12533
作者: Yura Choi,Roy Miles,Rolandos Alexandros Potamias,Ismail Elezi,Jiankang Deng,Stefanos Zafeiriou
机构: Imperial College London; Huawei Noah’s Ark Lab, UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Understanding and answering questions based on a user’s pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video. To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks. Built upon it, we further propose Hand Intent Tokens (HINT), which encodes tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleaves them with the model input to provide explicit spatial and temporal context for interpreting pointing intent. We show that our model outperforms others in different backbones and model sizes. In particular, HINT-14B achieves 68.1% accuracy, on average over 6 tasks, surpassing the state-of-the-art, InternVL3-14B, by 6.6%. To further facilitate the open research, we will release the code, model, and dataset. Project page: this https URL

[CV-111] Curriculum Sampling: A Two-Phase Curriculum for Efficient Training of Flow Matching

【速读】：该论文旨在解决流匹配（Flow Matching）模型中时间步采样策略（timestep sampling）设计对训练效率与生成质量之间权衡的影响问题。现有方法普遍采用静态的中段偏置分布（如Logit-Normal），虽能加速早期收敛，但会损害最终生成样本的保真度（fidelity）。其关键在于通过分析每一步的时间步训练损失，发现存在U型难度分布，边界区域因采样不足导致细节无法充分优化。为此，作者提出课程采样（Curriculum Sampling）——一种两阶段调度策略：初期使用中段偏置采样快速学习结构特征，后期切换至均匀采样以精细优化边界区域。该方案在CIFAR-10上将最优FID从3.85提升至3.22，并将峰值性能提前至10万步而非15万步，表明时间步采样应视为动态演化的课程而非固定超参数。

链接: https://arxiv.org/abs/2603.12517
作者: Pengwei Sun
机构: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Timestep sampling p(t) is a central design choice in Flow Matching models, yet common practice increasingly favors static middle-biased distributions (e.g., Logit-Normal). We show that this choice induces a speed–quality trade-off: middle-biased sampling accelerates early convergence but yields worse asymptotic fidelity than Uniform sampling. By analyzing per-timestep training losses, we identify a U-shaped difficulty profile with persistent errors near the boundary regimes, implying that under-sampling the endpoints leaves fine details unresolved. Guided by this insight, we propose \textbfCurriculum Sampling, a two-phase schedule that begins with middle-biased sampling for rapid structure learning and then switches to Uniform sampling for boundary refinement. On CIFAR-10, Curriculum Sampling improves the best FID from 3.85 (Uniform) to 3.22 while reaching peak performance at 100 k rather than 150 k training steps. Our results highlight that timestep sampling should be treated as an evolving curriculum rather than a fixed hyperparameter.

[CV-112] Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding

【速读】：该论文旨在解决腹部CT扫描中创伤性损伤的准确检测与定位问题，其核心挑战在于标注医学数据严重稀缺。解决方案的关键在于结合自监督预训练与半监督检测方法：首先利用基于patch的掩码图像建模（Masked Image Modeling, MIM）在1,206个无标注CT体积上对3D U-Net编码器进行预训练，学习鲁棒的解剖结构表征；随后将该预训练编码器用于两个下游任务——基于VDETR模型的3D损伤检测（引入顶点相对位置编码）和多标签损伤分类。在检测任务中，通过使用2,000个未标注样本并结合一致性正则化，仅用144个标注样本即实现56.57%验证mAP@0.50和45.30%测试mAP@0.50，较纯监督训练提升115%；在分类任务中，冻结编码器仅用2,244个标注样本即达到94.07%测试准确率，验证了自监督特征的可迁移性。此方案有效缓解了标注稀缺问题，实现了低标注成本下的高性能3D医学图像分析。

链接: https://arxiv.org/abs/2603.12514
作者: Shivam Chaudhary,Sheethal Bhat,Andreas Maier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 6 figures, 6 tables. The code is available at this https URL

点击查看摘要

Abstract:Accurate detection and localization of traumatic injuries in abdominal CT scans remains a critical challenge in emergency radiology, primarily due to severe scarcity of annotated medical data. This paper presents a label-efficient approach combining self-supervised pre-training with semi-supervised detection for 3D medical image analysis. We employ patch-based Masked Image Modeling (MIM) to pre-train a 3D U-Net encoder on 1,206 CT volumes without annotations, learning robust anatomical representations. The pretrained encoder enables two downstream clinical tasks: 3D injury detection using VDETR with Vertex Relative Position Encoding, and multi-label injury classification. For detection, semi-supervised learning with 2,000 unlabeled volumes and consistency regularization achieves 56.57% validation mAP@0.50 and 45.30% test mAP@0.50 with only 144 labeled training samples, representing a 115% improvement over supervised-only training. For classification, expanding to 2,244 labeled samples yields 94.07% test accuracy across seven injury categories using only a frozen encoder, demonstrating immediately transferable self-supervised features. Our results validate that self-supervised pre-training combined with semi-supervised learning effectively addresses label scarcity in medical imaging, enabling robust 3D object detection with limited annotations.

[CV-113] MemRoPE: Training-Free Infinite Video Generation via Evolving Memory Tokens

【速读】：该论文旨在解决自回归扩散模型在长时间视频生成中因滑动窗口缓存机制丢弃历史上下文而导致的保真度下降、身份漂移和运动停滞问题。现有方法通过固定早期token作为注意力锚点来维持一致性，但无法适应不断演化的视频内容。其解决方案的关键在于提出一种无需训练的MemRoPE框架，包含两个协同设计的核心组件：一是记忆令牌（Memory Tokens），利用指数移动平均持续将所有历史键压缩为长短时双流表示，在固定大小缓存中同时保持全局身份与近期动态；二是在线RoPE索引（Online RoPE Indexing），缓存未旋转的键并在注意力计算时动态应用位置嵌入，避免冲突的位置相位干扰。二者相互促进：位置解耦使时间聚合具有明确定义，而聚合机制则使得固定大小缓存适用于无限长度生成任务。

链接: https://arxiv.org/abs/2603.12513
作者: Youngrae Kim,Qixin Hu,C.-C. Jay Kuo,Peter A. Beerel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages main, 3 pages references, 6 pages appendix. Project page: this https URL

点击查看摘要

Abstract:Autoregressive diffusion enables real-time frame streaming, yet existing sliding-window caches discard past context, causing fidelity degradation, identity drift, and motion stagnation over long horizons. Current approaches preserve a fixed set of early tokens as attention sinks, but this static anchor cannot reflect the evolving content of a growing video. We introduce MemRoPE, a training-free framework with two co-designed components. Memory Tokens continuously compress all past keys into dual long-term and short-term streams via exponential moving averages, maintaining both global identity and recent dynamics within a fixed-size cache. Online RoPE Indexing caches unrotated keys and applies positional embeddings dynamically at attention time, ensuring the aggregation is free of conflicting positional phases. These two mechanisms are mutually enabling: positional decoupling makes temporal aggregation well-defined, while aggregation makes fixed-size caching viable for unbounded generation. Extensive experiments validate that MemRoPE outperforms existing methods in temporal coherence, visual fidelity, and subject consistency across minute- to hour-scale generation.

[CV-114] Naïve PAINE: Lightweight Text-to-Image Generation Improvement with Prompt Evaluation

【速读】：该论文旨在解决扩散模型（Diffusion Models, DM）在文本到图像（Text-to-Image, T2I）生成中因随机高斯噪声导致结果不可控的问题，即相同输入提示（prompt）下生成结果具有高度不确定性，迫使用户反复尝试以获取满意图像，形成“赌徒负担”（gambler’s burden）。解决方案的关键在于提出 Naïve PAINE 方法，其核心是利用 T2I 偏好基准数据集直接从初始噪声和提示中预测图像质量数值，并据此筛选出高质量候选噪声，仅将其输入扩散模型进行生成，从而提升生成效率与质量。该方法轻量且可无缝集成至现有 DM 流程中，同时提供关于模型在特定提示下生成能力的反馈。

链接: https://arxiv.org/abs/2603.12506
作者: Joong Ho Kim,Nicholas Thai,Souhardya Saha Dip,Dong Lao,Keith G. Mills
机构: LSU-ATHENA; Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code available at this https URL

点击查看摘要

Abstract:Text-to-Image (T2I) generation is primarily driven by Diffusion Models (DM) which rely on random Gaussian noise. Thus, like playing the slots at a casino, a DM will produce different results given the same user-defined inputs. This imposes a gambler’s burden: To perform multiple generation cycles to obtain a satisfactory result. However, even though DMs use stochastic sampling to seed generation, the distribution of generated content quality highly depends on the prompt and the generative ability of a DM with respect to it. To account for this, we propose Naïve PAINE for improving the generative quality of Diffusion Models by leveraging T2I preference benchmarks. We directly predict the numerical quality of an image from the initial noise and given prompt. Naïve PAINE then selects a handful of quality noises and forwards them to the DM for generation. Further, Naïve PAINE provides feedback on the DM generative quality given the prompt and is lightweight enough to seamlessly fit into existing DM pipelines. Experimental results demonstrate that Naïve PAINE outperforms existing approaches on several prompt corpus benchmarks. Comments: Code available at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.12506 [cs.CV] (or arXiv:2603.12506v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.12506 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-115] RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution CVPR

【速读】：该论文旨在解决智能手机数字变焦中基于学习的超分辨率（Super-Resolution, SR）模型在真实场景下性能受限的问题，其核心挑战在于缺乏传感器特定的高质量训练数据（即RAW图像与对应的高分辨率（High-Resolution, HR）图像对）。为应对这一问题，论文提出的关键解决方案是：通过精心设计的退化建模（degradation modeling）方法，利用校准后的“逆处理”（unprocessing）流程将公开可用的渲染图像转换至不同智能手机的RAW域，从而生成设备特异性的图像对用于训练。相较于依赖通用先验假设的退化模拟方式，该方法显著缩小了合成数据与真实数据之间的域差距（domain gap），实验证明其可有效提升SR模型在未见设备上的真实表现。

链接: https://arxiv.org/abs/2603.12493
作者: Ali Mosleh,Faraz Ali,Fengjia Zhang,Stavros Tsogkas,Junyong Lee,Alex Levinshtein,Michael S. Brown
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted to The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:Digital zoom on smartphones relies on learning-based super-resolution (SR) models that operate on RAW sensor images, but obtaining sensor-specific training data is challenging due to the lack of ground-truth images. Synthetic data generation via ``unprocessing’’ pipelines offers a potential solution by simulating the degradations that transform high-resolution (HR) images into their low-resolution (LR) counterparts. However, these pipelines can introduce domain gaps due to incomplete or unrealistic degradation modeling. In this paper, we demonstrate that principled and carefully designed degradation modeling can enhance SR performance in real-world conditions. Instead of relying on generic priors for camera blur and noise, we model device-specific degradations through calibration and unprocess publicly available rendered images into the RAW domain of different smartphones. Using these image pairs, we train a single-image RAW-to-RGB SR model and evaluate it on real data from a held-out device. Our experiments show that accurate degradation modeling leads to noticeable improvements, with our SR model outperforming baselines trained on large pools of arbitrarily chosen degradations.

[CV-116] CalliMaster: Mastering Page-level Chinese Calligraphy via Layout-guided Spatial Planning

【速读】：该论文旨在解决页级书法合成中字符精度与版面布局之间的矛盾问题：现有字符级模型缺乏空间上下文，而页级方法往往牺牲笔触细节。其解决方案的关键在于提出一个统一的可控生成与编辑框架CalliMaster，通过解耦空间规划与内容合成两个阶段来实现平衡。具体而言，受人类“先规划后书写”认知过程启发，设计了一个从文本到布局再到图像的粗粒度到细粒度的流水线（Text → Layout → Image），在单一多模态扩散Transformer中，首先由空间规划阶段预测字符边界框以建立全局空间排布，再将该中间布局作为几何提示输入内容合成阶段，利用流匹配（flow-matching）技术渲染高保真笔触。这种解耦机制不仅实现了最先进的生成质量，还支持语义重规划、修复和取证等下游任务。

链接: https://arxiv.org/abs/2603.12482
作者: Tianshuo Xu,Tiantian Hong,Zhifei Chen,Fei Chao,Ying-cong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Page-level calligraphy synthesis requires balancing glyph precision with layout composition. Existing character models lack spatial context, while page-level methods often compromise brushwork detail. In this paper, we present \textbfCalliMaster, a unified framework for controllable generation and editing that resolves this conflict by decoupling spatial planning from content synthesis. Inspired by the human cognitive process of ``planning before writing’', we introduce a coarse-to-fine pipeline \textbf(Text \rightarrow Layout \rightarrow Image) to tackle the combinatorial complexity of page-scale synthesis. Operating within a single Multimodal Diffusion Transformer, a spatial planning stage first predicts character bounding boxes to establish the global spatial arrangement. This intermediate layout then serves as a geometric prompt for the content synthesis stage, where the same network utilizes flow-matching to render high-fidelity brushwork. Beyond achieving state-of-the-art generation quality, this disentanglement supports versatile downstream capabilities. By treating the layout as a modifiable constraint, CalliMaster enables controllable semantic re-planning: users can resize or reposition characters while the model automatically harmonizes the surrounding void space and brush momentum. Furthermore, we demonstrate the framework’s extensibility to artifact restoration and forensic analysis, providing a comprehensive tool for digital cultural heritage.

[CV-117] Less Data Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning

【速读】：该论文旨在解决多模态指令微调（Multimodal Instruction Tuning）中因训练数据分布不均导致的计算效率低下问题，即在大规模混合图像-视频数据池中，样本效用差异显著，但现有方法未对数据进行针对性优化。解决方案的关键在于提出目标驱动的数据优化（Goal-Driven Data Optimization, GDO）框架：通过对每个候选样本计算六种描述符，构建面向不同训练目标（如最小损失、多样性、时间敏感性等）的优化子集，从而在固定训练协议下以更少样本实现更快收敛和更高精度。实验表明，GDO 在多个基准测试上均优于固定512k样本的Uni-10x基线，且更强的时间敏感性策略能持续提升长视频理解能力。

链接: https://arxiv.org/abs/2603.12478
作者: Rujie Wu,Haozhe Zhao,Hai Ci,Yizhou Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal instruction tuning is often compute-inefficient because training budgets are spread across large mixed image-video pools whose utility is highly uneven. We present Goal-Driven Data Optimization (GDO), a framework that computes six sample descriptors for each candidate and constructs optimized 1 \times training subsets for different goals. Under a fixed one-epoch Qwen3-VL-8B-Instruct training and evaluation recipe on 8 H20 GPUs, GDO uses far fewer training samples than the Uni-10x baseline while converging faster and achieving higher accuracy. Relative to the fixed 512k-sample Uni-10x baseline, GDO reaches the Uni-10x reference after 35.4k samples on MVBench, 26.6k on VideoMME, 27.3k on MLVU, and 34.7k on LVBench, while improving Accuracy by +1.38, +1.67, +3.08, and +0.84 percentage points, respectively. The gains are largest on MVBench and MLVU, while LVBench improves more modestly, consistent with its ultra-long-video setting and the mismatch between that benchmark and the short-video/image-dominant training pool. Across MinLoss, Diverse, Temp, and Temp+, stronger temporal emphasis yields steadily better long-video understanding behavior. Overall, GDO provides a goal-driven data optimization framework that enables faster convergence with fewer training samples under a fixed training protocol. Code is available at this https URL.

[CV-118] Unleashing Video Language Models for Fine-grained HRCT Report Generation MICCAI2026

【速读】：该论文旨在解决高分辨率计算机断层扫描（High-Resolution Computed Tomography, HRCT）影像中生成精准诊断报告的难题，该问题源于3D医学图像中病灶类型的高度多样性与空间稀疏性。现有通用视频语言模型（Video Language Models, VideoLMs）虽在一般场景下展现出强大的时空推理能力，但在特定医疗领域中的高体积影像理解任务中适应性不足。解决方案的关键在于提出AbSteering框架，其核心创新包括：(i) 一种以异常为中心的思维链（Chain-of-Thought）策略，强制模型聚焦于异常区域进行推理；(ii) 一种直接偏好优化（Direct Preference Optimization）目标函数，利用临床上易混淆的异常作为难负样本，提升模型对细微差异的判别能力。实验表明，该方法显著优于依赖大规模CT预训练的专用基础模型，在保持更高检测敏感度的同时有效减少幻觉现象。

链接: https://arxiv.org/abs/2603.12469
作者: Yingying Fang,Huichi Zhou,KinHei Lee,Yijia Wang,Zhenxuan Zhang,Jiahao Huang,Guang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2026

点击查看摘要

Abstract:Generating precise diagnostic reports from High-Resolution Computed Tomography (HRCT) is critical for clinical workflow, yet it remains a formidable challenge due to the high pathological diversity and spatial sparsity within 3D volumes. While Video Language Models (VideoLMs) have demonstrated remarkable spatio-temporal reasoning in general domains, their adaptability to domain-specific, high-volume medical interpretation remains underexplored. In this work, we present AbSteering, an abnormality-centric framework that steers VideoLMs toward precise HRCT report generation. Specifically, AbSteering introduces: (i) an abnormality-centric Chain-of-Thought scheme that enforces abnormality reasoning, and (ii) a Direct Preference Optimization objective that utilizes clinically confusable abnormalities as hard negatives to enhance fine-grained discrimination. Our results demonstrate that general-purpose VideoLMs possess strong transferability to high-volume medical imaging when guided by this paradigm. Notably, AbSteering outperforms state-of-the-art domain-specific CT foundation models, which are pretrained with large-scale CTs, achieving superior detection sensitivity while simultaneously mitigating hallucinations. Our data and model weights are released at this https URL

[CV-119] Adaptation of Weakly Supervised Localization in Histopathology by Debiasing Predictions

【速读】：该论文旨在解决弱监督目标定位（Weakly Supervised Object Localization, WSOL）模型在跨域场景下因分布偏移（distribution shift）导致的性能退化问题，尤其在新器官或不同染色协议与扫描设备条件下，WSOL预测易偏向主导类别，造成伪标签分布严重失衡。解决方案的关键在于提出一种名为 SFDA-DeP 的源域无关（Source-Free）域适应方法，其核心机制是通过模拟机器遗忘（machine unlearning）的思想，将域适应建模为一个迭代识别并修正预测偏差的过程：周期性识别目标域中被过度预测的类别样本，并对高熵（不确定性高）的图像降低其预测置信度，同时保留高置信度预测以稳定决策边界；此外，引入联合优化的像素级分类器以恢复受分布偏移影响的定位特征，从而有效缓解类别偏倚并提升分类与定位性能。

链接: https://arxiv.org/abs/2603.12468
作者: Alexis Guichemerre,Banafsheh Karimian,Soufiane Belharbi,Natacha Gillet,Nicolas Thome,Pourya Shamsolmoali,Mohammadhadi Shateri,Luke McCaffrey,Eric Granger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Weakly Supervised Object Localization (WSOL) models enable joint classification and region-of-interest localization in histology images using only image-class supervision. When deployed in a target domain, distributions shift remains a major cause of performance degradation, especially when applied on new organs or institutions with different staining protocols and scanner characteristics. Under stronger cross-domain shifts, WSOL predictions can become biased toward dominant classes, producing highly skewed pseudo-label distributions in the target domain. Source-Free (Unsupervised) Domain Adaptation (SFDA) methods are commonly employed to address domain shift. However, because they rely on self-training, the initial bias is reinforced over training iterations, degrading both classification and localization tasks. We identify this amplification of prediction bias as a primary obstacle to the SFDA of WSOL models in histopathology. This paper introduces \sfdadep, a method inspired by machine unlearning that formulates SFDA as an iterative process of identifying and correcting prediction bias. It periodically identifies target images from over-predicted classes and selectively reduces the predictive confidence for uncertain (high entropy) images, while preserving confident predictions. This process reduces the drift of decision boundaries and bias toward dominant classes. A jointly optimized pixel-level classifier further restores discriminative localization features under distribution shift. Extensive experiments on cross-organ and -center histopathology benchmarks (glas, CAMELYON-16, CAMELYON-17) with several WSOL models show that SFDA-DeP consistently improves classification and localization over state-of-the-art SFDA baselines. \small Code: \hrefthis https URLthis http URL

[CV-120] Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group

【速读】：该论文旨在解决可旋转卷积神经网络（steerable convolutional neural networks）设计中出现的可旋转核约束（steerable kernel constraint）问题。其解决方案的关键在于构造一组显式实数和复数基底，适用于任意对称群和任意张量类型的特征图，从而避免了繁琐的Clebsch-Gordan系数的数值或解析计算；该方法的核心思想是：先在某一点 $ x_0 $ 上找到满足较简单不变性条件的核基底，再通过可旋转性的定义方程将该基底“牵引”至任意点 $ x = g \cdot x_0 $，实现全局一致性。此策略以最小的技术复杂度实现了通用且高效的可旋转核构造。

链接: https://arxiv.org/abs/2603.12459
作者: Alan Garbarz
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages. Comments are welcome

点击查看摘要

Abstract:We present an alternative way of solving the steerable kernel constraint that appears in the design of steerable equivariant convolutional neural networks. We find explicit real and complex bases which are ready to use, for different symmetry groups and for feature maps of arbitrary tensor type. A major advantage of this method is that it bypasses the need to numerically or analytically compute Clebsch-Gordan coefficients and works directly with the representations of the input and output feature maps. The strategy is to find a basis of kernels that respect a simpler invariance condition at some point x_0 , and then \textitsteer it with the defining equation of steerability to move to some arbitrary point x=g\cdot x_0 . This idea has already been mentioned in the literature before, but not advanced in depth and with some generality. Here we describe how it works with minimal technical tools to make it accessible for a general audience.

[CV-121] Revisiting Model Stitching In the Foundation Model Era CVPR2023

【速读】：该论文旨在解决异构视觉基础模型（Vision Foundation Models, VFMs）之间的可缝合性（stitchability）问题，即如何将不同训练目标、数据来源和模态混合（如CLIP、DINOv2、SigLIP 2）的VFMs通过轻量级缝合层（stitch layer）连接起来，并保持下游任务性能。其关键解决方案在于引入一种系统性的缝合协议，核心创新是：在目标模型的倒数第二层施加一个简单的特征匹配损失（feature-matching loss），而非传统的中间特征对齐或端到端任务损失优化；这一策略显著提升了异构VFMs的缝合成功率，尤其在浅层缝合点上表现突出，并使得深度缝合后的模型能在仅增加少量推理开销的前提下超越任一原始模型性能。基于此，作者进一步提出VFM Stitch Tree（VST），实现多VFMs共享早期特征提取层的同时保留各自后期结构，为多模态大语言模型（multimodal LLMs）提供可控的精度-延迟权衡方案。

链接: https://arxiv.org/abs/2603.12433
作者: Zheda Mai,Ke Zhang,Fu-En Wang,Zixiao Ken Wang,Albert Y. C. Chen,Lu Xia,Min Sun,Wei-Lun Chao,Cheng-Hao Kuo
机构: Amazon(亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by CVPR 2023

点击查看摘要

Abstract:Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model’s penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.

[CV-122] Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation

【速读】：该论文旨在解决当前外科视觉-语言模型（Vision-Language Model, VLM）在手术场景理解中缺乏可解释推理链的问题，即现有模型虽能提供准确预测，但无法生成符合外科临床逻辑的逐步推理过程；同时，通用推理模型因缺少领域特定知识，在组合式外科任务上表现不佳。解决方案的关键在于提出Surg-R1模型，其核心创新包括：构建一个三层推理层次结构（感知定位、关系理解与情境推理），以系统化分解外科理解任务；建立包含32万条推理对的全球最大外科思维链数据集；并通过四阶段训练流程（从监督微调到群体相对策略优化及迭代自我改进）实现模型性能的持续提升。该方法显著提升了模型在多个外科任务上的准确性与可解释性，尤其在外部多中心验证中表现出优于主流商用和专用模型的综合性能。

链接: https://arxiv.org/abs/2603.12430
作者: Jian Jiang,Chenxi Lin,Yiming Gu,Zengyi Qin,Zhitao Zeng,Kun Yuan,Yonghao Long,Xiang Xia,Cheng Yuan,Yuqi Wang,Zijie Yue,Kunyi Yang,Yuting Zhang,Zhu Zhuo,Dian Qin,Xin Wang,NG Chi Fai,Brian Anthony,Daguang Xu,Guy Rosman,Ozanan Meireles,Zizhen Zhang,Nicolas Padoy,Hesheng Wang,Qi Dou,Yueming Jin,Yutong Ban
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surgical scene understanding demands not only accurate predictions but also interpretable reasoning that surgeons can verify against clinical expertise. However, existing surgical vision-language models generate predictions without reasoning chains, and general-purpose reasoning models fail on compositional surgical tasks without domain-specific knowledge. We present Surg-R1, a surgical Vision-Language Model that addresses this gap through hierarchical reasoning trained via a four-stage pipeline. Our approach introduces three key contributions: (1) a three-level reasoning hierarchy decomposing surgical interpretation into perceptual grounding, relational understanding, and contextual reasoning; (2) the largest surgical chain-of-thought dataset with 320,000 reasoning pairs; and (3) a four-stage training pipeline progressing from supervised fine-tuning to group relative policy optimization and iterative self-improvement. Evaluation on SurgBench, comprising six public benchmarks and six multi-center external validation datasets from five institutions, demonstrates that Surg-R1 achieves the highest Arena Score (64.9%) on public benchmarks versus Gemini 3.0 Pro (46.1%) and GPT-5.1 (37.9%), outperforming both proprietary reasoning models and specialized surgical VLMs on the majority of tasks spanning instrument localization, triplet recognition, phase recognition, action recognition, and critical view of safety assessment, with a 15.2 percentage point improvement over the strongest surgical baseline on external validation.

[CV-123] A Neuro-Symbolic Framework Combining Inductive and Deductive Reasoning for Autonomous Driving Planning

【速读】：该论文旨在解决当前端到端自动驾驶模型在复杂长尾场景中缺乏可解释性和绝对安全性保障的问题，其核心瓶颈在于纯数据驱动的归纳推理（inductive reasoning）导致的“黑箱”特性。解决方案的关键在于提出一种新颖的神经符号轨迹规划框架，通过将严格的演绎推理（deductive reasoning）无缝集成到端到端神经网络中：首先利用大语言模型（Large Language Model, LLM）动态提取场景规则，再借助答案集编程（Answer Set Programming, ASP）求解器进行确定性逻辑仲裁，生成安全且可追溯的离散驾驶决策；同时引入决策条件解码机制，将高阶逻辑决策映射为可学习嵌入向量，并结合微分化的运动学自行车模型（Kinematic Bicycle Model, KBM）约束物理初始速度与规划查询，从而在保证运动学可行性的同时实现高度透明的轨迹生成。

链接: https://arxiv.org/abs/2603.12421
作者: Hongyan Wei,Wael AbdAlmageed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review. 16 pages, 2 figures

点击查看摘要

Abstract:Existing end-to-end autonomous driving models rely heavily on purely data-driven inductive reasoning. This “black-box” nature leads to a lack of interpretability and absolute safety guarantees in complex, long-tail scenarios. To overcome this bottleneck, we propose a novel neuro-symbolic trajectory planning framework that seamlessly integrates rigorous deductive reasoning into end-to-end neural networks. Specifically, our framework utilizes a Large Language Model (LLM) to dynamically extract scene rules and employs an Answer Set Programming (ASP) solver for deterministic logical arbitration, generating safe and traceable discrete driving decisions. To bridge the gap between discrete symbols and continuous trajectories, we introduce a decision-conditioned decoding mechanism that transforms high-level logical decisions into learnable embedding vectors, simultaneously constraining the planning query and the physical initial velocity of a differentiable Kinematic Bicycle Model (KBM). By combining KBM-generated physical baseline trajectories with neural residual corrections, our approach inherently guarantees kinematic feasibility while ensuring a high degree of transparency. On the nuScenes benchmark, our method comprehensively outperforms the state-of-the-art baseline MomAD, reducing the L2 mean error to 0.57 m, decreasing the collision rate to 0.075%, and optimizing trajectory prediction consistency (TPC) to 0.47 m.

[CV-124] ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection

【速读】：该论文旨在解决开放词汇目标检测（Open-Vocabulary Object Detection）模型在域偏移（domain shift）场景下性能显著下降的问题，尤其是在缺乏标注数据的目标域（如夜间或雾天场景）中无法直接进行微调的情况。解决方案的关键在于提出Aligned Basis Relocation for Adaptation (ABRA)，其将知识迁移建模为预训练检测器权重空间中的几何传输问题，通过对齐源域和目标域的专家特征表示，实现类特定检测知识的跨域传输，从而在无目标域训练样本的情况下完成类级专长的“ teleportation ”（远距离迁移）。

链接: https://arxiv.org/abs/2603.12409
作者: Mattia Bernardi,Chiara Cappellino,Matteo Mosconi,Enver Sangineto,Angelo Porrello,Simone Calderara
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although recent Open-Vocabulary Object Detection architectures, such as Grounding DINO, demonstrate strong zero-shot capabilities, their performance degrades significantly under domain shifts. Moreover, many domains of practical interest, such as nighttime or foggy scenes, lack large annotated datasets, preventing direct fine-tuning. In this paper, we introduce Aligned Basis Relocation for Adaptation(ABRA), a method that transfers class-specific detection knowledge from a labeled source domain to a target domain where no training images containing these classes are accessible. ABRA formulates this adaptation as a geometric transport problem in the weight space of a pretrained detector, aligning source and target domain experts to transport class-specific knowledge. Extensive experiments across challenging domain shifts demonstrate that ABRA successfully teleports class-level specialization under multiple adverse conditions. Our code will be made public upon acceptance.

[CV-125] SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLM s CVPR2026

【速读】：该论文旨在解决视频多模态大语言模型（Video MLLMs）在像素级定位任务中面临的时空精度与一致性难题，尤其是现有方法依赖静态分割标记（[SEG]）导致的空间漂移、目标身份混淆和初始化不稳定等问题。其核心解决方案是提出SPARROW框架，通过两个关键设计实现突破：一是引入目标特定追踪特征（Target-Specific Tracked Features, TSF），在训练过程中注入时序对齐的参照线索以增强时间稳定性；二是采用双提示机制解码边界框（[BOX]）和分割（[SEG]）标记，融合几何先验与语义定位信息，从而提升空间精度与时间连贯性。

链接: https://arxiv.org/abs/2603.12382
作者: Mohamad Alansari,Naufal Suryanto,Divya Velayudhan,Sajid Javed,Naoufel Werghi,Muzammal Naseer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2026; Project page: this https URL Repository: this https URL

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have advanced from image-level reasoning to pixel-level grounding, but extending these capabilities to videos remains challenging as models must achieve spatial precision and temporally consistent reference tracking. Existing video MLLMs often rely on a static segmentation token ([SEG]) for frame-wise grounding, which provides semantics but lacks temporal context, causing spatial drift, identity switches, and unstable initialization when objects move or reappear. We introduce SPARROW, a pixel-grounded video MLLM that unifies spatial accuracy and temporal stability through two key components: (i) Target-Specific Tracked Features (TSF), which inject temporally aligned referent cues during training, and (ii) a dual-prompt design that decodes box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding. SPARROW is supported by a curated referential video dataset of 30,646 videos and 45,231 QA pairs and operates end-to-end without external detectors via a class-agnostic SAM2-based proposer. Integrated into three recent open-source video MLLMs (UniPixel, GLUS, and VideoGLaMM), SPARROW delivers consistent gains across six benchmarks, improving up to +8.9 JF on RVOS, +5 mIoU on visual grounding, and +5.4 CLAIR on GCG. These results demonstrate that SPARROW substantially improves referential stability, spatial precision, and temporal coherence in pixel-grounded video understanding. Project page: this https URL

[CV-126] Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization

【速读】：该论文旨在解决跨域图像分类在关键医疗任务中的泛化难题，例如基于眼底图像的糖尿病视网膜病变（Diabetic Retinopathy, DR）分级和静息态fMRI癫痫发作起始区（Seizure Onset Zone, SOZ）检测中，由于未知因果因素导致的域间差异问题。现有方法难以在缺乏直接元数据或协议信息的情况下客观评估域间差异，从而限制了模型的泛化能力。解决方案的关键在于提出领域一致性边界（Domain Conformal Bounds, DCB）理论框架以量化未知因果因素引起的域差异，并进一步设计GenEval方法，利用多模态视觉语言模型（Vision Language Models, VLM）结合基础模型（如MedGemma-4B）与人类知识通过低秩适配（Low-Rank Adaptation, LoRA）来弥合因果差距，从而提升单源域泛化（Single-Source Domain Generalization, SDG）性能，在8个DR和2个SOZ数据集上分别实现平均准确率69.2%和81%，显著优于最强基线模型。

链接: https://arxiv.org/abs/2603.12369
作者: Ayan Banerjee,Kuntal Thakur,Sandeep Gupta
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generalizing image classification across domains remains challenging in critical tasks such as fundus image-based diabetic retinopathy (DR) grading and resting-state fMRI seizure onset zone (SOZ) detection. When domains differ in unknown causal factors, achieving cross-domain generalization is difficult, and there is no established methodology to objectively assess such differences without direct metadata or protocol-level information from data collectors, which is typically inaccessible. We first introduce domain conformal bounds (DCB), a theoretical framework to evaluate whether domains diverge in unknown causal factors. Building on this, we propose GenEval, a multimodal Vision Language Models (VLM) approach that combines foundational models (e.g., MedGemma-4B) with human knowledge via Low-Rank Adaptation (LoRA) to bridge causal gaps and enhance single-source domain generalization (SDG). Across eight DR and two SOZ datasets, GenEval achieves superior SDG performance, with average accuracy of 69.2% (DR) and 81% (SOZ), outperforming the strongest baselines by 9.4% and 1.8%, respectively.

[CV-127] Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

【速读】：该论文旨在解决结构化剪枝（structured pruning）中传统基于权重幅度或激活感知的静态启发式方法在深度视觉网络中存在幅度偏差（magnitude bias）的问题，导致关键功能路径无法有效保留，从而引发模型性能显著下降。解决方案的关键在于提出一种受交替梯度流（Alternating Gradient Flow, AGF）启发的解耦动力学范式，通过绝对特征空间泰勒展开（absolute feature-space Taylor expansion）精确捕捉网络结构的“动力学效用”（kinetic utility），实现对结构路径的更合理评估与保留。该方法不仅在极端稀疏条件下避免了拓扑崩溃，还揭示了视觉 Transformer（ViT）中的稀疏瓶颈现象，并设计了一种结合离线结构搜索与在线执行的混合路由框架，利用零成本物理先验实现高效动态推理，在 ImageNet-1K 和 ImageNet-100 上均展现出优越的压缩效率与精度平衡。

链接: https://arxiv.org/abs/2603.12354
作者: Tianhao Qian,Zhuoxuan Li,Jinde Cao,Xinli Shi,Hanjie Liu,Leszek Rutkowski
机构: Southeast University (东南大学); Systems Research Institute of the Polish Academy of Sciences (波兰科学院系统研究所); AGH University of Krakow (克拉科夫AGH大学); SAN University (SAN大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 11 pages, 6 figures, 9 tables

点击查看摘要

Abstract:Efficient deep learning traditionally relies on static heuristics like weight magnitude or activation awareness (e.g., Wanda, RIA). While successful in unstructured settings, we observe a critical limitation when applying these metrics to the structural pruning of deep vision networks. These contemporary metrics suffer from a magnitude bias, failing to preserve critical functional pathways. To overcome this, we propose a decoupled kinetic paradigm inspired by Alternating Gradient Flow (AGF), utilizing an absolute feature-space Taylor expansion to accurately capture the network’s structural “kinetic utility”. First, we uncover a topological phase transition at extreme sparsity, where AGF successfully preserves baseline functionality and exhibits topological implicit regularization, avoiding the collapse seen in models trained from scratch. Second, transitioning to architectures without strict structural priors, we reveal a phenomenon of Sparsity Bottleneck in Vision Transformers (ViTs). Through a gradient-magnitude decoupling analysis, we discover that dynamic signals suffer from signal compression in converged models, rendering them suboptimal for real-time routing. Finally, driven by these empirical constraints, we design a hybrid routing framework that decouples AGF-guided offline structural search from online execution via zero-cost physical priors. We validate our paradigm on large-scale benchmarks: under a 75% compression stress test on ImageNet-1K, AGF effectively avoids the structural collapse where traditional metrics aggressively fall below random sampling. Furthermore, when systematically deployed for dynamic inference on ImageNet-100, our hybrid approach achieves Pareto-optimal efficiency. It reduces the usage of the heavy expert by approximately 50% (achieving an estimated overall cost of 0.92 \times ) without sacrificing the full-model accuracy.

[CV-128] DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression

【速读】：该论文旨在解决扩散模型在图像压缩中面临的采样开销过高和内存占用大的问题，尤其是现有基于U-Net架构的扩散编解码器因层级下采样限制只能在浅层潜在空间（通常仅8倍空间下采样）运行，导致计算冗余。其解决方案的关键在于提出DiT-IC——一种对齐扩散Transformer（Aligned Diffusion Transformer for Image Compression），通过将U-Net替换为可在32倍下采样分辨率下完整执行扩散过程的Diffusion Transformer，并引入三项关键对齐机制：(1) 方差引导的重建流程以适应潜在不确定性实现高效重建；(2) 自蒸馏对齐确保与编码器定义的潜在几何一致性，支持单步扩散；(3) 潜在条件引导替代文本提示，实现无需文本输入的推理。这一设计使DiT-IC在保持最优感知质量的同时，解码速度提升达30倍且内存消耗显著降低，可实现在16 GB显存设备上重建2048×2048图像。

链接: https://arxiv.org/abs/2603.13162
作者: Junqi Shi,Ming Lu,Xingchen Li,Anle Ke,Ruiqi Zhang,Zhan Ma
机构: Nanjing University (南京大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead and high memory usage. Most existing diffusion codecs employ U-Net architectures, where hierarchical downsampling forces diffusion to operate in shallow latent spaces (typically with only 8x spatial downscaling), resulting in excessive computation. In contrast, conventional VAE-based codecs work in much deeper latent domains (16x - 64x downscaled), motivating a key question: Can diffusion operate effectively in such compact latent spaces without compromising reconstruction quality? To address this, we introduce DiT-IC, an Aligned Diffusion Transformer for Image Compression, which replaces the U-Net with a Diffusion Transformer capable of performing diffusion in latent space entirely at 32x downscaled resolution. DiT-IC adapts a pretrained text-to-image multi-step DiT into a single-step reconstruction model through three key alignment mechanisms: (1) a variance-guided reconstruction flow that adapts denoising strength to latent uncertainty for efficient reconstruction; (2) a self-distillation alignment that enforces consistency with encoder-defined latent geometry to enable one-step diffusion; and (3) a latent-conditioned guidance that replaces text prompts with semantically aligned latent conditions, enabling text-free inference. With these designs, DiT-IC achieves state-of-the-art perceptual quality while offering up to 30x faster decoding and drastically lower memory usage than existing diffusion-based codecs. Remarkably, it can reconstruct 2048x2048 images on a 16 GB laptop GPU.

[CV-129] Accelerating Stroke MRI with Diffusion Probabilistic Models through Large-Scale Pre-training and Target-Specific Fine-Tuning

【速读】：该论文旨在解决临床急性卒中磁共振成像（MRI）中因数据稀缺导致的加速重建难题，即在仅有少量目标域（如FLAIR序列）全采样数据的情况下，如何实现高质量、高效率的图像重建以缩短扫描时间。其解决方案的关键在于采用“基础模型预训练+针对性微调”的策略：首先在大规模公开脑部MRI数据集（fastMRI）上对扩散概率生成模型（DPMs）进行预训练，随后仅用20名患者的FLAIR数据进行精细微调，结合优化的学习率和微调时长，使模型在多倍加速因子下达到与使用大量目标域数据训练的模型相当的重建性能，且在盲法临床读片中证实其图像质量与标准护理相当，显著降低了对特定应用场景大数据集的依赖。

链接: https://arxiv.org/abs/2603.13007
作者: Yamin Arefeen,Sidharth Kumar,Steven Warach,Hamidreza Saber,Jonathan Tamir
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Purpose: To develop a data-efficient strategy for accelerated MRI reconstruction with Diffusion Probabilistic Generative Models (DPMs) that enables faster scan times in clinical stroke MRI when only limited fully-sampled data samples are available. Methods: Our simple training strategy, inspired by the foundation model paradigm, first trains a DPM on a large, diverse collection of publicly available brain MRI data in fastMRI and then fine-tunes on a small dataset from the target application using carefully selected learning rates and fine-tuning durations. The approach is evaluated on controlled fastMRI experiments and on clinical stroke MRI data with a blinded clinical reader study. Results: DPMs pre-trained on approximately 4000 subjects with non-FLAIR contrasts and fine-tuned on FLAIR data from only 20 target subjects achieve reconstruction performance comparable to models trained with substantially more target-domain FLAIR data across multiple acceleration factors. Experiments reveal that moderate fine-tuning with a reduced learning rate yields improved performance, while insufficient or excessive fine-tuning degrades reconstruction quality. When applied to clinical stroke MRI, a blinded reader study involving two neuroradiologists indicates that images reconstructed using the proposed approach from 2 \times accelerated data are non-inferior to standard-of-care in terms of image quality and structural delineation. Conclusion: Large-scale pre-training combined with targeted fine-tuning enables DPM-based MRI reconstruction in data-constrained, accelerated clinical stroke MRI. The proposed approach substantially reduces the need for large application-specific datasets while maintaining clinically acceptable image quality, supporting the use of foundation-inspired diffusion models for accelerated MRI in targeted applications. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph) Cite as: arXiv:2603.13007 [eess.IV] (or arXiv:2603.13007v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2603.13007 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yamin Arefeen [view email] [v1] Fri, 13 Mar 2026 14:13:53 UTC (12,639 KB)

[CV-130] Reinforcing the Weakest Links: Modernizing SIENA with Targeted Deep Learning Integration

【速读】：该论文旨在解决传统脑萎缩定量方法SIENA（Structural Image Evaluation, using Normalization, of Atrophy）在执行过程中对经典图像处理步骤（如去颅骨分割和组织分割）的依赖问题，这些步骤的失败会传播至整个流程并导致脑体积变化（PBVC）估计偏差。解决方案的关键在于通过引入基于深度学习的模块——SynthStrip（用于去颅骨分割）和SynthSeg（用于组织分割）来替代SIENA中性能较弱的环节，从而在不破坏其可解释性框架的前提下提升整体鲁棒性和准确性。实验表明，替换去颅骨分割模块效果最显著，能增强PBVC与临床及结构退变指标的相关性，并大幅提高扫描顺序一致性（误差降低达99.1%），同时GPU加速版本还可减少46%的运行时间。

链接: https://arxiv.org/abs/2603.12951
作者: Riccardo Raciti,Lemuel Puglisi,Francesco Guarnera,Daniele Ravì,Sebastiano Battiato
机构: University of Catania (卡塔尼亚大学); Alzheimer’s Disease Neuroimaging Initiative (阿尔茨海默病神经影像计划)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Percentage Brain Volume Change (PBVC) derived from Magnetic Resonance Imaging (MRI) is a widely used biomarker of brain atrophy, with SIENA among the most established methods for its estimation. However, SIENA relies on classical image processing steps, particularly skull stripping and tissue segmentation, whose failures can propagate through the pipeline and bias atrophy estimates. In this work, we examine whether targeted deep learning substitutions can improve SIENA while preserving its established and interpretable framework. To this end, we integrate SynthStrip and SynthSeg into SIENA and evaluate three pipeline variants on the ADNI and PPMI longitudinal cohorts. Performance is assessed using three complementary criteria: correlation with longitudinal clinical and structural decline, scan-order consistency, and end-to-end runtime. Replacing the skull-stripping module yields the most consistent gains: in ADNI, it substantially strengthens associations between PBVC and multiple measures of disease progression relative to the standard SIENA pipeline, while across both datasets it markedly improves robustness under scan reversal. The fully integrated pipeline achieves the strongest scan-order consistency, reducing the error by up to 99.1%. In addition, GPU-enabled variants reduce execution time by up to 46% while maintaining CPU runtimes comparable to standard SIENA. Overall, these findings show that deep learning can meaningfully strengthen established longitudinal atrophy pipelines when used to reinforce their weakest image processing steps. More broadly, this study highlights the value of modularly modernizing clinically trusted neuroimaging tools without sacrificing their interpretability. Code is publicly available at this https URL.

[CV-131] GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification

【速读】：该论文旨在解决青光眼（glaucoma）病变评估中因单一模态信息局限性导致的诊断精度不足问题，尤其是在不同疾病阶段难以实现精准分类与辅助决策的挑战。其解决方案的关键在于构建首个公开的三模态青光眼数据集GLEAM（包含扫描激光眼底成像、视盘周围OCT图像及视野模式偏离图），并提出分层注意力掩码建模（hierarchical attentive masked modeling, HAMM）框架，通过分层注意力编码器与轻量解码器实现跨模态特征的有效融合与聚焦，从而增强多模态互补信息的利用效率，提升青光眼分期分类的准确性。

链接: https://arxiv.org/abs/2603.12800
作者: Jiao Wang,Chi Liu,Yiying Zhang,Hongchen Luo,Zhifen Guo,Ying Hu,Ke Xu,Jing Zhou,Hongyan Xu,Ruiting Zhou,Man Tang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose glaucoma lesion evaluation and analysis with multimodal imaging (GLEAM), the first publicly available tri-modal glaucoma dataset comprising scanning laser ophthalmoscopy fundus images, circumpapillary OCT images, and visual field pattern deviation maps, annotated with four disease stages, enabling effective exploitation of multimodal complementary information and facilitating accurate diagnosis and treatment across disease stages. To effectively integrate cross-modal information, we propose hierarchical attentive masked modeling (HAMM) for multimodal glaucoma classification. Our framework employs hierarchical attentive encoders and light decoders to focus cross-modal representation learning on the encoder.

[CV-132] Deep Learning Based Estimation of Blood Glucose Levels from Multidirectional Scleral Blood Vessel Imaging

【速读】：该论文旨在解决糖尿病患者频繁监测血糖水平时依赖侵入性血液检测所带来的负担问题，提出了一种基于多视角巩膜血管图像的非侵入式糖代谢状态评估方法。其关键解决方案是构建ScleraGluNet——一个融合多方向巩膜血管图像的深度学习框架，通过并行卷积分支提取特征、Manta Ray Foraging Optimization（MRFO）优化特征表示，并利用基于Transformer的跨视角注意力机制进行特征融合，从而实现三类糖代谢状态分类（正常、控制良好糖尿病、高血糖糖尿病）和空腹血浆葡萄糖（FPG）的连续估计。

链接: https://arxiv.org/abs/2603.12715
作者: Muhammad Ahmed Khan,Manqiang Peng,Ding Lin,Saif Ur Rehman Khan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Regular monitoring of glycemic status is essential for diabetes management, yet conventional blood-based testing can be burdensome for frequent assessment. The sclera contains superficial microvasculature that may exhibit diabetes related alterations and is readily visible on the ocular surface. We propose ScleraGluNet, a multiview deep-learning framework for three-class metabolic status classification (normal, controlled diabetes, and high-glucose diabetes) and continuous fasting plasma glucose (FPG) estimation from multidirectional scleral vessel images. The dataset comprised 445 participants (150/140/155) and 2,225 anterior-segment images acquired from five gaze directions per participant. After vascular enhancement, features were extracted using parallel convolutional branches, refined with Manta Ray Foraging Optimization (MRFO), and fused via transformer-based cross-view attention. Performance was evaluated using subject-wise five-fold cross-validation, with all images from each participant assigned to the same fold. ScleraGluNet achieved 93.8% overall accuracy, with one-vs-rest AUCs of 0.971,0.956, and 0.982 for normal, controlled diabetes, and high-glucose diabetes, respectively. For FPG estimation, the model achieved MAE = 6.42 mg/dL and RMSE = 7.91 mg/dL, with strong correlation to laboratory measurements (r = 0.983; R2 = 0.966). Bland Altman analysis showed a mean bias of +1.45 mg/dL with 95% limits of agreement from -8.33 to +11.23 mg/dL. These results support multidirectional scleral vessel imaging with multiview learning as a promising noninvasive approach for glycemic assessment, warranting multicenter validation before clinical deployment.

[CV-133] Multiscale Structure-Guided Latent Diffusion for Multimodal MRI Translation

【速读】：该论文旨在解决多模态磁共振成像（MRI）翻译任务中，现有扩散模型在处理任意缺失模态场景时存在的解剖结构不一致和纹理细节退化问题。解决方案的关键在于提出了一种基于潜在空间的多模态MRI翻译框架MSG-LDM，其核心创新是引入了风格-结构解耦机制，在潜在空间中显式分离模态特异性风格特征与共享结构表示，并在多尺度特征空间中联合建模低频解剖布局与高频边界细节；同时通过设计风格一致性损失和结构感知损失，抑制模态特异性风格对结构表征的干扰，从而提升结构重建的稳定性和准确性。

链接: https://arxiv.org/abs/2603.12581
作者: Jianqiang Lin(1 and 2),Zhiqiang Shen(1 and 2),Peng Cao(1, 2 and 3),Jinzhu Yang(1, 2 and 3),Osmar R. Zaiane(4),Xiaoli Liu(5) ((1) Northeastern University, Shenyang, China, (2) Key Laboratory of Intelligent Computing in Medical Image, Shenyang, China, (3) National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Shenyang, China, (4) University of Alberta, Edmonton, Canada, (5) AiShiWeiLai AI Research, Beijing, China)
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although diffusion models have achieved remarkable progress in multi-modal magnetic resonance imaging (MRI) translation tasks, existing methods still tend to suffer from anatomical inconsistencies or degraded texture details when handling arbitrary missing-modality scenarios. To address these issues, we propose a latent diffusion-based multi-modal MRI translation framework, termed MSG-LDM. By leveraging the available modalities, the proposed method infers complete structural information, which preserves reliable boundary details. Specifically, we introduce a style–structure disentanglement mechanism in the latent space, which explicitly separates modality-specific style features from shared structural representations, and jointly models low-frequency anatomical layouts and high-frequency boundary details in a multi-scale feature space. During the structure disentanglement stage, high-frequency structural information is explicitly incorporated to enhance feature representations, guiding the model to focus on fine-grained structural cues while learning modality-invariant low-frequency anatomical representations. Furthermore, to reduce interference from modality-specific styles and improve the stability of structure representations, we design a style consistency loss and a structure-aware loss. Extensive experiments on the BraTS2020 and WMH datasets demonstrate that the proposed method outperforms existing MRI synthesis approaches, particularly in reconstructing complete structures. The source code is publicly available at this https URL.

[CV-134] Variational Garrote for Sparse Inverse Problems

【速读】：该论文旨在解决稀疏正则化在欠定逆问题中因先验假设与数据真实稀疏结构不匹配而导致的重建性能下降问题。其核心解决方案是通过比较传统的L1正则化与基于变分二值门控变量近似L0稀疏性的变分套索（Variational Garrote, VG）方法，验证更贴近“尖刺- slab”（spike-and-slab）结构的稀疏先验在强欠定场景下的优势。实验表明，VG在支持恢复关键的极端欠采样条件下通常能实现更低的最小泛化误差和更高的稳定性，凸显了先验与数据对齐的重要性，并为变分L0类方法在不同信息瓶颈下的行为提供了实证依据。

链接: https://arxiv.org/abs/2603.12562
作者: Kanghun Lee,Hyungjoon Soh,Junghyo Jo
机构: Seoul National University (首尔国立大学)
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Sparse regularization plays a central role in solving inverse problems arising from incomplete or corrupted measurements. Different regularizers correspond to different prior assumptions about the structure of the unknown signal, and reconstruction performance depends on how well these priors match the intrinsic sparsity of the data. This work investigates the effect of sparsity priors in inverse problems by comparing conventional L1 regularization with the Variational Garrote (VG), a probabilistic method that approximates L0 sparsity through variational binary gating variables. A unified experimental framework is constructed across multiple reconstruction tasks including signal resampling, signal denoising, and sparse-view computed tomography. To enable consistent comparison across models with different parameterizations, regularization strength is swept across wide ranges and reconstruction behavior is analyzed through train-generalization error curves. Experiments reveal characteristic bias-variance tradeoff patterns across tasks and demonstrate that VG frequently achieves lower minimum generalization error and improved stability in strongly underdetermined regimes where accurate support recovery is critical. These results suggest that sparsity priors closer to spike-and-slab structure can provide advantages when the underlying coefficient distribution is strongly sparse. The study highlights the importance of prior-data alignment in sparse inverse problems and provides empirical insights into the behavior of variational L0-type methods across different information bottlenecks.

[CV-135] Unmasking Biases and Reliability Concerns in Convolutional Neural Networks Analysis of Cancer Pathology Images

【速读】：该论文试图解决的问题是：当前用于癌症病理诊断的卷积神经网络（Convolutional Neural Networks, CNNs）在评估过程中可能存在不可靠性，因其决策过程缺乏可解释性，且现有基准数据集可能隐含偏差，导致模型性能被高估。解决方案的关键在于通过构造不含临床信息的背景裁剪数据集（cropped segments from the background of original images），对多个CNN架构进行对比实验，验证其在无生物医学内容下的分类准确率是否仍显著高于随机水平。结果表明，部分CNN模型在无临床信息的数据上仍能达到高达93%的准确率，揭示了模型对数据偏倚的高度敏感性，从而指出当前机器学习评估范式在癌症病理领域存在系统性风险。

链接: https://arxiv.org/abs/2603.12445
作者: Michael Okonoda,Eder Martinez,Abhilekha Dalal,Lior Shamir
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Electronics, published

点击查看摘要

Abstract:Convolutional Neural Networks have shown promising effectiveness in identifying different types of cancer from radiographs. However, the opaque nature of CNNs makes it difficult to fully understand the way they operate, limiting their assessment to empirical evaluation. Here we study the soundness of the standard practices by which CNNs are evaluated for the purpose of cancer pathology. Thirteen highly used cancer benchmark datasets were analyzed, using four common CNN architectures and different types of cancer, such as melanoma, carcinoma, colorectal cancer, and lung cancer. We compared the accuracy of each model with that of datasets made of cropped segments from the background of the original images that do not contain clinically relevant content. Because the rendered datasets contain no clinical information, the null hypothesis is that the CNNs should provide mere chance-based accuracy when classifying these datasets. The results show that the CNN models provided high accuracy when using the cropped segments, sometimes as high as 93%, even though they lacked biomedical information. These results show that some CNN architectures are more sensitive to bias than others. The analysis shows that the common practices of machine learning evaluation might lead to unreliable results when applied to cancer pathology. These biases are very difficult to identify, and might mislead researchers as they use available benchmark datasets to test the efficacy of CNN methods.

[CV-136] Generation of maximal snake polyominoes using a deep neural network

【速读】：该论文旨在解决在大矩形网格中生成最大蛇形多格形（maximal snake polyominoes）的难题，因为传统方法依赖于对特定网格尺寸下所有蛇形多格形的完全枚举，这属于暴力搜索算法，在较大网格中计算复杂度极高，难以实现。为突破这一限制，论文提出使用深度神经网络进行数据驱动训练，无需显式编码最大性（maximality）和邻接（adjacency）约束，而是让模型从数据中学习这些结构特性。其关键解决方案是采用一种称为结构化像素空间扩散模型（Structured Pixel Space Diffusion, SPS Diffusion）的去噪扩散模型，该模型能够从较小网格泛化到更大网格（如28×28），生成有效蛇形多格形，并逼近当前计算极限下的最大蛇形候选解，尽管仍存在分支、环路或多个连通分量等错误。整体而言，该方法展示了深度神经网络在理解和生成复杂组合对象方面的潜力。

链接: https://arxiv.org/abs/2603.12400
作者: Benjamin Gauthier,Alain Goupil,Fadel Toure
机构: Université du Québec à Trois-Rivières (魁北克省大学三河市分校)
类目: Combinatorics (math.CO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8-page extended abstract, plus 2 pages of references; 6 figures. Submitted to GASCom 2026

点击查看摘要

Abstract:Maximal snake polyominoes are difficult to study numerically in large rectangles, as computing them requires the complete enumeration of all snakes for a specific grid size, which corresponds to a brute force algorithm. This technique is thus challenging to use in larger rectangles, which hinders the study of maximal snakes. Furthermore, most enumerable snakes lie in small rectangles, making it difficult to study large-scale patterns. In this paper, we investigate the contribution of a deep neural network to the generation of maximal snake polyominoes from a data-driven training, where the maximality and adjacency constraints are not encoded explicitly, but learned. To this extent, we experiment with a denoising diffusion model, which we call Structured Pixel Space Diffusion (SPS Diffusion). We find that SPS Diffusion generalizes from small grids to larger ones, generating valid snakes up to 28x28 squares and producing maximal snake candidates on squares close to the current computational limit. The model is, however, prone to errors such as branching, cycles, or multiple components. Overall, the diffusion model is promising and shows that complex combinatorial objects can be understood by deep neural networks, which is useful in their investigation.

[CV-137] Deep Learning-based Assessment of the Relation Between the Third Molar and Mandibular Canal on Panoramic Radiographs using Local Centralized and Federated Learning

【速读】：该论文旨在解决下颌第三磨牙（mandibular third molar）与下牙槽神经管（mandibular canal）相邻时，因影像学评估不准确导致的下牙槽神经损伤风险问题。研究提出通过自动化分类算法识别全景片（panoramic radiography）中牙-管重叠情况，以支持临床分诊并减少不必要的锥形束CT（CBCT）检查。其关键解决方案是采用联邦学习（federated learning, FL）框架，在不共享患者数据的前提下实现多中心协作建模，相较本地训练（local learning, LL）显著提升泛化性能，并在隐私保护前提下逼近集中式训练（centralized learning, CL）的效果，验证了FL在医学影像AI模型开发中的可行性与优越性。

链接: https://arxiv.org/abs/2603.11850
作者: Johan Andreas Balle Rubak,Sara Haghighat,Sanyam Jain,Mostafa Aldesoki,Akhilanand Chaurasia,Sarah Sadat Ehsani,Faezeh Dehghan Ghanatkaman,Ahmad Badruddin Ghazali,Julien Issa,Basel Khalil,Rishi Ramani,Ruben Pauwels
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Impaction of the mandibular third molar in proximity to the mandibular canal increases the risk of inferior alveolar nerve injury. Panoramic radiography is routinely used to assess this relationship. Automated classification of molar-canal overlap could support clinical triage and reduce unnecessary CBCT referrals, while federated learning (FL) enables multi-center collaboration without sharing patient data. We compared Local Learning (LL), FL, and Centralized Learning (CL) for binary overlap/no-overlap classification on cropped panoramic radiographs partitioned across eight independent labelers. A pretrained ResNet-34 was trained under each paradigm and evaluated using per-client metrics with locally optimized thresholds and pooled test performance with a global threshold. Performance was assessed using area under the receiver operating characteristic curve (AUC) and threshold-based metrics, alongside training dynamics, Grad-CAM visualizations, and server-side aggregate monitoring signals. On the test set, CL achieved the highest performance (AUC 0.831; accuracy = 0.782), FL showed intermediate performance (AUC 0.757; accuracy = 0.703), and LL generalized poorly across clients (AUC range = 0.619-0.734; mean = 0.672). Training curves suggested overfitting, particularly in LL models, and Grad-CAM indicated more anatomically focused attention in CL and FL. Overall, centralized training provided the strongest performance, while FL offers a privacy-preserving alternative that outperforms LL.

人工智能

[AI-0] Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights ICLR2026

【速读】：该论文旨在解决神经网络中成员隐私保护（membership privacy preservation）问题，传统方法通过更新或重新训练全部权重来实现隐私保护，但存在计算成本高、性能损失大甚至训练数据与非训练数据预测结果严重偏离的问题。其解决方案的关键在于：基于三个核心洞察——隐私漏洞仅存在于少量权重中、这些权重对模型性能同样关键、权重重要性源于其位置而非数值——提出仅对关键权重进行评分，并通过回滚（rewind）机制进行微调，而非直接丢弃相关神经元。实验表明，该方法在多数情况下能显著提升对成员推理攻击（Membership Inference Attacks）的鲁棒性，同时保持模型性能。

链接: https://arxiv.org/abs/2603.13186
作者: Xingli Fang,Jung-Eun Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: ICLR 2026

点击查看摘要

Abstract:Prior approaches for membership privacy preservation usually update or retrain all weights in neural networks, which is costly and can lead to unnecessary utility loss or even more serious misalignment in predictions between training data and non-training data. In this work, we observed three insights: i) privacy vulnerability exists in a very small fraction of weights; ii) however, most of those weights also critically impact utility performance; iii) the importance of weights stems from their locations rather than their values. According to these insights, to preserve privacy, we score critical weights, and instead of discarding those neurons, we rewind only the weights for fine-tuning. We show that, through extensive experiments, this mechanism exhibits outperforming resilience in most cases against Membership Inference Attacks while maintaining utility.

[AI-1] MXNorm: Reusing MXFP block scales for efficient tensor normalisation

【速读】：该论文旨在解决深度学习训练中矩阵乘法（Matrix Multiplication）性能提升远超归约（Reduction）和逐元素计算（Elementwise Computation）性能的问题，后者仍需在高精度下执行，成为性能瓶颈。解决方案的关键在于提出MXNorm，这是一种可直接替换RMSNorm的归一化方法，其通过利用MXFP8量化过程中已计算出的块尺度（Block Scales）来估算均方根（RMS），从而将归约操作所需的数据规模减少32倍。实验表明，在Llama 3模型（125M、1B和8B参数）预训练中，MXNorm相较于使用MXFP8矩阵乘法的RMSNorm基线仅带来极小的训练精度损失，同时在实际核函数层面实现了最高达2.4倍的加速，对应Transformer层整体速度提升1.3%（MXFP8场景）和2.6%（NVFP4场景）。

链接: https://arxiv.org/abs/2603.13180
作者: Callum McLean,Luke Y. Prince,Alexandre Payot,Paul Balança,Carlo Luschi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Preprint, Under Review. 15 pages, 12 figures

点击查看摘要

Abstract:Matrix multiplication performance has long been the major bottleneck to scaling deep learning workloads, which has stimulated the design of new accelerators that use increasingly low-precision number formats. However, improvements in matrix multiplication performance have far outstripped improvements in performance on reductions and elementwise computations, which are still being performed in higher precision. In this work, we propose MXNorm, a drop-in replacement for RMSNorm that estimates the RMS using only the block scales calculated as part of the MXFP8 cast and enables a 32x decrease in the size of reduction needed for normalization. We validate our approximation method on pre-training of Llama 3 models of 125M, 1B and 8B parameters, finding minimal loss of training accuracy compared to a baseline using RMSNorm with MXFP8 matmuls. We also show practical kernel speedups using only this http URL of up to 2.4x for MXNorm over RMSNorm, corresponding to a 1.3% speedup in Llama 3 8B transformer layers in MXFP8 and a 2.6% speedup in NVFP4.

[AI-2] When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO

【速读】：该论文旨在解决Group Relative Policy Optimization (GRPO)在训练推理模型时忽视组内正确与错误解之间自然对比信号的问题，即GRPO将每个输出视为独立样本进行优化，忽略了同一组中成功与失败推理路径之间的结构化信息。其解决方案的关键在于提出两种机制：一是Bilateral Context Conditioning (BICC)，通过跨样本交叉参考成功与失败的推理轨迹，实现不同样本间的直接信息流动；二是Reward-Confidence Correction (RCC)，基于一阶近似下的方差最小化估计器推导出的奖励-置信度协方差动态调整优势基线，从而稳定训练过程。这两个机制均无需额外采样或辅助模型，且可适配所有GRPO变体，在数学推理基准测试中实现了稳定提升。

链接: https://arxiv.org/abs/2603.13134
作者: Yu Li,Tian Lan,Zhengling Qi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has emerged as an effective method for training reasoning models. While it computes advantages based on group mean, GRPO treats each output as an independent sample during the optimization and overlooks a vital structural signal: the natural contrast between correct and incorrect solutions within the same group, thus ignoring the rich, comparative data that could be leveraged by explicitly pitting successful reasoning traces against failed ones. To capitalize on this, we present a contrastive reformulation of GRPO, showing that the GRPO objective implicitly maximizes the margin between the policy ratios of correct and incorrect samples. Building on this insight, we propose Bilateral Context Conditioning (BICC), a mechanism that allows the model to cross-reference successful and failed reasoning traces during the optimization, enabling a direct information flow across samples. We further introduce Reward-Confidence Correction (RCC) to stabilize training by dynamically adjusts the advantage baseline in GRPO using reward-confidence covariance derived from the first-order approximation of the variance-minimizing estimator. Both mechanisms require no additional sampling or auxiliary models and can be adapted to all GRPO variants. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements across comprehensive models and algorithms. Code is available at \hrefthis https URLthis https URL.

[AI-3] Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation

【速读】：该论文旨在解决开放世界具身智能体在执行长时程任务时的核心瓶颈问题——即单步规划质量并非限制因素，而在于如何高效组织与演化交互经验。其解决方案的关键在于提出一种非参数化的自进化框架Steve-Evolving，通过细粒度执行诊断与双轨知识蒸馏的闭环耦合机制实现持续进化：首先，利用结构化经验元组（预状态、动作、诊断结果、后状态）和三层索引空间（条件签名、空间哈希、语义标签）进行经验锚定与高效可审计召回；其次，基于组合式诊断信号（状态差异摘要、失败原因枚举、连续指标及停滞/循环检测）确保信息密度以支持归因分析，并将成功轨迹抽象为带显式前提和验证标准的可复用技能，将失败提炼为捕获根本原因的可执行防护规则（guardrails）；最后，在LLM规划器中注入检索到的技能与防护规则，并通过诊断触发局部重规划在线更新约束，形成无需模型参数更新的持续进化过程。

链接: https://arxiv.org/abs/2603.13131
作者: Zhengwei Xie,Zhisheng Chen,Ziyan Weng,Tingyu Wu,Chenglong Li,Vireo Zhang,Kun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Open-world embodied agents must solve long-horizon tasks where the main bottleneck is not single-step planning quality but how interaction experience is organized and evolved. To this end, we present Steve-Evolving, a non-parametric self-evolving framework that tightly couples fine-grained execution diagnosis with dual-track knowledge distillation in a closed loop. The method follows three phases: Experience Anchoring, Experience Distillation, and Knowledge-Driven Closed-Loop Control. In detail, Experience Anchoring solidifies each subgoal attempt into a structured experience tuple with a fixed schema (pre-state, action, diagnosis-result, and post-state) and organizes it in a three-tier experience space with multi-dimensional indices (e.g., condition signatures, spatial hashing, and semantic tags) plus rolling summarization for efficient and auditable recall. To ensure sufficient information density for attribution, the execution layer provides compositional diagnosis signals beyond binary outcomes, including state-difference summaries, enumerated failure causes, continuous indicators, and stagnation/loop detection. Moreover, successful trajectories of Experience Distillation are generalized into reusable skills with explicit preconditions and verification criteria, while failures are distilled into executable guardrails that capture root causes and forbid risky operations at both subgoal and task granularities. Besides, Knowledge-Driven Closed-Loop Control retrieved skills and guardrails are injected into an LLM planner, and diagnosis-triggered local replanning updates the active constraints online, forming a continual evolution process without any model parameter updates. Experiments on the long-horizon suite of Minecraft MCU demonstrate consistent improvements over static-retrieval baselines.

[AI-4] BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning

【速读】：该论文旨在解决当前主动学习（Active Learning, AL）策略在不同模型、标注预算和数据集上缺乏鲁棒性的问题，尤其是现有最优选择策略（oracle strategies）难以扩展到大规模数据集和复杂深度神经网络的局限。其解决方案的关键在于提出一种可扩展的oracle策略——Best-of-Strategy Selector (BoSS)，该方法通过集成多种选择策略生成候选批次，并选取性能提升最高的批次作为最终选择结果；BoSS作为策略集合的组合框架，具备良好的可扩展性，能够无缝整合新出现的先进AL策略，从而在大规模场景下提供稳定且接近理论最优的性能基准。

链接: https://arxiv.org/abs/2603.13109
作者: Denis Huseljic,Paul Hahn,Marek Herde,Christoph Sandrock,Bernhard Sick
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Active learning (AL) aims to reduce annotation costs while maximizing model performance by iteratively selecting valuable instances. While foundation models have made it easier to identify these instances, existing selection strategies still lack robustness across different models, annotation budgets, and datasets. To highlight the potential weaknesses of existing AL strategies and provide a reference point for research, we explore oracle strategies, i.e., strategies that approximate the optimal selection by accessing ground-truth information unavailable in practical AL scenarios. Current oracle strategies, however, fail to scale effectively to large datasets and complex deep neural networks. To tackle these limitations, we introduce the Best-of-Strategy Selector (BoSS), a scalable oracle strategy designed for large-scale AL scenarios. BoSS constructs a set of candidate batches through an ensemble of selection strategies and then selects the batch yielding the highest performance gain. As an ensemble of selection strategies, BoSS can be easily extended with new state-of-the-art strategies as they emerge, ensuring it remains a reliable oracle strategy in the future. Our evaluation demonstrates that i) BoSS outperforms existing oracle strategies, ii) state-of-the-art AL strategies still fall noticeably short of oracle performance, especially in large-scale datasets with many classes, and iii) one possible solution to counteract the inconsistent performance of AL strategies might be to employ an ensemble-based approach for the selection.

[AI-5] Evaluating VLMs Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences ICLR2026

【速读】：该论文旨在解决智能机器人系统在执行任务时如何有效理解用户指令与周围环境中的物体空间关系的问题，特别是评估视觉语言模型（Vision-Language Models, VLMs）在机器人运动规划中进行空间推理的能力。其解决方案的关键在于通过四种不同的查询方法对四个前沿VLMs进行空间推理能力的量化评估，发现Qwen2.5-VL在最优查询策略下可实现71.4%的零样本准确率，并在微调后提升至75%，显著优于GPT-4o；同时分析了物体邻近偏好和路径风格偏好两类运动偏好，并揭示了准确性与计算成本（token数量）之间的权衡关系，为VLM集成到机器人运动规划流程提供了实证依据与优化方向。

链接: https://arxiv.org/abs/2603.13100
作者: Wenxi Wu,Jingjing Zhang,Martim Brandão
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to the First Workshop on Efficient Spatial Reasoning at ICLR 2026

点击查看摘要

Abstract:Understanding user instructions and object spatial relations in surrounding environments is crucial for intelligent robot systems to assist humans in various tasks. The natural language and spatial reasoning capabilities of Vision-Language Models (VLMs) have the potential to enhance the generalization of robot planners on new tasks, objects, and motion specifications. While foundation models have been applied to task planning, it is still unclear the degree to which they have the capability of spatial reasoning required to enforce user preferences or constraints on motion, such as desired distances from objects, topological properties, or motion style preferences. In this paper, we evaluate the capability of four state-of-the-art VLMs at spatial reasoning over robot motion, using four different querying methods. Our results show that, with the highest-performing querying method, Qwen2.5-VL achieves 71.4% accuracy zero-shot and 75% on a smaller model after fine-tuning, and GPT-4o leads to lower performance. We evaluate two types of motion preferences (object-proximity and path-style), and we also analyze the trade-off between accuracy and computation cost in number of tokens. This work shows some promise in the potential of VLM integration with robot motion planning pipelines.

[AI-6] Human-in-the-Loop LLM Grading for Handwritten Mathematics Assessments

【速读】：该论文旨在解决大规模教学中及时、个性化反馈难以实现的问题，尤其是在生成式 AI（Generative AI）削弱远程考试可靠性后，对课堂内评估需求上升的背景下。其核心解决方案是构建一个端到端的、人机协同的大型语言模型（Large Language Model, LLM）辅助评分工作流，关键在于通过三阶段设计实现高效且可信的评分：(1) 构建结构化答案标准（solution keys），(2) 开发基于评分量表（rubric-style grading keys）的指导规则以引导LLM评分，(3) 采用多轮LLM评分、自动化一致性校验与强制人工复核相结合的混合机制，从而在显著降低约23%评阅时间的同时，保持甚至优于纯人工评分的一致性与公平性。

链接: https://arxiv.org/abs/2603.13083
作者: Arne Vanhoyweghen,Vincent Holst,Melika Mobini,Lukas Van de Voorde,Tibo Vanleke,Bert Verbruggen,Brecht Verbeken,Andres Algaba,Sam Verboven,Marie-Anne Guerry,Filip Van Droogenbroeck,Vincent Ginis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures

点击查看摘要

Abstract:Providing timely and individualised feedback on handwritten student work is highly beneficial for learning but difficult to achieve at scale. This challenge has become more pressing as generative AI undermines the reliability of take-home assessments, shifting emphasis toward supervised, in-class evaluation. We present a scalable, end-to-end workflow for LLM-assisted grading of short, pen-and-paper assessments. The workflow spans (1) constructing solution keys, (2) developing detailed rubric-style grading keys used to guide the LLM, and (3) a grading procedure that combines automated scanning and anonymisation, multi-pass LLM scoring, automated consistency checks, and mandatory human verification. We deploy the system in two undergraduate mathematics courses using six low-stakes in-class tests. Empirically, LLM assistance reduces grading time by approximately 23% while achieving agreement comparable to, and in several cases tighter than, fully manual grading. Occasional model errors occur but are effectively contained by the hybrid design. Overall, our results show that carefully embedded human-in-the-loop LLM grading can substantially reduce workload while maintaining fairness and accuracy.

[AI-7] GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration

【速读】：该论文旨在解决当前地球化学异常检测研究中存在的两大问题：一是模型通常仅在单一区域场景下训练，导致泛化能力受限；二是缺乏公开数据集，阻碍了研究成果的可复现性。其解决方案的关键在于构建了一个名为GeoChemAD的开源基准数据集，涵盖多个区域、采样源和目标元素，包含八个代表不同空间尺度与采样条件的子集，并提出了一种基于Transformer架构的GeoChemFormer框架，通过自监督预训练学习目标元素感知的地球化学表征，从而在多个子集上均实现了优于现有无监督方法的异常检测准确率与泛化性能。

链接: https://arxiv.org/abs/2603.13068
作者: Yihao Ding,Yiran Zhang,Chris Gonzalez,Eun-Jung Holden,Wei Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Geochemical anomaly detection plays a critical role in mineral exploration as deviations from regional geochemical baselines may indicate mineralization. Existing studies suffer from two key limitations: (1) single region scenarios which limit model generalizability; (2) proprietary datasets, which makes result reproduction unattainable. In this work, we introduce \textbfGeoChemAD, an open-source benchmark dataset compiled from government-led geological surveys, covering multiple regions, sampling sources, and target elements. The dataset comprises eight subsets representing diverse spatial scales and sampling conditions. To establish strong baselines, we reproduce and benchmark a range of unsupervised anomaly detection methods, including statistical models, generative and transformer-based approaches. Furthermore, we propose \textbfGeoChemFormer, a transformer-based framework that leverages self-supervised pretraining to learn target-element-aware geochemical representations for spatial samples. Extensive experiments demonstrate that GeoChemFormer consistently achieves superior and robust performance across all eight subsets, outperforming existing unsupervised methods in both anomaly detection accuracy and generalization capability. The proposed dataset and framework provide a foundation for reproducible research and future development in this direction.

[AI-8] L2GTX: From Local to Global Time Series Explanations

【速读】：该论文旨在解决时间序列分类中模型决策行为的可解释性问题，特别是如何生成类级别的全局解释，以揭示模型在不同类别上的判别模式。现有方法存在三个主要局限：一是图像和表格数据中的模型无关解释方法难以直接应用于具有时序依赖性的数据；二是时间序列的全局解释合成研究不足；三是多数全局解释方法为特定模型设计，缺乏通用性。解决方案的关键在于提出一种模型无关的框架L2GTX，其核心是通过从代表性实例中提取局部解释（基于LOMATCE方法），识别出参数化的时序事件基元（如趋势上升/下降、局部极值等）及其重要性得分，再通过聚类合并跨实例的重复事件，并利用实例-簇重要性矩阵估计全局相关性，最终在用户定义的实例选择预算下挑选最具代表性的实例，聚合其事件形成简洁且可解释的类级别全局解释。

链接: https://arxiv.org/abs/2603.13065
作者: Ephrem Tibebe Mekonnen,Luca Longo,Lucas Rizzo,Pierpaolo Dondio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 4th World Conference on Explainable Artificial Intelligence (xAI 2026), 18 pages, 6 figures

点击查看摘要

Abstract:Deep learning models achieve high accuracy in time series classification, yet understanding their class-level decision behaviour remains challenging. Explanations for time series must respect temporal dependencies and identify patterns that recur across instances. Existing approaches face three limitations: model-agnostic XAI methods developed for images and tabular data do not readily extend to time series, global explanation synthesis for time series remains underexplored, and most existing global approaches are model-specific. We propose L2GTX, a model-agnostic framework that generates class-wise global explanations by aggregating local explanations from a representative set of instances. L2GTX extracts clusters of parameterised temporal event primitives, such as increasing or decreasing trends and local extrema, together with their importance scores from instance-level explanations produced by LOMATCE. These clusters are merged across instances to reduce redundancy, and an instance-cluster importance matrix is used to estimate global relevance. Under a user-defined instance selection budget, L2GTX selects representative instances that maximise coverage of influential clusters. Events from the selected instances are then aggregated into concise class-wise global explanations. Experiments on six benchmark time series datasets show that L2GTX produces compact and interpretable global explanations while maintaining stable global faithfulness measured as mean local surrogate fidelity.

[AI-9] Competition-Aware CPC Forecasting with Near-Market Coverag e

【速读】：该论文旨在解决付费搜索广告中点击成本（Cost-per-click, CPC）预测的难题，特别是在竞争环境部分可观测的情况下，如何提升中长期CPC预测的稳定性与准确性。其核心挑战在于，CPC作为拍卖机制的结果具有高度波动性，且单个广告主的历史数据难以全面反映市场中的潜在竞争态势。解决方案的关键在于通过多源信号构建对隐含竞争的近似：首先利用预训练Transformer模型提取关键词语义特征，构建语义邻域与关键词图结构；其次通过动态时间规整（Dynamic Time Warping, DTW）对CPC时序轨迹进行对齐，形成行为邻域；最后引入地理意图协变量捕捉局部需求与市场异质性。这些信号既可独立作为协变量使用，也可作为时空图神经网络中的关系先验，显著提升了在业务相关中长期预测窗口下的性能表现。

链接: https://arxiv.org/abs/2603.13059
作者: Sebastian Frey,Edoardo Beccari,Maximilian Kranz,Nicolò Alberto Pellizzari,Ali Mete Karaman,Qiwei Han,Maximilian Kaiser
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Cost-per-click (CPC) in paid search is a volatile auction outcome generated by a competitive landscape that is only partially observable from any single advertiser’s history. Using Google Ads auction logs from a concentrated car-rental market (2021–2023), we forecast weekly CPC for 1,811 keyword series and approximate latent competition through complementary signals derived from keyword text, CPC trajectories, and geographic market structure. We construct (i) semantic neighborhoods and a semantic keyword graph from pretrained transformer-based representations of keyword text, (ii) behavioral neighborhoods via Dynamic Time Warping (DTW) alignment of CPC trajectories, and (iii) geographic-intent covariates capturing localized demand and marketplace heterogeneity. We extensively evaluate these signals both as stand-alone covariates and as relational priors in spatiotemporal graph forecasters, benchmarking them against strong statistical, neural, and time-series foundation-model baselines. Across methods, competition-aware augmentation improves stability and error profiles at business-relevant medium and longer horizons, where competitive regimes shift and volatility is most consequential. The results show that broad market-outcome coverage, combined with keyword-derived semantic and geographic priors, provides a scalable way to approximate latent competition and improve CPC forecasting in auction-driven markets.

[AI-10] Purify Once Edit Freely: Breaking Image Protections under Model Mismatch

【速读】：该论文旨在解决扩散模型（Diffusion Models）在图像编辑应用中可能被滥用的问题，特别是未经授权的风格模仿和有害内容生成。为此，现有方法通常在图像发布前嵌入微小且人眼难以察觉的对抗扰动以干扰下游编辑或微调过程。然而，在现实场景中，内容所有者无法控制后续处理流程，且针对替代模型优化的保护机制在攻击者使用架构不匹配的扩散管道时易失效。论文提出一种统一的后发布净化框架，用于评估保护机制在模型不匹配下的存活能力，并设计了两种无需访问受保护图像或防御内部信息的实用净化器：VAE-Trans通过潜在空间投影进行校正，EditorClean则利用扩散变压器（Diffusion Transformer）的架构异质性执行指令引导重建。关键创新在于揭示了“一次净化、自由编辑”的失效模式——一旦净化成功，保护信号基本消除，使编辑完全不受限，从而强调必须在模型不匹配条件下评估保护效果，并设计对异构攻击者鲁棒的防御策略。

链接: https://arxiv.org/abs/2603.13028
作者: Qichen Zhao,Shengfang Zhai,Xinjian Bai,Qingni Shen,Qiqi Lin,Yansong Gao,Zhonghai Wu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models enable high-fidelity image editing but can also be misused for unauthorized style imitation and harmful content generation. To mitigate these risks, proactive image protection methods embed small, often imperceptible adversarial perturbations into images before sharing to disrupt downstream editing or fine-tuning. However, in realistic post-release scenarios, content owners cannot control downstream processing pipelines, and protections optimized for a surrogate model may fail when attackers use mismatched diffusion pipelines. Existing purification methods can weaken protections but often sacrifice image quality and rarely examine architectural mismatch. We introduce a unified post-release purification framework to evaluate protection survivability under model mismatch. We propose two practical purifiers: VAE-Trans, which corrects protected images via latent-space projection, and EditorClean, which performs instruction-guided reconstruction with a Diffusion Transformer to exploit architectural heterogeneity. Both operate without access to protected images or defense internals. Across 2,100 editing tasks and six representative protection methods, EditorClean consistently restores editability. Compared to protected inputs, it improves PSNR by 3-6 dB and reduces FID by 50-70 percent on downstream edits, while outperforming prior purification baselines by about 2 dB PSNR and 30 percent lower FID. Our results reveal a purify-once, edit-freely failure mode: once purification succeeds, the protective signal is largely removed, enabling unrestricted editing. This highlights the need to evaluate protections under model mismatch and design defenses robust to heterogeneous attackers.

[AI-11] ARL-Tangram: Unleash the Resource Efficiency in Agent ic Reinforcement Learning

【速读】：该论文针对生成式 AI (Generative AI) 中的代理强化学习（Agentic Reinforcement Learning, Agentic RL）在云集群中资源管理效率低下的问题展开研究。传统方法依赖静态超配（static over-provisioning），导致外部计算资源（如CPU用于代码执行、GPU用于奖励模型）无法按需共享，造成严重浪费。其解决方案的关键在于提出动作级编排（action-level orchestration），并集成至统一资源管理系统 ARL-Tangram 中，通过细粒度的外部资源调度与弹性分配机制，在满足异构资源约束的前提下最小化动作完成时间（Action Completion Time, ACT）。该方案显著提升了资源利用率和训练效率，实测显示平均ACT提升达4.3倍，训练步长加速1.5倍，外部资源节省最高达71.2%。

链接: https://arxiv.org/abs/2603.13019
作者: Bangjun Xiao,Yihao Zhao,Xiangwei Deng,Shihua Yu,Yuxing Xiang,Huaqiu Liu,Qiying Wang,Liang Zhao,Hailin Zhang,Xuanzhe Liu,Xin Jin,Fuli Luo
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Agentic reinforcement learning (RL) has emerged as a transformative workload in cloud clusters, enabling large language models (LLMs) to solve complex problems through interactions with real world. However, unlike traditional RL, agentic RL demands substantial external cloud resources, e.g., CPUs for code execution and GPUs for reward models, that exist outside the primary training cluster. Existing agentic RL framework typically rely on static over-provisioning, i.e., resources are often tied to long-lived trajectories or isolated by tasks, which leads to severe resource inefficiency. We propose the action-level orchestration, and incorporate it into ARL-Tangram, a unified resource management system that enables fine-grained external resource sharing and elasticity. ARL-Tangram utilizes a unified action-level formulation and an elastic scheduling algorithm to minimize action completion time (ACT) while satisfying heterogeneous resource constraints. Further, heterogeneous resource managers are tailored to efficiently support the action-level execution on resources with heterogeneous characteristics and topologies. Evaluation on real-world agentic RL tasks demonstrates that ARL-Tangram improves average ACT by up to 4.3 \times , speeds up the step duration of RL training by up to 1.5 \times , and saves the external resources by up to 71.2 % . This system has been deployed to support the training of the MiMo series models. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.13019 [cs.DC] (or arXiv:2603.13019v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2603.13019 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-12] Efficient Real-World Autonomous Racing via Attenuated Residual Policy Optimization

【速读】：该论文旨在解决基于残差策略学习（Residual Policy Learning, RPL）的机器人控制器在实际部署中面临的系统复杂性高和推理延迟大的问题。其解决方案的关键在于提出了一种名为衰减残差策略优化（α-RPO）的新方法：通过逐步衰减基础策略（base policy）的影响，使最终获得的策略成为独立的神经网络策略，从而简化系统架构并降低延迟；同时，该机制支持特权学习（privileged learning），允许基础策略使用最终部署时不可用的传感器模态，进一步提升训练效率与性能。α-RPO 与近端策略优化（PPO）无缝集成，确保在策略优化过程中动态补偿基础策略的衰减影响，实现在仿真和零样本真实世界迁移中的性能提升。

链接: https://arxiv.org/abs/2603.12960
作者: Raphael Trumpp,Denis Hoornaert,Mirco Theile,Marco Caccamo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Residual policy learning (RPL), in which a learned policy refines a static base policy using deep reinforcement learning (DRL), has shown strong performance across various robotic applications. Its effectiveness is particularly evident in autonomous racing, a domain that serves as a challenging benchmark for real-world DRL. However, deploying RPL-based controllers introduces system complexity and increases inference latency. We address this by introducing an extension of RPL named attenuated residual policy optimization ( \alpha -RPO). Unlike standard RPL, \alpha -RPO yields a standalone neural policy by progressively attenuating the base policy, which initially serves to bootstrap learning. Furthermore, this mechanism enables a form of privileged learning, where the base policy is permitted to use sensor modalities not required for final deployment. We design \alpha -RPO to integrate seamlessly with PPO, ensuring that the attenuated influence of the base controller is dynamically compensated during policy optimization. We evaluate \alpha -RPO by building a framework for 1:10-scaled autonomous racing around it. In both simulation and zero-shot real-world transfer to Roboracer cars, \alpha -RPO not only reduces system complexity but also improves driving performance compared to baselines - demonstrating its practicality for robotic deployment. Our code is available at: this https URL.

[AI-13] Delta1 with LLM : symbolic and neural integration for credible and explainable reasoning AAAI2026

【速读】：该论文旨在解决神经符号推理（neuro-symbolic reasoning）中逻辑形式化严谨性与大语言模型（LLM）可解释性之间难以融合的问题。其核心挑战在于如何构建既保证推理正确性（soundness）又具备人类可读性的解释机制。解决方案的关键在于提出一个端到端的“构造性可解释性”流水线：首先利用基于全三角标准矛盾（FTSC）的自动定理生成器Delta1，在多项式时间内确定性地构造最小不可满足子句集和完整的定理，从而从结构上确保推理的最小性和无歧义性；随后通过LLM层将每个定理及其证明轨迹转化为连贯自然语言的解释和可操作洞察。实证研究表明，该方法在医疗、合规和监管等高风险领域实现了可审计、领域对齐的推理能力，推动了逻辑、语言与学习的融合，确立了构造性定理生成作为神经符号可解释AI的坚实基础。

链接: https://arxiv.org/abs/2603.12953
作者: Yang Xu,Jun Liu,Shuwei Chen,Chris Nugent,Hailing Guo
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure, 3 tables, accepted oral presentation at AAAI2026 Bridge Program on Logic AI

点击查看摘要

Abstract:Neuro-symbolic reasoning increasingly demands frameworks that unite the formal rigor of logic with the interpretability of large language models (LLMs). We introduce an end to end explainability by construction pipeline integrating the Automated Theorem Generator Delta1 based on the full triangular standard contradiction (FTSC) with LLMs. Delta1 deterministically constructs minimal unsatisfiable clause sets and complete theorems in polynomial time, ensuring both soundness and minimality by construction. The LLM layer verbalizes each theorem and proof trace into coherent natural language explanations and actionable insights. Empirical studies across health care, compliance, and regulatory domains show that Delta1 and LLM enables interpretable, auditable, and domain aligned reasoning. This work advances the convergence of logic, language, and learning, positioning constructive theorem generation as a principled foundation for neuro-symbolic explainable AI.

[AI-14] Efficient and Interpretable Multi-Agent LLM Routing via Ant Colony Optimization

【速读】：该论文旨在解决多智能体系统（Multi-Agent Systems, MAS）在实际部署中面临的高推理成本、延迟以及透明度不足等问题，尤其针对现有路由策略依赖昂贵的大型语言模型（Large Language Model, LLM）选择器或静态策略时，在动态负载和混合意图下难以实现语义感知路由、性能不稳定及资源利用率低的问题。解决方案的关键在于提出AMRO-S框架，其核心机制包括：利用监督微调的小型语言模型进行意图识别，构建轻量级语义接口；将路由记忆分解为任务特异性的信息素专家以减少跨任务干扰；采用质量门控的异步更新机制解耦推理与学习过程，从而在不增加延迟的前提下优化路由决策。

链接: https://arxiv.org/abs/2603.12933
作者: Xudong Wang,Chaoning Zhang,Jiaquan Zhang,Chenghao Li,Qigan Sun,Sung-Ho Bae,Peng Wang,Ning Xie,Jie Zou,Yang Yang,Hengtao Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, submitted to IEEE Transactions on Artificial Intelligence

点击查看摘要

Abstract:Large Language Model (LLM)-driven Multi-Agent Systems (MAS) have demonstrated strong capability in complex reasoning and tool use, and heterogeneous agent pools further broaden the quality–cost trade-off space. Despite these advances, real-world deployment is often constrained by high inference cost, latency, and limited transparency, which hinders scalable and efficient routing. Existing routing strategies typically rely on expensive LLM-based selectors or static policies, and offer limited controllability for semantic-aware routing under dynamic loads and mixed intents, often resulting in unstable performance and inefficient resource utilization. To address these limitations, we propose AMRO-S, an efficient and interpretable routing framework for Multi-Agent Systems (MAS). AMRO-S models MAS routing as a semantic-conditioned path selection problem, enhancing routing performance through three key mechanisms: First, it leverages a supervised fine-tuned (SFT) small language model for intent inference, providing a low-overhead semantic interface for each query; second, it decomposes routing memory into task-specific pheromone specialists, reducing cross-task interference and optimizing path selection under mixed workloads; finally, it employs a quality-gated asynchronous update mechanism to decouple inference from learning, optimizing routing without increasing latency. Extensive experiments on five public benchmarks and high-concurrency stress tests demonstrate that AMRO-S consistently improves the quality–cost trade-off over strong routing baselines, while providing traceable routing evidence through structured pheromone patterns.

[AI-15] ODRL Policy Comparison Through Normalisation ESWC2026 ESWC

【速读】：该论文旨在解决ODRL（Open Digital Rights Language）政策表示中存在的复杂性问题，包括其碎片化、语义等价但表达方式多样导致的政策比较与处理困难。解决方案的关键在于提出一种参数化的归一化方法，将包含许可和禁止条款的ODRL政策转化为仅含许可条款的最小组件形式，并通过简化复杂的逻辑约束为简单约束来提升可处理性。该方法基于最近定义的ODRL语义，提供了计算政策归一形式及简化数值与符号约束的算法，证明其保持原政策语义不变，并分析了结果在属性数量上的指数级和唯一值数量上的线性级空间复杂度，从而使得复杂政策可在ODRL的基础片段中表示，且政策比较问题退化为规则是否一致的判定问题。

链接: https://arxiv.org/abs/2603.12926
作者: Jaime Osvaldo Salas,Paolo Pareti,George Konstantinidis
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Accepted at the 23rd European Semantic Web Conference (ESWC), ESWC 2026

点击查看摘要

Abstract:The ODRL language has become the standard for representing policies and regulations for digital rights. However its complexity is a barrier to its usage, which has caused many related theoretical and practical works to focus on different, and not interoperable, fragments of ODRL. Moreover, semantically equivalent policies can be expressed in numerous different ways, which makes comparing them and processing them harder. Building on top of a recently defined semantics, we tackle these problems by proposing an approach that involves a parametrised normalisation of ODRL policies into its minimal components which reformulates policies with permissions and prohibitions into policies with permissions exclusively, and simplifies complex logic constraints into simple ones. We provide algorithms to compute a normal form for ODRL policies and simplifying numerical and symbolic constraints. We prove that these algorithms preserve the semantics of policies, and analyse the size complexity of the result, which is exponential on the number of attributes and linear on the number of unique values for these attributes. We show how this makes complex policies representable in more basic fragments of ODRL, and how it reduces the problem of policy comparison to the simpler problem of checking if two rules are identical.

[AI-16] Surprised by Attention: Predictable Query Dynamics for Time Series Anomaly Detection

【速读】：该论文旨在解决多变量时间序列异常检测中因跨通道依赖关系发生结构性变化而难以被传统残差驱动检测方法识别的问题，例如自动驾驶场景下转向指令与横向加速度之间的解耦现象。其解决方案的关键在于提出AxonAD，一种基于多头注意力机制查询向量演化预测的无监督异常检测模型：通过引入一个仅依赖历史信息的预测器来预测未来查询向量，并结合梯度更新的重建路径，利用掩码预测-目标优化策略（以指数移动平均目标编码器为基准）进行训练；推理阶段则融合重建误差与尾部聚合的查询不匹配得分（即近期时间步上预测查询与目标查询的余弦偏差），从而同时具备对结构依赖性偏移和幅度异常的敏感性。

链接: https://arxiv.org/abs/2603.12916
作者: Kadir-Kaan Özer,René Ebeling,Markus Enzweiler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Main: 17 Pages, 7 Figures, 3 Tables; Appendix: 3 Pages, 4 Tables

点击查看摘要

Abstract:Multivariate time series anomalies often manifest as shifts in cross-channel dependencies rather than simple amplitude excursions. In autonomous driving, for instance, a steering command might be internally consistent but decouple from the resulting lateral acceleration. Residual-based detectors can miss such anomalies when flexible sequence models still reconstruct signals plausibly despite altered coordination. We introduce AxonAD, an unsupervised detector that treats multi-head attention query evolution as a short horizon predictable process. A gradient-updated reconstruction pathway is coupled with a history-only predictor that forecasts future query vectors from past context. This is trained via a masked predictor-target objective against an exponential moving average (EMA) target encoder. At inference, reconstruction error is combined with a tail-aggregated query mismatch score, which measures cosine deviation between predicted and target queries on recent timesteps. This dual approach provides sensitivity to structural dependency shifts while retaining amplitude-level detection. On proprietary in-vehicle telemetry with interval annotations and on the TSB-AD multi-variate suite (17 datasets, 180 series) with threshold-free and range-aware metrics, AxonAD improves ranking quality and temporal localization over strong baselines. Ablations confirm that query prediction and combined scoring are the primary drivers of the observed gains. Code is available at the URL this https URL.

[AI-17] Hierarchical Reference Sets for Robust Unsupervised Detection of Scattered and Clustered Outliers

【速读】：该论文旨在解决物联网（IoT）数据中因存在聚集型异常点（clustered outliers）而导致的异常检测困难问题。这类异常点由于局部密度较高，容易被误判为正常行为，从而掩盖了散点型异常（scattered outliers）和上下文相关异常的识别。解决方案的关键在于提出一种基于图结构（graph structure）的新颖异常检测范式，通过利用节点间的自然邻近关系，在局部与全局尺度上构建参考集（reference sets），实现多视角的异常评估。该方法能够有效分离并识别散点型异常，同时保留并隔离聚集型异常群组，从而提升整体异常检测的准确性与鲁棒性。

链接: https://arxiv.org/abs/2603.12847
作者: Yiqun Zhang,Zexi Tan,Xiaopeng Luo,Yunlin Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 9 figures

点击查看摘要

Abstract:Most real-world IoT data analysis tasks, such as clustering and anomaly event detection, are unsupervised and highly susceptible to the presence of outliers. In addition to sporadic scattered outliers caused by factors such as faulty sensor readings, IoT systems often exhibit clustered outliers. These occur when multiple devices or nodes produce similar anomalous measurements, for instance, owing to localized interference, emerging security threats, or regional false alarms, forming micro-clusters. These clustered outliers can be easily mistaken for normal behavior because of their relatively high local density, thereby obscuring the detection of both scattered and contextual anomalies. To address this, we propose a novel outlier detection paradigm that leverages the natural neighboring relationships using graph structures. This facilitates multi-perspective anomaly evaluation by incorporating reference sets at both local and global scales derived from the graph. Our approach enables the effective recognition of scattered outliers without interference from clustered anomalies, whereas the graph structure simultaneously helps reflect and isolate clustered outlier groups. Extensive experiments, including comparative performance analysis, ablation studies, validation on downstream clustering tasks, and evaluation of hyperparameter sensitivity, demonstrate the efficacy of the proposed method. The source code is available at this https URL.

[AI-18] DAST: A Dual-Stream Voice Anonymization Attacker with Staged Training

【速读】：该论文旨在解决语音匿名化（voice anonymization）技术在保护说话人隐私方面的局限性问题，即尽管语音内容得以保留，但仍可能泄露与说话人相关的特征。为增强隐私评估的准确性与鲁棒性，作者提出了一种双流攻击者模型（dual-stream attacker），其核心在于通过并行编码器融合频谱特征与自监督学习特征，并采用三阶段训练策略：第一阶段构建基础的说话人判别表征；第二阶段利用语音转换与匿名化共享的身份变换特性，引入多样化转换语音以提升跨系统鲁棒性；第三阶段对目标匿名数据进行轻量级适应。实验表明，第二阶段是泛化能力的主要驱动力，而第三阶段仅需10%的目标数据微调即可超越当前最优攻击者在等错误率（EER）上的表现。

链接: https://arxiv.org/abs/2603.12840
作者: Ridwan Arefeen,Xiaoxiao Miao,Rong Tong,Aik Beng Ng,Simon See,Timothy Liu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Voice anonymization masks vocal traits while preserving linguistic content, which may still leak speaker-specific patterns. To assess and strengthen privacy evaluation, we propose a dual-stream attacker that fuses spectral and self-supervised learning features via parallel encoders with a three-stage training strategy. Stage I establishes foundational speaker-discriminative representations. Stage II leverages the shared identity-transformation characteristics of voice conversion and anonymization, exposing the model to diverse converted speech to build cross-system robustness. Stage III provides lightweight adaptation to target anonymized data. Results on the VoicePrivacy Attacker Challenge (VPAC) dataset demonstrate that Stage II is the primary driver of generalization, enabling strong attacking performance on unseen anonymization datasets. With Stage III, fine-tuning on only 10% of the target anonymization dataset surpasses current state-of-the-art attackers in terms of EER.

[AI-19] Mask2Flow-TSE: Two-Stage Target Speaker Extraction with Masking and Flow Matching INTERSPEECH2026

【速读】：该论文旨在解决目标说话人提取（Target Speaker Extraction, TSE）任务中现有方法的局限性：判别式方法虽推理速度快但易过度抑制目标语音信号，而生成式方法虽能合成高质量语音却需大量迭代步骤。其解决方案的关键在于提出一种两阶段框架Mask2Flow-TSE，第一阶段采用判别式掩蔽进行粗分离，第二阶段引入流匹配（flow matching）技术对掩蔽谱图进行精细化重构，从而在单次推理中实现高保真语音重建，兼具生成式方法的音质优势与判别式方法的高效性。

链接: https://arxiv.org/abs/2603.12837
作者: Junwon Moon,Hyunjin Choi,Hansol Park,Heeseung Kim,Kyuhong Shim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Target speaker extraction (TSE) extracts the target speaker’s voice from overlapping speech mixtures given a reference utterance. Existing approaches typically fall into two categories: discriminative and generative. Discriminative methods apply time-frequency masking for fast inference but often over-suppress the target signal, while generative methods synthesize high-quality speech at the cost of numerous iterative steps. We propose Mask2Flow-TSE, a two-stage framework combining the strengths of both paradigms. The first stage applies discriminative masking for coarse separation, and the second stage employs flow matching to refine the output toward target speech. Unlike generative approaches that synthesize speech from Gaussian noise, our method starts from the masked spectrogram, enabling high-quality reconstruction in a single inference step. Experiments show that Mask2Flow-TSE achieves comparable performance to existing generative TSE methods with approximately 85M parameters.

[AI-20] Context is all you need: Towards autonomous model-based process design using agent ic AI in flowsheet simulations

【速读】：该论文旨在解决工业流程模拟环境中化学过程流程图建模（flowsheet modelling）自动化程度低的问题，尤其是在集成大型语言模型（Large Language Models, LLMs）与推理及工具调用能力的智能体（Agentic AI）系统尚未在该领域得到充分应用的背景下。其解决方案的关键在于构建一个基于多智能体协作的框架：其中一个智能体利用工程知识抽象地解析和规划工艺流程问题，另一个智能体则将该解决方案转化为符合自研流程模拟工具Chemasim语法规范的代码；该框架通过引入如Claude Opus 4.6等先进LLM，并以技术文档和少量注释示例作为上下文，显著提升了代码生成的准确性与可用性，从而实现从概念设计到可执行流程模型的端到端辅助建模。

链接: https://arxiv.org/abs/2603.12813
作者: Pascal Schäfer,Lukas J. Krinke,Martin Wlotzka,Norbert Asprion
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic AI systems integrating large language models (LLMs) with reasoning and tooluse capabilities are transforming various domains - in particular, software development. In contrast, their application in chemical process flowsheet modelling remains largely unexplored. In this work, we present an agentic AI framework that delivers assistance in an industrial flowsheet simulation environment. To this end, we show the capabilities of GitHub Copilot (GitHub, Inc., 2026), when using state-of-the-art LLMs, such as Claude Opus 4.6 (Anthropic, PBC, 2026), to generate valid syntax for our in-house process modelling tool Chemasim using the technical documentation and a few commented examples as context. Based on this, we develop a multi-agent system that decomposes process development tasks with one agent solving the abstract problem using engineering knowledge and another agent implementing the solution as Chemasim code. We demonstrate the effectiveness of our framework for typical flowsheet modelling examples, including (i) a reaction/separation process, (ii) a pressure-swing distillation, and (iii) a heteroazeotropic distillation including entrainer selection. Along these lines, we discuss current limitations of the framework and outline future research directions to further enhance its capabilities.

[AI-21] AI Model Modulation with Logits Redistribution

【速读】：该论文旨在解决大规模模型在适配不同用户需求时需维护多个专用版本所导致的效率低下问题。其核心解决方案是提出一种名为AIM（Adaptive Inference Modulation）的新型模型调制范式，通过引入两种关键调制模式——效用调制（utility modulation）和聚焦调制（focus modulation），使单一模型能够动态调整输出质量与关注输入特征，从而满足多样化的终端需求。AIM的关键创新在于其无训练数据依赖且无需重新训练的logits重分配策略，并基于logits排序的联合概率分布构建了形式化理论基础，确保调制能力的可证明性与稳定性。

链接: https://arxiv.org/abs/2603.12755
作者: Zihan Wang,Zhongkui Ma,Xinguo Feng,Zhiyang Mei,Ethan Ma,Derui Wang,Minhui Xue,Guangdong Bai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The 2025 ACM Web Conference

点击查看摘要

Abstract:Large-scale models are typically adapted to meet the diverse requirements of model owners and users. However, maintaining multiple specialized versions of the model is inefficient. In response, we propose AIM, a novel model modulation paradigm that enables a single model to exhibit diverse behaviors to meet the specific end requirements. AIM enables two key modulation modes: utility and focus modulations. The former provides model owners with dynamic control over output quality to deliver varying utility levels, and the latter offers users precise control to shift model’s focused input features. AIM introduces a logits redistribution strategy that operates in a training data-agnostic and retraining-free manner. We establish a formal foundation to ensure AIM’s regulation capability, based on the statistical properties of logits ordering via joint probability distributions. Our evaluation confirms AIM’s practicality and versatility for Al model modulation, with tasks spanning image classification, semantic segmentation and text generation, and prevalent architectures including ResNet, SegFormer and Llama.

[AI-22] aoBench: Do Automated Theorem Prover LLM s Generalize Beyond MathLib?

【速读】：该论文旨在解决当前自动化定理证明（Automated Theorem Proving, ATP）系统在标准数学库（如MathLib）框架下训练和评估时所存在的偏差问题，即这些系统对MathLib的定义体系过度依赖，难以有效处理前沿数学中常见的自定义构造（bespoke constructions）。为解决这一问题，作者提出了一种新的基准测试集TaoBench，其源自Terence Tao的《Analysis I》，通过从零开始构建核心数学概念（而非直接调用MathLib定义），并混合使用自定义与MathLib构造来模拟真实研究场景。关键创新在于构建了一个代理式（agentic）流水线，可自动提取每个问题的独立可编译环境，并将每个TaoBench命题翻译为数学等价的MathLib形式，从而实现跨定义框架的成对比较。实验表明，尽管先进ATP模型在MathLib上表现良好，但在等价的Tao框架下平均性能下降约26%，揭示了当前方法的核心瓶颈在于定义框架间的泛化能力不足，而非任务本身难度。

链接: https://arxiv.org/abs/2603.12744
作者: Alexander K Taylor,Junyi Zhang,Ethan Ji,Vigyan Sahai,Haikang Deng,Yuanzhou Chen,Yifan Yuan,Di Wu,Jia-Chen Gu,Kai-Wei Chang,Nanyun Peng,Amit Sahai,Wei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Automated theorem proving (ATP) benchmarks largely consist of problems formalized in MathLib, so current ATP training and evaluation are heavily biased toward MathLib’s definitional framework. However, frontier mathematics is often exploratory and prototype-heavy, relying on bespoke constructions that deviate from standard libraries. In this work, we evaluate the robustness of current ATP systems when applied to a novel definitional framework, specifically examining the performance gap between standard library problems and bespoke mathematical constructions. We introduce TaoBench, an undergraduate-level benchmark derived from Terence Tao’s Analysis I, which formalizes analysis by constructing core mathematical concepts from scratch, without relying on standard Mathlib definitions, as well as by mixing from-scratch and MathLib constructions. For fair evaluation, we build an agentic pipeline that automatically extracts a compilable, self-contained local environment for each problem. To isolate the effect of definitional frameworks, we additionally translate every problem into a mathematically equivalent Mathlib formulation, yielding paired TaoBench-Mathlib statements for direct comparison. While state-of-the-art ATP models perform capably within the MathLib framework, performance drops by an average of roughly 26% on the definitionally equivalent Tao formulation. This indicates that the main bottleneck is limited generalization across definitional frameworks rather than task difficulty. TaoBench thus highlights a gap between benchmark performance and applicability, and provides a concrete foundation for developing and testing provers better aligned with research mathematics.

[AI-23] oolTree: Efficient LLM Agent Tool Planning via Dual-Feedback Monte Carlo Tree Search and Bidirectional Pruning ICLR2026

【速读】：该论文旨在解决当前大型语言模型（Large Language Model, LLM）代理在执行多步骤任务时，因依赖贪婪、反应式的工具选择策略而缺乏前瞻性规划能力，且未能有效处理工具间的依赖关系的问题。其解决方案的关键在于提出ToolTree——一种受蒙特卡洛树搜索（Monte Carlo Tree Search）启发的新型工具规划范式，通过双阶段LLM评估与双向剪枝机制，在工具使用轨迹探索中实现更智能、自适应的决策，同时在工具执行前和执行后对低潜力分支进行剪枝，从而在保证最高效率的同时显著提升规划性能，实证表明相较最先进方法平均提升约10%。

链接: https://arxiv.org/abs/2603.12740
作者: Shuo Yang,Soyeon Caren Han,Yihao Ding,Shuhe Wang,Eduard Hoy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICLR 2026

点击查看摘要

Abstract:Large Language Model (LLM) agents are increasingly applied to complex, multi-step tasks that require interaction with diverse external tools across various domains. However, current LLM agent tool planning methods typically rely on greedy, reactive tool selection strategies that lack foresight and fail to account for inter-tool dependencies. In this paper, we present ToolTree, a novel Monte Carlo tree search-inspired planning paradigm for tool planning. ToolTree explores possible tool usage trajectories using a dual-stage LLM evaluation and bidirectional pruning mechanism that enables the agent to make informed, adaptive decisions over extended tool-use sequences while pruning less promising branches before and after the tool execution. Empirical evaluations across both open-set and closed-set tool planning tasks on 4 benchmarks demonstrate that ToolTree consistently improves performance while keeping the highest efficiency, achieving an average gain of around 10% compared to the state-of-the-art planning paradigm.

[AI-24] SRAM-Based Compute-in-Memory Accelerator for Linear-decay Spiking Neural Networks

【速读】：该论文旨在解决脉冲神经网络（Spiking Neural Networks, SNNs）在硬件加速中因神经元膜电位状态更新步骤存在串行计算瓶颈而导致的吞吐量受限与能效低下问题。当前多数基于存算一体（Compute-in-Memory, CIM）架构的加速器虽能高效并行化突触运算（W × I），实现矩阵-向量乘法的O(1)复杂度，但后续的神经元状态更新仍需O(N)时间完成全部膜电位刷新，成为主要延迟和能耗瓶颈。解决方案的关键在于算法与硬件协同优化：首先，在算法层面采用线性衰减漏积分放火（Linear Decay Leaky Integrate-and-Fire, LD-LIF）神经元模型替代传统指数衰减，将高代价乘法操作简化为加法，仅导致约1%精度损失；其次，在架构层面设计一种基于SRAM的片上并行更新机制，直接在存储阵列内执行就地衰减，避免全局串行更新。此方法在基准SNN任务上实现了1.1×至16.7×的SOP能量降低，并带来15.9×至69×的能效提升，同时保持近似原始模型的精度，验证了对状态更新动态进行优化是实现可扩展、低功耗、实时类脑计算的关键路径。

链接: https://arxiv.org/abs/2603.12739
作者: Hongyang Shang,Shuai Dong,Yahan Yang,Junyi Yang,Peng Zhou,Arindam Basu
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) have emerged as a biologically inspired alternative to conventional deep networks, offering event-driven and energy-efficient computation. However, their throughput remains constrained by the serial update of neuron membrane states. While many hardware accelerators and Compute-in-Memory (CIM) architectures efficiently parallelize the synaptic operation (W x I) achieving O(1) complexity for matrix-vector multiplication, the subsequent state update step still requires O(N) time to refresh all neuron membrane potentials. This mismatch makes state update the dominant latency and energy bottleneck in SNN inference. To address this challenge, we propose an SRAM-based CIM for SNN with Linear Decay Leaky Integrate-and-Fire (LD-LIF) Neuron that co-optimizes algorithm and hardware. At the algorithmic level, we replace the conventional exponential membrane decay with a linear decay approximation, converting costly multiplications into simple additions while accuracy drops only around 1%. At the architectural level, we introduce an in-memory parallel update scheme that performs in-place decay directly within the SRAM array, eliminating the need for global sequential updates. Evaluated on benchmark SNN workloads, the proposed method achieves a 1.1 x to 16.7 x reduction of SOP energy consumption, while providing 15.9 x to 69 x more energy efficiency, with negligible accuracy loss relative to original decay models. This work highlights that beyond accelerating the (W x I) computation, optimizing state-update dynamics within CIM architectures is essential for scalable, low-power, and real-time neuromorphic processing.

[AI-25] On Using Machine Learning to Early Detect Catastrophic Failures in Marine Diesel Engines

【速读】：该论文旨在解决海洋发动机突发性灾难性故障（catastrophic failures）早期检测难题，这类故障具有突发性和不可预测性，易导致系统功能丧失甚至永久损坏，对航行安全构成严重威胁。传统方法通常关注传感器信号的偏差，但难以在故障达到临界阈值前提供预警。本文提出了一种基于实际故障数据的新方法，其关键在于通过计算实际传感器读数与预期值之间偏差的导数（derivative），捕捉异常动态变化趋势，从而提前识别潜在危险事件。该方法利用随机森林（Random Forest）作为最优机器学习算法进行预测，在测量值未触发工业标准报警阈值之前即可发出预警，使操作人员有足够时间采取预防措施，如停机、调整航线并规避障碍物。仿真与真实数据验证均表明该方案具备良好的有效性、鲁棒性和工程适用性。

链接: https://arxiv.org/abs/2603.12733
作者: Francesco Maione,Paolo Lino,Giuseppe Giannino,Guido Maione
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Catastrophic failures of marine engines imply severe loss of functionality and destroy or damage the systems irreversibly. Being sudden and often unpredictable events, they pose a severe threat to navigation, crew, and passengers. The abrupt nature makes early detection the only effective countermeasure. However, research has concentrated on modeling the gradual degradation of components, with limited attention to sudden and anomalous phenomena. This work proposes a new method for early detection of catastrophic failures. Based on real data from a failed engine, the approach evaluates the derivatives of the deviation between actual sensor readings and expected values of engine variables. Predictions are obtained by a Random Forest, which is the most suitable Machine Learning algorithm among the tested ones. Traditional methods focus on deviations of monitored signals, whereas the proposed approach employs the derivatives of the deviations to provide earlier indications of abnormal dynamics, and to alert that a rapid and dangerous event is breaking out within the system. The method allows the detection of anomalies before measurements reach critical thresholds and alarms are triggered, which is the common method in industry. Consequently, operators can be warned in advance and shut down the engine, then prevent damage and unexpected power loss. Moreover, they have the time to safely change the ship route and avoid potential obstacles. Simulation results conf irm the effectiveness of the proposed approach in anticipating occurrence of catastrophic failures. Validation on real-world data further reinforces the robustness and practical applicability of the method. It is worth noting that data acquisition to train the predictive algorithm is not a problem, since a Deep Learning-based data augmentation procedure is used.

[AI-26] Graph In-Context Operator Networks for Generalizable Spatiotemporal Prediction

【速读】：该论文旨在解决在相同训练数据和训练步数条件下，对比**上下文操作符学习（in-context operator learning）**与传统单操作符学习（classical operator learning）的性能差异问题，以明确前者是否在真实世界的时空系统中具有更优的泛化能力和样本效率。其解决方案的关键在于提出GICON（Graph In-Context Operator Network），该模型通过图消息传递机制实现几何泛化（geometric generalization），并引入示例感知的位置编码（example-aware positional encoding）实现基数泛化（cardinality generalization），从而在不更新模型权重的前提下，利用上下文样例有效推断解算子（solution operator）。实验表明，该方法在跨区域空气质量预测任务中显著优于传统方法，尤其在少量训练样本下展现出更强的鲁棒性和可扩展性。

链接: https://arxiv.org/abs/2603.12725
作者: Chenghan Wu,Zongmin Yu,Boai Sun,Liu Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 figures, 2 tables

点击查看摘要

Abstract:In-context operator learning enables neural networks to infer solution operators from contextual examples without weight updates. While prior work has demonstrated the effectiveness of this paradigm in leveraging vast datasets, a systematic comparison against single-operator learning using identical training data has been absent. We address this gap through controlled experiments comparing in-context operator learning against classical operator learning (single-operator models trained without contextual examples), under the same training steps and dataset. To enable this investigation on real-world spatiotemporal systems, we propose GICON (Graph In-Context Operator Network), combining graph message passing for geometric generalization with example-aware positional encoding for cardinality generalization. Experiments on air quality prediction across two Chinese regions show that in-context operator learning outperforms classical operator learning on complex tasks, generalizing across spatial domains and scaling robustly from few training examples to 100 at inference.

[AI-27] Altered Thoughts Altered Actions: Probing Chain-of-Thought Vulnerabilities in VLA Robotic Manipulation

【速读】：该论文旨在解决生成式视觉-语言-动作（Vision-Language-Action, VLA）模型中内部推理轨迹（reasoning trace）的脆弱性问题，即：尽管这些模型依赖自然语言形式的中间计划来指导物理动作执行，但其对推理文本内容的鲁棒性尚未被系统评估，尤其是是否存在仅通过篡改推理文本即可显著降低机器人任务成功率的攻击方式。解决方案的关键在于设计了一套分层的七类文本扰动方法（分为盲噪声、机械语义级和大语言模型自适应三级攻击），并应用于40个LIBERO桌面操作任务中，发现仅对象名称替换会引发高达19.3个百分点的任务成功率下降，而其他如句序打乱、空间方向反转或由大型语言模型生成看似合理实则错误的计划均影响甚微（≤±4 pp）。这一不对称结果表明，动作解码器高度依赖于实体引用的完整性，而非推理逻辑或结构质量，揭示了推理模块中的实体指称一致性是关键脆弱点，且该漏洞在非推理型VLA中不存在，说明其为推理增强架构特有且难以被输入验证防御检测的隐蔽威胁向量。

链接: https://arxiv.org/abs/2603.12717
作者: Tuan Duong Trinh,Naveed Akhtar,Basim Azam
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent Vision-Language-Action (VLA) models increasingly adopt chain-of-thought (CoT) reasoning, generating a natural-language plan before decoding motor commands. This internal text channel between the reasoning module and the action decoder has received no adversarial scrutiny. We ask: which properties of this intermediate plan does the action decoder actually rely on, and can targeted corruption of the reasoning trace alone – with all inputs left intact – degrade a robot’s physical task performance? We design a taxonomy of seven text corruptions organized into three attacker tiers (blind noise, mechanical-semantic, and LLM-adaptive) and apply them to a state-of-the-art reasoning VLA across 40 LIBERO tabletop manipulation tasks. Our results reveal a striking asymmetry: substituting object names in the reasoning trace reduces overall success rate by 8.3~percentage points (pp) – reaching - 19.3~pp on goal-conditioned tasks and - 45~pp on individual tasks – whereas sentence reordering, spatial-direction reversal, token noise, and even a 70B-parameter LLM crafting plausible-but-wrong plans all have negligible impact (within \pm 4~pp). This asymmetry indicates that the action decoder depends on entity-reference integrity rather than reasoning quality or sequential structure. Notably, a sophisticated LLM-based attacker underperforms simple mechanical object-name substitution, because preserving plausibility inadvertently retains the entity-grounding structure the decoder needs. A cross-architecture control using a non-reasoning VLA confirms the vulnerability is exclusive to reasoning-augmented models, while instruction-level attacks degrade both architectures – establishing that the internal reasoning trace is a distinct and stealthy threat vector invisible to input-validation defenses.

[AI-28] Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Model, MLLM）推理过程中因视觉编码与语言生成阶段硬件需求差异导致的资源调度效率低下问题。其核心挑战在于：视觉编码为计算密集型（compute-bound），而语言生成则受内存带宽限制（memory-bandwidth-bound），传统基于阶段级拆分（stage-level disaggregation）的部署方式需在设备间传输海量KV缓存（O(L * s_ctx)字节），造成高通信开销。解决方案的关键在于采用模态级拆分（modality-level partitioning），将模型在视觉编码器与语言模型之间断开，仅传输低维度嵌入向量（O(N_v * d)字节），实现跨设备通信复杂度从GB级降至MB级，显著降低传输压力。该策略使系统可在通用PCIe互连下支持异构部署（heterogeneous serving），相较NVLink等高速互联方案更具成本优势，并通过构建相位感知运行时HeteroServe验证了其有效性，在相同硬件条件下吞吐量提升达54%，且在固定预算下异构集群相比同构集群可提升37%的每美元token产出（Tokens/$）。

链接: https://arxiv.org/abs/2603.12707
作者: Donglin Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points that preserve standard stage-based execution. Partitioning here reduces transfer complexity from O(L * s_ctx) bytes (GB-scale KV caches under stage-level disaggregation) to O(N_v * d) bytes (MB-scale embeddings), an O(L) reduction where L is the transformer depth. The result holds across attention mechanisms (MHA/GQA), dynamic vision resolutions, and model scales, and the advantage grows as models deepen. A direct implication is that existing stage-level disaggregation systems are constrained to high-bandwidth interconnects (e.g., NVLink), whereas modality-level disaggregation enables cross-tier heterogeneous serving over commodity PCIe. A closed-form cost model shows that heterogeneous deployment is cost-optimal under phase-separable workloads (predicts 31.4% savings; observed 40.6%). We build HeteroServe, a phase-aware runtime with modality-level partitioning and cross-tier scheduling, and evaluate it on LLaVA-1.5-7B and Qwen2.5-VL against vLLM v0.3.0. On identical 4xA100 hardware, engine optimizations raise throughput by up to 54%. Under a fixed budget, a heterogeneous cluster (\ 38k) improves Tokens/\ by 37% over a homogeneous baseline (\ 64k) without degrading latency.

[AI-29] Federated Hierarchical Clustering with Automatic Selection of Optimal Cluster Numbers

【速读】：该论文旨在解决联邦聚类（Federated Clustering, FC）中两个关键挑战：一是客户端数据的真实聚类数 $ k^* $ 通常未知且簇大小不均衡，二是联邦学习中的隐私保护传输限制导致可用信息减少，从而影响聚类的准确性和鲁棒性。解决方案的关键在于提出一种名为 Fed-$ k^* $-HC 的新型联邦聚类框架，其核心创新是通过层次聚类（hierarchical clustering）自动确定最优聚类数 $ k^* $：每个客户端首先生成微子簇（micro-subclusters），并将它们的原型上传至服务器进行层次合并；利用基于密度的合并策略，可有效识别不同大小和形状的簇，并通过原型间的邻近关系实现渐进式合并并自终止，从而精准估计全局数据分布下的最优聚类数 $ k^* $。

链接: https://arxiv.org/abs/2603.12684
作者: Yue Zhang,Chuanlong Qiu,Xinfa Liao,Yiqun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 29 pages, 7 figures

点击查看摘要

Abstract:Federated Clustering (FC) is an emerging and promising solution in exploring data distribution patterns from distributed and privacy-protected data in an unsupervised manner. Existing FC methods implicitly rely on the assumption that clients are with a known number of uniformly sized clusters. However, the true number of clusters is typically unknown, and cluster sizes are naturally imbalanced in real scenarios. Furthermore, the privacy-preserving transmission constraints in federated learning inevitably reduce usable information, making the development of robust and accurate FC extremely challenging. Accordingly, we propose a novel FC framework named Fed- k^* -HC, which can automatically determine an optimal number of clusters k^* based on the data distribution explored through hierarchical clustering. To obtain the global data distribution for k^* determination, we let each client generate micro-subclusters. Their prototypes are then uploaded to the server for hierarchical merging. The density-based merging design allows exploring clusters of varying sizes and shapes, and the progressive merging process can self-terminate according to the neighboring relationships among the prototypes to determine k^* . Extensive experiments on diverse datasets demonstrate the FC capability of the proposed Fed- k^* -HC in accurately exploring a proper number of clusters.

[AI-30] RetroReason er: A Reasoning LLM for Strategic Retrosynthesis Prediction

【速读】：该论文旨在解决传统逆合成预测（retrosynthesis prediction）中缺乏战略推理的问题，即现有方法往往仅基于产物的通用分析生成反应物，而未能显式地模拟化学家在选择特定键断开策略时的逻辑思维过程。解决方案的关键在于提出RetroReasoner模型，该模型通过监督微调（SFT）和强化学习（RL）联合训练：SFT阶段引入SyntheticRetro框架以生成结构化的断键推理理由；RL阶段则采用往返准确率（round-trip accuracy）作为奖励信号，确保预测的反应物经正向合成模型推导后能准确还原原始产物。这一设计使模型不仅能提升预测准确性，还能在复杂反应场景下生成更广泛可行的反应物候选方案。

链接: https://arxiv.org/abs/2603.12666
作者: Hanbum Ko,Chanhui Lee,Ye Rin Kim,Rodrigo Hormazabal,Sehui Han,Sungbin Lim,Sungwoong Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 18 figures

点击查看摘要

Abstract:Retrosynthesis prediction is a core task in organic synthesis that aims to predict reactants for a given product molecule. Traditionally, chemists select a plausible bond disconnection and derive corresponding reactants, which is time-consuming and requires substantial expertise. While recent advancements in molecular large language models (LLMs) have made progress, many methods either predict reactants without strategic reasoning or conduct only a generic product analysis, rather than reason explicitly about bond-disconnection strategies that logically lead to the choice of specific reactants. To overcome these limitations, we propose RetroReasoner, a retrosynthetic reasoning model that leverages chemists’ strategic thinking. RetroReasoner is trained using both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we introduce SyntheticRetro, a framework that generates structured disconnection rationales alongside reactant predictions. In the case of RL, we apply a round-trip accuracy as reward, where predicted reactants are passed through a forward synthesis model, and predictions are rewarded when the forward-predicted product matches the original input product. Experimental results show that RetroReasoner not only outperforms prior baselines but also generates a broader range of feasible reactant proposals, particularly in handling more challenging reaction instances.

[AI-31] LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing

【速读】：该论文旨在解决基于混合专家（Mixture-of-Experts, MoE）的大语言模型（Large Language Models, LLMs）在部署过程中面临的高内存需求问题，尤其是由于大量专家模块加载导致的存储瓶颈。现有专家压缩方法如剪枝或合并常因不可逆的知识损失或高昂的训练开销而效果受限。其解决方案的关键在于提出一种全新的专家压缩范式——“专家替换”（expert replacing），即用参数高效的模块替代冗余专家，并通过低代价训练恢复其能力；在此基础上进一步构建LightMoE框架，引入自适应专家选择、分层专家结构和渐进式恢复策略，在保持高性能的同时显著提升内存效率与训练效率。

链接: https://arxiv.org/abs/2603.12645
作者: Jiawei Hao,Zhiwei Hao,Jianyuan Guo,Li Shen,Yong Luo,Han Hu,Dan Zeng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) based Large Language Models (LLMs) have demonstrated impressive performance and computational efficiency. However, their deployment is often constrained by substantial memory demands, primarily due to the need to load numerous expert modules. While existing expert compression techniques like pruning or merging attempt to mitigate this, they often suffer from irreversible knowledge loss or high training overhead. In this paper, we propose a novel expert compression paradigm termed expert replacing, which replaces redundant experts with parameter-efficient modules and recovers their capabilities with low training costs. We find that even a straightforward baseline of this paradigm yields promising performance. Building on this foundation, we introduce LightMoE, a framework that enhances the paradigm by introducing adaptive expert selection, hierarchical expert construction, and an annealed recovery strategy. Experimental results show that LightMoE matches the performance of LoRA fine-tuning at a 30% compression ratio. Even under a more aggressive 50% compression rate, it outperforms existing methods and achieves average performance improvements of 5.6% across five diverse tasks. These findings demonstrate that LightMoE strikes a superior balance among memory efficiency, training efficiency, and model performance.

[AI-32] Spend Less Reason Better: Budget-Aware Value Tree Search for LLM Agents

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）代理在测试时扩展（test-time scaling）过程中因缺乏预算意识而导致的资源浪费问题，即当前方法将计算资源视为无限，导致代理在冗余步骤或死胡同路径上耗尽token和工具调用预算。其核心解决方案是提出Budget-Aware Value Tree（BAVT），一种无需训练的推理时框架，通过在单一LLM骨干网络中建模多跳推理为动态搜索树，并引入基于剩余资源比例的节点选择机制——该机制以资源比率为自然缩放指数对节点价值进行加权，实现从广度探索到贪婪利用的平滑过渡。此外，BAVT采用残差价值预测器来评估相对进展而非绝对状态质量，从而有效抑制LLM自评估的过自信问题，实现可靠剪枝。理论分析进一步证明，在有限预算下，BAVT以至少 $1-\epsilon$ 的概率收敛至最终答案。

链接: https://arxiv.org/abs/2603.12634
作者: Yushu Li,Wenlong Deng,Jiajin Li,Xiaoxiao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test-time scaling has become a dominant paradigm for improving LLM agent reliability, yet current approaches treat compute as an abundant resource, allowing agents to exhaust token and tool budgets on redundant steps or dead-end trajectories. Existing budget-aware methods either require expensive fine-tuning or rely on coarse, trajectory-level heuristics that cannot intervene mid-execution. We propose the Budget-Aware Value Tree (BAVT), a training-free inference-time framework that models multi-hop reasoning as a dynamic search tree guided by step-level value estimation within a single LLM backbone. Another key innovation is a budget-conditioned node selection mechanism that uses the remaining resource ratio as a natural scaling exponent over node values, providing a principled, parameter-free transition from broad exploration to greedy exploitation as the budget depletes. To combat the well-known overconfidence of LLM self-evaluation, BAVT employs a residual value predictor that scores relative progress rather than absolute state quality, enabling reliable pruning of uninformative or redundant tool calls. We further provide a theoretical convergence guarantee, proving that BAVT reaches a terminal answer with probability at least 1-\epsilon under an explicit finite budget bound. Extensive evaluations on four multi-hop QA benchmarks across two model families demonstrate that BAVT consistently outperforms parallel sampling baselines. Most notably, BAVT under strict low-budget constraints surpasses baseline performance at 4\times the resource allocation, establishing that intelligent budget management fundamentally outperforms brute-force compute scaling.

[AI-33] When Drafts Evolve: Speculative Decoding Meets Online Learning

【速读】：该论文旨在解决生成式 AI（Generative AI）推理过程中因轻量级草稿模型（draft model）与目标模型分布差异导致的接受长度短、加速效果有限的问题。其解决方案的关键在于利用推测解码（speculative decoding）过程中自然产生的验证反馈，构建一个“草稿-提交-反馈-适应”的在线学习闭环，从而实现草稿模型的持续演化。通过动态 regret 最小化理论框架，论文提出基于乐观在线学习和在线集成学习的新算法，有效利用历史梯度信息作为预测更新提示，并动态维护多个草稿模型，显著提升系统加速比，实测最高达 24% 的速度提升。

链接: https://arxiv.org/abs/2603.12617
作者: Yu-Yang Qian,Hao-Cong Wu,Yichao Fu,Hao Zhang,Peng Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speculative decoding has emerged as a widely adopted paradigm for accelerating large language model inference, where a lightweight draft model rapidly generates candidate tokens that are then verified in parallel by a larger target model. However, due to limited model capacity, drafts often struggle to approximate the target distribution, resulting in shorter acceptance lengths and diminished speedup. A key yet under-explored observation is that speculative decoding inherently provides verification feedback that quantifies the deviation between the draft and target models at no additional cost. This process naturally forms an iterative “draft commits-feedback provides-draft adapts” evolving loop, which precisely matches the online learning paradigm. Motivated by this connection, we propose OnlineSpec, a unified framework that systematically leverages interactive feedback to continuously evolve draft models. Grounded in dynamic regret minimization, we establish a formal link between online learning performance and speculative system’s acceleration rate, and develop novel algorithms via modern online learning techniques, including optimistic online learning that adaptively reuses historical gradients as predictive update hints, and online ensemble learning that dynamically maintains multiple draft models. Our algorithms are equipped with theoretical justifications and improved acceleration rates, achieving up to 24% speedup over seven benchmarks and three foundation models.

[AI-34] FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control

【速读】：该论文旨在解决最大熵强化学习（Maximum Entropy Reinforcement Learning, MaxEnt RL）在高维人形机器人控制中因“维度灾难”导致的探索效率低下和训练不稳定问题。现有方法多采用确定性策略梯度结合大规模并行仿真以提升训练效率，但牺牲了随机策略在复杂连续控制任务中的潜在优势。本文提出FastDSAC框架，其核心创新在于引入逐维熵调制（Dimension-wise Entropy Modulation, DEM），动态重分配探索预算并强制策略多样性，同时设计面向高维动作空间的连续分布 critic 以保证价值估计准确性并缓解高维值函数过估计问题。实验表明，精心设计的随机策略可在HumanoidBench等任务上稳定超越甚至显著优于确定性基线，尤其在Basketball和Balance Hard任务上分别取得180%和400%的性能提升。

链接: https://arxiv.org/abs/2603.12612
作者: Jun Xue,Junze Wang,Xinming Zhang,Shanze Wang,Yanjun Chen,Wei Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling Maximum Entropy Reinforcement Learning (RL) to high-dimensional humanoid control remains a formidable challenge, as the ``curse of dimensionality’’ induces severe exploration inefficiency and training instability in expansive action spaces. Consequently, recent high-throughput paradigms have largely converged on deterministic policy gradients combined with massive parallel simulation. We challenge this compromise with FastDSAC, a framework that effectively unlocks the potential of maximum entropy stochastic policies for complex continuous control. We introduce Dimension-wise Entropy Modulation (DEM) to dynamically redistribute the exploration budget and enforce diversity, alongside a continuous distributional critic tailored to ensure value fidelity and mitigate high-dimensional value overestimation. Extensive evaluations on HumanoidBench and other continuous control tasks demonstrate that rigorously designed stochastic policies can consistently match or outperform deterministic baselines, achieving notable gains of 180% and 400% on the challenging \textitBasketball and \textitBalance Hard tasks.

[AI-35] CarPLAN: Context-Adaptive and Robust Planning with Dynamic Scene Awareness for Autonomous Driving

【速读】：该论文旨在解决基于模仿学习（Imitation Learning, IL）的自动驾驶运动规划方法在真实复杂交通场景中缺乏对驾驶上下文理解与适应性决策能力的问题，从而导致规划结果难以应对多变的交通环境并保障安全性。其核心解决方案在于提出CarPLAN框架，关键创新包括：一是引入位移感知预测编码（Displacement-Aware Predictive Encoding, DPE），通过预测自车（Autonomous Vehicle, AV）与周围场景元素之间的未来位移向量，增强模型的空间感知能力，使轨迹生成时能显式考虑相对间距关系；二是设计上下文自适应多专家解码器（Context-Adaptive Multi-Expert Decoder, CMD），基于Mixture of Experts（MoE）架构，在Transformer每一层动态选择最适配当前场景结构的专家解码器，实现对多样化驾驶情境的自适应规划。这两个模块共同提升了模型在复杂交通场景下的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2603.12607
作者: Junyong Yun,Jungho Kim,ByungHyun Lee,Dongyoung Lee,Sehwan Choi,Seunghyeop Nam,Kichun Jo,Jun Won Choi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures. Under review at IEEE Transactions on Intelligent Transportation Systems

点击查看摘要

Abstract:Imitation learning (IL) is widely used for motion planning in autonomous driving due to its data efficiency and access to real-world driving data. For safe and robust real-world driving, IL-based planning requires capturing the complex driving contexts inherent in real-world data and enabling context-adaptive decision-making, rather than relying solely on expert trajectory imitation. In this paper, we propose CarPLAN, a novel IL-based motion planning framework that explicitly enhances driving context understanding and enables adaptive planning across diverse traffic scenarios. Our contributions are twofold: We introduce Displacement-Aware Predictive Encoding (DPE) to improve the model’s spatial awareness by predicting future displacement vectors between the Autonomous Vehicle (AV) and surrounding scene elements. This allows the planner to account for relational spacing when generating trajectories. In addition to the standard imitation loss, we incorporate an augmented loss term that captures displacement prediction errors, ensuring planning decisions consider relative distances from other agents. To improve the model’s ability to handle diverse driving contexts, we propose Context-Adaptive Multi-Expert Decoder (CMD), which leverages the Mixture of Experts (MoE) framework. CMD dynamically selects the most suitable expert decoders based on scene structure at each Transformer layer, enabling adaptive and context-aware planning in dynamic environments. We evaluate CarPLAN on the nuPlan benchmark and demonstrate state-of-the-art performance across all closed-loop simulation metrics. In particular, CarPLAN exhibits robust performance on challenging scenarios such as Test14-Hard, validating its effectiveness in complex driving conditions. Additional experiments on the Waymax benchmark further demonstrate its generalization capability across different benchmark settings.

[AI-36] Optimize Wider Not Deeper: Consensus Aggregation for Policy Optimization

【速读】：该论文试图解决强化学习中近端策略优化（Proximal Policy Optimization, PPO）因多轮剪裁随机梯度下降（clipped SGD）导致的路径依赖噪声问题，即随着优化深度增加，信号（自然梯度投影）趋于饱和而浪费（Fisher正交残差）持续增长，从而引发优化深度困境。解决方案的关键在于将计算资源从“深度”转向“宽度”：提出共识聚合策略（Consensus Aggregation for Policy Optimization, CAPO），在相同数据批次上并行运行K个PPO副本（仅随机打乱mini-batch顺序），并通过两种空间下的聚合方式——欧氏参数空间和自然参数空间（采用对数意见池）——生成共识策略；其中在自然参数空间中，共识策略能严格保证更高的KL惩罚代理目标和更紧的信赖域约束，优于平均专家策略，且参数平均近似继承这些优势。实验表明，在固定样本预算下，CAPO在连续控制任务上性能显著优于PPO及计算量相当的更深基线，最高提升达8.6倍。

链接: https://arxiv.org/abs/2603.12596
作者: Zelal Su(Lain)Mustafaoglu,Sungyoung Lee,Eshan Balachandar,Risto Miikkulainen,Keshav Pingali
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Proximal policy optimization (PPO) approximates the trust region update using multiple epochs of clipped SGD. Each epoch may drift further from the natural gradient direction, creating path-dependent noise. To understand this drift, we can use Fisher information geometry to decompose policy updates into signal (the natural gradient projection) and waste (the Fisher-orthogonal residual that consumes trust region budget without first-order surrogate improvement). Empirically, signal saturates but waste grows with additional epochs, creating an optimization-depth dilemma. We propose Consensus Aggregation for Policy Optimization (CAPO), which redirects compute from depth to width: K PPO replicates are optimized on the same batch, differing only in minibatch shuffling order, and then aggregated into a consensus. We study aggregation in two spaces: Euclidean parameter space, and the natural parameter space of the policy distribution via the logarithmic opinion pool. In natural parameter space, the consensus provably achieves higher KL-penalized surrogate and tighter trust region compliance than the mean expert; parameter averaging inherits these guarantees approximately. On continuous control tasks, CAPO outperforms PPO and compute-matched deeper baselines under fixed sample budgets by up to 8.6x. CAPO demonstrates that policy optimization can be improved by optimizing wider, rather than deeper, without additional environment interactions.

[AI-37] Swap-guided Preference Learning for Personalized Reinforcement Learning from Human Feedback ICLR2026

【速读】：该论文旨在解决变分偏好学习（Variational Preference Learning, VPL）中因后验坍缩（posterior collapse）导致用户特定潜在变量被忽略的问题，从而限制了个性化建模能力。VPL虽引入用户特异性潜在变量以捕捉多样偏好，但在稀疏偏好数据和过强解码器的条件下，潜在变量可能被模型忽略，退化为单一奖励模型。解决方案的关键在于提出Swap-guided Preference Learning (SPL)，其核心机制是通过构造虚构的交换标注者（swap annotators）并利用其偏好的镜像特性来引导编码器学习有意义的用户特定潜在表示；具体包含三个组件：(1) 交换引导的基础正则化、(2) 偏好导向的逆自回归流（Preferential Inverse Autoregressive Flow, P-IAF），以及 (3) 自适应潜在条件控制，有效缓解了坍缩问题并提升了偏好预测性能。

链接: https://arxiv.org/abs/2603.12595
作者: Gihoon Kim,Euntai Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICLR 2026

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is a widely used approach to align large-scale AI systems with human values. However, RLHF typically assumes a single, universal reward, which overlooks diverse preferences and limits personalization. Variational Preference Learning (VPL) seeks to address this by introducing user-specific latent variables. Despite its promise, we found that VPL suffers from posterior collapse. While this phenomenon is well known in VAEs, it has not previously been identified in preference learning frameworks. Under sparse preference data and with overly expressive decoders, VPL may cause latent variables to be ignored, reverting to a single-reward model. To overcome this limitation, we propose Swap-guided Preference Learning (SPL). The key idea is to construct fictitious swap annotators and use the mirroring property of their preferences to guide the encoder. SPL introduces three components: (1) swap-guided base regularization, (2) Preferential Inverse Autoregressive Flow (P-IAF), and (3) adaptive latent conditioning. Experiments show that SPL mitigates collapse, enriches user-specific latents, and improves preference prediction. Our code and data are available at this https URL

[AI-38] Early Pruning for Public Transport Routing

【速读】：该论文旨在解决公共交通路径规划算法（如RAPTOR及其变体）在处理密集换乘图时，因换乘松弛阶段效率低下而导致的性能瓶颈问题，尤其是在支持无限次换乘的情况下，传统方法需遍历大量潜在的站点间连接（如步行、自行车、电动滑板车等），导致查询时间显著增加。为提升效率而不牺牲路径最优性，论文提出“早期剪枝”（Early Pruning）这一低开销技术：通过预先按换乘耗时排序换乘连接，并在换乘循环中引入剪枝规则——一旦当前换乘路径无法优于已知最优解，则提前终止对该站点后续更长换乘路径的探索。该方案仅需一次预处理且可无缝集成至现有代码库，实测在多个主流RAPTOR变体上对瑞士和伦敦交通网络的查询时间加速达57%，从而显著提升了多式联运路径规划的计算效率与实用性。

链接: https://arxiv.org/abs/2603.12592
作者: Andrii Rohovyi,Abdallah Abuaisha,Toby Walsh
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Routing algorithms for public transport, particularly the widely used RAPTOR and its variants, often face performance bottlenecks during the transfer relaxation phase, especially on dense transfer graphs, when supporting unlimited transfers. This inefficiency arises from iterating over many potential inter-stop connections (walks, bikes, e-scooters, etc.). To maintain acceptable performance, practitioners often limit transfer distances or exclude certain transfer options, which can reduce path optimality and restrict the multimodal options presented to travellers. This paper introduces Early Pruning, a low-overhead technique that accelerates routing algorithms without compromising optimality. By pre-sorting transfer connections by duration and applying a pruning rule within the transfer loop, the method discards longer transfers at a stop once they cannot yield an earlier arrival than the current best solution. Early Pruning can be integrated with minimal changes to existing codebases and requires only a one-time preprocessing step. Across multiple state-of-the-art RAPTOR-based solutions, including RAPTOR, ULTRA-RAPTOR, McRAPTOR, BM-RAPTOR, ULTRA-McRAPTOR, and UBM-RAPTOR and tested on the Switzerland and London transit networks, we achieved query time reductions of up to 57%. This approach provides a generalizable improvement to the efficiency of transit pathfinding algorithms. Beyond algorithmic performance, Early Pruning has practical implications for transport planning. By reducing computational costs, it enables transit agencies to expand transfer radii and incorporate additional mobility modes into journey planners without requiring extra server infrastructure. This is particularly relevant for passengers in areas with sparse direct transit coverage, such as outer suburbs and smaller towns, where richer multimodal routing can reveal viable alternatives to private car use. Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2603.12592 [cs.DS] (or arXiv:2603.12592v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2603.12592 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Andrii Rohovyi [view email] [v1] Fri, 13 Mar 2026 02:49:32 UTC (18 KB)

[AI-39] CA-HFP: Curvature-Aware Heterogeneous Federated Pruning with Model Reconstruction

【速读】：该论文旨在解决异构边缘设备上的联邦学习（Federated Learning）中个性化压缩与聚合兼容性及稳定收敛之间的矛盾问题。解决方案的关键在于提出Curvature-Aware Heterogeneous Federated Pruning (CA-HFP) 框架，其核心机制是：每个客户端基于曲率感知的重要性评分执行结构化、设备特定的剪枝操作，并通过轻量级重建将压缩后的子模型映射回共享的全局参数空间；同时，作者推导出包含局部计算、数据异质性和剪枝扰动因素的联邦优化收敛边界，从而导出以损失为基础的剪枝准则，确保在显著降低每客户端计算和通信开销的同时维持模型精度。

链接: https://arxiv.org/abs/2603.12591
作者: Gang Hu,Yinglei Teng,Pengfei Wu,Shijun Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning on heterogeneous edge devices requires personalized compression while preserving aggregation compatibility and stable convergence. We present Curvature-Aware Heterogeneous Federated Pruning (CA-HFP), a practical framework that enables each client perform structured, device-specific pruning guided by a curvature-informed significance score, and subsequently maps its compact submodel back into a common global parameter space via a lightweight reconstruction. We derive a convergence bound for federated optimization with multiple local SGD steps that explicitly accounts for local computation, data heterogeneity, and pruning-induced perturbations; from which a principled loss-based pruning criterion is derived. Extensive experiments on FMNIST, CIFAR-10, and CIFAR-100 using VGG and ResNet architectures under varying degrees of data heterogeneity demonstrate that CA-HFP preserves model accuracy while significantly reducing per-client computation and communication costs, outperforming standard federated training and existing pruning-based baselines.

[AI-40] CALF: Communication-Aware Learning Framework for Distributed Reinforcement Learning

【速读】：该论文旨在解决分布式强化学习（Distributed Reinforcement Learning, DRL）策略在边缘设备与云服务器之间部署时，因网络延迟、抖动和丢包导致的性能显著下降问题。标准强化学习训练假设零延迟交互，无法适应现实网络环境，从而造成仿真到实际部署（sim-to-real）的性能差距。解决方案的关键在于提出一种通信感知学习框架（Communication-Aware Learning Framework, CALF），在模拟阶段即引入真实的网络模型对策略进行训练，使策略显式地学习并适应通信约束。实验表明，该方法能显著缩小部署性能差距，并在异构硬件上的分布式部署中验证了其鲁棒性，确立了网络条件作为Wi-Fi类分布式系统中sim-to-real迁移的重要维度。

链接: https://arxiv.org/abs/2603.12543
作者: Carlos Purves,Pietro Lio’
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Distributed reinforcement learning policies face network delays, jitter, and packet loss when deployed across edge devices and cloud servers. Standard RL training assumes zero-latency interaction, causing severe performance degradation under realistic network conditions. We introduce CALF (Communication-Aware Learning Framework), which trains policies under realistic network models during simulation. Systematic experiments demonstrate that network-aware training substantially reduces deployment performance gaps compared to network-agnostic baselines. Distributed policy deployments across heterogeneous hardware validate that explicitly modelling communication constraints during training enables robust real-world execution. These findings establish network conditions as a major axis of sim-to-real transfer for Wi-Fi-like distributed deployments, complementing physics and visual domain randomisation.

[AI-41] Embedded Quantum Machine Learning in Embedded Systems: Feasibility Hybrid Architectures and Quantum Co-Processors

【速读】：该论文旨在解决将量子机器学习（Quantum Machine Learning, QML）能力部署到资源受限的边缘计算平台（如物联网节点、可穿戴设备、无人机和网络物理控制器）中的技术可行性问题。其核心挑战在于当前量子硬件的局限性与边缘系统对低延迟、低功耗和高可靠性的严苛要求之间的矛盾。解决方案的关键在于提出两条清晰的实现路径：一是基于混合工作流的架构，即嵌入式设备负责感知与经典计算，而将特定量子子程序卸载至远程量子处理单元（Quantum Processing Unit, QPU）或本地量子设备；二是早期“嵌入式量子处理单元”概念，即将紧凑型量子协处理器与传统控制硬件集成。同时，论文强调量子启发式机器学习和优化在经典嵌入式处理器及FPGA上的实践应用是现阶段的重要桥梁，并从电路与系统层面识别出主导障碍（如延迟、数据编码开销、NISQ噪声、工具链不匹配和能耗），进而映射至接口设计、控制电子、电源管理、验证与安全等具体工程方向，以推动EQML的实际落地。

链接: https://arxiv.org/abs/2603.12540
作者: Somdip Dey,Syed Muhammad Raza
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure, 5th International Conference Computing, Mathematics Engineering Technologies (iCoMET 2026)

点击查看摘要

Abstract:Embedded quantum machine learning (EQML) seeks to bring quantum machine learning (QML) capabilities to resource-constrained edge platforms such as IoT nodes, wearables, drones, and cyber-physical controllers. In 2026, EQML is technically feasible only in limited and highly experimental forms: (i) hybrid workflows where an embedded device performs sensing and classical processing while offloading a narrowly scoped quantum subroutine to a remote quantum processing unit (QPU) or nearby quantum appliance, and (ii) early-stage “embedded QPU” concepts in which a compact quantum co-processor is integrated with classical control hardware. A practical bridge is quantum-inspired machine learning and optimisation on classical embedded processors and FPGAs. This paper analyses feasibility from a circuits-and-systems perspective aligned with the academic community, formalises two implementation pathways, identifies the dominant barriers (latency, data encoding overhead, NISQ noise, tooling mismatch, and energy), and maps them to concrete engineering directions in interface design, control electronics, power management, verification, and security. We also argue that responsible deployment requires adversarial evaluation and governance practices that are increasingly necessary for edge AI systems.

[AI-42] RACE: Temporal Rule-Anchored Chain-of-Evidence on Knowledge Graphs for Interpretable Stock Movement Prediction

【速读】：该论文旨在解决股票价格走势预测中模型可解释性不足的问题，特别是在复杂知识图谱（Knowledge Graph, KG）背景下如何实现既准确又可审计的推理过程。其解决方案的关键在于提出了一种时序规则锚定的证据链（Temporal Rule-Anchored Chain-of-Evidence, TRACE）框架，该框架通过三个核心机制实现：(i) 基于符号关系先验的规则引导多跳探索，限制搜索路径仅限于经济上合理的关联序列，从而提升搜索效率与语义相关性；(ii) 将候选推理链锚定在同期新闻文本上，并基于置信度筛选和聚合完全接地（fully grounded）的假设，而非简单平均弱信号，显著增强预测灵敏度（recall 和 F1）而不牺牲选择性（precision）。最终，TRACE 在 S&P 500 基准上实现了 60.8% 的 F1 分数，优于现有基线方法，且提供人类可读的推理路径，确保决策过程透明、可审计。

链接: https://arxiv.org/abs/2603.12500
作者: Qianggang Ding,Haochen Shi,Luis Castejón Lozano,Miguel Conner,Juan Abia,Luis Gallego-Ledesma,Joshua Fellowes,Gerard Conangla Planes,Adam Elwood,Bang Liu
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a Temporal Rule-Anchored Chain-of-Evidence (TRACE) on knowledge graphs for interpretable stock movement prediction that unifies symbolic relational priors, dynamic graph exploration, and LLM-guided decision making in a single end-to-end pipeline. The approach performs rule-guided multi-hop exploration restricted to admissible relation sequences, grounds candidate reasoning chains in contemporaneous news, and aggregates fully grounded evidence into auditable \textttUP/\textttDOWN verdicts with human-readable paths connecting text and structure. On an S\P~500 benchmark, the method achieves 55.1% accuracy, 55.7% precision, 71.5% recall, and 60.8% F1, surpassing strong baselines and improving recall and F1 over the best graph baseline under identical evaluation. The gains stem from (i) rule-guided exploration that focuses search on economically meaningful motifs rather than arbitrary walks, and (ii) text-grounded consolidation that selectively aggregates high-confidence, fully grounded hypotheses instead of uniformly pooling weak signals. Together, these choices yield higher sensitivity without sacrificing selectivity, delivering predictive lift with faithful, auditably interpretable explanations.

[AI-43] Generating Expressive and Customizable Evals for Timeseries Data Analysis Agents with Agent Fuel

【速读】：该论文旨在解决当前生成式数据分析师（Generative Data Analysis Agents）在时序数据（timeseries data）场景下对状态相关和事件特定查询（stateful and incident-specific queries）表现不佳的问题，以及现有评估方法在领域定制数据集和领域特异性查询类型上的表达能力不足。其解决方案的关键在于提出AgentFuel——一个帮助领域专家快速构建定制化、高表达力评估基准的工具，从而实现端到端的功能性测试，并揭示现有数据代理框架的关键改进方向。

链接: https://arxiv.org/abs/2603.12483
作者: Aadyaa Maddi,Prakhar Naval,Deepti Mande,Shane Duan,Muckai Girish,Vyas Sekar
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Across many domains (e.g., IoT, observability, telecommunications, cybersecurity), there is an emerging adoption of conversational data analysis agents that enable users to “talk to your data” to extract insights. Such data analysis agents operate on timeseries data models; e.g., measurements from sensors or events monitoring user clicks and actions in product analytics. We evaluate 6 popular data analysis agents (both open-source and proprietary) on domain-specific data and query types, and find that they fail on stateful and incident-specific queries. We observe two key expressivity gaps in existing evals: domain-customized datasets and domain-specific query types. To enable practitioners in such domains to generate customized and expressive evals for such timeseries data agents, we present AgentFuel. AgentFuel helps domain experts quickly create customized evals to perform end-to-end functional tests. We show that AgentFuel’s benchmarks expose key directions for improvement in existing data agent frameworks. We also present anecdotal evidence that using AgentFuel can improve agent performance (e.g., with GEPA). AgentFuel benchmarks are available at this https URL.

[AI-44] One-Step Flow Policy: Self-Distillation for Fast Visuomotor Policies

【速读】：该论文旨在解决生成式流模型（flow models）和扩散模型（diffusion models）在机器人控制中因迭代采样导致的严重推理延迟问题，这一延迟会降低控制频率并损害时敏操作任务的性能。解决方案的关键在于提出一种从零开始的自蒸馏框架——单步流策略（One-Step Flow Policy, OFP），其核心创新包括：1）引入自一致性损失（self-consistency loss）以确保时间间隔间的动作传输一致性；2）设计自引导正则化（self-guided regularization）使预测结果聚焦于高密度专家模式；3）采用热启动机制（warm-start mechanism）利用动作的时间相关性最小化生成传输距离。实验表明，OFP在56个模拟操作任务中实现了超越百步扩散与流策略的精度，同时将动作生成速度提升超过100倍，验证了其在高精度、低延迟机器人控制中的实用性与可扩展性。

链接: https://arxiv.org/abs/2603.12480
作者: Shaolong Li,Lichao Sun,Yongchao Chen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative flow and diffusion models provide the continuous, multimodal action distributions needed for high-precision robotic policies. However, their reliance on iterative sampling introduces severe inference latency, degrading control frequency and harming performance in time-sensitive manipulation. To address this problem, we propose the One-Step Flow Policy (OFP), a from-scratch self-distillation framework for high-fidelity, single-step action generation without a pre-trained teacher. OFP unifies a self-consistency loss to enforce coherent transport across time intervals, and a self-guided regularization to sharpen predictions toward high-density expert modes. In addition, a warm-start mechanism leverages temporal action correlations to minimize the generative transport distance. Evaluations across 56 diverse simulated manipulation tasks demonstrate that a one-step OFP achieves state-of-the-art results, outperforming 100-step diffusion and flow policies while accelerating action generation by over 100\times . We further integrate OFP into the \pi_0.5 model on RoboTwin 2.0, where one-step OFP surpasses the original 10-step policy. These results establish OFP as a practical, scalable solution for highly accurate and low-latency robot control.

[AI-45] Operationalising Cyber Risk Management Using AI: Connecting Cyber Incidents to MITRE ATTCK Techniques Security Controls and Metrics

【速读】：该论文旨在解决小企业因缺乏内部专家、知识和财务资源而在应对日益频繁的网络攻击时面临的挑战，其核心问题是将威胁情报与可操作的安全控制措施之间缺乏有效连接。解决方案的关键在于提出了一种基于自然语言处理（Natural Language Processing, NLP）的新框架，通过自动化映射网络事件到对手技术来实现威胁情报的结构化利用；其中，创新性地构建了“Cyber Catalog”知识库，系统整合了CIS关键安全控制（CIS Critical Security Controls）、MITRE ATT&CK技术以及SMART指标，并利用74,986个事件-技术对增强的数据集对all-mpnet-base-v2模型进行微调，显著提升了文本语义相似度计算的准确性（Spearman相关系数0.7894，Pearson相关系数0.8756），并降低了预测误差（MAE=0.135，MSE=0.027），从而实现了从威胁情报到可执行控制与可量化结果的闭环映射。

链接: https://arxiv.org/abs/2603.12455
作者: Emad Sherif,Iryna Yevseyeva,Vitor Basto-Fernandes,Allan Cook
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The escalating frequency of cyber-attacks poses significant challenges for organisations, particularly small enterprises constrained by limited in-house expertise, insufficient knowledge, and financial resources. This research presents a novel framework that leverages Natural Language Processing to address these challenges through automated mapping of cyber incidents to adversary techniques. We introduce the Cyber Catalog, a knowledge base that systematically integrates CIS Critical Security Controls, MITRE ATTCK techniques, and SMART metrics. This integrated resource enables organisations to connect threat intelligence directly to actionable controls and measurable outcomes. To operationalise the framework, we fine-tuned all-mpnet-base-v2, a highly regarded sentence-transformers model used to convert text into numerical vectors on an augmented dataset comprising 74,986 incident-technique pairs to enhance semantic similarity between cyber incidents and MITRE ATTCK techniques. Our fine-tuned model achieved a Spearman correlation of 0.7894 and Pearson correlation of 0.8756, demonstrating substantial improvements over top baseline models including all-mpnet-base-v2, all-distilroberta-v1, and all-MiniLM-L12-v2. Furthermore, our model exhibited significantly lower prediction errors (MAE = 0.135, MSE = 0.027) compared to all baseline models, confirming superior accuracy and consistency. The Cyber Catalog, training dataset, trained model, and implementation code made publicly available to facilitate further research and enable practical deployment in resource-constrained environments. This work bridges the gap between threat intelligence and operational security management, providing an actionable tool for systematic cyber incident response and evidence-based cyber risk management.

[AI-46] Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection

【速读】：该论文旨在解决科学发现中候选物选择策略缺乏预算感知且可验证的评估框架的问题，尤其是在生成式 AI（Generative AI）如大语言模型（LLM）产生高似然但不可靠的科学假设时，如何公平比较不同提案方法的性能。解决方案的关键在于提出一个形式化验证的指标——预算敏感发现评分（Budget-Sensitive Discovery Score, BSDS），其通过联合惩罚假发现率（lambda-weighted FDR）和过度回避（gamma-weighted coverage gap）来量化每种策略在不同预算下的表现；进一步地，BSDS 的预算平均形式——发现质量评分（Discovery Quality Score, DQS）提供了一个单一统计量，防止任何提案者通过在特定预算点上表现优异来人为抬高整体评价。该框架已成功应用于药物发现场景下对 39 种提案方法（包括 LLM 和传统机器学习模型）的系统评估，结果表明：现有训练好的随机森林基线（Greedy-ML）优于所有 LLM 配置，且该结论在多个分子数据集和多种参数设置下具有鲁棒性。

链接: https://arxiv.org/abs/2603.12349
作者: Abhinaba Basu,Pavan Chakraborty
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Scientific discovery increasingly relies on AI systems to select candidates for expensive experimental validation, yet no principled, budget-aware evaluation framework exists for comparing selection strategies – a gap intensified by large language models (LLMs), which generate plausible scientific proposals without reliable downstream evaluation. We introduce the Budget-Sensitive Discovery Score (BSDS), a formally verified metric – 20 theorems machine-checked by the Lean 4 proof assistant – that jointly penalizes false discoveries (lambda-weighted FDR) and excessive abstention (gamma-weighted coverage gap) at each budget level. Its budget-averaged form, the Discovery Quality Score (DQS), provides a single summary statistic that no proposer can inflate by performing well at a cherry-picked budget. As a case study, we apply BSDS/DQS to: do LLMs add marginal value to an existing ML pipeline for drug discovery candidate selection? We evaluate 39 proposers – 11 mechanistic variants, 14 zero-shot LLM configurations, and 14 few-shot LLM configurations – using SMILES representations on MoleculeNet HIV (41,127 compounds, 3.5% active, 1,000 bootstrap replicates) under both random and scaffold splits. Three findings emerge. First, the simple RF-based Greedy-ML proposer achieves the best DQS (-0.046), outperforming all MLP variants and LLM configurations. Second, no LLM surpasses the Greedy-ML baseline under zero-shot or few-shot evaluation on HIV or Tox21, establishing that LLMs provide no marginal value over an existing trained classifier. Third, the proposer hierarchy generalizes across five MoleculeNet benchmarks spanning 0.18%-46.2% prevalence, a non-drug AV safety domain, and a 9x7 grid of penalty parameters (tau = 0.636, mean tau = 0.863). The framework applies to any setting where candidates are selected under budget constraints and asymmetric error costs. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML) MSC classes: 68T05, 62F07, 92C60, 03B35 ACMclasses: I.2.6; J.3; I.5.2; F.4.1 Cite as: arXiv:2603.12349 [cs.LG] (or arXiv:2603.12349v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.12349 Focus to learn more arXiv-issued DOI via DataCite

[AI-47] Optimizing Task Completion Time Updates Using POMDPs

【速读】：该论文旨在解决项目管理中任务完成时间公告的动态更新问题，即如何在保证公告准确性的同时最小化因频繁更新导致的干系人信任损失和重规划成本。其关键在于将公告决策建模为部分可观测马尔可夫决策过程（Partially Observable Markov Decision Process, POMDP），并利用混合可观测马尔可夫决策过程（Mixed Observability MDP, MOMDP）框架，对可观测状态（如当前时间和历史公告）进行高效处理，从而优化控制策略。通过设计兼顾公告误差与更新频率的奖励函数，论文生成了基于信念状态演化的自适应反馈控制器，显著提升了公告准确性和稳定性，相比基线策略最多减少75%的非必要更新。

链接: https://arxiv.org/abs/2603.12340
作者: Duncan Eddy,Esen Yel,Emma Passmore,Niles Egan,Grayson Armour,Dylan M. Asmar,Mykel J. Kochenderfer
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures, submitted to American Control Conference 2026

点击查看摘要

Abstract:Managing announced task completion times is a fundamental control problem in project management. While extensive research exists on estimating task durations and task scheduling, the problem of when and how to update completion times communicated to stakeholders remains understudied. Organizations must balance announcement accuracy against the costs of frequent timeline updates, which can erode stakeholder trust and trigger costly replanning. Despite the prevalence of this problem, current approaches rely on static predictions or ad-hoc policies that fail to account for the sequential nature of announcement management. In this paper, we formulate the task announcement problem as a Partially Observable Markov Decision Process (POMDP) where the control policy must decide when to update announced completion times based on noisy observations of true task completion. Since most state variables (current time and previous announcements) are fully observable, we leverage the Mixed Observability MDP (MOMDP) framework to enable more efficient policy optimization. Our reward structure captures the dual costs of announcement errors and update frequency, enabling synthesis of optimal announcement control policies. Using off-the-shelf solvers, we generate policies that act as feedback controllers, adaptively managing announcements based on belief state evolution. Simulation results demonstrate significant improvements in both accuracy and announcement stability compared to baseline strategies, achieving up to 75% reduction in unnecessary updates while maintaining or improving prediction accuracy.

[AI-48] Maximum Entropy Exploration Without the Rollouts

【速读】：该论文旨在解决强化学习中高效探索（efficient exploration）的核心挑战，特别是在缺乏外部奖励函数时，如何通过预训练目标有效收集数据。传统方法通常依赖于多次策略内回放（on-policy rollouts）来估计状态访问频率，计算成本较高。其解决方案的关键在于提出一种基于内在平均奖励（intrinsic average-reward）的公式化方法，其中奖励由访问分布本身决定，从而使得最优策略能最大化稳态熵。进一步地，作者引入熵正则化版本，利用谱特性——即相关稳态分布可从问题相关的转移矩阵的主特征向量中计算得出——设计出EVE（EigenVector-based Exploration）算法，该算法无需显式回放或分布估计，而是通过类似值迭代的方式进行迭代更新求解。为处理原始未正则化目标，还采用后验策略迭代（posterior-policy iteration, PPI）策略，确保熵单调提升并收敛。理论证明和实验表明，EVE在确定性网格世界环境中能高效生成高稳态熵策略，探索性能优于基于回放的基线方法。

链接: https://arxiv.org/abs/2603.12325
作者: Jacob Adamczyk,Adam Kamoski,Rahul V. Kulkarni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficient exploration remains a central challenge in reinforcement learning, serving as a useful pretraining objective for data collection, particularly when an external reward function is unavailable. A principled formulation of the exploration problem is to find policies that maximize the entropy of their induced steady-state visitation distribution, thereby encouraging uniform long-run coverage of the state space. Many existing exploration approaches require estimating state visitation frequencies through repeated on-policy rollouts, which can be computationally expensive. In this work, we instead consider an intrinsic average-reward formulation in which the reward is derived from the visitation distribution itself, so that the optimal policy maximizes steady-state entropy. An entropy-regularized version of this objective admits a spectral characterization: the relevant stationary distributions can be computed from the dominant eigenvectors of a problem-dependent transition matrix. This insight leads to a novel algorithm for solving the maximum entropy exploration problem, EVE (EigenVector-based Exploration), which avoids explicit rollouts and distribution estimation, instead computing the solution through iterative updates, similar to a value-based approach. To address the original unregularized objective, we employ a posterior-policy iteration (PPI) approach, which monotonically improves the entropy and converges in value. We prove convergence of EVE under standard assumptions and demonstrate empirically that it efficiently produces policies with high steady-state entropy, achieving competitive exploration performance relative to rollout-based baselines in deterministic grid-world environments.

[AI-49] hermodynamics of Reinforcement Learning Curricula ICLR2026

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）中课程学习（Curriculum Learning）的优化问题，即如何设计有效的任务序列以提升训练效率与性能。其解决方案的关键在于引入非平衡热力学理论，将奖励参数视为任务流形（task manifold）上的坐标，并通过最小化超额热力学功（excess thermodynamic work）来确定最优课程路径，该路径对应于任务空间中的测地线（geodesics）。基于此几何框架，作者提出了一种名为“MEW”（Minimum Excess Work）的算法，用于在最大熵强化学习中推导出温度退火（temperature annealing）的合理调度策略。

链接: https://arxiv.org/abs/2603.12324
作者: Jacob Adamczyk,Juan Sebastian Rojas,Rahul V. Kulkarni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at SciForDL Workshop at ICLR 2026

点击查看摘要

Abstract:Connections between statistical mechanics and machine learning have repeatedly proven fruitful, providing insight into optimization, generalization, and representation learning. In this work, we follow this tradition by leveraging results from non-equilibrium thermodynamics to formalize curriculum learning in reinforcement learning (RL). In particular, we propose a geometric framework for RL by interpreting reward parameters as coordinates on a task manifold. We show that, by minimizing the excess thermodynamic work, optimal curricula correspond to geodesics in this task space. As an application of this framework, we provide an algorithm, “MEW” (Minimum Excess Work), to derive a principled schedule for temperature annealing in maximum-entropy RL.

[AI-50] HCP-DCNet: A Hierarchical Causal Primitive Dynamic Composition Network for Self-Improving Causal Understanding

【速读】：该论文旨在解决当前深度学习模型在因果推理方面的根本性缺陷，即缺乏对干预（intervention）、反事实（counterfactual）和底层机制的建模能力，导致系统在分布偏移下表现脆弱且无法回答“如果……会怎样”类问题。解决方案的关键在于提出一种统一框架——分层因果原语动态组合网络（Hierarchical Causal Primitive Dynamic Composition Network, HCP-DCNet），其核心创新包括：将因果场景分解为四层抽象的可重用类型化因果原语（物理、功能、事件和规则），并通过双通道路由网络动态构建任务特定的可微分因果执行图（Causal Execution Graphs, CEGs）；同时引入因果干预驱动的元进化策略，通过约束马尔可夫决策过程实现系统的自主自我优化。该方法在理论层面保证了类型安全组合、路由收敛性和因果动态的通用逼近能力，并在模拟物理与社会环境中显著优于现有最先进基线，在因果发现、反事实推理和组合泛化方面展现出优越性能。

链接: https://arxiv.org/abs/2603.12305
作者: Ming Lei,Shufan Wu,Christophe Baehr
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 2 figures, submitted to a journal and under review

点击查看摘要

Abstract:The ability to understand and reason about cause and effect – encompassing interventions, counterfactuals, and underlying mechanisms – is a cornerstone of robust artificial intelligence. While deep learning excels at pattern recognition, it fundamentally lacks a model of causality, making systems brittle under distribution shifts and unable to answer ``what-if’’ questions. This paper introduces the \emphHierarchical Causal Primitive Dynamic Composition Network (HCP-DCNet), a unified framework that bridges continuous physical dynamics with discrete symbolic causal inference. Departing from monolithic representations, HCP-DCNet decomposes causal scenes into reusable, typed \emphcausal primitives organized into four abstraction layers: physical, functional, event, and rule. A dual-channel routing network dynamically composes these primitives into task-specific, fully differentiable \emphCausal Execution Graphs (CEGs). Crucially, the system employs a \emphcausal-intervention-driven meta-evolution strategy, enabling autonomous self-improvement through a constrained Markov decision process. We establish rigorous theoretical guarantees, including type-safe composition, routing convergence, and universal approximation of causal dynamics. Extensive experiments across simulated physical and social environments demonstrate that HCP-DCNet significantly outperforms state-of-the-art baselines in causal discovery, counterfactual reasoning, and compositional generalization. This work provides a principled, scalable, and interpretable architecture for building AI systems with human-like causal abstraction and continual self-refinement capabilities.

[AI-51] A Geometrically-Grounded Drive for MDL-Based Optimization in Deep Learning

【速读】：该论文旨在解决深度神经网络训练过程中模型复杂度与数据拟合能力之间难以平衡的问题，即如何在保证高精度的同时实现自动化的模型简化和泛化能力提升。其解决方案的关键在于将最小描述长度（Minimum Description Length, MDL）原则从传统的模型选择标准重构为训练动态中的主动驱动机制，通过引入一个基于几何流（Ricci flow）的认知流形（cognitive manifold），并设计一种由任务损失梯度调制的新型MDL驱动项（MDL Drive term），使模型在优化过程中自发压缩内部表示、实现数据保真与模型简洁性的协同演化。该方法不仅建立了严格的理论框架，包括描述长度单调递减、拓扑相变有限性及普适临界行为等性质，还提出了具有O(N log N)计算复杂度的高效算法，为构建更自主、可解释且泛化能力强的AI系统提供了信息论与几何深度学习融合的新路径。

链接: https://arxiv.org/abs/2603.12304
作者: Ming Lei,Shufan Wu,Christophe Baehr
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 9 figures, submitted to a journal and under review

点击查看摘要

Abstract:This paper introduces a novel optimization framework that fundamentally integrates the Minimum Description Length (MDL) principle into the training dynamics of deep neural networks. Moving beyond its conventional role as a model selection criterion, we reformulate MDL as an active, adaptive driving force within the optimization process itself. The core of our method is a geometrically-grounded cognitive manifold whose evolution is governed by a \textitcoupled Ricci flow, enriched with a novel \textitMDL Drive term derived from first principles. This drive, modulated by the task-loss gradient, creates a seamless harmony between data fidelity and model simplification, actively compressing the internal representation during training. We establish a comprehensive theoretical foundation, proving key properties including the monotonic decrease of description length (Theorem~\refthm:convergence), a finite number of topological phase transitions via a geometric surgery protocol (Theorems~\refthm:surgery, \refthm:ultimate_fate), and the emergence of universal critical behavior (Theorem~\refthm:universality). Furthermore, we provide a practical, computationally efficient algorithm with O(N \log N) per-iteration complexity (Theorem~\refthm:complexity), alongside guarantees for numerical stability (Theorem~\refthm:stability) and exponential convergence under convexity assumptions (Theorem~\refthm:convergence_rate). Empirical validation on synthetic regression and classification tasks confirms the theoretical predictions, demonstrating the algorithm’s efficacy in achieving robust generalization and autonomous model simplification. This work provides a principled path toward more autonomous, generalizable, and interpretable AI systems by unifying geometric deep learning with information-theoretic principles.

[AI-52] Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer Consistency

【速读】：该论文旨在解决现有激活工程（activation engineering）方法在控制大语言模型（Large Language Models, LLMs）时存在的两个核心问题：一是从静态激活差异中提取的引导向量易受高维噪声干扰，二是层间语义漂移导致捕捉到的是伪相关而非目标意图。解决方案的关键在于提出一种无需训练的框架——全局进化精炼引导（Global Evolutionary Refined Steering, GER-steer），其核心思想是利用网络表示演化过程中的几何稳定性作为全局信号，对原始引导向量进行校正，从而有效分离出鲁棒的语义意图与正交噪声伪影，实现跨层一致且泛化能力强的模型对齐。

链接: https://arxiv.org/abs/2603.12298
作者: Xinyan Jiang,Wenjing Yu,Di Wang,Lijie Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Activation engineering enables precise control over Large Language Models (LLMs) without the computational cost of fine-tuning. However, existing methods deriving vectors from static activation differences are susceptible to high-dimensional noise and layer-wise semantic drift, often capturing spurious correlations rather than the target intent. To address this, we propose Global Evolutionary Refined Steering (GER-steer), a training-free framework that grounded in the geometric stability of the network’s representation evolution. GER-steer exploits this global signal to rectify raw steering vectors, effectively decoupling robust semantic intent from orthogonal artifacts. Extensive evaluations confirm that GER-steer consistently outperforms baselines, delivering superior efficacy and generalization without layer-specific tuning, establishing a universal solution for reliable model alignment.

[AI-53] Synthetic Data Generation for Brain-Computer Interfaces: Overview Benchmarking and Future Directions

【速读】：该论文旨在解决脑机接口（Brain-Computer Interface, BCI）系统发展中面临的训练数据稀缺、异质性强及隐私敏感等问题，这些问题严重制约了模型性能与泛化能力的提升。其核心解决方案是通过生成合成但生理上合理的脑信号来扩充数据集，从而缓解数据不足并增强模型容量。关键在于系统性地对现有脑信号生成方法进行分类（分为基于知识、特征、模型和翻译四类），并通过在四种典型BCI范式上的基准实验提供客观性能比较，为未来开发准确、数据高效且隐私友好的BCI系统奠定基础。

链接: https://arxiv.org/abs/2603.12296
作者: Ziwei Wang,Zhentao He,Xingyi He,Hongbin Wang,Tianwang Jia,Jingwei Luo,Siyang Li,Xiaoqing Chen,Dongrui Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 20 pages, 7 figures

点击查看摘要

Abstract:Deep learning has achieved transformative performance across diverse domains, largely driven by the large-scale, high-quality training data. In contrast, the development of brain-computer interfaces (BCIs) is fundamentally constrained by the limited, heterogeneous, and privacy-sensitive neural recordings. Generating synthetic yet physiologically plausible brain signals has therefore emerged as a compelling way to mitigate data scarcity and enhance model capacity. This survey provides a comprehensive review of brain signal generation for BCIs, covering methodological taxonomies, benchmark experiments, evaluation metrics, and key applications. We systematically categorize existing generative algorithms into four types: knowledge-based, feature-based, model-based, and translation-based approaches. Furthermore, we benchmark existing brain signal generation approaches across four representative BCI paradigms to provide an objective performance comparison. Finally, we discuss the potentials and challenges of current generation approaches and prospect future research on accurate, data-efficient, and privacy-aware BCI systems. The benchmark codebase is publicized at this https URL.

[AI-54] From Garbage to Gold: A Data-Architectural Theory of Predictive Robustness

【速读】：该论文旨在解决表格型机器学习中一个核心悖论：现代高维（high-D）、共线性强且含噪声的预测变量仍能实现最优性能，这与“垃圾进，垃圾出”（Garbage In, Garbage Out）的传统认知相悖。其解决方案的关键在于重构对数据质量的理解——从关注个体数据项的清洁度转向强调预测变量空间的整体架构（portfolio-level architecture），并揭示了模型容量与数据结构之间的协同作用机制。通过将预测空间中的噪声细分为“预测误差”（Predictor Error）和“结构性不确定性”（Structural Uncertainty），作者证明，在高维场景下利用含有错误的预测变量可渐近地克服两类噪声，而低维数据清洗则受限于结构性不确定性；进一步提出“主动数据中心型AI”（Proactive Data-Centric AI）以识别促进鲁棒性的关键预测因子，并阐明为何信息性共线性（Informative Collinearity）可提升收敛效率与可靠性，从而为从静态模型迁移向方法论迁移转变提供理论依据。

链接: https://arxiv.org/abs/2603.12288
作者: Terrence J. Lee-St. John,Jordan L. Lawson,Bartlomiej Piechowski-Jozwiak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 120 pages, 12 figures, 3 tables. Simulation code and documentation available at: this https URL

点击查看摘要

Abstract:Tabular machine learning presents a paradox: modern models achieve state-of-the-art performance using high-dimensional (high-D), collinear, error-prone data, defying the “Garbage In, Garbage Out” mantra. To help resolve this, we synthesize principles from Information Theory, Latent Factor Models, and Psychometrics, clarifying that predictive robustness arises not solely from data cleanliness, but from the synergy between data architecture and model capacity. Partitioning predictor-space “noise” into “Predictor Error” and “Structural Uncertainty” (informational deficits from stochastic generative mappings), we prove that leveraging high-D sets of error-prone predictors asymptotically overcomes both types of noise, whereas cleaning a low-D set is fundamentally bounded by Structural Uncertainty. We demonstrate why “Informative Collinearity” (dependencies from shared latent causes) enhances reliability and convergence efficiency, and explain why increased dimensionality reduces the latent inference burden, enabling feasibility with finite samples. To address practical constraints, we propose “Proactive Data-Centric AI” to identify predictors that enable robustness efficiently. We also derive boundaries for Systematic Error Regimes and show why models that absorb “rogue” dependencies can mitigate assumption violations. Linking latent architecture to Benign Overfitting, we offer a first step towards a unified view of robustness to Outcome Error and predictor-space noise, while also delineating when traditional DCAI’s focus on label cleaning remains powerful. By redefining data quality from item-level perfection to portfolio-level architecture, we provide a theoretical rationale for “Local Factories” – learning from live, uncurated enterprise “data swamps” – supporting a deployment paradigm shift from “Model Transfer” to "Methodology Transfer’’ to overcome static generalizability limitations.

[AI-55] DART: Input-Difficulty-AwaRe Adaptive Threshold for Early-Exit DNNs

【速读】：该论文旨在解决早期退出深度神经网络（Early-exit Deep Neural Networks）在资源受限场景下存在的三个关键问题：现有方法依赖次优的退出策略、忽略输入难度差异，以及阈值优化独立进行导致整体效率低下。其核心解决方案是提出DART（Input-Difficulty-Aware Adaptive Threshold）框架，包含三项关键技术：(1) 轻量级输入难度估计模块，以极低计算开销量化输入复杂度；(2) 基于动态规划的联合退出策略优化算法，实现多层退出决策的整体最优；(3) 自适应系数管理机制，动态调整不同目标间的权衡。实验表明，DART在多个经典CNN模型上相比静态网络实现最高达3.3倍的速度提升和5.1倍的能耗降低，同时引入难度感知效率评分（DAES）指标验证了其在准确率、效率与鲁棒性之间更优的综合性能表现。

链接: https://arxiv.org/abs/2603.12269
作者: Parth Patne,Mahdi Taheri,Christian Herglotz,Maksim Jenihhin,Milos Krstic,Michael Hübner
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Early-exit deep neural networks enable adaptive inference by terminating computation when sufficient confidence is achieved, reducing cost for edge AI accelerators in resource-constrained settings. Existing methods, however, rely on suboptimal exit policies, ignore input difficulty, and optimize thresholds independently. This paper introduces DART (Input-Difficulty-Aware Adaptive Threshold), a framework that overcomes these limitations. DART introduces three key innovations: (1) a lightweight difficulty estimation module that quantifies input complexity with minimal computational overhead, (2) a joint exit policy optimization algorithm based on dynamic programming, and (3) an adaptive coefficient management system. Experiments on diverse DNN benchmarks (AlexNet, ResNet-18, VGG-16) demonstrate that DART achieves up to \textbf3.3 \times speedup, \textbf5.1 \times lower energy, and up to \textbf42% lower average power compared to static networks, while preserving competitive accuracy. Extending DART to Vision Transformers (LeViT) yields power (5.0 \times ) and execution-time (3.6 \times ) gains but also accuracy loss (up to 17 percent), underscoring the need for transformer-specific early-exit mechanisms. We further introduce the Difficulty-Aware Efficiency Score (DAES), a novel multi-objective metric, under which DART achieves up to a 14.8 improvement over baselines, highlighting superior accuracy, efficiency, and robustness trade-offs.

[AI-56] From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的AI代理在计算材料科学中执行任务时缺乏知识累积与复用的问题，即每次计算独立进行、无法保留和利用先前实验中的经验教训。解决方案的关键在于提出QMatSuite这一开源平台，其核心机制包括：AI代理以完整溯源方式记录每次执行结果，在新计算前检索历史知识，并通过专门的反思会话修正错误发现并提炼跨化合物模式，从而实现知识的持续积累与迁移，显著提升推理效率与预测准确性。

链接: https://arxiv.org/abs/2603.13191
作者: Haonan Huang
机构: 未知
类目: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have transformed AI agents into proficient executors of computational materials science, performing a hundred simulations does not make a researcher. What distinguishes research from routine execution is the progressive accumulation of knowledge – learning which approaches fail, recognizing patterns across systems, and applying understanding to new problems. However, the prevailing paradigm in AI-driven computational science treats each execution in isolation, largely discarding hard-won insights between runs. Here we present QMatSuite, an open-source platform closing this gap. Agents record findings with full provenance, retrieve knowledge before new calculations, and in dedicated reflection sessions correct erroneous findings and synthesize observations into cross-compound patterns. In benchmarks on a six-step quantum-mechanical simulation workflow, accumulated knowledge reduces reasoning overhead by 67% and improves accuracy from 47% to 3% deviation from literature – and when transferred to an unfamiliar material, achieves 1% deviation with zero pipeline failures.

[AI-57] Clustering Astronomical Orbital Synthetic Data Using Advanced Feature Extraction and Dimensionality Reduction Techniques

【速读】：该论文旨在解决传统方法在分析土星卫星系统复杂轨道动力学时面临的挑战，尤其是面对大规模模拟数据集时，傅里叶分析和稳定性指标等手段难以有效捕捉长期轨道演化中的稳定区域、共振结构及复杂动力学行为的问题。其解决方案的关键在于构建一个基于机器学习的聚类分析流程，核心创新是采用MiniRocket算法将400个时间步的轨道序列高效映射至9,996维特征空间，从而精准提取时序模式；结合自动特征提取与降维技术，显著提升了高维数据的可解释性与聚类效果，最终实现了对土星卫星系统长期动力学演化的新认知。

链接: https://arxiv.org/abs/2603.13177
作者: Eraldo Pereira Marinho,Nelson Callegari Junior,Fabricio Aparecido Breve,Caetano Mazzoni Ranieri
机构: 未知
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for publication in Neural Computing and Applications (Springer Nature)

点击查看摘要

Abstract:The dynamics of Saturn’s satellite system offer a rich framework for studying orbital stability and resonance interactions. Traditional methods for analysing such systems, including Fourier analysis and stability metrics, struggle with the scale and complexity of modern datasets. This study introduces a machine learning-based pipeline for clustering approximately 22,300 simulated satellite orbits, addressing these challenges with advanced feature extraction and dimensionality reduction techniques. The key to this approach is using MiniRocket, which efficiently transforms 400 timesteps into a 9,996-dimensional feature space, capturing intricate temporal patterns. Additional automated feature extraction and dimensionality reduction techniques refine the data, enabling robust clustering analysis. This pipeline reveals stability regions, resonance structures, and other key behaviours in Saturn’s satellite system, providing new insights into their long-term dynamical evolution. By integrating computational tools with traditional celestial mechanics techniques, this study offers a scalable and interpretable methodology for analysing large-scale orbital datasets and advancing the exploration of planetary dynamics.

[AI-58] Developing the PsyCogMetrics AI Lab to Evaluate Large Language Models and Advance Cognitive Science – A Three-Cycle Action Design Science Study

【速读】：该论文旨在解决当前大型语言模型（Large Language Model, LLM）评估方法中存在的局限性，以及未能满足利益相关者在心理测量学与认知科学维度上的需求。其解决方案的关键在于构建一个集成的、基于云的平台——PsyCogMetrics AI Lab，通过三阶段行动设计科学研究框架（Relevance Cycle、Rigor Cycle 和 Design Cycle）实现：首先识别评估痛点并明确设计目标，其次基于波普尔可证伪性、经典测验理论和认知负荷理论等核心理论推导出严谨的设计原则，最后通过嵌套的“构建-干预-评估”循环将这些原则具体化为可验证的IT工具与评估流程，从而提供一种系统化、可重复且跨学科的LLM评估新范式。

链接: https://arxiv.org/abs/2603.13126
作者: Zhiye Jin,Yibai Li,K. D. Joshi,Xuefei(Nancy)Deng,Xiaobing(Emily)Li
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 10 pages. Prepared: April 2025; submitted: June 15, 2025; accepted: August 2025. In: Proceedings of the 59th Hawaii International Conference on System Sciences (HICSS 2026), January 2026

点击查看摘要

Abstract:This study presents the development of the PsyCogMetrics AI Lab (this http URL), an integrated, cloud-based platform that operationalizes psychometric and cognitive-science methodologies for Large Language Model (LLM) evaluation. Framed as a three-cycle Action Design Science study, the Relevance Cycle identifies key limitations in current evaluation methods and unfulfilled stakeholder needs. The Rigor Cycle draws on kernel theories such as Popperian falsifiability, Classical Test Theory, and Cognitive Load Theory to derive deductive design objectives. The Design Cycle operationalizes these objectives through nested Build-Intervene-Evaluate loops. The study contributes a novel IT artifact, a validated design for LLM evaluation, benefiting research at the intersection of AI, psychology, cognitive science, and the social and behavioral sciences.

[AI-59] owards unified brain-to-text decoding across speech production and perception

【速读】：该论文旨在解决跨模态（speech production and perception）脑到文本解码在汉字语言（如汉语普通话）中的统一建模问题，尤其是如何实现从神经信号中准确重建句子级别的语义内容，同时支持未见过的字符和音节。其解决方案的关键在于提出一个统一的脑到句子解码框架：首先通过分类神经信号中的拼音成分（声母和韵母）来提取音节结构，再利用经过三阶段后训练的70亿参数大语言模型（LLM）将无调拼音序列映射为中文句子；该方法不仅在仅用单字数据训练下实现了句子级解码和对未见词汇的泛化能力，还通过两阶段推理流程显著提升了性能，甚至优于数百亿参数的商用大模型。此方案突破了传统单一模态、字母文字的局限，揭示了汉语语音产生与感知的神经动力学差异，并验证了多模态神经语言解码系统的可行性。

链接: https://arxiv.org/abs/2603.12628
作者: Zhizhang Yuan,Yang Yang,Gaorui Zhang,Baowen Cheng,Zehan Wu,Yuhao Xu,Xiaoying Liu,Liang Chen,Ying Mao,Meng Li
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 37 pages, 9 figures

点击查看摘要

Abstract:Speech production and perception are the main ways humans communicate daily. Prior brain-to-text decoding studies have largely focused on a single modality and alphabetic languages. Here, we present a unified brain-to-sentence decoding framework for both speech production and perception in Mandarin Chinese. The framework exhibits strong generalization ability, enabling sentence-level decoding when trained only on single-character data and supporting characters and syllables unseen during training. In addition, it allows direct and controlled comparison of neural dynamics across modalities. Mandarin speech is decoded by first classifying syllable components in Hanyu Pinyin, namely initials and finals, from neural signals, followed by a post-trained large language model (LLM) that maps sequences of toneless Pinyin syllables to Chinese sentences. To enhance LLM decoding, we designed a three-stage post-training and two-stage inference framework based on a 7-billion-parameter LLM, achieving overall performance that exceeds larger commercial LLMs with hundreds of billions of parameters or more. In addition, several characteristics were observed in Mandarin speech production and perception: speech production involved neural responses across broader cortical regions than auditory perception; channels responsive to both modalities exhibited similar activity patterns, with speech perception showing a temporal delay relative to production; and decoding performance was broadly comparable across hemispheres. Our work not only establishes the feasibility of a unified decoding framework but also provides insights into the neural characteristics of Mandarin speech production and perception. These advances contribute to brain-to-text decoding in logosyllabic languages and pave the way toward neural language decoding systems supporting multiple modalities.

[AI-60] CLARE: Classification-based Regression for Electron Temperature Prediction

【速读】：该论文旨在解决空间天气研究中电子温度（Electron Temperature, Te）参数长期被忽视的问题，尤其是在地球等离子体层（plasmasphere）中Te预测精度不足的挑战。传统机器学习方法在处理连续输出的Te预测时存在准确性局限，且缺乏不确定性估计能力。解决方案的关键在于提出一种基于分类回归（classification-based regression）架构的新型机器学习模型CLARE，其将连续的Te输出空间离散化为150个分类区间，从而提升预测精度并自然输出预测不确定性信息。实验表明，相较于传统回归模型，CLARE在Akebono卫星数据上的预测准确率相对提升6.46%，并在已知地磁暴期间（1991年1月30日至2月7日）实现46.17%的预测准确率，验证了该方法在公开数据上构建高精度Te模型的有效性。

链接: https://arxiv.org/abs/2603.12470
作者: Michael Liang,Blake DeHaas,Naomi Maruyama,Xiangning Chu,Takumi Abe,Koh-Ichiro Oyama
机构: 未知
类目: pace Physics (physics.space-ph); Artificial Intelligence (cs.AI)
备注: 19 pages, 8 figures. Submitted to JGR: Machine Learning and Computation. Research conducted at CU Boulder LASP with support from NASA and JAXA

点击查看摘要

Abstract:Electron temperature (Te) is an important parameter governing space weather in the upper atmosphere, but has historically been underexplored in the space weather machine learning literature. We present CLARE, a machine learning model for predicting electron temperature in the Earth’s plasmasphere trained on AKEBONO (EXOS-D) satellite measurements as well as solar and geomagnetic indices. CLARE uses a classification-based regression architecture that transforms the continuous Te output space into 150 discrete classification intervals. Training the model on a classification task improves prediction accuracy by 6.46% relative compared to a traditional regression model while also outputting uncertainty estimation information on its predictions. On a held out test set from the AKEBONO data, the model’s Te predictions achieve 69.67% accuracy within 10% of the ground truth and 46.17% on a known geomagnetic storm period from January 30th to February 7th, 1991. We show that machine learning can be used to produce high-accuracy Te models on publicly available data.

[AI-61] he DIME Architecture: A Unified Operational Algorithm for Neural Representation Dynamics Control and Integration

【速读】：该论文旨在解决现代神经科学中缺乏一个统一的计算架构来整合感知、记忆、估值与意识等现象的问题。现有理论如预测编码、痕迹理论、神经调制模型及全局工作空间模型虽在各自领域具有解释力，但在架构层面仍难以融合。其解决方案的关键在于提出DIME（Detect-Integrate-Mark-Execute）神经架构，该架构通过四个相互作用的组件实现功能整合：痕迹（engrams）支持分布式递归神经结构和多条激活轨迹；执行线程（execution threads）实现时空轨迹上的神经过程；标记系统（marker systems）调节增益、可塑性和轨迹选择；超痕迹（hyperengrams）则与操作性意识访问相关联。这一框架不仅契合海马索引、皮层递归处理、回放现象及大规模网络整合等实证证据，也为生成式AI和机器人学提供了一个以统一机制涌现表征、估值与时间序列的架构模板。

链接: https://arxiv.org/abs/2603.12286
作者: Ionel Cristian Vladu,Nicu Bizdoaca,Ionica Pirici,Tudor-Adrian Balseanu,Eduard Nicusor Bondoc
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 45 pages, 8 figures. Architectural overview of the DIME framework. Extended theoretical treatment available in companion monograph (Zenodo)

点击查看摘要

Abstract:Modern neuroscience has accumulated extensive evidence on perception, memory, prediction, valuation, and consciousness, yet still lacks an explicit operational architecture capable of integrating these phenomena within a unified computational framework. Existing theories address specific aspects of neural function: predictive coding and active inference emphasize hierarchical inference and prediction error minimization; engram theories explain memory through distributed cell assemblies; neuromodulatory accounts focus on value-dependent regulation of plasticity and behaviour; and global workspace or large-scale network models investigate mechanisms underlying conscious access. Despite their explanatory power, these approaches remain only partially integrated at the architectural level. This work introduces DIME (Detect-Integrate-Mark-Execute), a neural architecture organizing perception, memory, valuation, and conscious access within a common operational cycle. The framework includes four interacting components: engrams, distributed recurrent neural structures supporting multiple activation trajectories; execution threads, spatiotemporal trajectories implementing neural processes; marker systems, neuromodulatory and limbic mechanisms regulating gain, plasticity, and trajectory selection; and hyperengrams, large-scale integrative states associated with operational conscious access. The framework is consistent with empirical evidence from hippocampal indexing, recurrent cortical processing, replay phenomena, large-scale network integration, and neuromodulatory regulation. Formulated at an abstract computational level, DIME may also inform artificial intelligence and robotics by providing an architectural template in which representation, valuation, and temporal sequencing emerge from a unified mechanism. An extended theoretical exposition is available in a companion monograph on Zenodo.

[AI-62] Predictive Analytics for Foot Ulcers Using Time-Series Temperature and Pressure Data

【速读】：该论文旨在解决糖尿病足溃疡（Diabetic Foot Ulcers, DFUs）早期预测难题，以实现更早的干预和降低发病率。其解决方案的关键在于构建一个基于可穿戴传感器的时间序列数据分析框架，利用NTC薄膜热电偶监测足部温度变化与FlexiForce压力传感器记录足底负荷分布，并结合无监督机器学习算法（Isolation Forest与K-Nearest Neighbors, KNN）识别潜在异常模式。通过严谨的数据预处理和特征工程，系统能够捕捉足部生理参数的细微波动，其中Isolation Forest对微小异常敏感，KNN则擅长检测极端偏离值，二者协同使用可提升预测准确性，从而为实时糖尿病足健康监测提供可靠依据。

链接: https://arxiv.org/abs/2603.12278
作者: Md Tanvir Hasan Turja
机构: 未知
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 36 pages, 19 figures

点击查看摘要

Abstract:Diabetic foot ulcers (DFUs) are a severe complication of diabetes, often resulting in significant morbidity. This paper presents a predictive analytics framework utilizing time-series data captured by wearable foot sensors – specifically NTC thin-film thermocouples for temperature measurement and FlexiForce pressure sensors for plantar load monitoring. Data was collected from healthy subjects walking on an instrumented pathway. Unsupervised machine learning algorithms, Isolation Forest and K-Nearest Neighbors (KNN), were applied to detect anomalies that may indicate early ulcer risk. Through rigorous data preprocessing and targeted feature engineering, physiologic patterns were extracted to identify subtle changes in foot temperature and pressure. Results demonstrate Isolation Forest is sensitive to micro-anomalies, while KNN is effective in flagging extreme deviations, albeit at a higher false-positive rate. Strong correlations between temperature and pressure readings support combined sensor monitoring for improved predictive accuracy. These findings provide a basis for real-time diabetic foot health surveillance, aiming to facilitate earlier intervention and reduce DFU incidence.

机器学习

[LG-0] ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training

链接: https://arxiv.org/abs/2603.13115
作者: Jie Ji,Gen Li,Kaiyuan Deng,Fatemeh Afghah,Xiaolong Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models, despite their impressive achievements, suffer from high computational costs and memory requirements, limiting their usability in resource-constrained environments. Sparse neural networks significantly alleviate these constraints by dramatically reducing parameter count and computational overhead. However, existing sparse training methods often experience chaotic and noisy gradient signals, severely hindering convergence and generalization performance, particularly at high sparsity levels. To tackle this critical challenge, we propose Zero-Order Sharpness-Aware Minimization (ZO-SAM), a novel optimization framework that strategically integrates zero-order optimization within the SAM approach. Unlike traditional SAM, ZO-SAM requires only a single backpropagation step during perturbation, selectively utilizing zero-order gradient estimations. This innovative approach reduces the backpropagation computational cost by half compared to conventional SAM, significantly lowering gradient variance and effectively eliminating associated computational overhead. By harnessing SAM’s capacity for identifying flat minima, ZO-SAM stabilizes the training process and accelerates convergence. These efficiency gains are particularly important in sparse training scenarios, where computational cost is the primary bottleneck that limits the practicality of SAM. Moreover, models trained with ZO-SAM exhibit improved robustness under distribution shift, further broadening its practicality in real-world deployments.

[LG-1] Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors

链接: https://arxiv.org/abs/2603.13092
作者: Wei W. Xing,Kaiqi Huang,Jiazhan Liu,Hong Qiu,Shan Shen
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Accepted by DAC2026. Initial Version

点击查看摘要

Abstract:Yield Multi-Corner Analysis validates circuits across 25+ Process-Voltage-Temperature corners, resulting in a combinatorial simulation cost of O(K \times N) where K denotes corners and N exceeds 10^4 samples per corner. Existing methods face a fundamental trade-off: simple models achieve automation but fail on nonlinear circuits, while advanced AI models capture complex behaviors but require hours of hyperparameter tuning per design iteration, forming the Tuning Barrier. We break this barrier by replacing engineered priors (i.e., model specifications) with learned priors from a foundation model pre-trained on millions of regression tasks. This model performs in-context learning, instantly adapting to each circuit without tuning or retraining. Its attention mechanism automatically transfers knowledge across corners by identifying shared circuit physics between operating conditions. Combined with an automated feature selector (1152D to 48D), our method matches state-of-the-art accuracy (mean MREs as low as 0.11%) with zero tuning, reducing total validation cost by over 10\times .

[LG-2] Causal Cellular Context Transfer Learning (C3TL): An Efficient Architecture for Prediction of Unseen Perturbation Effects

链接: https://arxiv.org/abs/2603.13051
作者: Michael Scholkemper,Sach Mukherjee
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 12 Pages, 3 figures, Keywords: perturbation prediction, context transfer, lightweight, machine learning

点击查看摘要

Abstract:Predicting the effects of chemical and genetic perturbations on quantitative cell states is a central challenge in computational biology, molecular medicine and drug discovery. Recent work has leveraged large-scale single-cell data and massive foundation models to address this task. However, such computational resources and extensive datasets are not always accessible in academic or clinical settings, hence limiting utility. Here we propose a lightweight framework for perturbation effect prediction that exploits the structured nature of biological interventions and specific inductive biases/invariances. Our approach leverages available information concerning perturbation effects to allow generalization to novel contexts and requires only widely-available bulk molecular data. Extensive testing, comparing predictions of context-specific perturbation effects against real, large-scale interventional experiments, demonstrates accurate prediction in new contexts. The proposed approach is competitive with SOTA foundation models but requires simpler data, much smaller model sizes and less time. Focusing on robust bulk signals and efficient architectures, we show that accurate prediction of perturbation effects is possible without proprietary hardware or very large models, hence opening up ways to leverage causal learning approaches in biomedicine generally.

[LG-3] 3DTCR: A Physics-Based Generative Framework for Vortex-Following 3D Reconstruction to Improve Tropical Cyclone Intensity Forecasting

链接: https://arxiv.org/abs/2603.13049
作者: Jun Liu,Xiaohui Zhong,Kai Zheng,Jiarui Li,Yifei Li,Tao Zhou,Wenxu Qian,Shun Dai,Ruian Tie,Yangyang Zhao,Hao Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tropical cyclone (TC) intensity forecasting remains challenging as current numerical and AI-based weather models fail to satisfactorily represent extreme TC structure and intensity. Although intensity time-series forecasting has achieved significant advances, it outputs intensity sequences rather than the three-dimensional inner-core fine-scale structure and physical mechanisms governing TC evolution. High-resolution numerical simulations can capture these features but remain computationally expensive and inefficient for large-scale operational applications. Here we present 3DTCR, a physics-based generative framework combining physical constraints with generative AI efficiency for 3D TC structure reconstruction. Trained on a six-year, 3-km-resolution moving-domain WRF dataset, 3DTCR enables region-adaptive vortex-following reconstruction using conditional Flow Matching(CFM), optimized via latent domain adaptation and two-stage transfer learning. The framework mitigates limitations imposed by low-resolution targets and over-smoothed forecasts, improving the representation of TC inner-core structure and intensity while maintaining track stability. Results demonstrate that 3DTCR outperforms the ECMWF high-resolution forecasting system (ECMWF-HRES) in TC intensity prediction at nearly all lead times up to 5 days and reduces the RMSE of maximum WS10M by 36.5% relative to its FuXi inputs. These findings highlight 3DTCR as a physics-based generative framework that efficiently resolves fine-scale structures at lower computational cost, which may offer a promising avenue for improving TC intensity forecasting.

[LG-4] OpenACMv2: An Accuracy-Constrained Co-Optimization Framework for Approximate DCiM

链接: https://arxiv.org/abs/2603.13042
作者: Yiqi Zhou,Yue Yuan,Yikai Wang,Bohao Liu,Qinxin Mei,Zhuohua Liu,Shan Shen,Wei Xing,Daying Sun,Li Li,Guozhu Liu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Accepted by DAC2026. Initial version

点击查看摘要

Abstract:Digital Compute-in-Memory (DCiM) accelerates neural networks by reducing data movement. Approximate DCiM can further improve power-performance-area (PPA), but demands accuracy-constrained co-optimization across coupled architecture and transistor-level choices. Building on OpenYield, we introduce Accuracy-Constrained Co-Optimization (ACCO) and present OpenACMv2, an open framework that operationalizes ACCO via two-level optimization: (1) accuracy-constrained architecture search of compressor combinations and SRAM macro parameters, driven by a fast GNN-based surrogate for PPA and error; and (2) variation- and PVT-aware transistor sizing for standard cells and SRAM bitcells using Monte Carlo. By decoupling ACCO into architecture-level exploration and circuit-level sizing, OpenACMv2 integrates classic single- and multi-objective optimizers to deliver strong PPA-accuracy tradeoffs and robust convergence. The workflow is compatible with FreePDK45 and OpenROAD, supporting reproducible evaluation and easy adoption. Experiments demonstrate significant PPA improvements under controlled accuracy budgets, enabling rapid “what-if” exploration for approximate DCiM. The framework is available on this https URL.

[LG-5] Federated Few-Shot Learning on Neuromorphic Hardware: An Empirical Study Across Physical Edge Nodes

链接: https://arxiv.org/abs/2603.13037
作者: Steven Motta,Gioele Nanni
类目: Neural and Evolutionary Computing (cs.NE); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 13 pages, 2 figures, 10 tables. Code: this https URL

点击查看摘要

Abstract:Federated learning on neuromorphic hardware remains unexplored because on-chip spike-timing-dependent plasticity (STDP) produces binary weight updates rather than the floating-point gradients assumed by standard algorithms. We build a two-node federated system with BrainChip Akida AKD1000 processors and run approximately 1,580 experimental trials across seven analysis phases. Of four weight-exchange strategies tested, neuron-level concatenation (FedUnion) consistently preserves accuracy while element-wise weight averaging (FedAvg) destroys it (p = 0.002). Domain-adaptive fine-tuning of the upstream feature extractor accounts for most of the accuracy gains, confirming feature quality as the dominant factor. Scaling feature dimensionality from 64 to 256 yields 77.0% best-strategy federated accuracy (n=30, p 0.001). Two independent asymmetries (wider features help federation more than individual learning, while binarization hurts federation more) point to a shared prototype complementarity mechanism: cross-node transfer scales with the distinctiveness of neuron prototypes.

[LG-6] PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses

链接: https://arxiv.org/abs/2603.13026
作者: Chenlong Yin,Runpeng Geng,Yanting Wang,Jinyuan Jia
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 26 pages, 3 figures

点击查看摘要

Abstract:Prompt injection poses serious security risks to real-world LLM applications, particularly autonomous agents. Although many defenses have been proposed, their robustness against adaptive attacks remains insufficiently evaluated, potentially creating a false sense of security. In this work, we propose PISmith, a reinforcement learning (RL)-based red-teaming framework that systematically assesses existing prompt-injection defenses by training an attack LLM to optimize injected prompts in a practical black-box setting, where the attacker can only query the defended LLM and observe its outputs. We find that directly applying standard GRPO to attack strong defenses leads to sub-optimal performance due to extreme reward sparsity – most generated injected prompts are blocked by the defense, causing the policy’s entropy to collapse before discovering effective attack strategies, while the rare successes cannot be learned effectively. In response, we introduce adaptive entropy regularization and dynamic advantage weighting to sustain exploration and amplify learning from scarce successes. Extensive evaluation on 13 benchmarks demonstrates that state-of-the-art prompt injection defenses remain vulnerable to adaptive attacks. We also compare PISmith with 7 baselines across static, search-based, and RL-based attack categories, showing that PISmith consistently achieves the highest attack success rates. Furthermore, PISmith achieves strong performance in agentic settings on InjecAgent and AgentDojo against both open-source and closed-source LLMs (e.g., GPT-4o-mini and GPT-5-nano). Our code is available at this https URL.

[LG-7] FraudFox: Adaptable Fraud Detection in the Real World

链接: https://arxiv.org/abs/2603.13014
作者: Matthew Butler,Yi Fan,Christos Faloutsos
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The proposed method (FraudFox) provides solutions to adversarial attacks in a resource constrained environment. We focus on questions like the following: How suspicious is Smith', trying to buy \ 500 shoes, on Monday 3am? How to merge the risk scores, from a handful of risk-assessment modules (oracles’) in an adversarial environment? More importantly, given historical data (orders, prices, and what-happened afterwards), and business goals/restrictions, which transactions, like the Smith' transaction above, which ones should we pass’, versus send to human investigators? The business restrictions could be: at most x investigations are feasible', or at most \ y lost due to fraud’. These are the two research problems we focus on, in this work. One approach to address the first problem (`oracle-weighting’), is by using Extended Kalman Filters with dynamic importance weights, to automatically and continuously update our weights for each ‘oracle’. For the second problem, we show how to derive an optimal decision surface, and how to compute the Pareto optimal set, to allow what-if questions. An important consideration is adaptation: Fraudsters will change their behavior, according to our past decisions; thus, we need to adapt accordingly. The resulting system, \method, is scalable, adaptable to changing fraudster behavior, effective, and already in \textbfproduction at Amazon. FraudFox augments a fraud prevention sub-system and has led to significant performance gains.

[LG-8] Dependency-Aware Parallel Decoding via Attention for Diffusion LLM s

链接: https://arxiv.org/abs/2603.12996
作者: Bumjun Kim,Dongjae Jeon,Moongyu Jeon,Albert No
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parallel decoding for diffusion LLMs (dLLMs) is difficult because each denoising step provides only token-wise marginal distributions, while unmasking multiple tokens simultaneously requires accounting for inter-token dependencies. We propose Dependency-Aware Parallel Decoding (DAPD), a simple, training-free decoding method that uses self-attention to induce a conditional dependency graph over masked tokens. At each iteration, edges in this graph capture strong token interactions, while non-edges indicate weak dependence. Parallel decoding is then reduced to selecting an independent set on the graph and unmasking the selected tokens in parallel. This avoids co-updating strongly coupled tokens without auxiliary models or retraining. Experiments on LLaDA and Dream show that DAPD improves the accuracy-steps trade-off over existing methods and enables more globally distributed parallel updates that better exploit the any-order generation capability of dLLMs.

[LG-9] Retrieval-Enhanced Real Estate Appraisal ECML2024

链接: https://arxiv.org/abs/2603.12986
作者: Simon Popelier,Matthieu X. B. Sarazin,Maximilien Bohm,Mathieu Gierski,Hanna Mergui,Matthieu Ospici,Adrien Bernhardt
类目: Machine Learning (cs.LG)
*备注: Accepted at NFMCP 2024 workshop (New Frontiers in Mining Complex Patterns), held in conjunction with ECML 2024

点击查看摘要

Abstract:The Sales Comparison Approach (SCA) is one of the most popular when it comes to real estate appraisal. Used as a reference in real estate expertise and as one of the major types of Automatic Valuation Models (AVM), it recently gained popularity within machine learning methods. The performance of models able to use data represented as sets and graphs made it possible to adapt this methodology efficiently, yielding substantial results. SCA relies on taking past transactions (comparables) as references, selected according to their similarity with the target property’s sale. In this study, we focus on the selection of these comparables for real estate appraisal. We demonstrate that the selection of comparables used in many state-of-the-art algorithms can be significantly improved by learning a selection policy instead of imposing it. Our method relies on a hybrid vector-geographical retrieval module capable of adapting to different datasets and optimized jointly with an estimation module. We further show that the use of carefully selected comparables makes it possible to build models that require fewer comparables and fewer parameters with performance close to state-of-the-art models. All our evaluations are made on five datasets which span areas in the United States, Brazil, and France.

[LG-10] Exact Federated Continual Unlearning for Ridge Heads on Frozen Foundation Models

链接: https://arxiv.org/abs/2603.12977
作者: Yijun Quan,Wentai Wu,Giovanni Montana
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models are commonly deployed as frozen feature extractors with a small trainable head to adapt to private, user-generated data in federated settings. The ``right to be forgotten’’ requires removing the influence of specific samples or users from the trained model on demand. Existing federated unlearning methods target general deep models and rely on approximate reconstruction or selective retraining, making exactness costly or elusive. We study this problem in a practically relevant but under-explored regime: a frozen foundation model with a ridge-regression head. The exact optimum depends on the data only through two additive sufficient statistics, which we turn into a communication protocol supporting an arbitrary stream of \emphadd and \emphdelete requests via fixed-size messages. The server maintains a head that is, in exact arithmetic, \emphpointwise identical to centralized retraining after every request. We provide deterministic retrain-equivalence guarantees, order and partition invariance, two server-side variants, and a Bayesian certificate of zero KL divergence. Experiments on four benchmarks confirm the guarantees: both variants match centralized ridge retraining to within 10^-9 relative Frobenius error and complete each request at orders-of-

[LG-11] Enhanced Drug-drug Interaction Prediction Using Adaptive Knowledge Integration

链接: https://arxiv.org/abs/2603.12885
作者: Pengfei Liu,Jun Tao,Zhixiang Ren
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Drug-drug interaction event (DDIE) prediction is crucial for preventing adverse reactions and ensuring optimal therapeutic outcomes. However, existing methods often face challenges with imbalanced datasets, complex interaction mechanisms, and poor generalization to unknown drug combinations. To address these challenges, we propose a knowledge augmentation framework that adaptively infuses prior drug knowledge into a large language model (LLM). This framework utilizes reinforcement learning techniques to facilitate adaptive knowledge extraction and synthesis, thereby efficiently optimizing the strategy space to enhance the accuracy of LLMs for DDIE predictions. As a result of few-shot learning, we achieved a notable improvement compared to the baseline. This approach establishes an effective framework for scientific knowledge learning for DDIE predictions.

[LG-12] st-time RL alignment exposes task familiarity artifacts in LLM benchmarks

链接: https://arxiv.org/abs/2603.12875
作者: Kun Wang,Reinhard Heckel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Direct evaluation of LLMs on benchmarks can be misleading because comparatively strong performance may reflect task familiarity rather than capability. The train-before-test approach controls for task familiarity by giving each model task-relevant training before evaluation, originally through supervised finetuning. However, suitable training data is often hard to come by, and evaluation results vary with the data chosen. In this paper, we propose a two-stage test-time reinforcement learning (RL) alignment method for train-before-test. First, RL with a single sample provides a first alignment of the model to the task format, and second, test-time RL with majority-voting reward aligns the model to the benchmark distribution. Our test-time RL alignment method aligns similarly well as SFT-based train-before test, but without requiring a task-specific training set. On a domain-specific benchmark without training data, we show that direct evaluation underestimates base models which perform substantially better once aligned, yielding a more faithful evaluation of their capabilities. Moreover, for reasoning tasks, the performance gap between fine-tuned models and their base models largely disappears after alignment, suggesting that many gains from RLVR/SFT reported in the literature are not a difference in reasoning capability, but rather artifacts of task familiarity.

[LG-13] Surrogates for Physics-based and Data-driven Modelling of Parametric Systems: Review and New Perspectives

链接: https://arxiv.org/abs/2603.12870
作者: Matteo Giacomini,Pedro Díez
类目: Numerical Analysis (math.NA); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Surrogate models provide compact relations between user-defined input parameters and output quantities of interest, enabling the efficient evaluation of complex parametric systems in many-query settings. Such capabilities are essential in a wide range of applications, including optimisation, control, data assimilation, uncertainty quantification, and emerging digital twin technologies in various fields such as manufacturing, personalised healthcare, smart cities, and sustainability. This article reviews established methodologies for constructing surrogate models exploiting either knowledge of the governing laws and the dynamical structure of the system (physics-based) or experimental observations (data-driven), as well as hybrid approaches combining these two paradigms. By revisiting the design of a surrogate model as a functional approximation problem, existing methodologies are reviewed in terms of the choice of (i) a reduced basis and (ii) a suitable approximation criterion. The paper reviews methodologies pertaining to the field of Scientific Machine Learning, and it aims at synthesising established knowledge, recent advances, and new perspectives on: dimensionality reduction, physics-based, and data-driven surrogate modelling based on proper orthogonal decomposition, proper generalised decomposition, and artificial neural networks; multi-fidelity methods to exploit information from sources with different fidelities; adaptive sampling, enrichment, and data augmentation techniques to enhance the quality of surrogate models.

[LG-14] On Linear Separability of the MNIST Handwritten Digits Dataset

链接: https://arxiv.org/abs/2603.12850
作者: Ákos Hajnal
类目: Machine Learning (cs.LG)
*备注: 8 pages, 1 figure

点击查看摘要

Abstract:The MNIST dataset containing thousands of handwritten digit images is still a fundamental benchmark for evaluating various pattern-recognition and image-classification models. Linear separability is a key concept in many statistical and machine-learning techniques. Despite the long history of the MNIST dataset and its relative simplicity in size and resolution, the question of whether the dataset is linearly separable has never been fully answered – scientific and informal sources share conflicting claims. This paper aims to provide a comprehensive empirical investigation to address this question, distinguishing pairwise and one-vs-rest separation of the training, the test and the combined sets, respectively. It reviews the theoretical approaches to assessing linear separability, alongside state-of-the-art methods and tools, then systematically examines all relevant assemblies, and reports the findings.

[LG-15] From AI Weather Prediction to Infrastructure Resilience: A Correction-Downscaling Framework for Tropical Cyclone Impacts

链接: https://arxiv.org/abs/2603.12828
作者: You Wu,Zhenguo Wang,Naiyu Wang
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses a missing capability in infrastructure resilience: turning fast, global AI weather forecasts into asset-scale, actionable risk. We introduce the AI-based Correction-Downscaling Framework (ACDF), which transforms coarse AI weather prediction (AIWP) into 500-m, unbiased wind fields and transmission tower/line failure probabilities for tropical cyclones. ACDF separates storm-scale bias correction from terrain-aware downscaling, preventing error propagation while restoring sub-kilometer variability that governs structural loading. Tested on 11 typhoons affecting Zhejiang, China under leave-one-storm-out evaluation, ACDF reduces station-scale wind-speed MAE by 38.8% versus Pangu-Weather, matches observation-assimilated mesoscale analyses, yet runs in 25 s per 12-h cycle on a single GPU. In the Typhoon Hagupit case, ACDF reproduced observed high-wind tails, isolated a coastal high-risk corridor, and flagged the line that failed, demonstrating actionable guidance at tower and line scales. ACDF provides an end-to-end pathway from AI global forecasts to operational, impact-based early warning for critical infrastructure.

[LG-16] A Multi-task Large Reasoning Model for Molecular Science

链接: https://arxiv.org/abs/2603.12808
作者: Pengfei Liu,Shuang Ge,Jun Tao,Zhixiang Ren
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advancements in artificial intelligence for molecular science are necessitating a paradigm shift from purely data-driven predictions to knowledge-guided computational reasoning. Existing molecular models are predominantly proprietary, lacking general molecular intelligence and generalizability. This underscores the necessity for computational methods that can effectively integrate scientific logic with deep learning architectures. Here we introduce a multi-task large reasoning model designed to emulate the cognitive processes of molecular scientists through structured reasoning and reflection. Our approach incorporates multi-specialist modules to provide versatile molecular expertise and a chain-of-thought (CoT) framework enhanced by reinforcement learning infused with molecular knowledge, enabling structured and reflective reasoning. Systematic evaluations across 10 molecular tasks and 47 metrics demonstrate that our model achieves an average 50.3% improvement over the base architecture, outperforming over 20 state-of-the-art baselines, including ultra-large-parameter foundation models, despite using significantly fewer training data and computational resources. This validates that embedding explicit reasoning mechanisms enables high-efficiency learning, allowing smaller-scale models to surpass massive counterparts in both efficacy and interpretability. The practical utility of this computational framework was validated through a case study on the design of central nervous system (CNS) drug candidates, illustrating its capacity to bridge data-driven and knowledge-integrated approaches for intelligent molecular design.

[LG-17] A Fractional Fox H-Function Kernel for Support Vector Machines: Robust Classification via Weighted Transmutation Operators

链接: https://arxiv.org/abs/2603.12794
作者: Gustavo Dorrego
类目: Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Support Vector Machines (SVMs) rely heavily on the choice of the kernel function to map data into high-dimensional feature spaces. While the Gaussian Radial Basis Function (RBF) is the industry standard, its exponential decay makes it highly susceptible to structural noise and outliers, often leading to severe overfitting in complex datasets. In this paper, we propose a novel class of non-stationary kernels derived from the fundamental solution of the generalized time-space fractional diffusion-wave equation. By leveraging a structure-preserving transmutation method over Weighted Sobolev Spaces, we introduce the Fox-Dorrego Kernel, an exact analytical Mercer kernel governed by the Fox H-function. Unlike standard kernels, our formulation incorporates an aging weight function (the “Amnesia Effect”) to penalize distant outliers and a fractional asymptotic power-law decay to allow for robust, heavy-tailed feature mapping (analogous to Lévy flights). Numerical experiments on both synthetic datasets and real-world high-dimensional radar data (Ionosphere) demonstrate that the proposed Fox-Dorrego kernel consistently outperforms the standard Gaussian RBF baseline, reducing the classification error rate by approximately 50% while maintaining structural robustness against outliers.

[LG-18] Upper Bounds for Local Learning Coefficients of Three-Layer Neural Networks

链接: https://arxiv.org/abs/2603.12785
作者: Yuki Kurumadani
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Three-layer neural networks are known to form singular learning models, and their Bayesian asymptotic behavior is governed by the learning coefficient, or real log canonical threshold. Although this quantity has been clarified for regular models and for some special singular models, broadly applicable methods for evaluating it in neural networks remain limited. Recently, a formula for the local learning coefficient of semiregular models was proposed, yielding an upper bound on the learning coefficient. However, this formula applies only to nonsingular points in the set of realization parameters and cannot be used at singular points. In particular, for three-layer neural networks, the resulting upper bound has been shown to differ substantially from learning coefficient values already known in some cases. In this paper, we derive an upper-bound formula for the local learning coefficient at singular points in three-layer neural networks. This formula can be interpreted as a counting rule under budget constraints and demand-supply constraints, and is applicable to general analytic activation functions. In particular, it covers the swish function and polynomial functions, extending previous results to a wider class of activation functions. We further show that, when the input dimension is one, the upper bound obtained here coincides with the already known learning coefficient, thereby partially resolving the discrepancy above. Our result also provides a systematic perspective on how the weight parameters of three-layer neural networks affect the learning coefficient. Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2603.12785 [cs.LG] (or arXiv:2603.12785v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.12785 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yuki Kurumadani [view email] [v1] Fri, 13 Mar 2026 08:41:43 UTC (66 KB)

[LG-19] SciDesignBench: Benchmarking and Improving Language Models for Scientific Inverse Design

链接: https://arxiv.org/abs/2603.12724
作者: David van Dijk,Ivan Vrkic
类目: Machine Learning (cs.LG)
*备注: 35 pages, 19 figures, 9 tables

点击查看摘要

Abstract:Many of the most important problems in science and engineering are inverse problems: given a desired outcome, find a design that achieves it. Evaluating whether a candidate meets the spec is often routine; a binding energy can be computed, a reactor yield simulated, a pharmacokinetic profile predicted. But searching a combinatorial design space for inputs that satisfy those targets is fundamentally harder. We introduce SciDesignBench, a benchmark of 520 simulator-grounded tasks across 14 scientific domains and five settings spanning single-shot design, short-horizon feedback, long-horizon refinement, and seed-design optimization. On the 10-domain shared-core subset, the best zero-shot model reaches only 29.0% success despite substantially higher parse rates. Simulator feedback helps, but the leaderboard changes with horizon: Sonnet 4.5 is strongest in one-turn de novo design, whereas Opus 4.6 is strongest after 20 turns of simulator-grounded refinement. Providing a starting seed design reshuffles the leaderboard again, demonstrating that constrained modification requires a fundamentally different capability from unconstrained de novo generation. We then introduce RLSF, a simulator-feedback training recipe. An RLSF-tuned 8B model raises single-turn success rates by 8-17 percentage points across three domains. Together, these results position simulator-grounded inverse design as both a benchmark for scientific reasoning and a practical substrate for amortizing expensive test-time compute into model weights.

[LG-20] Design-Specification Tiling for ICL-based CAD Code Generation

链接: https://arxiv.org/abs/2603.12712
作者: Yali Du,San-Zhuo Xi,Hui Sun,Ming Li
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in code generation, yet they underperform on domain-specific tasks such as Computer-Aided Design (CAD) code generation due to scarce training data. In-Context Learning (ICL) offers a training-free alternative through task-specific exemplars. However, existing selection strategies prioritize similarity or point-wise diversity, often producing redundant selections that fail to satisfy the compositional requirements of complex CAD design specifications. In this work, we propose knowledge sufficiency as a principled objective for exemplar selection that aims to maximally satisfy all requirements within design specifications. To realize this objective, we introduce Design-Specification Tiling (DST), which quantifies knowledge sufficiency through a surrogate tiling ratio by extracting multi-granular design components and measuring the proportion of query components covered by selected exemplars. We demonstrate that maximizing this objective constitutes submodular maximization and provide a polynomial-time greedy algorithm with a (1-1/e)-approximation guarantee. Extensive experiments demonstrate that DST substantially improves CAD code generation quality, consistently outperforming existing exemplar selection strategies in ICL.

[LG-21] RXNRECer Enables Fine-grained Enzymatic Function Annotation through Active Learning and Protein Language Models

链接: https://arxiv.org/abs/2603.12694
作者: Zhenkun Shi,Jun Zhu,Dehang Wang,BoYu Chen,Qianqian Yuan,Zhitao Mao,Fan Wei,Weining Wu,Xiaoping Liao,Hongwu Ma
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:A key challenge in enzyme annotation is identifying the biochemical reactions catalyzed by proteins. Most existing methods rely on Enzyme Commission (EC) numbers as intermediaries: they first predict an EC number and then retrieve the associated reactions. This indirect strategy introduces ambiguity due to the complex many-to-many mappings among proteins, EC numbers, and reactions, and is further complicated by frequent updates to EC numbers and inconsistencies across databases. To address these challenges, we present RXNRECer, a transformer-based ensemble framework that directly predicts enzyme-catalyzed reactions without relying on EC numbers. It integrates protein language modeling and active learning to capture both high-level sequence semantics and fine-grained transformation patterns. Evaluations on curated cross-validation and temporal test sets demonstrate consistent improvements over six EC-based baselines, with gains of 16.54% in F1 score and 15.43% in accuracy. Beyond accuracy gains, the framework offers clear advantages for downstream applications, including scalable proteome-wide reaction annotation, enhanced specificity in refining generic reaction schemas, systematic annotation of previously uncurated proteins, and reliable identification of enzyme promiscuity. By incorporating large language models, it also provides interpretable rationales for predictions. These capabilities make RXNRECer a robust and versatile solution for EC-free, fine-grained enzyme function prediction, with potential applications across multiple areas of enzyme research and industrial applications.

[LG-22] Colluding LoRA: A Composite Attack on LLM Safety Alignment

链接: https://arxiv.org/abs/2603.12681
作者: Sihao Ding
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Colluding LoRA (CoLoRA), an attack in which each adapter appears benign and plausibly functional in isolation, yet their linear composition consistently compromises safety. Unlike attacks that depend on specific input triggers or prompt patterns, CoLoRA is a composition-triggered broad refusal suppression: once a particular set of adapters is loaded, the model undergoes effective alignment degradation, complying with harmful requests without requiring adversarial prompts or suffixes. This attack exploits the combinatorial blindness of current defense systems, where exhaustively scanning all compositions is computationally intractable. Across several open-weight LLMs, CoLoRA achieves benign behavior individually yet high attack success rate after composition, indicating that securing modular LLM supply-chains requires moving beyond single-module verification toward composition-aware defenses.

[LG-23] Disentangled Latent Dynamics Manifold Fusion for Solving Parameterized PDEs

链接: https://arxiv.org/abs/2603.12676
作者: Zhangyong Liang,Ji Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generalizing neural surrogate models across different PDE parameters remains difficult because changes in PDE coefficients often make learning harder and optimization less stable. The problem becomes even more severe when the model must also predict beyond the training time range. Existing methods usually cannot handle parameter generalization and temporal extrapolation at the same time. Standard parameterized models treat time as just another input and therefore fail to capture intrinsic dynamics, while recent continuous-time latent methods often rely on expensive test-time auto-decoding for each instance, which is inefficient and can disrupt continuity across the parameterized solution space. To address this, we propose Disentangled Latent Dynamics Manifold Fusion (DLDMF), a physics-informed framework that explicitly separates space, time, and parameters. Instead of unstable auto-decoding, DLDMF maps PDE parameters directly to a continuous latent embedding through a feed-forward network. This embedding initializes and conditions a latent state whose evolution is governed by a parameter-conditioned Neural ODE. We further introduce a dynamic manifold fusion mechanism that uses a shared decoder to combine spatial coordinates, parameter embeddings, and time-evolving latent states to reconstruct the corresponding spatiotemporal solution. By modeling prediction as latent dynamic evolution rather than static coordinate fitting, DLDMF reduces interference between parameter variation and temporal evolution while preserving a smooth and coherent solution manifold. As a result, it performs well on unseen parameter settings and in long-term temporal extrapolation. Experiments on several benchmark problems show that DLDMF consistently outperforms state-of-the-art baselines in accuracy, parameter generalization, and extrapolation robustness.

[LG-24] Sobolev–Ricci Curvature

链接: https://arxiv.org/abs/2603.12652
作者: Kyoichi Iwasaki,Tam Le,Hideitsu Hino
类目: Machine Learning (cs.LG)
*备注: 42 pages, 13 figures

点击查看摘要

Abstract:Ricci curvature is a fundamental concept in differential geometry for encoding local geometric structure, and its graph-based analogues have recently gained prominence as practical tools for reweighting, pruning, and reshaping network geometry. We propose Sobolev-Ricci Curvature (SRC), a graph Ricci curvature canonically induced by Sobolev transport geometry, which admits efficient evaluation via a tree-metric Sobolev structure on neighborhood measures. We establish two consistency behaviors that anchor SRC to classical transport curvature: (i) on trees endowed with the length measure, SRC recovers Ollivier-Ricci curvature (ORC) in the canonical W1 setting, and (ii) SRC vanishes in the Dirac limit, matching the flat case of measure-theoretic Ricci curvature. We demonstrate SRC as a reusable curvature primitive in two representative pipelines. We define Sobolev-Ricci Flow by replacing ORC with SRC in a Ricci-flow-style reweighting rule, and we use SRC for curvature-guided edge pruning aimed at preserving manifold structure. Overall, SRC provides a transport-based foundation for scalable curvature-driven graph transformation and manifold-oriented pruning.

[LG-25] Adaptive Diffusion Posterior Sampling for Data and Model Fusion of Complex Nonlinear Dynamical Systems

链接: https://arxiv.org/abs/2603.12635
作者: Dibyajyoti Chakraborty,Hojin Kim,Romit Maulik
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:High-fidelity numerical simulations of chaotic, high dimensional nonlinear dynamical systems are computationally expensive, necessitating the development of efficient surrogate models. Most surrogate models for such systems are deterministic, for example when neural operators are involved. However, deterministic models often fail to capture the intrinsic distributional uncertainty of chaotic systems. This work presents a surrogate modeling formulation that leverages generative machine learning, where a deep learning diffusion model is used to probabilistically forecast turbulent flows over long horizons. We introduce a multi-step autoregressive diffusion objective that significantly enhances long-rollout stability compared to standard single-step training. To handle complex, unstructured geometries, we utilize a multi-scale graph transformer architecture incorporating diffusion preconditioning and voxel-grid pooling. More importantly, our modeling framework provides a unified platform that also predicts spatiotemporally important locations for sensor placement, either via uncertainty estimates or through an error-estimation module. Finally, the observations of the ground truth state at these dynamically varying sensor locations are assimilated using diffusion posterior sampling requiring no retraining of the surrogate model. We present our methodology on two-dimensional homogeneous and isotropic turbulence and for a flow over a backwards-facing step, demonstrating its utility in forecasting, adaptive sensor placement, and data assimilation for high dimensional chaotic systems.

[LG-26] Human-AI Collaborative Autonomous Experimentation With Proxy Modeling for Comparative Observation

链接: https://arxiv.org/abs/2603.12618
作者: Arpan Biswas,Hiroshi Funakubo,Yongtao Liu
类目: Machine Learning (cs.LG)
*备注: 14 pages, 7 figures

点击查看摘要

Abstract:Optimization for different tasks like material characterization, synthesis, and functional properties for desired applications over multi-dimensional control parameters need a rapid strategic search through active learning such as Bayesian optimization (BO). However, such high-dimensional experimental physical descriptors are complex and noisy, from which realization of a low-dimensional mathematical scalar metrics or objective functions can be erroneous. Moreover, in traditional purely data-driven autonomous exploration, such objective functions often ignore the subtle variation and key features of the physical descriptors, thereby can fail to discover unknown phenomenon of the material systems. To address this, here we present a proxy-modelled Bayesian optimization (px-BO) via on-the-fly teaming between human and AI agents. Over the loop of BO, instead of defining a mathematical objective function directly from the experimental data, we introduce a voting system on the fly where the new experimental outcome will be compared with existing experiments, and the human agents will choose the preferred samples. These human-guided comparisons are then transformed into a proxy-based objective function via fitting Bradley-Terry (BT) model. Then, to minimize human interaction, this iteratively trained proxy model also acts as an AI agent for future surrogate human votes. Finally, these surrogate votes are periodically validated by human agents, and the corrections are then learned by the proxy model on-the-fly. We demonstrated the performance of the proposed px-BO framework into simulated and BEPS data generated from PTO sample. We find that our approach provided better control of the domain experts for an improved search over traditional data-driven exploration, thus, signifies the importance of human-AI teaming in an accelerated and meaningful material space exploration.

[LG-27] Maximizing Incremental Information Entropy for Contrastive Learning ICLR2026

链接: https://arxiv.org/abs/2603.12594
作者: Jiansong Zhang,Zhuoqin Yang,Xu Wu,Xiaoling Luo,Peizhong Liu,Linlin Shen
类目: Machine Learning (cs.LG)
*备注: ICLR 2026 (The Fourteenth International Conference on Learning Representations) this https URL

点击查看摘要

Abstract:Contrastive learning has achieved remarkable success in self-supervised representation learning, often guided by information-theoretic objectives such as mutual information maximization. Motivated by the limitations of static augmentations and rigid invariance constraints, we propose IE-CL (Incremental-Entropy Contrastive Learning), a framework that explicitly optimizes the entropy gain between augmented views while preserving semantic consistency. Our theoretical framework reframes the challenge by identifying the encoder as an information bottleneck and proposes a joint optimization of two components: a learnable transformation for entropy generation and an encoder regularizer for its preservation. Experiments on CIFAR-10/100, STL-10, and ImageNet demonstrate that IE-CL consistently improves performance under small-batch settings. Moreover, our core modules can be seamlessly integrated into existing frameworks. This work bridges theoretical principles and practice, offering a new perspective in contrastive learning.

[LG-28] A Spectral Revisit of the Distributional Bellm an Operator under the Cramér Metric

链接: https://arxiv.org/abs/2603.12576
作者: Keru Wang,Yixin Deng,Yao Lyu,Stephen Redmond,Shengbo Eben Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributional reinforcement learning (DRL) studies the evolution of full return distributions under Bellman updates rather than focusing on expected values. A classical result is that the distributional Bellman operator is contractive under the Cramér metric, which corresponds to an L^2 geometry on differences of cumulative distribution functions (CDFs). While this contraction ensures stability of policy evaluation, existing analyses remain largely metric, focusing on contraction properties without elucidating the structural action of the Bellman update on distributions. In this work, we analyse distributional Bellman dynamics directly at the level of CDFs, treating the Cramér geometry as the intrinsic analytical setting. At this level, the Bellman update acts affinely on CDFs and linearly on differences between CDFs, and its contraction property yields a uniform bound on this linear action. Building on this intrinsic formulation, we construct a family of regularised spectral Hilbert representations that realise the CDF-level geometry by exact conjugation, without modifying the underlying Bellman dynamics. The regularisation affects only the geometry and vanishes in the zero-regularisation limit, recovering the native Cramér metric. This framework clarifies the operator structure underlying distributional Bellman updates and provides a foundation for further functional and operator-theoretic analyses in DRL.

[LG-29] Scaling Laws and Pathologies of Single-Layer PINNs: Network Width and PDE Nonlinearity NEURIPS2025

链接: https://arxiv.org/abs/2603.12556
作者: Faris Chaudhry
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注: Accepted at the Machine Learning and Physical Sciences Workshop (NeurIPS 2025)

点击查看摘要

Abstract:We establish empirical scaling laws for Single-Layer Physics-Informed Neural Networks on canonical nonlinear PDEs. We identify a dual optimization failure: (i) a baseline pathology, where the solution error fails to decrease with network width, even at fixed nonlinearity, falling short of theoretical approximation bounds, and (ii) a compounding pathology, where this failure is exacerbated by nonlinearity. We provide quantitative evidence that a simple separable power law is insufficient, and that the scaling behavior is governed by a more complex, non-separable relationship. This failure is consistent with the concept of spectral bias, where networks struggle to learn the high-frequency solution components that intensify with nonlinearity. We show that optimization, not approximation capacity, is the primary bottleneck, and propose a methodology to empirically measure these complex scaling effects.

[LG-30] Asymptotic and Finite-Time Guarantees for Langevin-Based Temperature Annealing in InfoNCE NEURIPS2025

链接: https://arxiv.org/abs/2603.12552
作者: Faris Chaudhry
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Accepted at the Optimization for Machine Learning Workshop (NeurIPS 2025)

点击查看摘要

Abstract:The InfoNCE loss in contrastive learning depends critically on a temperature parameter, yet its dynamics under fixed versus annealed schedules remain poorly understood. We provide a theoretical analysis by modeling embedding evolution under Langevin dynamics on a compact Riemannian manifold. Under mild smoothness and energy-barrier assumptions, we show that classical simulated annealing guarantees extend to this setting: slow logarithmic inverse-temperature schedules ensure convergence in probability to a set of globally optimal representations, while faster schedules risk becoming trapped in suboptimal minima. Our results establish a link between contrastive learning and simulated annealing, providing a principled basis for understanding and tuning temperature schedules.

[LG-31] Deep Distance Measurement Method for Unsupervised Multivariate Time Series Similarity Retrieval ICDM

链接: https://arxiv.org/abs/2603.12544
作者: Susumu Naito,Kouta Nakata,Yasunori Taguchi
类目: Machine Learning (cs.LG)
*备注: Workshop of Artificial Intelligence for Time Series Analysis (AI4TS): Theory, Algorithms, and Applications at 2025 IEEE International Conference on Data Mining (ICDM), 2025

点击查看摘要

Abstract:We propose the Deep Distance Measurement Method (DDMM) to improve retrieval accuracy in unsupervised multivariate time series similarity retrieval. DDMM enables learning of minute differences within states in the entire time series and thereby recognition of minute differences between states, which are of interest to users in industrial plants. To achieve this, DDMM uses a learning algorithm that assigns a weight to each pair of an anchor and a positive sample, arbitrarily sampled from the entire time series, based on the Euclidean distance within the pair and learns the differences within the pairs weighted by the weights. This algorithm allows both learning minute differences within states and sampling pairs from the entire time series. Our empirical studies showed that DDMM significantly outperformed state-of-the-art time series representation learning methods on the Pulp-and-paper mill dataset and demonstrated the effectiveness of DDMM in industrial plants. Furthermore, we showed that accuracy can be further improved by linking DDMM with existing feature extraction methods through experiments with the combined model.

[LG-32] As Language Models Scale Low-order Linear Depth Dynamics Emerge

链接: https://arxiv.org/abs/2603.12541
作者: Buddhika Nettasinghe,Geethu Joseph
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Large language models are often viewed as high-dimensional nonlinear systems and treated as black boxes. Here, we show that transformer depth dynamics admit accurate low-order linear surrogates within context. Across tasks including toxicity, irony, hate speech and sentiment, a 32-dimensional linear surrogate reproduces the layerwise sensitivity profile of GPT-2-large with near-perfect agreement, capturing how the final output shifts under additive injections at each layer. We then uncover a surprising scaling principle: for a fixed-order linear surrogate, agreement with the full model improves monotonically with model size across the GPT-2 family. This linear surrogate also enables principled multi-layer interventions that require less energy than standard heuristic schedules when applied to the full model. Together, our results reveal that as language models scale, low-order linear depth dynamics emerge within contexts, offering a systems-theoretic foundation for analyzing and controlling them.

[LG-33] A Reduction Algorithm for Markovian Contextual Linear Bandits

链接: https://arxiv.org/abs/2603.12530
作者: Kaan Buyukkalayci,Osama Hanna,Christina Fragouli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent work shows that when contexts are drawn i.i.d., linear contextual bandits can be reduced to single-context linear bandits. This ``contexts are cheap" perspective is highly advantageous, as it allows for sharper finite-time analyses and leverages mature techniques from the linear bandit literature, such as those for misspecification and adversarial corruption. Motivated by applications with temporally correlated availability, we extend this perspective to Markovian contextual linear bandits, where the action set evolves via an exogenous Markov chain. Our main contribution is a reduction that applies under uniform geometric ergodicity. We construct a stationary surrogate action set to solve the problem using a standard linear bandit oracle, employing a delayed-update scheme to control the bias induced by the nonstationary conditional context distributions. We further provide a phased algorithm for unknown transition distributions that learns the surrogate mapping online. In both settings, we obtain a high-probability worst-case regret bound matching that of the underlying linear bandit oracle, with only lower-order dependence on the mixing time.

[LG-34] Learning Pore-scale Multiphase Flow from 4D Velocimetry

链接: https://arxiv.org/abs/2603.12516
作者: Chunyang Wang,Linqi Zhu,Yuxuan Gu,Robert van der Merwe,Xin Ju,Catherine Spurin,Samuel Krevor,Rex Ying,Tobias Pfaff,Martin J. Blunt,Tom Bultreys,Gege Wen
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Multiphase flow in porous media underpins subsurface energy and environmental technologies, including geological CO _2 storage and underground hydrogen storage, yet pore-scale dynamics in realistic three-dimensional materials remain difficult to characterize and predict. Here we introduce a multimodal learning framework that infers multiphase pore-scale flow directly from time-resolved four-dimensional (4D) micro-velocimetry measurements. The model couples a graph network simulator for Lagrangian tracer-particle motion with a 3D U-Net for voxelized interface evolution. The imaged pore geometry serves as a boundary constraint to the flow velocity and the multiphase interface predictions, which are coupled and updated iteratively at each time step. Trained autoregressively on experimental sequences in capillary-dominated conditions ( Ca\approx10^-6 ), the learned surrogate captures transient, nonlocal flow perturbations and abrupt interface rearrangements (Haines jumps) over rollouts spanning seconds of physical time, while reducing hour-to-day–scale direct numerical simulations to seconds of inference. By providing rapid, experimentally informed predictions, the framework opens a route to ‘‘digital experiments’’ to replicate pore-scale physics observed in multiphase flow experiments, offering an efficient tool for exploring injection conditions and pore-geometry effects relevant to subsurface carbon and hydrogen storage.

[LG-35] Byzantine-Robust Optimization under (L_0 L_1)-Smoothness

链接: https://arxiv.org/abs/2603.12512
作者: Arman Bolatov,Samuel Horváth,Martin Takáč,Eduard Gorbunov
类目: Machine Learning (cs.LG)
*备注: 10 pages, 1 table, 4 figures, accepted to CPAL 2026

点击查看摘要

Abstract:We consider distributed optimization under Byzantine attacks in the presence of (L_0,L_1) -smoothness, a generalization of standard L -smoothness that captures functions with state-dependent gradient Lipschitz constants. We propose Byz-NSGDM, a normalized stochastic gradient descent method with momentum that achieves robustness against Byzantine workers while maintaining convergence guarantees. Our algorithm combines momentum normalization with Byzantine-robust aggregation enhanced by Nearest Neighbor Mixing (NNM) to handle both the challenges posed by (L_0,L_1) -smoothness and Byzantine adversaries. We prove that Byz-NSGDM achieves a convergence rate of O(K^-1/4) up to a Byzantine bias floor proportional to the robustness coefficient and gradient heterogeneity. Experimental validation on heterogeneous MNIST classification, synthetic (L_0,L_1) -smooth optimization, and character-level language modeling with a small GPT model demonstrates the effectiveness of our approach against various Byzantine attack strategies. An ablation study further shows that Byz-NSGDM is robust across a wide range of momentum and learning rate choices.

[LG-36] Adaptive Conditional Forest Sampling for Spectral Risk Optimisation under Decision-Dependent Uncertainty

链接: https://arxiv.org/abs/2603.12507
作者: Marcell T. Kurbucz
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 15 pages, 3 figures, 8 tables

点击查看摘要

Abstract:Minimising a spectral risk objective, defined as a convex combination of expected cost and Conditional Value-at-Risk (CVaR), is challenging when the uncertainty distribution is decision-dependent, making both surrogate modelling and simulation-based ranking sensitive to tail estimation error. We propose Adaptive Conditional Forest Sampling (ACFS), a four-phase simulation-optimisation framework that integrates Generalised Random Forests for decision-conditional distribution approximation, CEM-guided global exploration, rank-weighted focused augmentation, and surrogate-to-oracle two-stage reranking before multi-start gradient-based refinement. We evaluate ACFS on two structurally distinct data-generating processes: a decision-dependent Student-t copula and a Gaussian copula with log-normal marginals, across three penalty-weight configurations and 100 replications per setting. ACFS achieves the lowest median oracle spectral risk on the second benchmark in every configuration, with median gaps over GP-BO ranging from 6.0% to 20.0%. On the first benchmark, ACFS and GP-BO are statistically indistinguishable in median objective, but ACFS reduces cross-replication dispersion by approximately 1.8 to 1.9 times on the first benchmark and 1.7 to 2.0 times on the second, indicating materially improved run-to-run reliability. ACFS also outperforms CEM-SO, SGD-CVaR, and KDE-SO in nearly all settings, while ablation and sensitivity analyses support the contribution and robustness of the proposed design.

[LG-37] Probing Length Generalization in Mamba via Image Reconstruction

链接: https://arxiv.org/abs/2603.12499
作者: Jan Rathjens,Robin Schiewer,Laurenz Wiskott,Anand Subramoney
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mamba has attracted widespread interest as a general-purpose sequence model due to its low computational complexity and competitive performance relative to transformers. However, its performance can degrade when inference sequence lengths exceed those seen during training. We study this phenomenon using a controlled vision task in which Mamba reconstructs images from sequences of image patches. By analyzing reconstructions at different stages of sequence processing, we reveal that Mamba qualitatively adapts its behavior to the distribution of sequence lengths encountered during training, resulting in strategies that fail to generalize beyond this range. To support our analysis, we introduce a length-adaptive variant of Mamba that improves performance across training sequence lengths. Our results provide an intuitive perspective on length generalization in Mamba and suggest directions for improving the architecture.

[LG-38] Modal Logical Neural Networks for Financial AI ICLR2026

链接: https://arxiv.org/abs/2603.12487
作者: Antonin Sulc
类目: Machine Learning (cs.LG)
*备注: 4 pages, 1 figure, Accepted at ICLR 2026 FinAI

点击查看摘要

Abstract:The financial industry faces a critical dichotomy in AI adoption: deep learning often delivers strong empirical performance, while symbolic logic offers interpretability and rule adherence expected in regulated settings. We use Modal Logical Neural Networks (MLNNs) as a bridge between these worlds, integrating Kripke semantics into neural architectures to enable differentiable reasoning about necessity, possibility, time, and knowledge. We illustrate MLNNs as a differentiable ``Logic Layer’’ for finance by mapping core components, Necessity Neurons ( \Box ) and Learnable Accessibility ( A_\theta ), to regulatory guardrails, market stress testing, and collusion detection. Four case studies show how MLNN-style constraints can promote compliance in trading agents, help recover latent trust networks for market surveillance, encourage robustness under stress scenarios, and distinguish statistical belief from verified knowledge to help mitigate robo-advisory hallucinations.

[LG-39] axBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition

链接: https://arxiv.org/abs/2603.12465
作者: Prabhu Vellaisamy,Shreesh Tripathi,Vignesh Natarajan,Surya Santhan Thenarasu,Shawn Blanton,John P. Shen
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注: Accepted at IEEE ISPASS 2026. Copyright assigned to IEEE

点击查看摘要

Abstract:Large Language Model (LLM) inference is widely used in interactive assistants and agentic systems. In latency-sensitive deployments, inference time can become dominated by host-side overheads. Existing approaches typically expose this cost only as an aggregate residual or a launch/queue metric, which is often insufficient to identify which execution layer should be optimized. This work presents TaxBreak, a trace-driven methodology for decomposing host-visible orchestration overhead into three components: framework translation time, CUDA library translation time, and kernel launch-path time. We validate TaxBreak on NVIDIA H100 and H200 systems and use it to derive our proposed Host-Device Balance Index (HDBI), a boundedness summary index that relates device-active execution to host-visible orchestration. Across representative dense and mixture-of-experts workloads in both prefill and decode, we show that aggregate latency, GPU inactivity, or boundedness ratios alone can obscure the dominant optimization target. TaxBreak instead distinguishes cases where optimization should reduce software-stack overhead from cases where the primary win comes from reducing device-side work. We further show that MoE models dispatch 8-11x more kernels per output token than dense models, and that for such host-bound workloads, CPU single-thread performance is a first-order parameter: a faster host CPU reduces orchestration overhead by 10-29% and improves end-to-end latency by up to 14%, even when paired with a slower-clocked GPU. These results position TaxBreak as a diagnostic tool for assessing whether optimization effort should target the software stack or the device-side workload execution.

[LG-40] Overcoming the Modality Gap in Context-Aided Forecasting

链接: https://arxiv.org/abs/2603.12451
作者: Vincent Zhihao Zheng,Étienne Marcotte,Arjun Ashok,Andrew Robert Williams,Lijun Sun,Alexandre Drouin,Valentina Zantedeschi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Context-aided forecasting (CAF) holds promise for integrating domain knowledge and forward-looking information, enabling AI systems to surpass traditional statistical methods. However, recent empirical studies reveal a puzzling gap: multimodal models often fail to outperform their unimodal counterparts. We hypothesize that this underperformance stems from poor context quality in existing datasets, as verification is challenging. To address these limitations, we introduce a semi-synthetic data augmentation method that generates contexts both descriptive of temporal dynamics and verifiably complementary to numerical histories. This approach enables massive-scale dataset creation, resulting in CAF-7M, a corpus of 7 million context-augmented time series windows, including a rigorously verified test set. We demonstrate that semi-synthetic pre-training transfers effectively to real-world evaluation, and show clear evidence of context utilization. Our results suggest that dataset quality, rather than architectural limitations, has been the primary bottleneck in context-aided forecasting.

[LG-41] Bridging the Gap Between Security Metrics and Key Risk Indicators: An Empirical Framework for Vulnerability Prioritization

链接: https://arxiv.org/abs/2603.12450
作者: Emad Sherif,Iryna Yevseyeva,Vitor Basto-Fernandes,Allan Cook
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Organisations overwhelmingly prioritize vulnerability remediation using Common Vulnerability Scoring System (CVSS) severity scores, yet CVSS classifiers achieve an Area Under the Precision-Recall Curve (AUPRC) of 0.011 on real-world exploitation data, near random chance. We propose a composite Key Risk Indicator grounded in expected-loss decomposition, integrating dimensions of threat, impact, and exposure. We evaluated the KRI framework against the Known Exploited Vulnerabilities (KEV) catalog using a comprehensive dataset of 280,694 Common Vulnerabilities and Exposures (CVEs). KRI achieves Receiver Operating Characteristic Area Under the Curve (ROC-AUC) 0.927 and AUPRC 0.223 versus 0.747 and 0.011 for CVSS (24 percents, 20). Ablation analysis shows Exploit Prediction Scoring System (EPSS) alone achieves AUPRC 0.365, higher than full KRI (0.223), confirming that EPSS and KRI serve distinct objectives: EPSS maximizes raw exploit detection, while KRI re-orders by impact and exposure, capturing 92.3 percents of impact-weighted remediation value at k=500 versus 82.6 percents for EPSS, and surfacing 1.75 more Critical-severity exploited CVEs. KRI’s net benefit exceeds EPSS whenever the severity premium exceeds 2. While EPSS serves as a robust baseline for exploit detection, the KRI framework is the superior choice for organizations seeking to align remediation efforts with tangible risk reduction.

[LG-42] KernelFoundry: Hardware-aware evolutionary GPU kernel optimization

链接: https://arxiv.org/abs/2603.12440
作者: Nina Wiedemann,Quentin Leboutet,Michael Paulitsch,Diana Wofk,Benjamin Ummenhofer
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimizing GPU kernels presents a significantly greater challenge for large language models (LLMs) than standard code generation tasks, as it requires understanding hardware architecture, parallel optimization strategies, and performance profiling outputs. Most existing LLM-based approaches to kernel generation rely on simple prompting and feedback loops, incorporating hardware awareness only indirectly through profiling feedback. We introduce KernelFoundry, an evolutionary framework that efficiently explores the GPU kernel design space through three key mechanisms: (1) MAP-Elites quality-diversity search with kernel-specific behavioral dimensions to sustain exploration across diverse optimization strategies; (2) meta-prompt evolution, which co-evolves prompts with kernels to uncover task-specific optimization strategies, and (3) template-based parameter optimization to tune kernels to inputs and hardware. We evaluate this framework on KernelBench, robust-kbench, and custom tasks, generating SYCL kernels as a cross-platform GPU programming model and CUDA kernels for comparison to prior work. Our approach consistently outperforms the baseline methods, achieving an average speedup of 2.3x on KernelBench for SYCL. Moreover, KernelFoundry is implemented as a distributed framework with remote access to diverse hardware, enabling rapid benchmarking and featuring a flexible user input layer that supports kernel generation for a wide range of real-world use cases beyond benchmarking.

[LG-43] SpectralGuard: Detecting Memory Collapse Attacks in State Space Models

链接: https://arxiv.org/abs/2603.12414
作者: Davi Bonetto
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 24 pages, 10 figures. Code, dataset, and demo: this https URL

点击查看摘要

Abstract:State Space Models (SSMs) such as Mamba achieve linear-time sequence processing through input-dependent recurrence, but this mechanism introduces a critical safety vulnerability. We show that the spectral radius rho(A-bar) of the discretized transition operator governs effective memory horizon: when an adversary drives rho toward zero through gradient-based Hidden State Poisoning, memory collapses from millions of tokens to mere dozens, silently destroying reasoning capacity without triggering output-level alarms. We prove an Evasion Existence Theorem showing that for any output-only defense, adversarial inputs exist that simultaneously induce spectral collapse and evade detection, then introduce SpectralGuard, a real-time monitor that tracks spectral stability across all model layers. SpectralGuard achieves F1=0.961 against non-adaptive attackers and retains F1=0.842 under the strongest adaptive setting, with sub-15ms per-token latency. Causal interventions and cross-architecture transfer to hybrid SSM-Attention systems confirm that spectral monitoring provides a principled, deployable safety layer for recurrent foundation models.

[LG-44] Beyond Motion Imitation: Is Human Motion Data Alone Sufficient to Explain Gait Control and Biomechanics?

链接: https://arxiv.org/abs/2603.12408
作者: Xinyi Liu,Jangwhan Ahn,Edgar Lobaton,Jennie Si,He Huang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:With the growing interest in motion imitation learning (IL) for human biomechanics and wearable robotics, this study investigates how additional foot-ground interaction measures, used as reward terms, affect human gait kinematics and kinetics estimation within a reinforcement learning-based IL framework. Results indicate that accurate reproduction of forward kinematics alone does not ensure biomechanically plausible joint kinetics. Adding foot-ground contacts and contact forces to the IL reward terms enables the prediction of joint moments in forward walking simulation, which are significantly closer to those computed by inverse dynamics. This finding highlights a fundamental limitation of motion-only IL approaches, which may prioritize kinematics matching over physical consistency. Incorporating kinetic constraints, particularly ground reaction force and center of pressure information, significantly enhances the realism of internal and external kinetics. These findings suggest that, when imitation learning is applied to human-related research domains such as biomechanics and wearable robot co-design, kinetics-based reward shaping is necessary to achieve physically consistent gait representations.

[LG-45] Sinkhorn-Drifting Generative Models

链接: https://arxiv.org/abs/2603.12366
作者: Ping He,Om Khangaonkar,Hamed Pirsiavash,Yikun Bai,Soheil Kolouri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We establish a theoretical link between the recently proposed “drifting” generative dynamics and gradient flows induced by the Sinkhorn divergence. In a particle discretization, the drift field admits a cross-minus-self decomposition: an attractive term toward the target distribution and a repulsive/self-correction term toward the current model, both expressed via one-sided normalized Gibbs kernels. We show that Sinkhorn divergence yields an analogous cross-minus-self structure, but with each term defined by entropic optimal-transport couplings obtained through two-sided Sinkhorn scaling (i.e., enforcing both marginals). This provides a precise sense in which drifting acts as a surrogate for a Sinkhorn-divergence gradient flow, interpolating between one-sided normalization and full two-sided Sinkhorn scaling. Crucially, this connection resolves an identifiability gap in prior drifting formulations: leveraging the definiteness of the Sinkhorn divergence, we show that zero drift (equilibrium of the dynamics) implies that the model and target measures match. Experiments show that Sinkhorn drifting reduces sensitivity to kernel temperature and improves one-step generative quality, trading off additional training time for a more stable optimization, without altering the inference procedure used by drift methods. These theoretical gains translate to strong low-temperature improvements in practice: on FFHQ-ALAE at the lowest temperature setting we evaluate, Sinkhorn drifting reduces mean FID from 187.7 to 37.1 and mean latent EMD from 453.3 to 144.4, while on MNIST it preserves full class coverage across the temperature sweep. Project page: this https URL

[LG-46] Spatial PDE-aware Selective State-space with Nested Memory for Mobile Traffic Grid Forecasting

链接: https://arxiv.org/abs/2603.12353
作者: Zineddine Bettouche,Khalid Ali,Andreas Fischer,Andreas Kassler
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traffic forecasting in cellular networks is a challenging spatiotemporal prediction problem due to strong temporal dependencies, spatial heterogeneity across cells, and the need for scalability to large network deployments. Traditional cell-specific models incur prohibitive training and maintenance costs, while global models often fail to capture heterogeneous spatial dynamics. Recent spatiotemporal architectures based on attention or graph neural networks improve accuracy but introduce high computational overhead, limiting their applicability in large-scale or real-time settings. We study spatiotemporal grid forecasting, where each time step is a 2D lattice of traffic values, and predict the next grid patch using previous patches. We propose NeST-S6, a convolutional selective state-space model (SSM) with a spatial PDE-aware core, implemented in a nested learning paradigm: convolutional local spatial mixing feeds a spatial PDE-aware SSM core, while a nested-learning long-term memory is updated by a learned optimizer when one-step prediction errors indicate unmodeled dynamics. On the mobile-traffic grid (Milan dataset) at three resolutions (202, 502, 1002), NeST-S6 attains lower errors than a strong Mamba-family baseline in both single-step and 6-step autoregressive rollouts. Under drift stress tests, our model’s nested memory lowers MAE by 48-65% over a no-memory ablation. NeST-S6 also speeds full-grid reconstruction by 32 times and reduces MACs by 4.3 times compared to competitive per-pixel scanning models, while achieving 61% lower per-pixel RMSE.

[LG-47] Generalist Large Language Models for Molecular Property Prediction: Distilling Knowledge from Specialist Models

链接: https://arxiv.org/abs/2603.12344
作者: Khiem Le,Sreejata Dey,Marcos Martínez Galindo,Vanessa Lopez,Ting Hua,Nitesh V. Chawla,Hoang Thanh Lam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Molecular Property Prediction (MPP) is a central task in drug discovery. While Large Language Models (LLMs) show promise as generalist models for MPP, their current performance remains below the threshold for practical adoption. We propose TreeKD, a novel knowledge distillation method that transfers complementary knowledge from tree-based specialist models into LLMs. Our approach trains specialist decision trees on functional group features, then verbalizes their learned predictive rules as natural language to enable rule-augmented context learning. This enables LLMs to leverage structural insights that are difficult to extract from SMILES strings alone. We further introduce rule-consistency, a test-time scaling technique inspired by bagging that ensembles predictions across diverse rules from a Random Forest. Experiments on 22 ADMET properties from the TDC benchmark demonstrate that TreeKD substantially improves LLM performance, narrowing the gap with SOTA specialist models and advancing toward practical generalist models for molecular property prediction.

[LG-48] Multi-objective Genetic Programming with Multi-view Multi-level Feature for Enhanced Protein Secondary Structure Prediction

链接: https://arxiv.org/abs/2603.12293
作者: Yining Qian,Lijie Su,Meiling Xu,Xianpeng Wang
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Predicting protein secondary structure is essential for understanding protein function and advancing drug discovery. However, the intricate sequence-structure relationship poses significant challenges for accurate modeling. To address these, we propose MOGP-MMF, a multi-objective genetic programming framework that reformulates PSSP as an automated optimization task focused on feature selection and fusion. Specifically, MOGP-MMF introduces a multi-view multi-level representation strategy that integrates evolutionary, semantic, and newly introduced structural views to capture the comprehensive protein folding logic. Leveraging an enriched operator set, the framework evolves both linear and nonlinear fusion functions, effectively capturing high-order feature interactions while reducing fusion complexity. To resolve the accuracy-complexity trade-off, an improved multi-objective GP algorithm is developed, incorporating a knowledge transfer mechanism that utilizes prior evolutionary experience to guide the population toward global optima. Extensive experiments across seven benchmark datasets demonstrate that MOGP-MMF surpasses state-of-the-art methods, particularly in Q8 accuracy and structural integrity. Furthermore, MOGP-MMF generates a diverse set of non-dominated solutions, offering flexible model selection schemes for various practical application scenarios. The source code is available on GitHub: this https URL.

[LG-49] No More DeLuLu: Physics-Inspired Kernel Networks for Geometrically-Grounded Neural Computation WWW

链接: https://arxiv.org/abs/2603.12276
作者: Taha Bouhsine
类目: Machine Learning (cs.LG)
*备注: for more info check this http URL

点击查看摘要

Abstract:We introduce the yat-product, a kernel operator combining quadratic alignment with inverse-square proximity. We prove it is a Mercer kernel, analytic, Lipschitz on bounded domains, and self-regularizing, admitting a unique RKHS embedding. Neural Matter Networks (NMNs) use yat-product as the sole non-linearity, replacing conventional linear-activation-normalization blocks with a single geometrically-grounded operation. This architectural simplification preserves universal approximation while shifting normalization into the kernel itself via the denominator, rather than relying on separate normalization layers. Empirically, NMN-based classifiers match linear baselines on MNIST while exhibiting bounded prototype evolution and superposition robustness. In language modeling, Aether-GPT2 achieves lower validation loss than GPT-2 with a comparable parameter budget while using yat-based attention and MLP blocks. Our framework unifies kernel learning, gradient stability, and information geometry, establishing NMNs as a principled alternative to conventional neural architectures.

[LG-50] A Holistic Framework for Automated Configuration Recommendation for Cloud Service Monitoring

链接: https://arxiv.org/abs/2603.12268
作者: Anson Bastos,Shreeya Venneti,Anjaly Parayil,Ayush Choure,Chetan Bansal,Rujia Wang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliability of large-scale cloud services is critical for user satisfaction and business continuity. Despite significant investments in reliability engineering, production incidents remain inevitable, often leading to customer impact and operational overhead. In large cloud companies, multiple services are deployed across regions necessitating robust health monitoring systems. However, the current monitor configuration process is manual, largely reactive and ad hoc, resulting in gaps in coverage and redundant alerts. In this paper, we present a comprehensive study of monitor creation in Microsoft, identifying key components in the existing process. We further design a modular recommendation framework that processes the graph structured service entities to suggest optimal monitor configurations. Through extensive experimentation on historical data and user study of recommendations for production services at Microsoft, we demonstrate the efficacy of our approach in providing relevant recommendations for monitor configurations.

[LG-51] Convergence Rate of a Functional Learning Method for Contextual Stochastic Optimization

链接: https://arxiv.org/abs/2603.13048
作者: Noel Smith,Andrzej Ruszczynski
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider a stochastic optimization problem involving two random variables: a context variable X and a dependent variable Y . The objective is to minimize the expected value of a nonlinear loss functional applied to the conditional expectation \mathbbE[f(X, Y,\beta) \mid X] , where f is a nonlinear function and \beta represents the decision variables. We focus on the practically important setting in which direct sampling from the conditional distribution of Y \mid X is infeasible, and only a stream of i.i.d.\ observation pairs (X^k, Y^k)_k=0,1,2,\ldots is available. In our approach, the conditional expectation is approximated within a prespecified parametric function class. We analyze a simultaneous learning-and-optimization algorithm that jointly estimates the conditional expectation and optimizes the outer objective, and establish that the method achieves a convergence rate of order \mathcalO\big(1/\sqrtN\big) , where N denotes the number of observed pairs.

[LG-52] Association-Aware GNN for Precoder Learning in Cell-Free Systems

链接: https://arxiv.org/abs/2603.13035
作者: Mingyu Deng,Shengqian Han
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning has been widely recognized as a promising approach for optimizing multi-user multi-antenna precoders in traditional cellular systems. However, a critical distinction between cell-free and cellular systems lies in the flexibility of user equipment (UE)-access point (AP) associations. Consequently, the optimal precoder depends not only on channel state information but also on the dynamic UE-AP association status. In this paper, we propose an association-aware graph neural network (AAGNN) that explicitly incorporates association status into the precoding design. We leverage the permutation equivariance properties of the cell-free precoding policy to reduce the training complexity of AAGNN and employ an attention mechanism to enhance its generalization performance. Simulation results demonstrate that the proposed AAGNN outperforms baseline learning methods in both learning performance and generalization capabilities while maintaining low training and inference complexity.

[LG-53] A theory of learning data statistics in diffusion models from easy to hard

链接: https://arxiv.org/abs/2603.12901
作者: Lorenzo Bardone,Claudia Merger,Sebastian Goldt
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While diffusion models have emerged as a powerful class of generative models, their learning dynamics remain poorly understood. We address this issue first by empirically showing that standard diffusion models trained on natural images exhibit a distributional simplicity bias, learning simple, pair-wise input statistics before specializing to higher-order correlations. We reproduce this behaviour in simple denoisers trained on a minimal data model, the mixed cumulant model, where we precisely control both pair-wise and higher-order correlations of the inputs. We identify a scalar invariant of the model that governs the sample complexity of learning pair-wise and higher-order correlations that we call the diffusion information exponent, in analogy to related invariants in different learning paradigms. Using this invariant, we prove that the denoiser learns simple, pair-wise statistics of the inputs at linear sample complexity, while more complex higher-order statistics, such as the fourth cumulant, require at least cubic sample complexity. We also prove that the sample complexity of learning the fourth cumulant is linear if pair-wise and higher-order statistics share a correlated latent structure. Our work describes a key mechanism for how diffusion models can learn distributions of increasing complexity.

[LG-54] Explainable AI Using Inherently Interpretable Components for Wearable-based Health Monitoring ALT

链接: https://arxiv.org/abs/2603.12880
作者: Maurice Kuschel,Solveig Vieluf,Claus Reinsberger,Tobias Loddenkemper,Tanuj Hasija
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Submitted to the IEEE Journal of Biomedical and Health Informatics

点击查看摘要

Abstract:The use of wearables in medicine and wellness, enabled by AI-based models, offers tremendous potential for real-time monitoring and interpretable event detection. Explainable AI (XAI) is required to assess what models have learned and build trust in model outputs, for patients, healthcare professionals, model developers, and domain experts alike. Explaining AI decisions made on time-series data recorded by wearables is especially challenging due to the data’s complex nature and temporal dependencies. Too often, explainability using interpretable features leads to performance loss. We propose a novel XAI method that combines explanation spaces and concept-based explanations to explain AI predictions on time-series data. By using Inherently Interpretable Components (IICs), which encapsulate domain-specific, interpretable concepts within a custom explanation space, we preserve the performance of models trained on time series while achieving the interpretability of concept-based explanations based on extracted features. Furthermore, we define a domain-specific set of IICs for wearable-based health monitoring and demonstrate their usability in real applications, including state assessment and epileptic seizure detection.

[LG-55] VecMol: Vector-Field Representations for 3D Molecule Generation

链接: https://arxiv.org/abs/2603.12734
作者: Yuchen Hua,Xingang Peng,Jianzhu Ma,Muhan Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative modeling of three-dimensional (3D) molecules is a fundamental yet challenging problem in drug discovery and materials science. Existing approaches typically represent molecules as 3D graphs and co-generate discrete atom types with continuous atomic coordinates, leading to intrinsic learning difficulties such as heterogeneous modality entanglement and geometry-chemistry coherence constraints. We propose VecMol, a paradigm-shifting framework that reimagines molecular representation by modeling 3D molecules as continuous vector fields over Euclidean space, where vectors point toward nearby atoms and implicitly encode molecular structure. The vector field is parameterized by a neural field and generated using a latent diffusion model, avoiding explicit graph generation and decoupling structure learning from discrete atom instantiation. Experiments on the QM9 and GEOM-Drugs benchmarks validate the feasibility of this novel approach, suggesting vector-field-based representations as a promising new direction for 3D molecular generation.

[LG-56] Weakly Time-Coupled Approximation of Markov Decision Processes

链接: https://arxiv.org/abs/2603.12636
作者: Negar Soheili,Selvaprabu Nadarajah,Bo Yang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Finite-horizon Markov decision processes (MDPs) with high-dimensional exogenous uncertainty and endogenous states arise in operations and finance, including the valuation and exercise of Bermudan and real options, but face a scalability barrier as computational complexity grows with the horizon. A common approximation represents the value function using basis functions, but methods for fitting weights treat cross-stage optimization differently. Least squares Monte Carlo (LSM) fits weights via backward recursion and regression, avoiding joint optimization but accumulating error over the horizon. Approximate linear programming (ALP) and pathwise optimization (PO) jointly fit weights to produce upper bounds, but temporal coupling causes computational complexity to grow with the horizon. We show this coupling is an artifact of the approximation architecture, and develop a weakly time-coupled approximation (WTCA) where cross-stage dependence is independent of horizon. For any fixed basis function set, the WTCA upper bound is tighter than that of ALP and looser than that of PO, and converges to the optimal policy value as the basis family expands. We extend parallel deterministic block coordinate descent to the stochastic MDP setting exploiting weak temporal coupling. Applied to WTCA, weak coupling yields computational complexity independent of the horizon. Within equal time budget, solving WTCA accommodates more exogenous samples or basis functions than PO, yielding tighter bounds despite PO being tighter for fixed samples and basis functions. On Bermudan option and ethanol production instances, WTCA produces tighter upper bounds than PO and LSM in every instance tested, with near-optimal policies at longer horizons.

[LG-57] Batched Kernelized Bandits: Refinements and Extensions

链接: https://arxiv.org/abs/2603.12627
作者: Chenkai Ma,Keqin Chen,Jonathan Scarlett
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we consider the problem of black-box optimization with noisy feedback revealed in batches, where the unknown function to optimize has a bounded norm in some Reproducing Kernel Hilbert Space (RKHS). We refer to this as the Batched Kernelized Bandits problem, and refine and extend existing results on regret bounds. For algorithmic upper bounds, (Li and Scarlett, 2022) shows that B=O(\log\log T) batches suffice to attain near-optimal regret, where T is the time horizon and B is the number of batches. We further refine this by (i) finding the optimal number of batches including constant factors (to within 1+o(1) ), and (ii) removing a factor of B in the regret bound. For algorithm-independent lower bounds, noticing that existing results only apply when the batch sizes are fixed in advance, we present novel lower bounds when the batch sizes are chosen adaptively, and show that adaptive batches have essentially same minimax regret scaling as fixed batches. Furthermore, we consider a robust setting where the goal is to choose points for which the function value remains high even after an adversarial perturbation. We present the robust-BPE algorithm, and show that a suitably-defined cumulative regret notion incurs the same bound as the non-robust setting, and derive a simple regret bound significantly below that of previous work.

[LG-58] Accelerating materials discovery using foundation model based In-context active learning

链接: https://arxiv.org/abs/2603.12567
作者: Jeffrey Hu,Rongzhi Dong,Ying Feng,Ming Hu,Jianjun Hu
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:Active learning (AL) has emerged as a powerful paradigm for accelerating materials discovery by iteratively steering experiments toward the most promising candidates, reducing costly synthesis-and-characterization cycles. However, current AL relies predominantly on Gaussian Process (GP) and Random Forest (RF) surrogates with complementary limitations: GP underfits complex composition–property landscapes due to rigid kernel assumptions, while RF produces unreliable uncertainty estimates in small-data regimes, precisely where most materials datasets reside (with 500 samples). Here we propose foudaiton model based In-Context Active Learning (ICAL), replacing conventional surrogates with TabPFN, a transformer-based foundation model pre-trained on millions of synthetic tasks to meta-learn a universal prior over tabular data. TabPFN performs principled Bayesian inference in a single forward pass without dataset-specific retraining, delivering well-calibrated predictive uncertainty where GP and RF fail most severely. Benchmarked against GP and RF across 10 materials datasets spanning copper alloy hardness and electrical conductivity, bulk metallic glass-forming ability, and crystal lattice thermal conductivity, TabPFN wins on 8 out of 10 datasets, achieving a mean saving of 52% in extra experiments/evaluations relative to GP and 29.77% relative to RF. Cross-validation analysis confirms that TabPFN’s advantage stems from superior uncertainty calibration,achieving the lowest Negative Log-Likelihood and Area Under the Sparsification Error curve among all surrogates. Our work demonstrates that a pre-trained foundation model can serve as a highly effective surrogate for accelerating active learning-based materials discovery.

[LG-59] EB-RANSAC: Random Sample Consensus based on Energy-Based Model

链接: https://arxiv.org/abs/2603.12525
作者: Muneki Yasuda,Nao Watanabe,Kaiji Sekimoto
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Random sample consensus (RANSAC), which is based on a repetitive sampling from a given dataset, is one of the most popular robust estimation methods. In this study, an energy-based model (EBM) for robust estimation that has a similar scheme to RANSAC, energy-based RANSAC (EB-RANSAC), is proposed. EB-RANSAC is applicable to a wide range of estimation problems similar to RANSAC. However, unlike RANSAC, EB-RANSAC does not require a troublesome sampling procedure and has only one hyperparameter. The effectiveness of EB-RANSAC is numerically demonstrated in two applications: a linear regression and maximum likelihood estimation.

[LG-60] FloeNet: A mass-conserving global sea ice emulator that generalizes across climates

链接: https://arxiv.org/abs/2603.12449
作者: William Gregory,Mitchell Bushuk,James Duncan,Elynn Wu,Adam Subel,Spencer K. Clark,Bill Hurlin,Oliver Watt-Meyer,Alistair Adcroft,Chris Bretherton,Laure Zanna
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 4 Figures, 18 supplementary figures

点击查看摘要

Abstract:We introduce FloeNet, a machine-learning emulator trained on the Geophysical Fluid Dynamics Laboratory global sea ice model, SIS2. FloeNet is a mass-conserving model, emulating 6-hour mass and area budget tendencies related to sea ice and snow-on-sea-ice growth, melt, and advection. We train FloeNet using simulated data from a reanalysis-forced ice-ocean simulation and test its ability to generalize to pre-industrial control and 1% CO2 climates. FloeNet outperforms a non-conservative model at reproducing sea ice and snow-on-sea-ice mean state, trends, and inter-annual variability, with volume anomaly correlations above 0.96 in the Antarctic and 0.76 in the Arctic, across all forcings. FloeNet also produces the correct thermodynamic vs dynamic response to forcing, enabling physical interpretability of emulator output. Finally, we show that FloeNet outputs high-fidelity coupling-related variables, including ice-surface skin temperature, ice-to-ocean salt flux, and melting energy fluxes. We hypothesize that FloeNet will improve polar climate processes within existing atmosphere and ocean emulators.

[LG-61] he Privacy-Utility Trade-Off of Location Tracking in Ad Personalization

链接: https://arxiv.org/abs/2603.12374
作者: Mohammad Mosaffa,Omid Rafieian
类目: Econometrics (econ.EM); Machine Learning (cs.LG)
*备注: 57 pages, 11 figures. Digital advertising, causal inference, and machine learning

点击查看摘要

Abstract:Firms collect vast amounts of behavioral and geographical data on individuals. While behavioral data captures an individual’s digital footprint, geographical data reflects their physical footprint. Given the significant privacy risks associated with combining these data sources, it is crucial to understand their respective value and whether they act as complements or substitutes in achieving firms’ business objectives. In this paper, we combine economic theory, machine learning, and causal inference to quantify the value of geographical data, the extent to which behavioral data can substitute for it, and the mechanisms through which it benefits firms. Using data from a leading in-app advertising platform in a large Asian country, we document that geographical data is most valuable in the early cold-start stage, when behavioral histories are limited. In this stage, geographical data complements behavioral data, improving targeting performance by almost 20%. As users accumulate richer behavioral histories, however, the role of geographical data shifts: it becomes largely substitutable, as behavioral data alone captures the relevant heterogeneity. These results highlight a central privacy-utility trade-off in ad personalization and inform managerial decisions about when location tracking creates value.

[LG-62] Optimal Experimental Design for Reliable Learning of History-Dependent Constitutive Laws

链接: https://arxiv.org/abs/2603.12365
作者: Kaushik Bhattacharya,Lianghao Cao,Andrew Stuart
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:History-dependent constitutive models serve as macroscopic closures for the aggregated effects of micromechanics. Their parameters are typically learned from experimental data. With a limited experimental budget, eliciting the full range of responses needed to characterize the constitutive relation can be difficult. As a result, the data can be well explained by a range of parameter choices, leading to parameter estimates that are uncertain or unreliable. To address this issue, we propose a Bayesian optimal experimental design framework to quantify, interpret, and maximize the utility of experimental designs for reliable learning of history-dependent constitutive models. In this framework, the design utility is defined as the expected reduction in parametric uncertainty or the expected information gain. This enables in silico design optimization using simulated data and reduces the cost of physical experiments for reliable parameter identification. We introduce two approximations that make this framework practical for advanced material testing with expensive forward models and high-dimensional data: (i) a Gaussian approximation of the expected information gain, and (ii) a surrogate approximation of the Fisher information matrix. The former enables efficient design optimization and interpretation, while the latter extends this approach to batched design optimization by amortizing the cost of repeated utility evaluations. Our numerical studies of uniaxial tests for viscoelastic solids show that optimized specimen geometries and loading paths yield image and force data that significantly improve parameter identifiability relative to random designs, especially for parameters associated with memory effects. Subjects: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph); Computation (stat.CO) Cite as: arXiv:2603.12365 [cond-mat.mtrl-sci] (or arXiv:2603.12365v1 [cond-mat.mtrl-sci] for this version) https://doi.org/10.48550/arXiv.2603.12365 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-63] Probabilistic Joint and Individual Variation Explained (ProJIVE) for Data Integration

链接: https://arxiv.org/abs/2603.12351
作者: Raphiel J. Murden,Ganzhong Tian,Deqiang Qiu,Benajmin B. Risk
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Collecting multiple types of data on the same set of subjects is common in modern scientific applications including, genomics, metabolomics, and neuroimaging. Joint and Individual Variance Explained (JIVE) seeks a low-rank approximation of the joint variation between two or more sets of features captured on common subjects and isolates this variation from that unique to eachset of features. We develop an expectation-maximization (EM) algorithm to estimate a probabilistic model for the JIVE framework. The model extends probabilistic principal components analysis to multiple data sets. Our maximum likelihood approach simultaneously estimates joint and individual components, which can lead to greater accuracy compared to other methods. We apply ProJIVE to measures of brain morphometry and cognition in Alzheimer’s disease. ProJIVE learns biologically meaningful courses of variation, and the joint morphometry and cognition subject scores are strongly related to more expensive existing biomarkers. Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. Code to reproduce the analysis is available on our GitHub page.

[LG-64] Pruning-induced phases in fully-connected neural networks: the eumentia the dementia and the amentia

链接: https://arxiv.org/abs/2603.12316
作者: Haining Pan,Nakul Aggarwal,J. H. Pixley
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 14 pages, 15 figures

点击查看摘要

Abstract:Modern neural networks are heavily overparameterized, and pruning, which removes redundant neurons or connections, has emerged as a key approach to compressing them without sacrificing performance. However, while practical pruning methods are well developed, whether pruning induces sharp phase transitions in the neural networks and, if so, to what universality class they belong, remain open questions. To address this, we study fully-connected neural networks trained on MNIST, independently varying the dropout (i.e., removing neurons) rate at both the training and evaluation stages to map the phase diagram. We identify three distinct phases: eumentia (the network learns), dementia (the network has forgotten), and amentia (the network cannot learn), sharply distinguished by the power-law scaling of the cross-entropy loss with the training dataset size. In the eumentia phase, the algebraic decay of the loss, as documented in the machine learning literature as neural scaling laws, is from the perspective of statistical mechanics the hallmark of quasi-long-range order. We demonstrate that the transition between the eumentia and dementia phases is accompanied by scale invariance, with a diverging length scale that exhibits hallmarks of a Berezinskii-Kosterlitz-Thouless-like transition; the phase structure is robust across different network widths and depths. Our results establish that dropout-induced pruning provides a concrete setting in which neural network behavior can be understood through the lens of statistical mechanics.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2026-03-16

目录

概览 (2026-03-16)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载