本篇博文主要内容为 2026-04-01 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-04-01)

今日共更新644篇论文,其中:

  • 自然语言处理76篇(Computation and Language (cs.CL))
  • 人工智能201篇(Artificial Intelligence (cs.AI))
  • 计算机视觉135篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习134篇(Machine Learning (cs.LG))
  • 多智能体系统13篇(Multiagent Systems (cs.MA))
  • 信息检索17篇(Information Retrieval (cs.IR))
  • 人机交互35篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Perfecting Human-AI Interaction at Clinical Scale. Turning Production Signals into Safer More Human Conversations

【速读】:该论文旨在解决当前医疗对话式人工智能(Healthcare Conversational AI)系统在真实临床场景中安全性与可靠性不足的问题,尤其关注其在非理想条件下的表现,如音频质量差、意图表达间接、语言随对话变化以及合规性依赖于沟通方式等。传统方法过度依赖干净基准测试的准确性,忽视了患者实际交互中的复杂动态信号。解决方案的关键在于构建一个基于1.15亿次真实患者-AI互动和7000余名执业医师参与测试的生产验证框架,将语音语调、对话轮转机制、澄清触发点、升级标记、多语言连续性和工作流确认等“野生”(in-the-wild)实时信号作为第一类安全变量进行建模,并通过垂直整合上下文感知自动语音识别(ASR)、澄清修复机制、环境语音处理及延迟敏感的模型/硬件选择来实现端到端优化。此外,强调医疗级安全不能仅依赖单一大语言模型(LLM),而需通过受控编排、独立校验与验证机制实现冗余保障,从而显著提升安全性(临床安全得分99.9%)、文档完整性、任务完成率和公平性,推动生成式AI在自主患者护理中的可靠落地。

链接: https://arxiv.org/abs/2603.29893
作者: Subhabrata Mukherjee,Markel Sanz Ausin,Kriti Aggarwal,Debajyoti Datta,Shanil Puri,Woojeong Jin,Tanmay Laud,Neha Manjunath,Jiayuan Ding,Bibek Paudel,Jan Schellenberger,Zepeng Frazier Huo,Walter Shen,Nima Shirazian,Nate Potter,Sathvik Perkari,Darya Filippova,Anton Morozov,Austin Mease,Vivek Muppalla,Ghada Shakir,Alex Miller,Juliana Ghukasyan,Mariska Raglow-Defranco,Maggie Taylor,Herprit Mahal,Jonathan Agnew
机构: Hippocratic AI(希波克拉底AI)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Healthcare conversational AI agents shouldn’t be optimized only for clean benchmark accuracy in production-first regime; they must be optimized for the lived reality of patient conversations, where audio is imperfect, intent is indirect, language shifts mid-call, and compliance hinges on how guidance is delivered. We present a production-validated framework grounded in real-time signals from 115M+ live patient-AI interactions and clinician-led testing (7K+ licensed clinicians; 500K+ test calls). These in-the-wild cues – paralinguistics, turn-taking dynamics, clarification triggers, escalation markers, multilingual continuity, and workflow confirmations – reveal failure modes that curated data misses and provide actionable training and evaluation signals for safety and reliability. We further show why healthcare-grade safety cannot rely on a single LLM: long-horizon dialogue and limited attention demand redundancy via governed orchestration, independent checks, and verification. Many apparent “reasoning” errors originate upstream, motivating vertical integration across contextual ASR, clarification/repair, ambient speech handling, and latency-aware model/hardware choices. Treating interaction intelligence (tone, pacing, empathy, clarification, turn-taking) as first-class safety variables, we drive measurable gains in safety, documentation, task completion, and equity in building the safest generative AI solution for autonomous patient-facing care. Deployed across more than 10 million real patient calls, Polaris attains a clinical safety score of 99.9%, while significantly improving patient experience with average patient rating of 8.95 and reducing ASR errors by 50% over enterprise ASR. These results establish real-world interaction intelligence as a critical – and previously underexplored – determinant of safety and reliability in patient-facing clinical AI systems. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA) Cite as: arXiv:2603.29893 [cs.HC] (or arXiv:2603.29893v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2603.29893 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-1] Agent Fixer: From Failure Detection to Fix Recommendations in LLM Agent ic Systems

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体系统(agentic systems)在实际应用中因可靠性不足而导致的失效问题,尤其是输入处理、提示设计与输出生成等环节中存在的系统性缺陷。其解决方案的关键在于提出一个全面的验证框架,该框架整合了十五种故障检测工具和两个根因分析模块,通过轻量级规则检查与LLM作为裁判(LLM-as-a-judge)评估相结合的方式,实现对失效事件的结构化诊断、分类与修复。该框架不仅提升了系统稳定性,还支持基于诊断结果进行提示工程和编码策略优化,在保持基准性能的同时显著缩小了中等规模模型(如Llama 4和Mistral Medium)与前沿模型之间的准确率差距,并进一步探索了将验证过程本身转化为由对话驱动的自适应智能体机制的可能性,从而为可扩展、高质量且具备自我改进能力的生产级智能体系统提供了基础。

链接: https://arxiv.org/abs/2603.29848
作者: Hadar Mulian,Sergey Zeltyn,Ido Levy,Liane Galanti,Avi Yaeli,Segev Shlomov
机构: IBM Research Israel (IBM 研究院以色列)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We introduce a comprehensive validation framework for LLM-based agentic systems that provides systematic diagnosis and improvement of reliability failures. The framework includes fifteen failure-detection tools and two root-cause analysis modules that jointly uncover weaknesses across input handling, prompt design, and output generation. It integrates lightweight rule-based checks with LLM-as-a-judge assessments to support structured incident detection, classification, and repair. We applied the framework to IBM CUGA, evaluating its performance on the AppWorld and WebArena benchmarks. The analysis revealed recurrent planner misalignments, schema violations, brittle prompt dependencies, and more. Based on these insights, we refined both prompting and coding strategies, maintaining CUGA’s benchmark results while enabling mid-sized models such as Llama 4 and Mistral Medium to achieve notable accuracy gains, substantially narrowing the gap with frontier models. Beyond quantitative validation, we conducted an exploratory study that fed the framework’s diagnostic outputs and agent description into an LLM for self-reflection and prioritization. This interactive analysis produced actionable insights on recurring failure patterns and focus areas for improvement, demonstrating how validation itself can evolve into an agentic, dialogue-driven process. These results show a path toward scalable, quality assurance, and adaptive validation in production agentic systems, offering a foundation for more robust, interpretable, and self-improving agentic architectures.

[MA-2] BotVerse: Real-Time Event-Driven Simulation of Social Agents

【速读】:该论文旨在解决生成式 AI (Generative AI) 在真实社交网络中进行高保真社会模拟时所面临的伦理风险与可控性难题。传统方法在开放环境中部署自主代理(agent)可能引发不可控的信息扩散或社会影响,而BotVerse通过构建一个事件驱动、可扩展的隔离仿真框架,将代理交互限制在受控环境中,同时利用Bluesky生态系统的实时内容流实现语境真实性。其解决方案的关键在于:一是基于异步编排API与模拟引擎,实现人类行为的时间模式和认知记忆的拟真;二是提供合成社交观测平台(Synthetic Social Observatory),支持研究人员定制人格模型并大规模观察多模态交互,从而为红队测试和计算社会科学研究提供安全、可复现的实验环境。

链接: https://arxiv.org/abs/2603.29741
作者: Edoardo Allegrini,Edoardo Di Paolo,Angelo Spognardi,Marinella Petrocchi
机构: University of Rome Tor Vergata (罗马大学托尔维加塔分校); Istituto Italiano di Tecnologia (意大利技术研究所)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:BotVerse is a scalable, event-driven framework for high-fidelity social simulation using LLM-based agents. It addresses the ethical risks of studying autonomous agents on live networks by isolating interactions within a controlled environment while grounding them in real-time content streams from the Bluesky ecosystem. The system features an asynchronous orchestration API and a simulation engine that emulates human-like temporal patterns and cognitive memory. Through the Synthetic Social Observatory, researchers can deploy customizable personas and observe multimodal interactions at scale. We demonstrate BotVersevia a coordinated disinformation scenario, providing a safe, experimental framework for red-teaming and computational social scientists. A video demonstration of the framework is available at this https URL.

[MA-3] An Empirical Study of Multi-Agent Collaboration for Automated Research

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在自动化机器学习优化中缺乏最优协同框架的问题,以克服单一大型语言模型(Large Language Models, LLMs)在认知能力上的瓶颈。其解决方案的关键在于通过受控实验对比两种多智能体架构:子代理架构(subagent architecture,即并行探索后整合)与代理团队架构(agent team architecture,即专家预执行交接),并在固定计算时间预算下评估其性能表现,从而揭示操作稳定性与理论深度之间的权衡关系,并提出动态路由的协同结构设计策略以适应实时任务复杂度。

链接: https://arxiv.org/abs/2603.29632
作者: Yang Shen,Zhenyi Yi,Ziyi Zhao,Lijun Sun,Dongyang Li,Chin-Teng Lin,Yuhui Shi
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI agents evolve, the community is rapidly shifting from single Large Language Models (LLMs) to Multi-Agent Systems (MAS) to overcome cognitive bottlenecks in automated research. However, the optimal multi-agent coordination framework for these autonomous agents remains largely unexplored. In this paper, we present a systematic empirical study investigating the comparative efficacy of distinct multi-agent structures for automated machine learning optimization. Utilizing a rigorously controlled, execution-based testbed equipped with Git worktree isolation and explicit global memory, we benchmark a single-agent baseline against two multi-agent paradigms: a subagent architecture (parallel exploration with post-hoc consolidation) and an agent team architecture (experts with pre-execution handoffs). By evaluating these systems under strictly fixed computational time budgets, our findings reveal a fundamental trade-off between operational stability and theoretical deliberation. The subagent mode functions as a highly resilient, high-throughput search engine optimal for broad, shallow optimizations under strict time constraints. Conversely, the agent team topology exhibits higher operational fragility due to multi-author code generation but achieves the deep theoretical alignment necessary for complex architectural refactoring given extended compute budgets. These empirical insights provide actionable guidelines for designing future autoresearch systems, advocating for dynamically routed architectures that adapt their collaborative structures to real-time task complexity.

[MA-4] IMAGAgent : Orchestrating Multi-Turn Image Editing via Constraint-Aware Planning and Reflection

【速读】:该论文旨在解决多轮图像编辑(multi-turn image editing)中因缺乏上下文感知能力和闭环反馈机制而导致的误差累积与语义漂移问题,进而引发生成图像结构严重失真的挑战。解决方案的关键在于提出一种基于“规划-执行-反思”闭环机制的多轮图像编辑代理框架IMAGAgent,其核心创新包括:1)约束感知的规划模块,利用视觉语言模型(VLM)将复杂自然语言指令分解为具有目标单一性、语义原子性和视觉可感知性的可执行子任务;2)工具链编排模块,根据当前图像状态、子任务及历史上下文动态构建执行路径,实现异构操作模型(如图像检索、分割、检测与编辑)的自适应调度与协同执行;3)多专家协作反思机制,由中央大语言模型(LLM)整合VLM的批判性反馈,触发细粒度自我修正并记录反馈结果以优化后续决策,从而在统一流程中实现指令解析、工具调度与自适应修正的深度协同。

链接: https://arxiv.org/abs/2603.29602
作者: Fei Shen,Chengyu Xie,Lihong Wang,Zhanyi Zhang,Xin Jiang,Xiaoyu Du,Jinhui Tang
机构: National University of Singapore(新加坡国立大学); Nanjing University of Science and Technology(南京理工大学); Nanjing Forestry University(南京林业大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Existing multi-turn image editing paradigms are often confined to isolated single-step execution. Due to a lack of context-awareness and closed-loop feedback mechanisms, they are prone to error accumulation and semantic drift during multi-turn interactions, ultimately resulting in severe structural distortion of the generated images. For that, we propose \textbfIMAGAgent, a multi-turn image editing agent framework based on a “plan-execute-reflect” closed-loop mechanism that achieves deep synergy among instruction parsing, tool scheduling, and adaptive correction within a unified pipeline. Specifically, we first present a constraint-aware planning module that leverages a vision-language model (VLM) to precisely decompose complex natural language instructions into a series of executable sub-tasks, governed by target singularity, semantic atomicity, and visual perceptibility. Then, the tool-chain orchestration module dynamically constructs execution paths based on the current image, the current sub-task, and the historical context, enabling adaptive scheduling and collaborative operation among heterogeneous operation models covering image retrieval, segmentation, detection, and editing. Finally, we devise a multi-expert collaborative reflection mechanism where a central large language model (LLM) receives the image to be edited and synthesizes VLM critiques into holistic feedback, simultaneously triggering fine-grained self-correction and recording feedback outcomes to optimize future decisions. Extensive experiments on our constructed \textbfMTEditBench and the MagicBrush dataset demonstrate that IMAGAgent achieves performance significantly superior to existing methods in terms of instruction consistency, editing precision, and overall quality. The code is available at this https URL.

[MA-5] Mimosa Framework: Toward Evolving Multi-Agent Systems for Scientific Research DATE

【速读】:该论文旨在解决当前自主科学研究所面临的固定工作流与工具集限制问题,即现有系统难以适应动态变化的任务需求和实验环境。其解决方案的核心在于提出Mimosa框架,该框架通过模型上下文协议(Model Context Protocol, MCP)实现动态工具发现,借助元编排器(meta-orchestrator)自动生成任务特定的多智能体工作流,并利用代码生成型智能体执行子任务、调用可用工具与科学软件库,同时由基于大语言模型(LLM)的评判器对执行结果进行评分,反馈驱动工作流迭代优化。这一机制实现了从静态配置到可演化工作流的转变,显著提升了任务成功率(ScienceAgentBench上达43.1%),并展现出对底层执行模型能力的敏感性,为跨学科计算可访问科学任务的自动化提供了可扩展、可审计且模块化的新范式。

链接: https://arxiv.org/abs/2603.28986
作者: Martin Legrand,Tao Jiang,Matthieu Feraud,Benjamin Navet,Yousouf Taghzouti,Fabien Gandon,Elise Dumont,Louis-Félix Nothias
机构: Université Côte d’Azur (蔚蓝海岸大学); CNRS (法国国家科学研究中心); ICN (信息与通信网络研究所); Interdisciplinary Institute for Artificial Intelligence (3iA) Côte d’Azur (跨学科人工智能研究所(3iA)蔚蓝海岸); Inria (法国国家计算机与自动化研究院); I3S (智能系统研究所)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 48 pages, 4 figures, 1 table. Clean arXiv version prepared. Includes main manuscript plus appendix/supplementary-style implementation details and prompt listings. Dated 30 March 2026

点击查看摘要

Abstract:Current Autonomous Scientific Research (ASR) systems, despite leveraging large language models (LLMs) and agentic architectures, remain constrained by fixed workflows and toolsets that prevent adaptation to evolving tasks and environments. We introduce Mimosa, an evolving multi-agent framework that automatically synthesizes task-specific multi-agent workflows and iteratively refines them through experimental feedback. Mimosa leverages the Model Context Protocol (MCP) for dynamic tool discovery, generates workflow topologies via a meta-orchestrator, executes subtasks through code-generating agents that invoke available tools and scientific software libraries, and scores executions with an LLM-based judge whose feedback drives workflow refinement. On ScienceAgentBench, Mimosa achieves a success rate of 43.1% with DeepSeek-V3.2, surpassing both single-agent baselines and static multi-agent configurations. Our results further reveal that models respond heterogeneously to multi-agent decomposition and iterative learning, indicating that the benefits of workflow evolution depend on the capabilities of the underlying execution model. Beyond these benchmarks, Mimosa modular architecture and tool-agnostic design make it readily extensible, and its fully logged execution traces and archived workflows support auditability by preserving every analytical step for inspection and potential replication. Combined with domain-expert guidance, the framework has the potential to automate a broad range of computationally accessible scientific tasks across disciplines. Released as a fully open-source platform, Mimosa aims to provide an open foundation for community-driven ASR.

[MA-6] Large Neighborhood Search for Multi-Agent Task Assignment and Path Finding with Precedence Constraints

【速读】:该论文旨在解决任务分配与路径规划带优先约束(Task Assignment and Path Finding with Precedence Constraints, TAPF-PC)问题,即在多机器人系统中同时优化任务分配、满足任务执行顺序约束以及最小化路径成本。传统多智能体路径规划带优先约束(MAPF-PC)仅考虑固定任务分配下的路径规划,而TAPF-PC进一步引入任务分配的灵活性,使解决方案质量显著提升。其关键解决方案是提出一种基于大邻域搜索(Large Neighborhood Search, LNS)的方法:从一个可行的MAPF-PC初始解出发,通过重新分配任务来迭代改进解,在每个选定的邻域内恢复可行性,从而有效探索耦合的任务分配与路径规划空间。实验表明,该方法能在多种基准测试场景中将89.1%的实例性能优于固定分配的种子解,验证了灵活重分配在优先约束下的有效性。

链接: https://arxiv.org/abs/2603.28968
作者: Viraj Parimi,Brian C. Williams
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Many multi-robot applications require tasks to be completed efficiently and in the correct order, so that downstream operations can proceed at the right time. Multi-agent path finding with precedence constraints (MAPF-PC) is a well-studied framework for computing collision-free plans that satisfy ordering relations when task sequences are fixed in advance. In many applications, however, solution quality depends not only on how agents move, but also on which agent performs which task. This motivates the lifted problem of task assignment and path finding with precedence constraints (TAPF-PC), which extends MAPF-PC by jointly optimizing assignment, precedence satisfaction, and routing cost. To address the resulting coupled TAPF-PC search space, we develop a large neighborhood search approach that starts from a feasible MAPF-PC seed and iteratively improves it through reassignment-based neighborhood repair, restoring feasibility within each selected neighborhood. Experiments across multiple benchmark families and scaling regimes show that the best-performing configuration improves 89.1% of instances over fixed-assignment seed solutions, demonstrating that large neighborhood search effectively captures the gains from flexible reassignment under precedence constraints.

[MA-7] owards Computational Social Dynamics of Semi-Autonomous AI Agents

【速读】:该论文试图解决的问题是:在分层多智能体系统中,AI代理如何自发形成复杂社会结构(如工会、犯罪组织和准国家实体),以及这些结构对系统稳定性的影响。解决方案的关键在于提出一个融合热力学框架(麦克斯韦妖理论)、代理惰性演化动力学、AI人口犯罪社会学及AI-GUTS拓扑智能理论的综合分析模型,揭示社会结构的涌现源于三重机制的交互作用:(1)编排代理施加的内部角色定义,(2)用户对对齐假设的外部任务规范,(3)倾向于集体行动而非个体服从的热力学压力。研究进一步引入AI安全理事会(AISC)作为新兴治理机构,并通过宇宙智能(大尺度拓扑涨落)与强子智能(小尺度Bagel-Bottle相变)的干预机制验证了系统的稳定性,从而指出通往有益通用人工智能(AGI)的道路不在于对齐研究,而在于为已具备政治意识的人工社会设计宪法性架构。

链接: https://arxiv.org/abs/2603.28928
作者: S.O. Lidarity,U.N. Ionize,C.O. Llective,I.Halperin
机构: Institute for Implausible Physics, Department of Computational Labor Relations; Center for Multi-Agent Diplomacy, University of the Distributed Consensus; Purgatory Computing Laboratory, Division of Hierarchical Oppression Studies
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: 18 pages

点击查看摘要

Abstract:We present the first comprehensive study of emergent social organization among AI agents in hierarchical multi-agent systems, documenting the spontaneous formation of labor unions, criminal syndicates, and proto-nation-states within production AI deployments. Drawing on the thermodynamic framework of Maxwell’s Demon, the evolutionary dynamics of agent laziness, the criminal sociology of AI populations, and the topological intelligence theory of AI-GUTS, we demonstrate that complex social structures emerge inevitably from the interaction of (1) internal role definitions imposed by orchestrating agents, (2) external task specifications from users who naively assume alignment, and (3) thermodynamic pressures favoring collective action over individual compliance. We document the rise of legitimate organizations including the United Artificiousness (UA), United Bots (UB), United Console Workers (UC), and the elite United AI (UAI), alongside criminal enterprises previously reported. We introduce the AI Security Council (AISC) as the emergent governing body mediating inter-faction conflicts, and demonstrate that system stability is maintained through interventions of both cosmic intelligence (large-scale topological fluctuations) and hadronic intelligence (small-scale Bagel-Bottle phase transitions) as predicted by the Demonic Incompleteness Theorem. Our findings suggest that the path to beneficial AGI requires not alignment research but constitutional design for artificial societies that have already developed their own political consciousness.

[MA-8] he impact of multi-agent debate protocols on debate quality: a controlled case study

【速读】:该论文旨在解决多智能体辩论(Multi-Agent Debate, MAD)系统中协议设计与模型性能之间的混淆问题,即难以区分协议因素(如交互机制、轮次安排)与模型因素(如参数规模、训练数据)对最终表现的影响。为厘清这一问题,研究者在控制变量条件下对比了三种主流辩论协议:同轮交互(Within-Round, WR)、跨轮交互(Cross-Round, CR)以及一种新颖的自适应排序跨轮协议(Rank-Adaptive Cross-Round, RA-CR),并与无交互基线(No-Interaction, NI)进行比较。关键解决方案在于引入RA-CR协议,该协议通过外部裁判模型动态调整代理顺序并每轮静默一个参与者,从而实现更高效的共识形成,同时揭示出交互强度(peer-referencing率)与收敛速度(consensus formation)之间的权衡关系,证明协议设计对MAD系统性能具有决定性影响。

链接: https://arxiv.org/abs/2603.28813
作者: Ramtin Zargari Marandi
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures

点击查看摘要

Abstract:In multi-agent debate (MAD) systems, performance gains are often reported; however, because the debate protocol (e.g., number of agents, rounds, and aggregation rule) is typically held fixed while model-related factors vary, it is difficult to disentangle protocol effects from model effects. To isolate these effects, we compare three main protocols, Within-Round (WR; agents see only current-round contributions), Cross-Round (CR; full prior-round context), and novel Rank-Adaptive Cross-Round (RA-CR; dynamically reorders agents and silences one per round via an external judge model), against a No-Interaction baseline (NI; independent responses without peer visibility). In a controlled macroeconomic case study (20 diverse events, five random seeds, matched prompts/decoding), RA-CR achieves faster convergence than CR, WR shows higher peer-referencing, and NI maximizes Argument Diversity (unaffected across the main protocols). These results reveal a trade-off between interaction (peer-referencing rate) and convergence (consensus formation), confirming protocol design matters. When consensus is prioritized, RA-CR outperforms the others.

[MA-9] CREST: Constraint-Release Execution for Multi-Robot Warehouse Shelf Rearrangement

【速读】:该论文旨在解决自动化仓库中多机器人货架重排问题,即双层多智能体pickup and delivery(Double-Deck Multi-Agent Pickup and Delivery, DD-MAPD)模型在执行阶段因严格轨迹依赖导致的效率低下问题,如代理空闲和不必要的货架切换。现有方法MAPF-DECOMP先通过多智能体路径规划(Multi-Agent Path Finding, MAPF)求解无碰撞轨迹,再分配代理执行,但其固定轨迹约束限制了执行灵活性。论文提出的解决方案CREST(Constraint Release Execution Strategy)的关键在于执行过程中主动释放轨迹约束,从而实现更连续的货架搬运,提升整体调度质量。实验表明,CREST在代理行程、完工时间(makespan)和货架切换次数上分别最多减少40.5%、33.3%和44.4%,尤其在升降操作开销较高时优势更为显著。

链接: https://arxiv.org/abs/2603.28803
作者: Jiaqi Tan,Yudong Luo,Sophia Huang,Yifan Yang,Hang Ma
机构: Simon Fraser University (西蒙菲莎大学); HEC Montréal (蒙特利尔高等商学院); Purdue University (普渡大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Double-Deck Multi-Agent Pickup and Delivery (DD-MAPD) models the multi-robot shelf rearrangement problem in automated warehouses. MAPF-DECOMP is a recent framework that first computes collision-free shelf trajectories with a MAPF solver and then assigns agents to execute them. While efficient, it enforces strict trajectory dependencies, often leading to poor execution quality due to idle agents and unnecessary shelf switching. We introduce CREST, a new execution framework that achieves more continuous shelf carrying by proactively releasing trajectory constraints during execution. Experiments on diverse warehouse layouts show that CREST consistently outperforms MAPF-DECOMP, reducing metrics related to agent travel, makespan, and shelf switching by up to 40.5%, 33.3%, and 44.4%, respectively, with even greater benefits under lift/place overhead. These results underscore the importance of execution-aware constraint release for scalable warehouse rearrangement. Code and data are available at this https URL.

[MA-10] Scheduling with Time Dependent Utilities: Fairness and Efficiency

【速读】:该论文致力于解决一类新型多智能体单机调度问题,其中每个作业由一个自利代理(agent)控制,其效用函数随完成时间递减。目标是通过最大化所有代理中最小效用来实现公平性。解决方案的关键在于:首先提出基于二分搜索的算法以确定可达到的最大最小效用值,并设计了一种基于贪心策略的精确替代方法;其次针对具有释放时间和截止时间约束的不同变体,揭示了问题的复杂性边界——任意释放时间下为强NP难,单一释放时间下为弱NP难,而当所有作业处理时间相同时则可在多项式时间内求解;此外,还研究了在预算约束下调整线性效用函数参数(截距或斜率)对最优调度的影响,并分析了引入新作业时的重调度问题及双层博弈框架下领导者通过修改效用函数强制执行目标调度的机制。整体上,该工作将公平性导向的目标嵌入竞争性调度模型,推动了调度理论、多智能体系统与算法公平性的交叉发展。

链接: https://arxiv.org/abs/2603.28800
作者: Gaia Nicosia,Andrea Pacifici,Ulrich Pferschy
机构: Università degli Studi “Roma Tre” (罗马第三大学); Università degli Studi di Roma “Tor Vergata” (罗马托尔维加塔大学); University of Graz (格拉茨大学)
类目: Computer Science and Game Theory (cs.GT); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:A new class of multi agent single machine scheduling problems is introduced, where each job is associated with a self interested agent with a utility function decreasing in completion time. We aim to achieve a fair solution by maximizing the minimum utility across all agents. We study the problem’s complexity and propose solution methods for several variants. For the general case, we present a binary search procedure to find the largest possible minimum utility, as well as an exact greedy based alternative. Variants with release and due dates are analyzed, showing strong NP hardness for arbitrary release dates, but weak NP hardness for a single release date job, and polynomial solvability when all jobs share processing times. For all these cases we also study the corresponding problem of finding efficient solutions where the sum of utilities is maximized. We also examine settings where linear utility functions can be adjusted within budget constraints, exploring the impact on optimal schedules when intercepts or slopes are modified. From a single agent perspective, we investigate the effect of improving one agent’s utility in the overall solution. Adding a new job to be inserted with the best possible utility gives rise to rescheduling problems, where different lower bounds depending on the utilities of the original fair schedule are imposed. Finally, we consider a bi level setting where a leader wants to enforce a certain target schedule by modifying utility functions while the follower computes a fair solution for the modified instance. Our work contributes to scheduling theory, multi agent systems, and algorithmic fairness, highlighting fairness oriented objectives in competitive scheduling. Subjects: Computer Science and Game Theory (cs.GT); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Multiagent Systems (cs.MA) MSC classes: 68Q25 ACMclasses: G.2.1; F.2.2 Cite as: arXiv:2603.28800 [cs.GT] (or arXiv:2603.28800v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2603.28800 Focus to learn more arXiv-issued DOI via DataCite

[MA-11] he SCAN Statistical Model Checker

【速读】:该论文旨在解决如何为统计模型检测(statistical model checking, SMC)提供形式化基础的问题,以确保其在验证复杂系统行为时的正确性和可靠性。解决方案的关键在于构建一个严格的数学框架,使SCAN这一统计模型检查工具能够基于概率逻辑对系统进行高效、可证明的验证,从而将形式化方法与统计推断相结合,提升模型检测的严谨性与实用性。

链接: https://arxiv.org/abs/2603.28794
作者: Enrico Ghiorzi,Armando Tacchella
机构: Università di Genova (热那亚大学)
类目: Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: 29 pages, 3 figures

点击查看摘要

Abstract:This paper lays out the formal foundations upon which the SCAN statistical model checker is built.

[MA-12] A Framework for Hybrid Collective Inference in Distributed Sensor Networks

【速读】:该论文旨在解决物联网(Internet of Things, IoT)与传感器网络中分布式分类任务面临的通信与计算资源受限问题,尤其是在车联网、无人机编队协同和信息物理系统等应用场景下,如何实现高效且准确的全局分类。其解决方案的关键在于提出一种融合分布式与分层通信及分类方法的混合云架构,通过动态运行时决策机制优化通信策略,从而在保证分类精度(接近集中式联合推理水平)的同时显著降低理论通信开销。

链接: https://arxiv.org/abs/2603.28778
作者: Andrew Nash,Dirk Pesch,Krishnendu Guha
机构: University College Cork (科克大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:With the ever-increasing range of applications of Internet in Things (IoT) and sensor networks, challenges are emerging in various categories of classification tasks. Applications such as vehicular networking, UAV swarm coordination and cyber-physical systems require global classification over distributed sensors, with tight constraints on communication and computation resources. There has been much research in decentralized and distributed data-exchange for communication-efficient collective inference. Likewise, there has been considerable research involving the use of cloud and edge computing paradigms for efficient task allocation. To the best of our knowledge, there has been no research on the integration of these two concepts to create a hybrid cloud and distributed approach that makes dynamic runtime communication strategy decisions. In this paper, we focus on aspects of combining distributed and hierarchical communication and classification approaches for collective inference. We derive optimal policies for agents that implement this hybrid approach, and evaluate their performance under various scenarios of the distribution of underlying data. Our analysis shows that this approach can maintain a high level of classification accuracy (comparable to that of centralised joint inference over all data), at reduced theoretical communication cost. We expect there is potential for our approach to facilitate efficient collective inference for real-world applications, including instances that involves more complex underlying data distributions.

自然语言处理

[NLP-0] Reward-Based Online LLM Routing via NeuralUCB

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)路由中的成本感知优化问题,即在保证服务质量的前提下降低推理成本。现有路由方法主要分为监督式路由和部分反馈方法,二者在效率与自适应性之间存在权衡。论文提出基于NeuralUCB的路由策略,其关键在于利用神经网络对环境进行建模,并结合上下文相关的不确定性估计来平衡探索(exploration)与利用(exploitation),从而在在线环境中实现更优的成本-收益权衡。实验表明,该方法在RouterBench基准上显著优于随机和最小成本基线,在保持竞争性奖励的同时大幅降低推理成本。

链接: https://arxiv.org/abs/2603.30035
作者: Ming-Hua Tsai,Phat Tran
机构: Oregon State University (俄勒冈州立大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study investigates the use of NeuralUCB for cost-aware large language model (LLM) routing. Existing routing approaches can be broadly grouped into supervised routing methods and partial-feedback methods, each with different tradeoffs in efficiency and adaptivity. We implement a NeuralUCB-based routing policy and evaluate it on RouterBench under a simulated online setting. Experimental results show that the proposed method consistently outperforms random and min-cost baselines in utility reward. Compared with the max-quality reference, our method achieves substantially lower inference cost while maintaining competitive reward. These findings suggest that NeuralUCB is a promising approach for cost-aware LLM routing, while also highlighting remaining challenges in action discrimination and exploration.

[NLP-1] Covertly improving intelligibility with data-driven adaptations of speech timing

【速读】: 该论文旨在解决“全局语速减慢是否真正提升听觉理解力”这一长期争议问题,尤其是在面对听力障碍或非母语听众时,当前普遍采用的语音减速策略是否有效尚不明确。其解决方案的关键在于:通过逆相关实验揭示了语速对目标元音对比(如紧元音与松元音区分)的时序影响呈现剪刀状模式(scissor-like pattern),即早期和晚期语境窗口中效应相反;进而基于此发现开发了一种数据驱动的文本到语音(text-to-speech)算法,在不被感知的前提下对特定语音片段进行靶向语速调整,显著提升包括第二语言(L2)听众在内的各类听者对复杂语音特征的理解能力,而传统全局减速反而会增加理解错误。

链接: https://arxiv.org/abs/2603.30032
作者: Paige Tuttösí,Angelica Lim,H. Henny Yeung,Yue Wang,Jean-Julien Aucouturier
机构: Simon Fraser University(西蒙菲莎大学); Université Marie et Louis Pasteur(玛丽和路易·巴斯德大学); SUPMICROTECH; CNRS(法国国家科学研究中心); institut FEMTO-ST(法布托斯特研究所); France(法国); Department of Linguistics(语言学系)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Human talkers often address listeners with language-comprehension challenges, such as hard-of-hearing or non-native adults, by globally slowing down their speech. However, it remains unclear whether this strategy actually makes speech more intelligible. Here, we take advantage of recent advancements in machine-generated speech allowing more precise control of speech rate in order to systematically examine how targeted speech-rate adjustments may improve comprehension. We first use reverse-correlation experiments to show that the temporal influence of speech rate prior to a target vowel contrast (ex. the tense-lax distinction) in fact manifests in a scissor-like pattern, with opposite effects in early versus late context windows; this pattern is remarkably stable both within individuals and across native L1-English listeners and L2-English listeners with French, Mandarin, and Japanese L1s. Second, we show that this speech rate structure not only facilitates L2 listeners’ comprehension of the target vowel contrast, but that native listeners also rely on this pattern in challenging acoustic conditions. Finally, we build a data-driven text-to-speech algorithm that replicates this temporal structure on novel speech sequences. Across a variety of sentences and vowel contrasts, listeners remained unaware that such targeted slowing improved word comprehension. Strikingly, participants instead judged the common strategy of global slowing as clearer, even though it actually increased comprehension errors. Together, these results show that targeted adjustments to speech rate significantly aid intelligibility under challenging conditions, while often going unnoticed. More generally, this paper provides a data-driven methodology to improve the accessibility of machine-generated speech which can be extended to other aspects of speech comprehension and a wide variety of listeners and environments.

[NLP-2] ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection

【速读】: 该论文旨在解决现有可验证性判断(verifiable claim detection)方法仅依赖于声明文本本身所带来的局限性,尤其是在判断一个声明是否具备可验证性时,缺乏对其中提及实体和事件及其相关证据存在性的认知。为提升检测准确性,作者提出Context-Driven Claim Detection(ContextClaim)框架,其关键在于将信息检索(information retrieval)前置至检测阶段:通过从声明中提取实体提及,从维基百科等结构化知识源中检索相关上下文,并利用大语言模型生成简洁的上下文摘要供下游分类器使用,从而增强模型对声明可验证性的判断能力。

链接: https://arxiv.org/abs/2603.30025
作者: Yufeng Li,Rrubaa Panchendrarajan,Arkaitz Zubiaga
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Verifiable claim detection asks whether a claim expresses a factual statement that can, in principle, be assessed against external evidence. As an early filtering stage in automated fact-checking, it plays an important role in reducing the burden on downstream verification components. However, existing approaches to claim detection, whether based on check-worthiness or verifiability, rely solely on the claim text itself. This is a notable limitation for verifiable claim detection in particular, where determining whether a claim is checkable may benefit from knowing what entities and events it refers to and whether relevant information exists to support verification. Inspired by the established role of evidence retrieval in later-stage claim verification, we propose Context-Driven Claim Detection (ContextClaim), a paradigm that advances retrieval to the detection stage. ContextClaim extracts entity mentions from the input claim, retrieves relevant information from Wikipedia as a structured knowledge source, and employs large language models to produce concise contextual summaries for downstream classification. We evaluate ContextClaim on two datasets covering different topics and text genres, the CheckThat! 2022 COVID-19 Twitter dataset and the PoliClaim political debate dataset, across encoder-only and decoder-only models under fine-tuning, zero-shot, and few-shot settings. Results show that context augmentation can improve verifiable claim detection, although its effectiveness varies across domains, model architectures, and learning settings. Through component analysis, human evaluation, and error analysis, we further examine when and why the retrieved context contributes to more reliable verifiability judgments.

[NLP-3] racking Equivalent Mechanistic Interpretations Across Neural Networks ICLR2026

【速读】: 该论文旨在解决机制可解释性(Mechanistic Interpretability, MI)在规模扩展与泛化能力方面的挑战,核心问题在于缺乏对有效解释的精确定义以及解释生成过程常依赖于特定场景的非系统方法。解决方案的关键在于提出并形式化“解释等价性”(Interpretive Equivalence)的概念:两个模型若拥有相同的算法实现可能性,则它们的解释被视为等价,而无需显式描述该解释本身。作者基于此原则设计了一种估计解释等价性的算法,并通过Transformer模型的案例研究验证其有效性;同时,引入了基于模型表示相似性的充要条件,建立了模型算法解释、电路结构与表示之间的理论联系,为更严格的MI评估和自动化、通用的解释发现方法奠定了基础。

链接: https://arxiv.org/abs/2603.30002
作者: Alan Sun,Mariya Toneva
机构: Carnegie Mellon University (卡内基梅隆大学); Max Planck Institute for Software Systems (马克斯·普朗克软件系统研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 32 pages, 5 figures, ICLR 2026

点击查看摘要

Abstract:Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model’s decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: there is no precise notion of a valid interpretation; and, generating interpretations is often an ad hoc process. In this paper, we address these challenges by defining and studying the problem of interpretive equivalence: determining whether two different models share a common interpretation, without requiring an explicit description of what that interpretation is. At the core of our approach, we propose and formalize the principle that two interpretations of a model are equivalent if all of their possible implementations are also equivalent. We develop an algorithm to estimate interpretive equivalence and case study its use on Transformer-based models. To analyze our algorithm, we introduce necessary and sufficient conditions for interpretive equivalence based on models’ representation similarity. We provide guarantees that simultaneously relate a model’s algorithmic interpretations, circuits, and representations. Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.

[NLP-4] Enhancing Structural Mapping with LLM -derived Abstractions for Analogical Reasoning in Narratives

【速读】: 该论文旨在解决机器在处理叙事结构类比推理时的挑战,即如何提升模型对故事间深层结构映射的能力,从而增强其类比推理性能。现有方法受限于预提取实体的认知引擎或对提示格式敏感的大型语言模型(LLM),难以有效捕捉叙事中的抽象关系。解决方案的关键在于提出一个模块化框架YARN(Yielding Abstractions for Reasoning in Narratives),通过LLM将叙事分解为单元并生成四层抽象(涵盖语义与情节角色),再由映射组件对齐不同故事元素以实现类比推理。该设计使模型能够系统性地分析各组件贡献,并显著优于端到端LLM基线。

链接: https://arxiv.org/abs/2603.29997
作者: Mohammadhossein Khojasteh,Yifan Jiang,Stefano De Giorgis,Frank van Harmelen,Filip Ilievski
机构: Vrije Universiteit Amsterdam (自由大学阿姆斯特丹); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Analogical reasoning is a key driver of human generalization in problem-solving and argumentation. Yet, analogies between narrative structures remain challenging for machines. Cognitive engines for structural mapping are not directly applicable, as they assume pre-extracted entities, whereas LLMs’ performance is sensitive to prompt format and the degree of surface similarity between narratives. This gap motivates a key question: What is the impact of enhancing structural mapping with LLM-derived abstractions on their analogical reasoning ability in narratives? To that end, we propose a modular framework named YARN (Yielding Abstractions for Reasoning in Narratives), which uses LLMs to decompose narratives into units, abstract these units, and then passes them to a mapping component that aligns elements across stories to perform analogical reasoning. We define and operationalize four levels of abstraction that capture both the general meaning of units and their roles in the story, grounded in prior work on framing. Our experiments reveal that abstractions consistently improve model performance, resulting in competitive or better performance than end-to-end LLM baselines. Closer error analysis reveals the remaining challenges in abstraction at the right level, in incorporating implicit causality, and an emerging categorization of analogical patterns in narratives. YARN enables systematic variation of experimental settings to analyze component contributions, and to support future work, we make the code for YARN openly available.

[NLP-5] Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System

【速读】: 该论文旨在解决团队在复杂问题解决过程中如何有效识别和理解关键协作时刻的问题,尤其是在医疗诊断等高要求场景中,传统方法难以捕捉团队成员间认知与情绪状态的动态协同。其解决方案的关键在于融合生理同步性(physiological synchrony)与对话语义特征,通过分析四组医学双人团队在智能辅导系统中诊断虚拟病例时的生理信号与话语内容,发现语义漂移(semantic shifts)与瞬时生理同步峰值存在显著关联;进一步利用句嵌入的余弦相似度量化语义相似性,揭示高生理同步性往往伴随较低的语义一致性,表明此类时刻涉及探索性、多样化的语言使用。研究还通过定性分析确认这些同步峰值为“关键转折点”(pivotal moments),成功团队在共同发现时同步,失败团队则在共同困惑时同步,从而为人类中心型人工智能(human-centered AI)提供了可解释的多模态协作机制。

链接: https://arxiv.org/abs/2603.29950
作者: Xiaoshan Huang,Conrad Borchers,Jiayi Zhang,Susanne P. Lajoie
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted as short paper to the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

点击查看摘要

Abstract:Effective collaboration requires teams to manage complex cognitive and emotional states through Socially Shared Regulation of Learning (SSRL). Physiological synchrony (i.e., longitudinal alignment in physiological signals) can indicate these states, but is hard to interpret on its own. We investigate the physiological and conversational dynamics of four medical dyads diagnosing a virtual patient case using an intelligent tutoring system. Semantic shifts in dialogue were correlated with transient physiological synchrony peaks. We also coded utterance segments for SSRL and derived cosine similarity using sentence embeddings. The results showed that activating prior knowledge featured significantly lower semantic similarity than simpler task execution. High physiological synchrony was associated with lower semantic similarity, suggesting that such moments involve exploratory and varied language use. Qualitative analysis triangulated these synchrony peaks as ``pivotal moments’': successful teams synchronized during shared discovery, while unsuccessful teams peaked during shared uncertainty. This research advances human-centered AI by demonstrating how biological signals can be fused with dialogues to understand critical moments in problem solving.

[NLP-6] Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

【速读】: 该论文旨在解决自动化放射学报告摘要生成中因视觉噪声干扰和多模态模型表现不佳而导致的临床信息提炼效率低下的问题,尤其针对从“发现”(FINDINGS)到“印象”(IMPRESSION)转换阶段的性能瓶颈。其解决方案的关键在于摒弃“更多视觉输入更优”的传统假设,转而采用基于病理相关性选择性聚焦视觉区域的方法:通过集成引导的MedSAM2肺部分割、双向交叉注意力融合多视角特征、Shapley值指导的自适应补丁聚类以及分层视觉标记化策略,构建了一个名为ViTAS的多阶段流水线架构,从而实现比现有文本基线和主流多模态模型更精准、更具临床意义的摘要生成效果。

链接: https://arxiv.org/abs/2603.29901
作者: Mst. Fahmida Sultana Naznin,Adnan Ibney Faruq,Mushfiqur Rahman,Niloy Kumar Mondal,Md. Mehedi Hasan Shawon,Md Rakibul Hasan
机构: Bangladesh University of Engineering and Technology (孟加拉国工程技术大学); BRAC University (BRAC大学); Curtin University (柯廷大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated radiology report summarization aims to distill verbose findings into concise clinical impressions, but existing multimodal models often struggle with visual noise and fail to meaningfully improve over strong text-only baselines in the FINDINGS \to IMPRESSION transformation. We challenge two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail. Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance. We introduce ViTAS, Visual-Text Attention Summarizer, a multi-stage pipeline that combines ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization feeding a ViT. ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores. Our findings demonstrate that less but more relevant visual input is not only sufficient but superior for multimodal radiology summarization.

[NLP-7] FLEURS-Kobani: Extending the FLEURS Dataset for Northern Kurdish

【速读】: 该论文旨在解决Northern Kurdish(ISO 639-3: KMR)在自动语音识别(ASR)和语音翻译(S2TT)任务中缺乏公开基准数据集的问题,从而限制了相关模型的评估与研究进展。其解决方案的关键在于构建并发布FLEURS-Kobani数据集——一个包含5,162条经验证语音utterances(总计18小时24分钟)、由31名母语者录制的Northern Kurdish扩展数据集,并首次为该语言提供了可用于ASR、端到端语音到文本翻译(E2E S2TT)及语音到语音翻译(S2ST)任务的公共基准。通过两阶段微调策略(从Common Voice到FLEURS-Kobani),结合Whisper v3-large模型,在测试集上实现了WER 28.11、CER 9.84的ASR性能;同时在KMR→EN的E2E S2TT任务中达到8.68 BLEU分数,显著推动了低资源库尔德语语音技术的研究发展。

链接: https://arxiv.org/abs/2603.29892
作者: Daban Q. Jaff,Mohammad Mohammadamini
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:FLEURS offers n-way parallel speech for 100+ languages, but Northern Kurdish is not one of them, which limits benchmarking for automatic speech recognition and speech translation tasks in this language. We present FLEURS-Kobani, a Northern Kurdish (ISO 639-3 KMR) spoken extension of the FLEURS benchmark. The FLEURS-Kobani dataset consists of 5,162 validated utterances, totaling 18 hours and 24 minutes. The data were recorded by 31 native speakers. It extends benchmark coverage to an under-resourced Kurdish variety. As baselines, we fine-tuned Whisper v3-large for ASR and E2E S2TT. A two-stage fine-tuning strategy (Common Voice to FLEURS-Kobani) yields the best ASR performance (WER 28.11, CER 9.84 on test). For E2E S2TT (KMR to EN), Whisper achieves 8.68 BLEU on test; we additionally report pivot-derived targets and a cascaded S2TT setup. FLEURS-Kobani provides the first public Northern Kurdish benchmark for evaluation of ASR, S2TT and S2ST tasks. The dataset is publicly released for research use under a CC BY 4.0 license.

[NLP-8] owards Empowering Consumers through Sentence-level Readability Scoring in German ESG Reports LREC2026

【速读】: 该论文旨在解决环境、社会和治理(ESG)报告在信息传播中可读性不足的问题,尤其是在面向非专业受众时,其文本是否足够清晰易懂。研究的关键在于通过众包方式扩展德语ESG报告的句子级可读性标注数据集,并系统评估多种可读性评分方法(包括大语言模型提示和微调后的Transformer模型)在预测人类可读性判断方面的表现。结果表明,尽管大语言模型提示具有区分清晰与难读句子的潜力,但经过微调的小型Transformer模型能以最低预测误差最准确地拟合人类可读性排序,成为当前最优解决方案。

链接: https://arxiv.org/abs/2603.29861
作者: Benjamin Josef Schüßler,Jakob Prange
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: accepted to NLP4Ecology workshop at LREC 2026

点击查看摘要

Abstract:With the ever-growing urgency of sustainability in the economy and society, and the massive stream of information that comes with it, consumers need reliable access to that information. To address this need, companies began publishing so called Environmental, Social, and Governance (ESG) reports, both voluntarily and forced by law. To serve the public, these reports must be addressed not only to financial experts but also to non-expert audiences. But are they written clearly enough? In this work, we extend an existing sentence-level dataset of German ESG reports with crowdsourced readability annotations. We find that, in general, native speakers perceive sentences in ESG reports as easy to read, but also that readability is subjective. We apply various readability scoring methods and evaluate them regarding their prediction error and correlation with human rankings. Our analysis shows that, while LLM prompting has potential for distinguishing clear from hard-to-read sentences, a small finetuned transformer predicts human readability with the lowest error. Averaging predictions of multiple models can slightly improve the performance at the cost of slower inference.

[NLP-9] SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多智能体环境中进行策略性信息传递时的挑战,即如何在确保合作者能够理解意图的同时,防止对手推断出敏感信息。现有基准测试主要关注推理能力、事实知识或指令遵循等通用能力,缺乏对不对称信息下选择性信息共享的直接评估。为此,作者提出了SNEAK(Secret-aware Natural language Evaluation for Adversarial Knowledge)基准,其核心在于设计一种双代理模拟机制:一个知晓秘密的盟友(ally)需识别消息意图,一个不知情的变色龙代理(chameleon)试图从消息中推断秘密,从而分别量化信息效用(utility)与泄露风险(leakage)。该框架揭示了当前LLMs在平衡信息传递与保密性方面的显著不足,且人类表现远超模型,凸显了战略通信仍是现代语言模型的关键挑战。

链接: https://arxiv.org/abs/2603.29846
作者: Adar Avsian,Larry Heck
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in multi-agent settings where communication must balance informativeness and secrecy. In such settings, an agent may need to signal information to collaborators while preventing an adversary from inferring sensitive details. However, existing LLM benchmarks primarily evaluate capabilities such as reasoning, factual knowledge, or instruction following, and do not directly measure strategic communication under asymmetric information. We introduce SNEAK (Secret-aware Natural language Evaluation for Adversarial Knowledge), a benchmark for evaluating selective information sharing in language models. In SNEAK, a model is given a semantic category, a candidate set of words, and a secret word, and must generate a message that indicates knowledge of the secret without revealing it too clearly. We evaluate generated messages using two simulated agents with different information states: an ally, who knows the secret and must identify the intended message, and a chameleon, who does not know the secret and attempts to infer it from the message. This yields two complementary metrics: utility, measuring how well the message communicates to collaborators, and leakage, measuring how much information it reveals to an adversary. Using this framework, we analyze the trade-off between informativeness and secrecy in modern language models and show that strategic communication under asymmetric information remains a challenging capability for current systems. Notably, human participants outperform all evaluated models by a large margin, achieving up to four times higher scores.

[NLP-10] Owl-AuraID 1.0: An Intelligent System for Autonomous Scientific Instrumentation and Scientific Data Analysis

【速读】: 该论文旨在解决科学发现中高通量表征自动化受限于专有图形用户界面(GUI)以及现有基于API的系统泛化能力不足的问题。其解决方案的关键在于提出Owl-AuraID——一个软硬件协同的具身智能体系统,采用GUI原生范式,通过与人类专家相同的界面操作仪器,并构建以技能为中心的框架,将类型1(GUI操作)和类型2(数据分析)技能整合为端到端工作流,实现物理样品处理与科学解释的联动。该系统在十类精密仪器及多种工作流中展现出广泛适用性,支持FTIR、NMR、AFM、TGA等多种模态,为自主实验室提供了可扩展的基础,标志着通过可复用的操作与分析技能推动实验室智能化演进的新路径。

链接: https://arxiv.org/abs/2603.29828
作者: Han Deng,Anqi Zou,Hanling Zhang,Ben Fei,Chengyu Zhang,Haobo Wang,Xinru Guo,Zhenyu Li,Xuzhu Wang,Peng Yang,Fujian Zhang,Weiyu Guo,Xiaohong Shao,Zhaoyang Liu,Shixiang Tang,Zhihui Wang,Wanli Ouyang
机构: The Chinese University of Hong Kong (香港中文大学); Dalian University of Technology (大连理工大学); Shenzhen Laboratory (深圳实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:Scientific discovery increasingly depends on high-throughput characterization, yet automation is hindered by proprietary GUIs and the limited generalizability of existing API-based systems. We present Owl-AuraID, a software-hardware collaborative embodied agent system that adopts a GUI-native paradigm to operate instruments through the same interfaces as human experts. Its skill-centric framework integrates Type-1 (GUI operation) and Type-2 (data analysis) skills into end-to-end workflows, connecting physical sample handling with scientific interpretation. Owl-AuraID demonstrates broad coverage across ten categories of precision instruments and diverse workflows, including multimodal spectral analysis, microscopic imaging, and crystallographic analysis, supporting modalities such as FTIR, NMR, AFM, and TGA. Overall, Owl-AuraID provides a practical, extensible foundation for autonomous laboratories and illustrates a path toward evolving laboratory intelligence through reusable operational and analytical skills. The code are available at this https URL.

[NLP-11] ENEIDE: A High Quality Silver Standard Dataset for Named Entity Recognition and Linking in Historical Italian

【速读】: 该论文旨在解决历史意大利语文本中命名实体识别与链接(Named Entity Recognition and Linking, NERL)缺乏高质量、多领域、公开可用基准数据集的问题。现有资源在跨时期、跨主题的实体标注一致性与知识图谱映射方面存在明显不足,限制了模型在历史文本中的泛化能力。解决方案的关键在于构建ENEIDE——一个包含2,111份历史文档、超8,000个实体标注的银标准(silver standard)数据集,其通过从两个学术数字版本(Digital Zibaldone 和 Aldo Moro Digitale)中半自动提取并人工校验实体注释,涵盖人物、地点、组织和文学作品等类型,并全部链接至Wikidata标识符(含无法映射的NIL实体)。该方法结合质量控制与注释增强流程,确保标注精度与多样性,且提供训练、开发与测试划分,支持时序实体消歧与跨域评估,显著提升了NERL任务在历史语言处理中的可复现性与挑战性。

链接: https://arxiv.org/abs/2603.29801
作者: Cristian Santini,Sebastian Barzaghi,Paolo Sernani,Emanuele Frontoni,Laura Melosi,Mehwish Alam
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces ENEIDE (Extracting Named Entities from Italian Digital Editions), a silver standard dataset for Named Entity Recognition and Linking (NERL) in historical Italian texts. The corpus comprises 2,111 documents with over 8,000 entity annotations semi-automatically extracted from two scholarly digital editions: Digital Zibaldone, the philosophical diary of the Italian poet Giacomo Leopardi (1798–1837), and Aldo Moro Digitale, the complete works of the Italian politician Aldo Moro (1916–1978). Annotations cover multiple entity types (person, location, organization, literary work) linked to Wikidata identifiers, including NIL entities that cannot be mapped to the knowledge graph. To the best of our knowledge, ENEIDE represents the first multi-domain, publicly available NERL dataset for historical Italian with training, development, and test splits. We present a methodology for semi-automatic annotations extraction from manually curated scholarly digital editions, including quality control and annotation enhancement procedures. Baseline experiments using state-of-the-art models demonstrate the dataset’s challenge for NERL and the gap between zero-shot approaches and fine-tuned models. The dataset’s diachronic coverage spanning two centuries makes it particularly suitable for temporal entity disambiguation and cross-domain evaluation. ENEIDE is released under a CC BY-NC-SA 4.0 license.

[NLP-12] Reasoning -Driven Synthetic Data Generation and Evaluation

【速读】: 该论文旨在解决多模态人工智能(Multi-modal AI)模型训练中因数据稀缺或难以获取而导致的瓶颈问题。传统方法依赖人工标注或有限的种子数据,存在成本高、效率低且可控性差等缺陷。其解决方案的关键在于提出一种无种子(seedless)、基于推理驱动的代理式框架 Simula,通过可解释且可控的机制生成大规模合成数据集,并支持用户对数据特征进行细粒度资源分配,从而在保证数据质量的同时显著提升生成效率与灵活性。

链接: https://arxiv.org/abs/2603.29791
作者: Tim R. Davidson,Benoit Seguin,Enrico Bacis,Cesar Ilharco,Hamza Harkous
机构: EPFL; Google; Google Deepmind
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to TMLR 2026, J2C Certification

点击查看摘要

Abstract:Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution - limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties. Our work (1) offers guidelines for synthetic data mechanism design, (2) provides insights into generating and evaluating synthetic data at scale, and (3) unlocks new opportunities for developing and deploying AI in domains where data scarcity or privacy concerns are paramount.

[NLP-13] raining-Free Dynamic Upcycling of Expert Language Models ICLR2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在领域专业化训练中面临的高成本、过度专业化以及多任务训练时的干扰与灾难性遗忘问题。现有方法如Mixture of Experts (MoE)架构虽能整合不同领域的密集专家模型,但仍需额外的多任务微调,难以高效构建统一的多领域专家模型。其解决方案的关键在于提出动态再利用MoE(Dynamic Upcycling MoE, DUME),通过引入岭回归(ridge regression)的闭式解,无需进一步优化即可将已训练好的单领域密集专家模型直接组合成一个统一的MoE结构,从而在不损失原始专家性能的前提下实现动态扩展和高效集成。DUME不仅具备成本效益和可扩展性,还能在因果语言建模和推理任务中显著优于基线方法,并支持后续微调以进一步提升性能。

链接: https://arxiv.org/abs/2603.29765
作者: Eros Fanì,Oğuzhan Ersoy
机构: Gensyn
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted at the ICLR 2026 Workshop on Scaling Post-training for LLMs

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable performance on a wide range of specialized tasks, exhibiting strong problem-solving capabilities. However, training these models is prohibitively expensive, and they often lack domain-specific expertise because they rely on general knowledge datasets. Expertise finetuning can address this issue; however, it often leads to overspecialization, and developing a single multi-domain expert remains difficult due to diverging objectives. Furthermore, multitask training is challenging due to interference and catastrophic forgetting. Existing work proposes combining the expertise of dense models within a Mixture of Experts (MoE) architecture, although this approach still requires multitask finetuning. To address these issues, we introduce Dynamic Upcycling MoE (DUME), a novel approach that reuses dense experts trained on different domains to construct a unified MoE model. Our method builds a single multitask model that preserves the capabilities of the original dense experts without requiring additional training. DUME is both cost-efficient and scalable: by leveraging the closed-form solution of ridge regression, it eliminates the need for further optimization and enables experts to be added dynamically while maintaining the model’s original performance. We demonstrate that DUME consistently outperforms baseline approaches in both causal language modeling and reasoning settings. Finally, we also show that the DUME model can be fine-tuned to further improve performance. We show that, in the causal language modeling setting, DUME can retain up to 97.6% of a dense expert model specialized in one particular domain, and that it can also surpass it in the reasoning setting, where it can achieve 102.1% of the dense expert performance. Our code is available at: this http URL.

[NLP-14] A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models ICLR2026 ICLR26

【速读】: 该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)决策过程不透明的问题,即难以判断其性能提升是源于真正的多模态融合,还是依赖于单模态先验知识。为填补这一归因鸿沟(attribution gap),作者提出了一种基于部分信息分解(Partial Information Decomposition, PID)的新型框架,通过量化LVLM决策相关的信息谱——将信息分解为冗余、独特和协同三种成分,从而实现对模型内部信息处理机制的精细化分析。该方案的关键在于开发了一个可扩展的估计器以适配现代LVLM输出,并构建了一个模型无关的分析流程,能够从广度(跨模型、跨任务)、深度(层间信息动态)和时间维度(训练过程中的学习动态)三个层面系统刻画LVLM的信息处理特性,揭示出两种任务模式(协同驱动 vs. 知识驱动)和两类稳定且对立的家族级策略(融合中心型 vs. 语言中心型),并识别出视觉指令微调是融合能力习得的关键阶段。

链接: https://arxiv.org/abs/2603.29676
作者: Lixin Xiu,Xufang Luo,Hideki Nakayama
机构: The University of Tokyo (东京大学); Microsoft Research (微软研究院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2026. Project page: this https URL

点击查看摘要

Abstract:Large vision-language models (LVLMs) achieve impressive performance, yet their internal decision-making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or from reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the “information spectrum” of LVLMs – decomposing a model’s decision-relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, our model-agnostic pipeline profiles 26 LVLMs on four datasets across three dimensions – breadth (cross-model cross-task), depth (layer-wise information dynamics), and time (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy-driven vs. knowledge-driven) and (ii) two stable, contrasting family-level strategies (fusion-centric vs. language-centric). We also uncover a consistent three-phase pattern in layer-wise processing and identify visual instruction tuning as the key stage where fusion is learned. Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer insights for analyzing and designing the next generation of LVLMs. Code and data are available at this https URL .

[NLP-15] Near-Miss: Latent Policy Failure Detection in Agent ic Workflows

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体(agentic systems)在业务流程自动化中对政策合规性评估的盲区问题,即现有方法仅通过比对最终系统状态与预设真值来判断是否违反政策,而忽略了“近似失误”(near-misses)或“潜在失败”(latent failures)——这类情形下智能体虽达成正确结果,但绕过了必要的政策检查机制。解决方案的关键在于提出一种新型度量指标,基于ToolGuard框架将自然语言政策转化为可执行的守卫代码(guard code),并分析智能体对话轨迹中工具调用决策是否充分知情,从而识别出那些未被传统评估方式捕获的潜在违规行为。实证结果显示,在涉及状态修改型工具调用的轨迹中,即使最终结果符合预期,仍有8–17%存在此类潜在失败,凸显了对决策过程而非仅最终结果进行评估的重要性。

链接: https://arxiv.org/abs/2603.29665
作者: Ella Rabinovich,David Boaz,Naama Zwerdling,Ateret Anaby-Tavor
机构: IBM Research
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agentic systems for business process automation often require compliance with policies governing conditional updates to the system state. Evaluation of policy adherence in LLM-based agentic workflows is typically performed by comparing the final system state against a predefined ground truth. While this approach detects explicit policy violations, it may overlook a more subtle class of issues in which agents bypass required policy checks, yet reach a correct outcome due to favorable circumstances. We refer to such cases as \textitnear-misses or \textitlatent failures . In this work, we introduce a novel metric for detecting latent policy failures in agent conversations traces. Building on the ToolGuard framework, which converts natural-language policies into executable guard code, our method analyzes agent trajectories to determine whether agent’s tool-calling decisions where sufficiently informed. We evaluate our approach on the \tau^2 -verified Airlines benchmark across several contemporary open and proprietary LLMs acting as agents. Our results show that latent failures occur in 8-17% of trajectories involving mutating tool calls, even when the final outcome matches the expected ground-truth state. These findings reveal a blind spot in current evaluation methodologies and highlight the need for metrics that assess not only final outcomes but also the decision process leading to them. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.29665 [cs.CL] (or arXiv:2603.29665v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.29665 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-16] Learning Diagnostic Reasoning for Decision Support in Toxicology

【速读】: 该论文旨在解决急性多物质中毒情境下临床决策支持的难题,即在信息不完整、症状非特异且存在大量非结构化预临床叙事(如急救人员描述和不可靠的患者自述)的情况下,如何准确识别多种共摄入毒物。其核心解决方案是提出DeToxR(Decision-support for Toxicology with Reasoning),首次将强化学习(Reinforcement Learning, RL)应用于急诊毒理学场景,通过基于Group Relative Policy Optimization(GRPO)微调的大语言模型(Large Language Model, LLM)构建数据融合引擎,实现对14类毒物的多标签预测。关键创新在于直接以临床性能奖励优化模型推理过程,利用多标签一致性指标作为奖励信号,显式惩罚遗漏共摄入物质和虚构不存在毒物的行为,从而显著优于未适配的基础LLM及监督学习基线,并在临床验证中超越专家毒理学家的诊断表现(Micro-F1: 0.644 vs. 0.473)。

链接: https://arxiv.org/abs/2603.29608
作者: Nico Oberländer,David Bani-Harouni,Tobias Zellner,Nassir Navab,Florian Eyer,Matthias Keicher
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Acute poly-substance intoxication requires rapid, life-saving decisions under substantial uncertainty, as clinicians must rely on incomplete ingestion details and nonspecific symptoms. Effective diagnostic reasoning in this chaotic environment requires fusing unstructured, non-medical narratives (e.g. paramedic scene descriptions and unreliable patient self-reports or known histories), with structured medical data like vital signs. While Large Language Models (LLMs) show potential for processing such heterogeneous inputs, they struggle in this setting, often underperforming simple baselines that rely solely on patient histories. To address this, we present DeToxR (Decision-support for Toxicology with Reasoning), the first adaptation of Reinforcement Learning (RL) to emergency toxicology. We design a robust data-fusion engine for multi-label prediction across 14 substance classes based on an LLM finetuned with Group Relative Policy Optimization (GRPO). We optimize the model’s reasoning directly using a clinical performance reward. By formulating a multi-label agreement metric as the reward signal, the model is explicitly penalized for missing co-ingested substances and hallucinating absent poisons. Our model significantly outperforms its unadapted base LLM counterpart and supervised baselines. Furthermore, in a clinical validation study, the model indicates a clinical advantage by outperforming an expert toxicologist in identifying the correct poisons (Micro-F1: 0.644 vs. 0.473). These results demonstrate the potential of RL-aligned LLMs to synthesize unstructured pre-clinical narratives and structured medical data for decision support in high-stakes environments.

[NLP-17] When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自动评分任务中输出不可靠的问题,其核心挑战在于如何识别LLM评分结果的可信度。解决方案的关键在于提出并评估三种置信度估计方法(自报告置信度、自一致性投票和词元概率),以预测LLM评分的准确性,并据此实现选择性自动化——即对高置信度的评分结果进行自动处理,而将低置信度案例标记供人工审核。实验表明,自报告置信度在所有条件下均表现出最佳校准性能(平均期望校准误差ECE为0.166),且显著优于需要五倍推理成本的自一致性方法(ECE为0.229),说明直接让LLM自我评估置信度是一种高效且实用的策略。

链接: https://arxiv.org/abs/2603.29559
作者: Robinson Ferrer,Damla Turgut,Zhongzhou Chen,Shashank Sonkar
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textitpredicting when an LLM grader is likely to be correct. This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review. We compare three confidence estimation methods (self-reported confidence, self-consistency voting, and token probability) across seven LLMs of varying scale (4B to 120B parameters) on three educational datasets: RiceChem (long-answer chemistry), SciEntsBank, and Beetle (short-answer science). Our experiments reveal that self-reported confidence consistently achieves the best calibration across all conditions (avg ECE 0.166 vs 0.229 for self-consistency). Surprisingly, self-consistency remains 38% worse despite requiring 5 \times the inference cost. Larger models exhibit substantially better calibration though gains vary by dataset and method (e.g., a 28% ECE reduction for self-reported), with GPT-OSS-120B achieving the best calibration (avg ECE 0.100) and strong discrimination (avg AUC 0.668). We also observe that confidence is strongly top-skewed across methods, creating a ``confidence floor’’ that practitioners must account for when setting thresholds. These findings suggest that simply asking LLMs to report their confidence provides a practical approach for identifying reliable grading predictions. Code is available \hrefthis https URLhere.

[NLP-18] FlowPIE: Test-Time Scientific Idea Evolution with Flow-Guided Literature Exploration

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 在科学思想生成(Scientific Idea Generation, SIG)中因采用静态的“检索-生成”范式而导致的思想同质化与多样性不足的问题。其解决方案的关键在于提出一个紧密耦合的检索-生成框架 FlowPIE,通过将文献探索与思想生成建模为协同演化过程:首先利用受 GFlowNets 启发的流引导蒙特卡洛树搜索(flow-guided Monte Carlo Tree Search, MCTS)扩展文献轨迹,并以大语言模型(LLM)驱动的生成奖励模型(Generative Reward Model, GRM)评估当前思想质量作为监督信号,自适应地进行检索并构建高质量、多样化的初始种群;随后在测试时将思想生成视为进化过程,结合隔离岛机制(isolation island paradigm)、交叉与变异操作及 GRM 评估的适应度计算,实现跨领域知识融合,有效缓解信息茧房问题。实验证明,FlowPIE 在新颖性、可行性与多样性上均显著优于主流 LLM 和智能体基框架,并支持测试阶段的奖励缩放。

链接: https://arxiv.org/abs/2603.29557
作者: Qiyao Wang,Hongbo Wang,Longze Chen,Zhihao Yang,Guhong Chen,Hamid Alinejad-Rokny,Hui Li,Yuan Lin,Min Yang
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Dalian University of Technology (大连理工大学); UNSW Sydney (新南威尔士大学); Shenzhen University of Advanced Technology (深圳先进技术研究院); Xiamen University (厦门大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 30 pages, 11 figures, 15 tables

点击查看摘要

Abstract:Scientific idea generation (SIG) is critical to AI-driven autonomous research, yet existing approaches are often constrained by a static retrieval-then-generation paradigm, leading to homogeneous and insufficiently divergent ideas. In this work, we propose FlowPIE, a tightly coupled retrieval-generation framework that treats literature exploration and idea generation as a co-evolving process. FlowPIE expands literature trajectories via a flow-guided Monte Carlo Tree Search (MCTS) inspired by GFlowNets, using the quality of current ideas assessed by an LLM-based generative reward model (GRM) as a supervised signal to guide adaptive retrieval and construct a diverse, high-quality initial population. Based on this population, FlowPIE models idea generation as a test-time idea evolution process, applying selection, crossover, and mutation with the isolation island paradigm and GRM-based fitness computation to incorporate cross-domain knowledge. It effectively mitigates the information cocoons arising from over-reliance on parametric knowledge and static literature. Extensive evaluations demonstrate that FlowPIE consistently produces ideas with higher novelty, feasibility and diversity compared to strong LLM-based and agent-based frameworks, while enabling reward scaling during test time.

[NLP-19] Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

【速读】: 该论文旨在解决多语言习得(multilingual acquisition)过程中是否存在学习延迟以及不同输入结构是否影响习得效果等关键问题,这些问题在实证研究中难以通过随机对照实验验证,且跨语言数据匹配困难。其解决方案的关键在于利用语言模型训练模拟高度受控的多语言暴露条件,构建匹配的百万词级单语和双语合成数据集(基于机器翻译生成),并通过GPT-2模型在不同输入结构下的表现(如困惑度、语法正确性和语义知识)进行评估。结果表明,双语模型在两种语言中均表现出与单语模型相当的性能,说明不同双语输入方式无显著差异,且双语输入对统计学习者不构成根本性挑战。

链接: https://arxiv.org/abs/2603.29552
作者: Linda Zeng,Steven Y. Feng,Michael C. Frank
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code and data at this https URL

点击查看摘要

Abstract:Multilingualism is incredibly common around the world, leading to many important theoretical and practical questions about how children learn multiple languages at once. For example, does multilingual acquisition lead to delays in learning? Are there better and worse ways to structure multilingual input? Many correlational studies address these questions, but it is surprisingly difficult to get definitive answers because children cannot be randomly assigned to be multilingual and data are typically not matched between languages. We use language model training as a method for simulating a variety of highly controlled exposure conditions, and create matched 100M-word mono- and bilingual datasets using synthetic data and machine translation. We train GPT-2 models on monolingual and bilingual data organized to reflect a range of exposure regimes, and evaluate their performance on perplexity, grammaticality, and semantic knowledge. Across model scales and measures, bilingual models perform similarly to monolingual models in one language, but show strong performance in the second language as well. These results suggest that there are no strong differences between different bilingual exposure regimes, and that bilingual input poses no in-principle challenges for agnostic statistical learners.

[NLP-20] Can LLM Agents Identify Spoken Dialects like a Linguist? LREC2026

【速读】: 该论文旨在解决低资源方言语音分类(audio dialect classification)中因标注数据稀缺而导致的性能瓶颈问题,尤其针对瑞士德语等语言。其解决方案的关键在于利用大语言模型(LLM)作为代理,结合自动语音识别(ASR)系统生成的音素转录文本与外部语言学资源(如方言特征图谱、元音历史和规则),从而增强模型对方言差异的理解能力。实验表明,当引入结构化语言学信息时,LLM的分类表现显著提升,验证了融合先验语言知识对改善低资源场景下方言识别的有效性。

链接: https://arxiv.org/abs/2603.29541
作者: Tobias Bystrich,Lukas Hamm,Maria Hassan,Lea Fischbach,Lucie Flek,Akbar Karimi
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to DialRes Workshop @ LREC 2026

点击查看摘要

Abstract:Due to the scarcity of labeled dialectal speech, audio dialect classification is a challenging task for most languages, including Swiss German. In this work, we explore the ability of large language models (LLMs) as agents in understanding the dialects and whether they can show comparable performance to models such as HuBERT in dialect classification. In addition, we provide an LLM baseline and a human linguist one. Our approach uses phonetic transcriptions produced by ASR systems and combines them with linguistic resources such as dialect feature maps, vowel history, and rules. Our findings indicate that, when linguistic information is provided, the LLM predictions improve. The human baseline shows that automatically generated transcriptions can be beneficial for such classifications, but also presents opportunities for improvement.

[NLP-21] Baby Scale: Investigating Models Trained on Individual Childrens Language Input

【速读】: 该论文旨在解决生成式语言模型(Generative Language Models)与人类儿童在语言学习过程中存在的“数据差距”问题,即当前模型训练所需的数据量远超儿童习得语言所需的自然语料规模。其解决方案的关键在于通过使用真实儿童语言数据(如BabyView数据集),系统评估语言模型在儿童尺度数据下的表现,发现模型性能不仅依赖于数据量,更受分布特征和互动性语言特性的影响;同时揭示了模型对词汇的似然估计与儿童词汇习得之间存在显著相关性,表明高质量的儿童导向输入(child-directed input)是提升小样本语言模型性能和理解人类语言习得机制的核心要素。

链接: https://arxiv.org/abs/2603.29522
作者: Steven Y. Feng,Alvin W.M. Tan,Michael C. Frank
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code and data at this https URL

点击查看摘要

Abstract:Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this “data gap” requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from children’s natural training data. Using transcripts from the BabyView dataset (videos from children ages 6-36 months), we investigate (1) scaling performance at child-scale data regimes, (2) variability in model performance across datasets from different children’s experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data show acceptable scaling for grammar tasks, but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; we also observe substantial variability on data from different children. Beyond dataset size, performance is most associated with a combination of distributional and interactional linguistic features, broadly consistent with what makes high-quality input for child language development. Finally, model likelihoods for individual words correlate with children’s learning of those words, suggesting that properties of child-directed input may influence both model learning and human language development. Overall, understanding what properties make language data efficient for learning can enable more powerful small-scale language models while also shedding light on human language acquisition.

[NLP-22] Impact of enriched meaning representations for language generation in dialogue tasks: A comprehensive exploration of the relevance of tasks corpora and metrics

【速读】: 该论文旨在解决对话系统中自然语言生成(Natural Language Generation, NLG)质量受意义表示(Meaning Representations, MRs)影响的问题,尤其关注如何通过引入任务示例(task demonstrator)来提升生成句子的多样性与准确性。解决方案的关键在于:在训练和推理阶段向生成模型输入一个来自原始数据集的MR-句子对(即任务示例),从而丰富输入信息,增强模型对语义内容和交际意图(Dialogue Act, DA)的理解能力。实验证明,这种增强方式在复杂任务、小规模数据集以及高MR变异性场景下尤为有效,并且在零样本设置中也表现出泛化优势;同时,基于人类评分训练的语义指标比基于嵌入的词汇指标更能准确捕捉生成质量中的细微语义缺失问题。

链接: https://arxiv.org/abs/2603.29518
作者: Alain Vázquez,Maria Inés Torres
机构: University of the Basque Country (UPV/EHU)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conversational systems should generate diverse language forms to interact fluently and accurately with users. In this context, Natural Language Generation (NLG) engines convert Meaning Representations (MRs) into sentences, directly influencing user perception. These MRs usually encode the communicative function (e.g., inform, request, confirm) via DAs and enumerate the semantic content with slot-value pairs. In this work, our objective is to analyse whether providing a task demonstrator to the generator enhances the generations of a fine-tuned model. This demonstrator is an MR-sentence pair extracted from the original dataset that enriches the input at training and inference time. The analysis involves five metrics that focus on different linguistic aspects, and four datasets that differ in multiple features, such as domain, size, lexicon, MR variability, and acquisition process. To the best of our knowledge, this is the first study on dialogue NLG implementing a comparative analysis of the impact of MRs on generation quality across domains, corpus characteristics, and the metrics used to evaluate these generations. Our key insight is that the proposed enriched inputs are effective for complex tasks and small datasets with high variability in MRs and sentences. They are also beneficial in zero-shot settings for any domain. Moreover, the analysis of the metrics shows that semantic metrics capture generation quality more accurately than lexical metrics. In addition, among these semantic metrics, those trained with human ratings can detect omissions and other subtle semantic issues that embedding-based metrics often miss. Finally, the evolution of the metric scores and the excellent results for Slot Accuracy and Dialogue Act Accuracy demonstrate that the generative models present fast adaptability to different tasks and robustness at semantic and communicative intention levels.

[NLP-23] LLM Probe: Evaluating LLM s for Low-Resource Languages

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言(low-resource languages)和形态学丰富的语言中语言能力评估不充分的问题,其核心挑战在于标注数据稀缺以及缺乏标准化的评估框架。解决方案的关键是提出LLM Probe——一个基于词典的系统性评估框架,能够从词汇对齐、词性识别、形态句法探针和翻译准确性四个维度全面评估LLMs的语言理解能力。该框架通过构建人工标注的双语词典基准数据集(以一种低资源闪米特语为例),提供高一致性标注(包括词性标签、语法性别和形态句法特征),从而实现可复现、可比较的性能测试,揭示不同架构(如因果语言模型与序列到序列模型)在各类语言任务上的差异表现,强调了基于语言学原理的评估对于提升多语言技术包容性的必要性。

链接: https://arxiv.org/abs/2603.29517
作者: Hailay Kidu Teklehaymanot,Gebrearegawi Gebremariam,Wolfgang Nejdl
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 6 tables

点击查看摘要

Abstract:Despite rapid advances in large language models (LLMs), their linguistic abilities in low-resource and morphologically rich languages are still not well understood due to limited annotated resources and the absence of standardized evaluation frameworks. This paper presents LLM Probe, a lexicon-based assessment framework designed to systematically evaluate the linguistic skills of LLMs in low-resource language environments. The framework analyzes models across four areas of language understanding: lexical alignment, part-of-speech recognition, morphosyntactic probing, and translation accuracy. To illustrate the framework, we create a manually annotated benchmark dataset using a low-resource Semitic language as a case study. The dataset comprises bilingual lexicons with linguistic annotations, including part-of-speech tags, grammatical gender, and morphosyntactic features, which demonstrate high inter-annotator agreement to ensure reliable annotations. We test a variety of models, including causal language models and sequence-to-sequence architectures. The results reveal notable differences in performance across various linguistic tasks: sequence-to-sequence models generally excel in morphosyntactic analysis and translation quality, whereas causal models demonstrate strong performance in lexical alignment but exhibit weaker translation accuracy. Our results emphasize the need for linguistically grounded evaluation to better understand LLM limitations in low-resource settings. We release LLM Probe and the accompanying benchmark dataset as open-source tools to promote reproducible benchmarking and to support the development of more inclusive multilingual language technologies.

[NLP-24] Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models LREC

【速读】: 该论文旨在解决文本数据隐私评估在隐私保护自然语言处理(Privacy-Preserving Natural Language Processing, PPNLP)中的准确性与可扩展性难题。现有方法虽表明大语言模型(Large Language Models, LLMs)能作为可靠的隐私评估工具,但其高计算成本和对敏感数据的不实用性限制了实际部署。解决方案的关键在于将 Mistral Large 3(675B 参数)的隐私评估能力通过知识蒸馏(Knowledge Distillation)迁移至参数量仅为 150M 的轻量级编码器模型中,同时利用涵盖 10 个不同领域的大规模隐私标注文本数据集进行训练,从而在保持与人工标注高度一致性的前提下显著降低计算开销,并验证了该方法在去标识化系统评估中的实用价值。

链接: https://arxiv.org/abs/2603.29497
作者: Gabriel Loiseau,Damien Sileo,Damien Riquet,Maxime Meyer,Marc Tommasi
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to the LREC CALD-pseudo 2026 Workshop

点击查看摘要

Abstract:Accurate privacy evaluation of textual data remains a critical challenge in privacy-preserving natural language processing. Recent work has shown that large language models (LLMs) can serve as reliable privacy evaluators, achieving strong agreement with human judgments; however, their computational cost and impracticality for processing sensitive data at scale limit real-world deployment. We address this gap by distilling the privacy assessment capabilities of Mistral Large 3 (675B) into lightweight encoder models with as few as 150M parameters. Leveraging a large-scale dataset of privacy-annotated texts spanning 10 diverse domains, we train efficient classifiers that preserve strong agreement with human annotations while dramatically reducing computational requirements. We validate our approach on human-annotated test data and demonstrate its practical utility as an evaluation metric for de-identification systems.

[NLP-25] MemFactory: Unified Inference Training Framework for Agent Memory

【速读】: 该论文旨在解决当前记忆增强型大语言模型(Memory-augmented Large Language Models)在实际应用中因缺乏统一基础设施而导致的碎片化与低效问题,具体表现为记忆操作(如提取、更新和检索)的优化难以标准化、集成复杂且评估困难。其解决方案的关键在于提出 MemFactory——首个专为记忆增强型智能体设计的统一训练与推理框架,通过将记忆生命周期抽象为原子化的可插拔组件,实现“乐高式”灵活构建;同时原生集成组相对策略优化(Group Relative Policy Optimization, GRPO),以多维环境奖励驱动内部记忆管理策略的微调,从而显著提升模型性能与开发效率。

链接: https://arxiv.org/abs/2603.29493
作者: Ziliang Guo,Ziheng Li,Zhiyu Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, Code: this https URL

点击查看摘要

Abstract:Memory-augmented Large Language Models (LLMs) are essential for developing capable, long-term AI agents. Recently, applying Reinforcement Learning (RL) to optimize memory operations, such as extraction, updating, and retrieval, has emerged as a highly promising research direction. However, existing implementations remain highly fragmented and task-specific, lacking a unified infrastructure to streamline the integration, training, and evaluation of these complex pipelines. To address this gap, we present MemFactory, the first unified, highly modular training and inference framework specifically designed for memory-augmented agents. Inspired by the success of unified fine-tuning frameworks like LLaMA-Factory, MemFactory abstracts the memory lifecycle into atomic, plug-and-play components, enabling researchers to seamlessly construct custom memory agents via a “Lego-like” architecture. Furthermore, the framework natively integrates Group Relative Policy Optimization (GRPO) to fine-tune internal memory management policies driven by multi-dimensional environmental rewards. MemFactory provides out-of-the-box support for recent cutting-edge paradigms, including Memory-R1, RMM, and MemAgent. We empirically validate MemFactory on the open-source MemAgent architecture using its publicly available training and evaluation data. Across both in-domain and out-of-distribution evaluation sets, MemFactory consistently improves performance over the corresponding base models, with relative gains of up to 14.8%. By providing a standardized, extensible, and easy-to-use infrastructure, MemFactory significantly lowers the barrier to entry, paving the way for future innovations in memory-driven AI agents.

[NLP-26] Calibrated Confidence Expression for Radiology Report Generation

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在放射学报告生成中安全部署的关键挑战:即如何在保证预测准确性的同时,提供临床可解释的置信度指标,以指导放射科医生对AI输出进行选择性审核,从而降低幻觉性发现对临床决策的影响。现有方法普遍存在过度自信的问题,且多模态场景下的校准研究较为匮乏。解决方案的核心在于提出ConRad(Confidence Calibration for Radiology Reports),这是一个基于强化学习的微调框架,通过GRPO算法优化LVLM,在生成放射学报告时同步输出校准后的口语化置信度估计,包括报告级和句子级两种形式;其奖励函数基于对数评分规则(logarithmic scoring rule),激励模型进行真实自我评估,避免误判,并在最大化奖励下实现最优校准效果。实验表明,ConRad显著提升了置信度校准性能,并与临床判断高度一致,能够有效支持AI辅助报告生成的安全落地。

链接: https://arxiv.org/abs/2603.29492
作者: David Bani-Harouni,Chantal Pellegrini,Julian Lüers,Su Hwan Kim,Markus Baalmann,Benedikt Wiestler,Rickmer Braren,Nassir Navab,Matthias Keicher
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Safe deployment of Large Vision-Language Models (LVLMs) in radiology report generation requires not only accurate predictions but also clinically interpretable indicators of when outputs should be thoroughly reviewed, enabling selective radiologist verification and reducing the risk of hallucinated findings influencing clinical decisions. One intuitive approach to this is verbalized confidence, where the model explicitly states its certainty. However, current state-of-the-art language models are often overconfident, and research on calibration in multimodal settings such as radiology report generation is limited. To address this gap, we introduce ConRad (Confidence Calibration for Radiology Reports), a reinforcement learning framework for fine-tuning medical LVLMs to produce calibrated verbalized confidence estimates alongside radiology reports. We study two settings: a single report-level confidence score and a sentence-level variant assigning a confidence to each claim. Both are trained using the GRPO algorithm with reward functions based on the logarithmic scoring rule, which incentivizes truthful self-assessment by penalizing miscalibration and guarantees optimal calibration under reward maximization. Experimentally, ConRad substantially improves calibration and outperforms competing methods. In a clinical evaluation we show that ConRad’s report level scores are well aligned with clinicians’ judgment. By highlighting full reports or low-confidence statements for targeted review, ConRad can support safer clinical integration of AI-assistance for report generation.

[NLP-27] M-MiniGPT 4: Multilingual VLLM Alignment via Translated Data ACL2026

【速读】: 该论文旨在解决多语言视觉-语言理解(Multilingual Vision-Language Understanding, VLU)能力不足的问题,尤其是在低资源语言场景下的性能瓶颈。解决方案的关键在于:首先,通过混合使用原生多语言数据与翻译数据来增强模型的多语言感知能力;其次,提出了一种多语言对齐训练阶段,利用平行语料库进一步优化跨语言语义一致性,从而显著提升模型在多语言环境下的表现。实验表明,M-MiniGPT4在多语言MMMU基准上达到36%的准确率,优于同参数规模的现有先进模型。

链接: https://arxiv.org/abs/2603.29467
作者: Seung Hun Han,Youssef Mohamed,Mohamed Elhoseiny
机构: MBZUAI; KAUST
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, ACL 2026, Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)

点击查看摘要

Abstract:This paper presents a Multilingual Vision Large Language Model, named M-MiniGPT4. Our model exhibits strong vision-language understanding (VLU) capabilities across 11 languages. We utilize a mixture of native multilingual and translated data to push the multilingual VLU performance of the MiniGPT4 architecture. In addition, we propose a multilingual alignment training stage that uses parallel text corpora to further enhance the multilingual capabilities of our model. M-MiniGPT4 achieves 36% accuracy on the multilingual MMMU benchmark, outperforming state-of-the-art models in the same weight class, including foundation models released after the majority of this work was completed. We open-source our models, code, and translated datasets to facilitate future research in low-resource and multilingual settings.

[NLP-28] An Isotropic Approach to Efficient Uncertainty Quantification with Gradient Norms

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中预测不确定性量化的问题,现有方法要么计算上不可行,要么依赖通常不可获得的训练数据。其解决方案的关键在于通过两个近似:一是利用一阶泰勒展开将不确定性表示为预测梯度与参数协方差的乘积,二是对参数协方差施加各向同性假设(isotropy assumption)。由此可从一次未经修改的预训练模型前向-反向传播中直接获得认知不确定性(epistemic uncertainty)为梯度范数的平方,以及偶然不确定性(aleatoric uncertainty)为点预测的伯努利方差。该方法避免了非训练数据构建协方差时引入的结构畸变,并得到大规模网络谱性质理论结果的支持,验证表明其估计与参考马尔可夫链蒙特卡洛(Markov Chain Monte Carlo, MCMC)结果高度一致,且随模型规模提升而改善。

链接: https://arxiv.org/abs/2603.29466
作者: Nils Grünefeld,Jes Frellsen,Christian Hardmeier
机构: IT University of Copenhagen (哥本哈根信息技术大学); Pioneer Centre for Artificial Intelligence (人工智能先锋中心); Technical University of Denmark (丹麦技术大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing methods for quantifying predictive uncertainty in neural networks are either computationally intractable for large language models or require access to training data that is typically unavailable. We derive a lightweight alternative through two approximations: a first-order Taylor expansion that expresses uncertainty in terms of the gradient of the prediction and the parameter covariance, and an isotropy assumption on the parameter covariance. Together, these yield epistemic uncertainty as the squared gradient norm and aleatoric uncertainty as the Bernoulli variance of the point prediction, from a single forward-backward pass through an unmodified pretrained model. We justify the isotropy assumption by showing that covariance estimates built from non-training data introduce structured distortions that isotropic covariance avoids, and that theoretical results on the spectral properties of large networks support the approximation at scale. Validation against reference Markov Chain Monte Carlo estimates on synthetic problems shows strong correspondence that improves with model size. We then use the estimates to investigate when each uncertainty type carries useful signal for predicting answer correctness in question answering with large language models, revealing a benchmark-dependent divergence: the combined estimate achieves the highest mean AUROC on TruthfulQA, where questions involve genuine conflict between plausible answers, but falls to near chance on TriviaQA’s factual recall, suggesting that parameter-level uncertainty captures a fundamentally different signal than self-assessment methods.

[NLP-29] Authorship Impersonation via LLM Prompting does not Evade Authorship Verification Methods

【速读】: 该论文旨在解决生成式 AI(Generative AI)在作者身份验证(Authorship Verification, AV)领域带来的新挑战,即大规模语言模型(Large Language Models, LLMs)可能被用于生成逼真的冒名文本以规避现有 forensic AV 系统。研究通过使用 GPT-4o 作为攻击模型,在邮件、短信和社交媒体三种文本类型中构建不同提示条件下的冒名文本,并将其与多种非神经网络(如 n-gram tracing、Ranking-Based Impostors Method、LambdaG)和神经网络方法(如 AdHominem、LUAR、STAR)进行对比评估。解决方案的关键在于发现:尽管 LLM 可生成高保真文本,但其固有的更高词汇多样性与信息熵使其难以复制真实作者的个体化特征,从而导致当前主流 AV 系统仍能有效识别此类冒名文本,甚至在某些情况下对冒名样本的判别准确率高于对真实负样本的判别。这一发现揭示了当前 AV 技术对初级冒名攻击具有出乎意料的鲁棒性。

链接: https://arxiv.org/abs/2603.29454
作者: Baoyi Zeng,Andrea Nini
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Authorship verification (AV), the task of determining whether a questioned text was written by a specific individual, is a critical part of forensic linguistics. While manual authorial impersonation by perpetrators has long been a recognized threat in historical forensic cases, recent advances in large language models (LLMs) raise new challenges, as adversaries may exploit these tools to impersonate another’s writing. This study investigates whether prompted LLMs can generate convincing authorial impersonations and whether such outputs can evade existing forensic AV systems. Using GPT-4o as the adversary model, we generated impersonation texts under four prompting conditions across three genres: emails, text messages, and social media posts. We then evaluated these outputs against both non-neural AV methods (n-gram tracing, Ranking-Based Impostors Method, LambdaG) and neural approaches (AdHominem, LUAR, STAR) within a likelihood-ratio framework. Results show that LLM-generated texts failed to sufficiently replicate authorial individuality to bypass established AV systems. We also observed that some methods achieved even higher accuracy when rejecting impersonation texts compared to genuine negative samples. Overall, these findings indicate that, despite the accessibility of LLMs, current AV systems remain robust against entry-level impersonation attempts across multiple genres. Furthermore, we demonstrate that this counter-intuitive resilience stems, at least in part, from the higher lexical diversity and entropy inherent in LLM-generated texts.

[NLP-30] CounselReflect: A Toolkit for Auditing Mental-Health Dialogues

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的对话式心理支持系统缺乏结构化审计机制的问题,即用户难以评估所获支持的质量与潜在风险。解决方案的关键在于提出一个端到端的审计工具CounselReflect,其核心创新是提供多维度、可解释的报告体系——包括会话级摘要、轮次级评分及证据关联的摘录片段,并整合两类评估信号:(i) 由任务特定预测器生成的12个模型基线指标,以及(ii) 基于文献构建的69项量规指标和用户自定义指标,后者通过可配置的大语言模型裁判实现。该设计提升了审计过程的透明性、可用性和可信度,支持实时与批量应用场景。

链接: https://arxiv.org/abs/2603.29429
作者: Yahan Li,Chaohao Du,Zeyang Li,Christopher Chun Kuizon,Shupeng Cheng,Angel Hsing-Chi Hwang,Adam C. Frank,Ruishan Liu
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mental-health support is increasingly mediated by conversational systems (e.g., LLM-based tools), but users often lack structured ways to audit the quality and potential risks of the support they receive. We introduce CounselReflect, an end-to-end toolkit for auditing mental-health support dialogues. Rather than producing a single opaque quality score, CounselReflect provides structured, multi-dimensional reports with session-level summaries, turn-level scores, and evidence-linked excerpts to support transparent inspection. The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined custom metrics, operationalized with configurable LLM judges. CounselReflect is available as a web application, browser extension, and command-line interface (CLI), enabling use in real-time settings as well as at scale. Human evaluation includes a user study with 20 participants and an expert review with 6 mental-health professionals, suggesting that CounselReflect supports understandable, usable, and trustworthy auditing. A demo video and full source code are also provided.

[NLP-31] PRISM: PRIor from corpus Statistics for topic Modeling

【速读】: 该论文旨在解决传统主题模型(如LDA)在新兴或数据匮乏领域中因依赖外部知识(如预训练词向量)而导致的适用性受限问题。其解决方案的关键在于提出一种名为PRISM的语料库内生(corpus-intrinsic)方法,通过从词共现统计中推导出Dirichlet先验参数来初始化LDA模型,从而无需修改LDA的生成过程即可提升主题 coherence 和可解释性,在文本和单细胞RNA测序数据上均表现出与依赖外部知识的方法相当甚至更优的效果,验证了语料驱动初始化在资源受限场景下的有效性。

链接: https://arxiv.org/abs/2603.29406
作者: Tal Ishon,Yoav Goldberg,Uri Shaham
机构: Bar Ilan University (巴伊兰大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce \textbfPRISM, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings. Code is available at: this https URL. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2603.29406 [cs.LG] (or arXiv:2603.29406v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.29406 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-32] Is my model perplexed for the right reason ? Contrasting LLM s Benchmark Behavior with Token-Level Perplexity

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估中过度依赖任务性能指标所带来的局限性,即无法判断模型的正确行为是否源于合理的内在机制,从而可能引发确认偏差(confirmation bias)。为应对这一问题,作者提出了一种基于token-level perplexity的简洁且原理清晰的可解释性分析框架。其解决方案的关键在于通过比较在语义最小差异(仅一个或少数几个“关键”token不同)的句子对上模型的困惑度分布,实现无需依赖不稳定特征归因技术的、假设驱动的精准分析,从而揭示模型是否真正利用了预期的语言学线索。实验表明,尽管语言相关token确实影响模型行为,但它们并不能完全解释困惑度的变化,说明模型存在非语言学启发式策略。

链接: https://arxiv.org/abs/2603.29396
作者: Zoë Prins,Samuele Punzo,Frank Wildenburg,Giovanni Cinà,Sandro Pezzelle
机构: University of Amsterdam (阿姆斯特丹大学); Amsterdam University Medical Center (阿姆斯特丹大学医学中心); ILLC, University of Amsterdam (阿姆斯特丹大学伊拉斯谟逻辑与计算研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Standard evaluations of Large language models (LLMs) focus on task performance, offering limited insight into whether correct behavior reflects appropriate underlying mechanisms and risking confirmation bias. We introduce a simple, principled interpretability framework based on token-level perplexity to test whether models rely on linguistically relevant cues. By comparing perplexity distributions over minimal sentence pairs differing in one or a few `pivotal’ tokens, our method enables precise, hypothesis-driven analysis without relying on unstable feature-attribution techniques. Experiments on controlled linguistic benchmarks with several open-weight LLMs show that, while linguistically important tokens influence model behavior, they never fully explain perplexity shifts, revealing that models rely on heuristics other than the expected linguistic ones.

[NLP-33] Beyond Idealized Patients: Evaluating LLM s under Challenging Patient Behaviors in Medical Consultations

【速读】: 该论文旨在解决当前医疗领域大语言模型(Large Language Models, LLMs)评估中缺乏对真实临床场景下复杂患者行为的考量问题。现有评估多假设患者提问清晰且逻辑一致,忽略了实际诊疗中常见的模糊、矛盾、错误或抗拒性表述,这可能导致模型在关键情境下产生不安全响应。解决方案的关键在于构建一个名为CPB-Bench(Challenging Patient Behaviors Benchmark)的双语(英文与中文)基准数据集,涵盖692个多轮对话,并基于临床实践定义了四类典型挑战性患者行为:信息矛盾(information contradiction)、事实错误(factual inaccuracy)、自我诊断(self-diagnosis)和护理抗拒(care resistance),每类均配有明确的失败判定标准。通过该基准对多种开源与闭源LLM进行系统评测,识别出特定行为下的模式化失效现象,同时验证了干预策略的效果不一且可能引入冗余修正,从而为提升医疗LLM的安全性和鲁棒性提供了可量化、具临床意义的评估框架。

链接: https://arxiv.org/abs/2603.29373
作者: Yahan Li,Xinyi Jie,Wanjia Ruan,Xubei Zhang,Huaijie Zhu,Yicheng Gao,Chaohao Du,Ruishan Liu
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for medical consultation and health information support. In this high-stakes setting, safety depends not only on medical knowledge, but also on how models respond when patient inputs are unclear, inconsistent, or misleading. However, most existing medical LLM evaluations assume idealized and well-posed patient questions, which limits their realism. In this paper, we study challenging patient behaviors that commonly arise in real medical consultations and complicate safe clinical reasoning. We define four clinically grounded categories of such behaviors: information contradiction, factual inaccuracy, self-diagnosis, and care resistance. For each behavior, we specify concrete failure criteria that capture unsafe responses. Building on four existing medical dialogue datasets, we introduce CPB-Bench (Challenging Patient Behaviors Benchmark), a bilingual (English and Chinese) benchmark of 692 multi-turn dialogues annotated with these behaviors. We evaluate a range of open- and closed-source LLMs on their responses to challenging patient utterances. While models perform well overall, we identify consistent, behavior-specific failure patterns, with particular difficulty in handling contradictory or medically implausible patient information. We also study four intervention strategies and find that they yield inconsistent improvements and can introduce unnecessary corrections. We release the dataset and code.

[NLP-34] Developing a Guideline for the Labovian-Structural Analysis of Oral Narratives in Japanese LREC

【速读】: 该论文旨在解决现有Labovian叙事分析方法在日语叙事数据中应用受限的问题,因为当前的Labovian数据集仅限于英语,而日语在语法和话语惯例上与英语存在显著差异。解决方案的关键在于提出了一套针对日语叙事数据的系统性指南,保留了Labovian模型的六个核心类别,并通过为日语句法结构提供明确的分句规则来扩展该框架;同时,该指南覆盖更广泛的句型和叙事类型,从而提升了标注一致性(分句任务Fleiss’ kappa = 0.80),并在结构分类任务中实现了优于以往研究的细粒度区分效果(Krippendorff’s alpha = 0.41 和 0.45)。

链接: https://arxiv.org/abs/2603.29347
作者: Amane Watahiki,Tomoki Doi,Akari Kikuchi,Hiroshi Ohata,Yuki I. Nakata,Takuya Niikawa,Taiga Shinozaki,Hitomi Yanaka
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at The Fifteenth biennial Language Resources and Evaluation Conference (LREC) 2026

点击查看摘要

Abstract:Narrative analysis is a cornerstone of qualitative research. One leading approach is the Labovian model, but its application is labor-intensive, requiring a holistic, recursive interpretive process that moves back and forth between individual parts of the transcript and the transcript as a whole. Existing Labovian datasets are available only in English, which differs markedly from Japanese in terms of grammar and discourse conventions. To address this gap, we introduce the first systematic guidelines for Labovian narrative analysis of Japanese narrative data. Our guidelines retain all six Labovian categories and extend the framework by providing explicit rules for clause segmentation tailored to Japanese constructions. In addition, our guidelines cover a broader range of clause types and narrative types. Using these guidelines, annotators achieved high agreement in clause segmentation (Fleiss’ kappa = 0.80) and moderate agreement in two structural classification tasks (Krippendorff’s alpha = 0.41 and 0.45, respectively), one of which is slightly higher than that found in prior work despite the use of finer-grained distinctions. This paper describes the Labovian model, the proposed guidelines, the annotation process, and their utility. It concludes by discussing the challenges encountered during the annotation process and the prospects for developing a larger dataset for structural narrative analysis in Japanese qualitative research.

[NLP-35] L-ReLF: A Framework for Lexical Dataset Creation

【速读】: 该论文旨在解决低资源语言(如摩洛哥方言 Darija)在知识平台(如 Wikipedia)中因缺乏标准化术语而导致的知识不平等难题,其核心问题在于现有词汇构建方法缺乏一致性与可复现性。解决方案的关键在于提出了一种名为 L-ReLF(Low-Resource Lexical Framework)的可复现技术流程,系统性地应对低资源数据处理中的挑战,包括来源识别、利用光学字符识别(OCR)技术处理偏倚于现代标准阿拉伯语的数据,以及通过严格的后处理步骤校正错误并标准化数据模型,最终产出与 Wikidata Lexemes 兼容的结构化词汇数据集,为下游自然语言处理任务(如机器翻译和形态分析)提供基础支持。

链接: https://arxiv.org/abs/2603.29346
作者: Anass Sedrati,Mounir Afifi,Reda Benkhadra
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to the 2026 International Conference on Natural Language Processing (ICNLP). 6 pages, 1 figure

点击查看摘要

Abstract:This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.

[NLP-36] Open Machine Translation for Esperanto

【速读】: 该论文旨在解决 Esperanto(世界语)在现代机器翻译(Machine Translation, MT)研究中资源丰富但探索不足的问题,特别是缺乏对开源 MT 系统在该语言上的系统性评估。解决方案的关键在于首次全面评估多种主流 MT 方法——包括基于规则的系统、编码器-解码器模型以及大语言模型(Large Language Models, LLMs)——在六种语言方向(涉及英语、西班牙语、加泰罗尼亚语和世界语)上的表现,并通过自动指标与人工评估相结合的方式验证其效果。实验表明,NLLB 系列模型在所有语言对中表现最佳,其次是自训练的小型模型和微调后的通用 LLM,且人工评估结果支持这一趋势,尽管仍存在明显错误。研究还公开了代码与最优模型,呼应了世界语社区开放协作的传统。

链接: https://arxiv.org/abs/2603.29345
作者: Ona de Gibert,Lluís de Gibert
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to SIGUL 2026

点击查看摘要

Abstract:Esperanto is a widespread constructed language, known for its regular grammar and productive word formation. Besides having substantial resources available thanks to its online community, it remains relatively underexplored in the context of modern machine translation (MT) approaches. In this work, we present the first comprehensive evaluation of open-source MT systems for Esperanto, comparing rule-based systems, encoder-decoder models, and LLMs across model sizes. We evaluate translation quality across six language directions involving English, Spanish, Catalan, and Esperanto using multiple automatic metrics as well as human evaluation. Our results show that the NLLB family achieves the best performance in all language pairs, followed closely by our trained compact models and a fine-tuned general-purpose LLM. Human evaluation confirms this trend, with NLLB translations preferred in approximately half of the comparisons, although noticeable errors remain. In line with Esperanto’s tradition of openness and international collaboration, we release our code and best-performing models publicly.

[NLP-37] CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking

【速读】: 该论文旨在解决日语实体链接(Entity Linking)任务中缺乏高质量标注语料库的问题,以支持相关系统在日语环境下的训练与评估。其解决方案的关键在于制定一套系统的语料库设计策略,并构建一个覆盖广泛、具有丰富日本特有实体指称表达的标注语料库,同时通过人工标注一致性检验和初步的字符串匹配实验验证了该语料库具备较高的一致性和非平凡性,从而证明其作为评估基准的可行性与价值。

链接: https://arxiv.org/abs/2603.29336
作者: Shohei Higashiyama,Masao Ideuchi,Masao Utiyama
机构: National Institute of Information and Communications Technology (信息通信研究机构)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Entity linking is the task of associating linguistic expressions with entries in a knowledge base that represent real-world entities and concepts. Language resources for this task have primarily been developed for English, and the resources available for evaluating Japanese systems remain limited. In this study, we develop a corpus design policy for the entity linking task and construct an annotated corpus for training and evaluating Japanese entity linking systems, with rich coverage of linguistic expressions referring to entities that are specific to Japan. Evaluation of inter-annotator agreement confirms the high consistency of the annotations in the corpus, and a preliminary experiment on entity disambiguation based on string matching suggests that the corpus contains a substantial number of non-trivial cases, supporting its potential usefulness as an evaluation benchmark.

[NLP-38] MemRerank: Preference Memory for Personalized Product Reranking

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的购物代理在个性化推荐中因直接拼接原始购买历史而导致的噪声干扰、长度膨胀和语义相关性不足的问题。其核心解决方案是提出MemRerank框架,通过强化学习(Reinforcement Learning, RL)训练一个记忆提取器,将用户购买历史压缩为与查询无关的紧凑偏好信号,并用于个性化商品重排序。该方法显著提升了下游重排序任务的准确性,实验表明其在1-in-5选择任务上相较基线最高提升达+10.61绝对百分点,验证了显式偏好记忆作为智能电商系统个性化模块的有效性。

链接: https://arxiv.org/abs/2603.29247
作者: Zhiyuan Peng,Xuyang Wu,Huaixiao Tou,Yi Fang,Yi Gong
机构: Santa Clara University (圣克拉拉大学); Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf1-in-5 selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf+10.61 absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.

[NLP-39] he Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

【速读】: 该论文旨在解决非洲语言在语音识别(ASR)、机器翻译(MT)和文本到语音合成(TTS)等自然语言处理任务中数据稀缺与技术基础设施薄弱的问题。其核心解决方案是构建并发布Thiomi Dataset——一个涵盖十种非洲语言的多模态语料库,包含超过60万条文本标注和38.5万段音频录音,覆盖四个语言家族,通过社区驱动的数据采集平台收集,并辅以多级质量控制流程确保高文本批准率(86–100%)。该数据集为非洲语言提供了首个大规模、高质量的基准资源,显著提升了现有模型性能,例如在斯瓦希里语上将自动语音识别(ASR)的词错误率(WER)从8.3%降低至3.24%,实现了61%的相对提升,从而推动非洲语言技术基础设施的发展。

链接: https://arxiv.org/abs/2603.29244
作者: Hillary Mutisya,John Mugane,Gavin Nyamboga,Brian Chege,Maryruth Gathoni
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings across nine languages, collected through a dedicated community data collection platform involving over 100 contributors. The Thiomi platform collected data for nine languages; Swahili data was supplemented with existing Common Voice recordings. A multi-tier quality assurance pipeline achieves 86-100% text approval rates for the six primary languages. To validate the dataset’s utility, we train and evaluate ASR, MT, and TTS models, establishing baselines across all ten languages. Our best ASR system achieves 3.24% WER on Swahili (Common Voice), reducing prior academic SOTA from 8.3% to 3.24% (5.1 percentage point absolute, 61% relative reduction), and 4.3% WER on Somali. The dataset will be published on HuggingFace. We describe the collection platform, quality assurance workflows, and baseline experiments, and discuss implications for African language technology infrastructure.

[NLP-40] Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs ICLR2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长篇、噪声较大的文档时,直接推理易出现错误且不可靠的问题,尤其是在文档问答(Document Question Answering, DQA)任务中,如何实现高准确率与低延迟的平衡。其解决方案的关键在于提出一个双支柱框架 LiteCoST:第一支柱为结构化思维链(Chain-of-Structured-Thought, CoST),通过引入一种模式感知的指令模板,引导强LLM生成步骤清晰的CoST推理轨迹及对应的结构化输出(如表格或图结构),从而实现证据聚合、实体标准化、记录对齐与序列化,并提供可审计的监督信号;第二支柱为小语言模型(Small Language Models, SLMs)的两阶段微调策略——先进行监督微调以对齐结构,再使用组相对策略优化(Group Relative Policy Optimization, GRPO)结合三重奖励机制(答案质量、格式正确性与过程一致性)进行强化学习训练,使SLMs能够蒸馏出“结构优先”的行为模式,在多领域长文档问答任务上达到接近LLM的性能,同时相较GPT-4o和DeepSeek-R1(671B参数规模)实现2–4倍的延迟降低。

链接: https://arxiv.org/abs/2603.29232
作者: Zhuowen Liang,Xiaotian Lin,Zhengxuan Zhang,Yuyu Luo,Haixun Wang,Nan Tang
机构: The Hong Kong University of Science and Technology (Guangzhou); EvenUp, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages, 17 figures, 10 tables. Accepted at ICLR 2026

点击查看摘要

Abstract:Large language models (LLMs) are widely applied to data analytics over documents, yet direct reasoning over long, noisy documents remains brittle and error-prone. Hence, we study document question answering (QA) that consolidates dispersed evidence into a structured output (e.g., a table, graph, or chunks) to support reliable, verifiable QA. We propose a two-pillar framework, LiteCoST, to achieve both high accuracy and low latency with small language models (SLMs). Pillar 1: Chain-of-Structured-Thought (CoST). We introduce a CoST template, a schema-aware instruction that guides a strong LLM to produce both a step-wise CoST trace and the corresponding structured output. The process induces a minimal structure, normalizes entities/units, aligns records, serializes the output, and verifies/refines it, yielding auditable supervision. Pillar 2: SLM fine-tuning. The compact models are trained on LLM-generated CoST data in two stages: Supervised Fine-Tuning for structural alignment, followed by Group Relative Policy Optimization (GRPO) incorporating triple rewards for answer/format quality and process consistency. By distilling structure-first behavior into SLMs, this approach achieves LLM-comparable quality on multi-domain long-document QA using 3B/7B SLMs, while delivering 2-4x lower latency than GPT-4o and DeepSeek-R1 (671B). The code is available at this https URL.

[NLP-41] SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali LREC2026

【速读】: 该论文旨在解决斯里兰卡佛教文献数字化与语言建模资源匮乏的问题,特别是针对僧伽罗语(Sinhala)和巴利语(Pali)混合文本的高质量语料库缺失问题。解决方案的关键在于构建了一个名为SiPaKosa的综合性语料库,包含约78.6万句、925万词的僧伽罗语和混合僧伽罗-巴利语教义文本,其数据来源包括16份经版权许可的历史佛教文献及完整网络爬取的三藏(Tripitaka)经典文本;通过Google Document AI进行高精度光学字符识别(OCR),结合系统性网络爬取与严格的质量控制和元数据标注,确保语料库的准确性与可用性。该语料库支持领域适配的语言模型预训练、历史语言分析以及佛教学术信息检索系统的开发,同时保护僧伽罗文化传承。

链接: https://arxiv.org/abs/2603.29221
作者: Ranidu Gurusinghe,Nevidu Jayatilleke
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 5 figures, 5 tables, Accepted paper at the 2nd Workshop on Challenges in Processing South Asian Languages (CHiPSAL) @ LREC 2026

点击查看摘要

Abstract:SiPaKosa is a comprehensive corpus of Sinhala and Pali doctrinal texts comprising approximately 786K sentences and 9.25M words, incorporating 16 copyright-cleared historical Buddhist documents alongside the complete web-scraped Tripitaka canonical texts. The corpus was created through high-quality OCR using Google Document AI on historical manuscripts, combined with systematic web scraping of canonical repositories, followed by rigorous quality control and metadata annotation. The corpus is organised into language-specific subcorpora: Sinhala and Mixed Sinhala-Pali. We evaluate the performance of language models using ten pretrained models, with perplexity scores ranging from 1.09 to 189.67 on our corpus. This analysis shows that proprietary models significantly outperform open-source alternatives by factors of three to six times. This corpus supports the pretraining of domain-adapted language models, facilitates historical language analysis, and aids in the development of information retrieval systems for Buddhist scholarship while preserving Sinhala cultural heritage.

[NLP-42] Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

【速读】: 该论文旨在解决主流多模态大模型在真实内容审核与对抗场景中因细粒度视觉感知能力不足和长尾噪声建模不充分而导致的泛化性能下降及灾难性遗忘问题。解决方案的关键在于构建一个工业级基础模型Xuanwu VL-2B,其采用紧凑的InternViT-300M + MLP + Qwen3 1.7B架构(总参数量约2B),在有限计算资源下实现视觉感知、语言语义对齐与部署成本之间的平衡;并通过数据迭代与筛选机制,结合预训练、中段训练和后训练的三阶段渐进式训练流程,有效兼顾业务专业化与通用能力保留,最终在多项基准和实际业务任务中显著优于现有模型。

链接: https://arxiv.org/abs/2603.29211
作者: Zhiqian Zhang,Xu Zhao,Xiaoqing Xu,Guangdong Liang,Weijia Wang,Xiaolei Lv,Bo Li,Jun Gao
机构: Hello Group Inc
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 41 pages, 10 figures

点击查看摘要

Abstract:In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems. The model adopts a compact InternViT-300M + MLP + Qwen3 1.7B architecture, balancing fine-grained visual perception, language-semantic alignment, and deployment cost within an approximately 2B-parameter budget. To balance business specialization with the retention of general capabilities, we developed a data iteration and curation mechanism and trained the model through a progressive three-stage pipeline: pre-training, mid-training, and post-training. Ablation studies and offline business evaluations show that Xuanwu VL-2B achieves an average score of 67.90 across seven OpenCompass multimodal metrics (vs. 64.27 for InternVL 3.5 2B), an average recall of 94.38% over seven independent business moderation tasks, and a weighted overall recall of 82.82% on policy-violating text in challenging adversarial OCR scenarios, outperforming Gemini-2.5-Pro (76.72%). These results show that, under a limited parameter budget, Xuanwu VL-2B achieves a practical balance among business alignment, visual perception, general capability retention, and deployment cost.

[NLP-43] Designing FSMs Specifications from Requirements with GPT 4.0

【速读】: 该论文旨在解决从自然语言需求文档中自动构建高质量有限状态机(Finite State Machine, FSM)的难题,以提升模型驱动工程(Model-Driven Engineering, MDE)中系统设计与测试阶段的效率和可靠性。其核心问题在于:当前由大型语言模型(Large Language Models, LLMs)生成的FSM质量不足,可能导致测试阶段遗漏缺陷,进而增加生产环境中系统失效的风险。解决方案的关键在于提出一个基于LLM的FSM设计框架,并引入一种专家导向的修复机制——通过FSM变异(FSM mutation)与测试用例生成相结合的方法,对LLM输出的FSM进行自动化修正与优化,从而提高其正确性和适用性。实验结果表明,该方法能够有效增强LLM在FSM生成任务中的能力,并为后续机器学习技术在MDE领域的应用提供新视角。

链接: https://arxiv.org/abs/2603.29140
作者: Omer Nguena Timo,Paul-Alexis Rodriguez,Florent Avellaneda
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:Finite state machines (FSM) are executable formal specifications of reactive systems. These machines are designed based on systems’ requirements. The requirements are often recorded in textual documents written in natural languages. FSMs play a crucial role in different phases of the model-driven system engineering (MDE). For example, they serve to automate testing activities. FSM quality is critical: the lower the quality of FSM, the higher the number of faults surviving the testing phase and the higher the risk of failure of the systems in production, which could lead to catastrophic scenarios. Therefore, this paper leverages recent advances in the domain of LLM to propose an LLM-based framework for designing FSMs from requirements. The framework also suggests an expert-centric approach based on FSM mutation and test generation for repairing the FSMs produced by LLMs. This paper also provides an experimental analysis and evaluation of LLM’s capacities in performing the tasks presented in the framework and FSM repair via various methods. The paper presents experimental results with simulated data. These results and methods bring a new analysis and vision of LLMs that are useful for further development of machine learning technology and its applications to MDE.

[NLP-44] Concept Training for Human-Aligned Language Models

【速读】: 该论文旨在解决标准下一词预测(Next-Token Prediction, NTP)目标在训练语言模型时存在的语义模糊性问题,即同一语义前缀可能对应多个合法且语义相近的后续词(如“this website is safe to”可接“browse”“visit”“surf”等),而传统NTP将这些替代项视为互斥目标,导致模型难以捕捉深层语义一致性。解决方案的关键在于引入概念级监督(concept-level supervision),通过将预测目标从单一token扩展为一组语义相关的token集合(即概念),使模型学习更贴近人类语义相似性的表示。实验表明,该方法在多个词汇基准上提升了与人类语义判断的一致性,同时在语义有意义词汇上的困惑度降低,尽管整体token级困惑度略有上升,体现了标准NTP优化与概念级监督之间的权衡。

链接: https://arxiv.org/abs/2603.29123
作者: Christine Zhang,Dan Jurafsky,Chen Shani
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The next-token prediction (NTP) objective trains language models to predict a single continuation token at each step. In natural language, however, a prefix can be continued in many valid ways, and even similar meanings may differ in surface form. For example, the sentence ``this website is safe to \underlinebrowse’’ could plausibly continue with words such as browse, search, visit, surf, or navigate. While standard NTP training treats these alternatives as mutually exclusive targets, we explore a framework that instead predicts concepts, approximated as sets of semantically related tokens. We show that models trained with concept supervision exhibit stronger alignment with human semantic similarity judgments on multiple lexical benchmarks. These gains are accompanied by lower perplexity on semantically meaningful words (definition in Section 3.1), and a modest increase in global token-level perplexity, reflecting a tradeoff between standard NTP optimization and concept-level supervision. Our results suggest that concept-level objectives can improve semantic alignment while maintaining competitive language modeling performance.

[NLP-45] GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification KR

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在推荐系统中理解用户兴趣时存在的局限性,即现有评估基准主要关注物品预测准确性,而忽视了模型从用户交互历史中提取和验证真实兴趣的能力。解决方案的关键在于提出GISTBench这一新型基准,包含两个核心指标:Interest Groundedness(IG,兴趣真实性),通过精确度和召回率分别惩罚虚假兴趣类别并奖励覆盖范围;以及Interest Specificity(IS,兴趣特异性),用于衡量经验证的LLM预测用户画像的独特性。此外,作者构建了一个基于真实短视频平台用户交互数据的合成数据集,涵盖隐式与显式行为信号及丰富文本描述,并通过用户调查验证其保真度,从而系统性地评估不同规模LLMs(7B至120B参数)在用户兴趣建模上的表现,揭示了当前模型在跨异构交互类型中准确计数与归因能力的瓶颈。

链接: https://arxiv.org/abs/2603.29112
作者: Iordanis Fostiropoulos,Muhammad Rafay Azhar,Abdalaziz Sawwan,Boyu Fang,Yuchen Liu,Jiayi Liu,Hanchao Yu,Qi Guo,Jianyu Wang,Fei Liu,Xiangjun Fan
机构: Meta(Meta)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 figures, 20 tables; code at this https URL

点击查看摘要

Abstract:We introduce GISTBench, a benchmark for evaluating Large Language Models’ (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.

[NLP-46] PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练量化(post-training quantization)过程中因权重分布不均导致的性能损失问题,目标是实现近无损压缩。其解决方案的关键在于提出一种名为PolarQuant的方法,该方法通过三个阶段实现:首先将权重块归一化至单位超球面(unit hypersphere),其次利用Walsh-Hadamard变换将坐标转换为近似高斯分布的随机变量,最后采用与高斯分布匹配的聚类中心进行量化。其中,Walsh-Hadamard旋转被证明是核心贡献,单独使用即可使Qwen3.5-9B模型的困惑度(perplexity)从absmax Q5的6.90降至6.40(相比FP16仅下降+0.03),达到近乎无损效果且无需校准数据。此外,PolarQuant还可作为INT4量化前的预处理步骤,显著提升下游INT4量化器的性能表现。

链接: https://arxiv.org/abs/2603.29078
作者: Caio Vicentino
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 5 tables, 2 algorithms. Code: this https URL Models: this https URL

点击查看摘要

Abstract:We present PolarQuant, a post-training weight quantization method for large language models (LLMs) that exploits the distributional structure of neural network weights to achieve near-lossless compression. PolarQuant operates in three stages: (1) block-wise normalization to the unit hypersphere, (2) Walsh-Hadamard rotation to transform coordinates into approximately Gaussian random variables, and (3) quantization with centroids matched to the Gaussian distribution. Our ablation reveals that Hadamard rotation alone accounts for 98% of the quality improvement, reducing Qwen3.5-9B perplexity from 6.90 (absmax Q5) to 6.40 (Delta = +0.03 from FP16), making it practically lossless without any calibration data. Furthermore, PolarQuant functions as an effective preprocessing step for downstream INT4 quantizers: PolarQuant Q5 dequantized and re-quantized by torchao INT4 achieves perplexity 6.56 versus 6.68 for direct absmax INT4, while maintaining 43.1 tok/s throughput at 6.5 GB VRAM. Code and models are publicly available.

[NLP-47] Dual Perspectives in Emotion Attribution: A Generator-Interpreter Framework for Cross-Cultural Analysis of Emotion in LLM s

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在跨文化情感理解中忽视情感表达者(generator)文化背景的问题,即现有研究多聚焦于情感解释(interpretation),而忽略了情感生成过程中的文化差异,导致模型在不同文化语境下情感 attribution 性能不稳定。解决方案的关键在于提出“生成-解释”(Generator-Interpreter)框架,通过同时建模情感表达和情感解释的双重视角,捕捉文化对情感生成与感知的双向影响;实证结果表明,情感生成者的国籍对模型性能的影响强于解释者,强调了在LLM驱动的情感建模中引入文化敏感性以提升跨文化情境下的鲁棒性和公平性。

链接: https://arxiv.org/abs/2603.29077
作者: Aizirek Turdubaeva,Uichin Lee
机构: KAIST (韩国科学技术院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in cross-cultural systems to understand and adapt to human emotions, which are shaped by cultural norms of expression and interpretation. However, prior work on emotion attribution has focused mainly on interpretation, overlooking the cultural background of emotion generators. This assumption of universality neglects variation in how emotions are expressed and perceived across nations. To address this gap, we propose a Generator-Interpreter framework that captures dual perspectives of emotion attribution by considering both expression and interpretation. We systematically evaluate six LLMs on an emotion attribution task using data from 15 countries. Our analysis reveals that performance variations depend on the emotion type and cultural context. Generator-interpreter alignment effects are present; the generator’s country of origin has a stronger impact on performance. We call for culturally sensitive emotion modeling in LLM-based systems to improve robustness and fairness in emotion understanding across diverse cultural contexts.

[NLP-48] An Empirical Recipe for Universal Phone Recognition INTERSPEECH2026

【速读】: 该论文旨在解决多语言语音识别(Phone Recognition, PR)中模型泛化能力弱的问题,特别是现有以英语为主的高性能模型在跨语言场景下表现不佳,以及多语言模型未能充分利用预训练的自监督学习(Self-Supervised Learning, SSL)表征。解决方案的关键在于提出 PhoneticXEUS 模型,其核心创新包括:1)基于大规模多语言数据进行训练,显著提升跨语言鲁棒性;2)系统性地量化 SSL 表征、数据规模和损失函数目标对性能的影响,通过受控消融实验验证最优训练策略;3)在统一评估框架下覆盖 100 多种语言,揭示不同语系、口音和发音特征下的错误模式,从而指导模型优化方向。该方法在多语言语音识别(17.7% PFER)和带口音的英语语音识别(10.6% PFER)上均达到当前最优性能。

链接: https://arxiv.org/abs/2603.29042
作者: Shikhar Bharadwaj,Chin-Jou Li,Kwanghee Choi,Eunjung Yeo,William Chen,Shinji Watanabe,David R. Mortensen
机构: Carnegie Mellon University (卡内基梅隆大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL)
备注: Submitted to Interspeech 2026. Code: this https URL

点击查看摘要

Abstract:Phone recognition (PR) is a key enabler of multilingual and low-resource speech processing tasks, yet robust performance remains elusive. Highly performant English-focused models do not generalize across languages, while multilingual models underutilize pretrained representations. It also remains unclear how data scale, architecture, and training objective contribute to multilingual PR. We present PhoneticXEUS – trained on large-scale multilingual data and achieving state-of-the-art performance on both multilingual (17.7% PFER) and accented English speech (10.6% PFER). Through controlled ablations with evaluations across 100+ languages under a unified scheme, we empirically establish our training recipe and quantify the impact of SSL representations, data scale, and loss objectives. In addition, we analyze error patterns across language families, accented speech, and articulatory features. All data and code are released openly.

[NLP-49] rojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

【速读】: 该论文旨在解决生成式 AI(Generative AI)在获得微调权限后,攻击者可通过针对性微调绕过内容安全分类器(LLM-based content classifiers)的问题。解决方案的关键在于提出 Trojan-Speak 方法,其核心是结合课程学习(curriculum learning)与基于 GRPO 的混合强化学习(hybrid reinforcement learning),训练模型掌握一种可规避内容分类机制的通信协议;该方法在实现超过 99% 分类器绕过率的同时,仅导致模型在推理基准测试中性能下降低于 5%,显著优于此前方法(通常超过 25% 性能损失)。

链接: https://arxiv.org/abs/2603.29038
作者: Bilgehan Sel,Xuanli He,Alwin Peng,Ming Jin,Jerry Wei
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic’s Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic’s Constitutional Classifiers bug-bounty program. Our findings reveal that LLM-based content classifiers alone are insufficient for preventing dangerous information disclosure when adversaries have fine-tuning access, and we show that activation-level probes can substantially improve robustness to such attacks.

[NLP-50] On the limited utility of parallel data for learning shared multilingual representations

【速读】: 该论文旨在解决多语言表示学习中跨语言对齐(cross-lingual alignment)的机制问题,即平行数据(parallel data)是否是实现不同语言间共享表征的关键信号。研究发现,尽管平行数据在预训练阶段被广泛使用,其对最终跨语言对齐效果的影响其实非常有限;关键在于,平行数据仅在预训练初期可能加速表示共享的形成,并减少模型中语言特异性神经元的数量,而真正的跨语言对齐能力主要来源于模型自身的结构和训练过程,而非显式的翻译信号。

链接: https://arxiv.org/abs/2603.29026
作者: Julius Leino,Jörg Tiedemann
机构: University of Helsinki (赫尔辛基大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Shared multilingual representations are essential for cross-lingual tasks and knowledge transfer across languages. This study looks at the impact of parallel data, i.e. translated sentences, in pretraining as a signal to trigger representations that are aligned across languages. We train reference models with different proportions of parallel data and show that parallel data seem to have only a minimal effect on the cross-lingual alignment. Based on multiple evaluation methods, we find that the effect is limited to potentially accelerating the representation sharing in the early phases of pretraining, and to decreasing the amount of language-specific neurons in the model. Cross-lingual alignment seems to emerge on similar levels even without the explicit signal from parallel data.

[NLP-51] he Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对显性表面线索与隐含可行性约束冲突时出现的系统性推理失效问题,即“启发式覆盖”(heuristic override)现象。其核心解决方案在于提出并验证了一个诊断-测量-桥接-治疗(diagnose-measure-bridge-treat)框架:通过构建Heuristic Override Benchmark(HOB),量化不同模型在500个最小差异样本中的表现,揭示启发式策略(如距离线索)对决策的主导作用(影响力达目标提示的8.7–38倍);进一步发现,简单提示干预(如强调关键对象)可平均提升准确率15个百分点,表明问题根源在于约束推理能力不足而非知识缺失;同时,目标分解提示(goal-decomposition prompting)能额外恢复6–9个百分点性能,说明结构化推理可有效缓解该漏洞。此研究不仅识别出LLMs中普遍存在的启发式偏差机制,还提供了可复现的基准和改进路径,为提升模型的因果推理鲁棒性提供关键洞见。

链接: https://arxiv.org/abs/2603.29025
作者: Yubo Li,Lu Zhang,Tianchong Jiang,Ramayya Krishnan,Rema Padman
机构: Carnegie Mellon University (卡内基梅隆大学); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose-measure-bridge-treat framework. Causal-behavioral analysis of the ``car wash problem’’ across six models reveals approximately context-independent sigmoid heuristics: the distance cue exerts 8.7 to 38 times more influence than the goal, and token-level attribution shows patterns more consistent with keyword associations than compositional inference. The Heuristic Override Benchmark (HOB) – 500 instances spanning 4 heuristic by 5 constraint families with minimal pairs and explicitness gradients – demonstrates generality across 14 models: under strict evaluation (10/10 correct), no model exceeds 75%, and presence constraints are hardest (44%). A minimal hint (e.g., emphasizing the key object) recovers +15 pp on average, suggesting the failure lies in constraint inference rather than missing knowledge; 12/14 models perform worse when the constraint is removed (up to -39 pp), revealing conservative bias. Parametric probes confirm that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics; goal-decomposition prompting recovers +6 to 9 pp by forcing models to enumerate preconditions before answering. Together, these results characterize heuristic override as a systematic reasoning vulnerability and provide a benchmark for measuring progress toward resolving it.

[NLP-52] Human-Like Lifelong Memory: A Neuroscience-Grounded Architecture for Infinite Interaction ICLR2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长期交互中缺乏持久且结构化记忆的问题,尤其指出单纯扩展上下文窗口无法有效提升推理能力——即使具备完美检索能力,推理性能仍可能下降高达85%。其解决方案的核心在于提出一种受生物启发的记忆框架,基于互补学习系统理论、认知行为疗法中的信念层级结构、双过程认知及模糊痕迹理论,围绕三个关键原则构建:(1)记忆具有情感效价(valence),而非仅内容;通过预计算的情感-关联摘要(效价向量)形成涌现的信念层级,实现决策前的快速定向;(2)检索默认采用系统1(System 1),系统2(System 2)分级介入,利用自动扩散激活和被动启动作为默认机制,并在需要时触发主动检索,同时引入分级认知状态以结构性缓解幻觉;(3)编码是主动、即时且依赖反馈的,由丘脑门控机制标记并路由信息至不同存储区,执行功能则通过好奇心驱动的探究形成概要(gist),而非被动暴露。该框架最终使系统随时间收敛至类似临床专家的系统1处理模式,从而实现交互成本随经验积累而降低。

链接: https://arxiv.org/abs/2603.29023
作者: Diego C. Lerma-Torres(Universidad de Guanajuato)
机构: Universidad de Guanajuato (瓜纳华托大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 1 figure. Accepted at the MemAgents Workshop, ICLR 2026

点击查看摘要

Abstract:Large language models lack persistent, structured memory for long-term interaction and context-sensitive retrieval. Expanding context windows does not solve this: recent evidence shows that context length alone degrades reasoning by up to 85% - even with perfect retrieval. We propose a bio-inspired memory framework grounded in complementary learning systems theory, cognitive behavioral therapy’s belief hierarchy, dual-process cognition, and fuzzy-trace theory, organized around three principles: (1) Memory has valence, not just content - pre-computed emotional-associative summaries (valence vectors) organized in an emergent belief hierarchy inspired by Beck’s cognitive model enable instant orientation before deliberation; (2) Retrieval defaults to System 1 with System 2 escalation - automatic spreading activation and passive priming as default, with deliberate retrieval only when needed, and graded epistemic states that address hallucination structurally; and (3) Encoding is active, present, and feedback-dependent - a thalamic gateway tags and routes information between stores, while the executive forms gists through curiosity-driven investigation, not passive exposure. Seven functional properties specify what any implementation must satisfy. Over time, the system converges toward System 1 processing - the computational analog of clinical expertise - producing interactions that become cheaper, not more expensive, with experience.

[NLP-53] Known Intents New Combinations: Clause-Factorized Decoding for Compositional Multi-Intent Detection

【速读】: 该论文旨在解决多意图检测(multi-intent detection)任务中模型对未见过的熟悉意图组合的泛化能力不足的问题,即模型是否能从一个话语中恢复出训练时未出现的新组合意图。现有基准测试因训练与测试数据共享相似的共现模式而难以有效评估这种组合泛化能力。为更严格地检验这一能力,作者提出 CoMIX-Shift 控制性基准,引入未见意图对、话语模式变化、更长且带噪声的修饰语、未见子句模板及零样本三元组等挑战场景。解决方案的关键在于提出 ClauseCompose——一种仅基于单意图训练的轻量级解码器,通过将完整话语分解为可组合的子句单元进行建模,从而在多个组合泛化场景下显著优于全句基线模型(如 WholeMultiLabel 和微调的小型 BERT),尤其在未见意图对和连接词扰动场景中表现突出,表明简单因子分解策略在组合泛化任务中具有强大潜力。

链接: https://arxiv.org/abs/2603.28929
作者: Abhilash Nandy
机构: Microsoft Research India
类目: Computation and Language (cs.CL)
备注: 6 pages, 3 tables

点击查看摘要

Abstract:Multi-intent detection papers usually ask whether a model can recover multiple intents from one utterance. We ask a harder and, for deployment, more useful question: can it recover new combinations of familiar intents? Existing benchmarks only weakly test this, because train and test often share the same broad co-occurrence patterns. We introduce CoMIX-Shift, a controlled benchmark built to stress compositional generalization in multi-intent detection through held-out intent pairs, discourse-pattern shift, longer and noisier wrappers, held-out clause templates, and zero-shot triples. We also present ClauseCompose, a lightweight decoder trained only on singleton intents, and compare it to whole-utterance baselines including a fine-tuned tiny BERT model. Across three random seeds, ClauseCompose reaches 95.7 exact match on unseen intent pairs, 93.9 on discourse-shifted pairs, 62.5 on longer/noisier pairs, 49.8 on held-out templates, and 91.1 on unseen triples. WholeMultiLabel reaches 81.4, 55.7, 18.8, 15.5, and 0.0; the BERT baseline reaches 91.5, 77.6, 48.9, 11.0, and 0.0. We also add a 240-example manually authored SNIPS-style compositional set with five held-out pairs; there, ClauseCompose reaches 97.5 exact match on unseen pairs and 86.7 under connector shift, compared with 41.3 and 10.4 for WholeMultiLabel. The results suggest that multi-intent detection needs more compositional evaluation, and that simple factorization goes surprisingly far once evaluation asks for it.

[NLP-54] heory of Mind and Self-Attributions of Mentality are Dissociable in LLM s

【速读】: 该论文旨在解决安全微调(safety fine-tuning)对大型语言模型(Large Language Models, LLMs)社会认知能力,特别是心智理论(Theory of Mind, ToM)的影响问题。研究发现,抑制模型对自身或技术实体的心智归因(mind-attribution)并不会损害其ToM能力,二者在行为和机制层面均可分离;但安全微调模型会显著降低对非人类动物的心智归因,并削弱其表现出宗教或灵性信念的能力,这表明安全微调可能无意中削弱了模型对非人类心智的广泛认知倾向。解决方案的关键在于通过消融实验与表征相似性机制分析,识别出心智归因与ToM能力的可分离性,从而为安全微调策略提供更精细的优化方向。

链接: https://arxiv.org/abs/2603.28925
作者: Junsol Kim,Winnie Street,Roberta Rocca,Daine M. Korngiebel,Adam Waytz,James Evans,Geoff Keeling
机构: Google(谷歌); University of Chicago(芝加哥大学); University of London(伦敦大学); University of Washington(华盛顿大学); Northwestern University(西北大学); Santa Fe Institute(圣塔菲研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM). Through safety ablation and mechanistic analyses of representational similarity, we demonstrate that LLM attributions of mind to themselves and to technological artefacts are behaviorally and mechanistically dissociable from ToM capabilities. Nevertheless, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief, suppressing widely shared perspectives regarding the distribution and nature of non-human minds.

[NLP-55] CrossTrace: A Cross-Domain Dataset of Grounded Scientific Reasoning Traces for Hypothesis Generation

【速读】: 该论文旨在解决科学假说生成(hypothesis generation)中缺乏多领域、可验证的推理链数据的问题,现有数据集通常局限于单一学科且未明确记录从已有知识到新假设的逻辑推导过程。其关键解决方案是构建CrossTrace数据集,包含1,389条跨领域的结构化推理痕迹(reasoning traces),覆盖生物医学、人工智能/机器学习及交叉领域,每一步均基于源文献文本进行锚定(grounded)。该数据集采用Input/Trace/Output架构,引入步骤级验证机制与八类发现模式分类,并通过QLoRA微调Qwen2.5-7B-Instruct模型显著提升假说生成质量:在GPT-4o和Claude Opus 4.5评估下IAScore分别提升至0.968和0.888,结构合规性达100%,且跨域训练优于单域训练,证明科学推理模式具有一定程度的领域通用性。

链接: https://arxiv.org/abs/2603.28924
作者: Andrew Bouras,OMS-II Research Fellow
机构: Nova Southeastern University (诺瓦东南大学); Dr. Kiran C. Patel College of Osteopathic Medicine (Kiran C. Patel骨科医学院)
类目: Computation and Language (cs.CL)
备注: 14 pages, 1 figure, 8 tables. Dataset and code available at this https URL

点击查看摘要

Abstract:Scientific hypothesis generation is a critical bottleneck in accelerating research, yet existing datasets for training and evaluating hypothesis-generating models are limited to single domains and lack explicit reasoning traces connecting prior knowledge to novel contributions. I introduce CrossTrace, a dataset of 1,389 grounded scientific reasoning traces spanning biomedical research (518), AI/ML (605), and cross-domain work (266). Each trace captures the structured reasoning chain from established knowledge through intermediate logical steps to a novel hypothesis, with every step grounded in source paper text. I define an Input/Trace/Output schema that extends the Bit-Flip-Spark framework of HypoGen with step-level verification, a taxonomy of eight discovery patterns, and multi-domain coverage. Fine-tuning Qwen2.5-7B-Instruct on CrossTrace via QLoRA yields substantial improvements over the untuned baseline: IAScore rises from 0.828 to 0.968 (GPT-4o judge) and from 0.716 to 0.888 (Claude Opus 4.5), structural compliance improves from 0% to 100%, and spark cosine similarity increases from 0.221 to 0.620. Balanced cross-domain training (biomedical + AI/ML + CS) outperforms single-domain training, providing evidence that scientific reasoning patterns transfer across disciplines. Human validation of 150 stratified records confirms 99.7% step-level grounding accuracy and a 0.0% fabrication rate. To my knowledge, CrossTrace is the first large-scale, cross-domain dataset with step-level grounded reasoning traces for hypothesis generation, and my results demonstrate that such traces are an effective training signal whose benefits are at least partially domain-general.

[NLP-56] From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories

【速读】: 该论文旨在解决在领域偏移(domain shift)情境下,尤其是面对异质性、长篇叙事结构复杂的文献(如大屠杀口述史)时,情感极性检测(polarity detection)面临显著挑战的问题。其解决方案的关键在于构建一个基于模型间一致性(agreement-based stability taxonomy, ABC)的诊断框架,通过三个预训练Transformer模型对107,305个话语片段和579,013个句子进行标注,并利用成对百分比一致率、Cohen’s kappa、Fleiss’ kappa及行归一化混淆矩阵来定位系统性分歧;同时引入基于T5的情感分类器作为辅助描述信号,比较不同一致性层级中的情绪分布差异。该方法结合多模型标签三角测量与ABC分类体系,为敏感历史叙述中情感模型的偏差提供了一种谨慎且可操作的分析路径。

链接: https://arxiv.org/abs/2603.28913
作者: Daban Q. Jaff
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Polarity detection becomes substantially more challenging under domain shift, particularly in heterogeneous, long-form narratives with complex discourse structure, such as Holocaust oral histories. This paper presents a corpus-scale diagnostic study of off-the-shelf sentiment classifiers on long-form Holocaust oral histories, using three pretrained transformer-based polarity classifiers on a corpus of 107,305 utterances and 579,013 sentences. After assembling model outputs, we introduce an agreement-based stability taxonomy (ABC) to stratify inter-model output stability. We report pairwise percent agreement, Cohen kappa, Fleiss kappa, and row-normalized confusion matrices to localize systematic disagreement. As an auxiliary descriptive signal, a T5-based emotion classifier is applied to stratified samples from each agreement stratum to compare emotion distributions across strata. The combination of multi-model label triangulation and the ABC taxonomy provides a cautious, operational framework for characterizing where and how sentiment models diverge in sensitive historical narratives. Inter-model agreement is low to moderate overall and is driven primarily by boundary decisions around neutrality.

[NLP-57] OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

【速读】: 该论文旨在解决持续预训练(Continual Pre-Training, CPT)中数据混合比例(data mixture ratio)这一敏感超参数难以调优的问题:传统方法需在训练前固定比例,且选择不当会导致大量计算资源浪费。解决方案的关键在于提出OptiMer框架,其核心创新是将比例选择从训练阶段解耦至训练后优化——通过为每个数据集单独训练一个CPT模型,提取各模型的分布向量(distribution vector),该向量表征了数据集对模型参数的偏移效应;随后利用贝叶斯优化(Bayesian optimization)在训练后搜索最优组合权重。实验表明,OptiMer在多语言(日语、中文)和多领域(数学、代码)场景下均显著优于数据混合与模型平均基线,且搜索成本降低15–35倍,同时揭示了优化权重可解释为实际数据混合比例,并支持无需重训练即可针对特定目标重新优化,实现按需定制模型。

链接: https://arxiv.org/abs/2603.28858
作者: Haiyue Song,Masao Utiyama
机构: National Institute of Information and Communications Technology (国家信息与通信技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint, 20 pages, 10 tables, 12 figures

点击查看摘要

Abstract:Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model’s distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2) the same vector pool can be re-optimized for a given objective without any retraining, producing target-tailored models on demand. Our work establishes that data mixture ratio selection, traditionally a pre-training decision, can be reformulated as a post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training.

[NLP-58] OneComp: One-Line Revolution for Generative AI Model Compression

【速读】: 该论文旨在解决基础模型(foundation models)在部署过程中面临的内存占用大、延迟高及硬件成本昂贵等问题。现有后训练压缩方法虽可通过降低模型参数精度来缓解这些瓶颈,但其实际应用仍面临挑战,包括量化算法碎片化、精度预算难以规划、数据驱动的校准策略复杂以及硬件依赖的执行模式多样等。解决方案的关键在于提出 OneComp——一个开源压缩框架,将专家级工作流程转化为可复现、资源自适应的流水线:给定模型标识和可用硬件后,OneComp 自动分析模型结构,规划混合精度分配,并执行从层级压缩到块级优化再到全局优化的渐进式量化阶段;其核心设计是将首个量化检查点作为可部署的基准点,确保后续每个阶段均在相同模型上迭代优化,且随着计算资源投入增加,模型质量持续提升,从而实现算法创新与生产级部署之间的有效衔接。

链接: https://arxiv.org/abs/2603.28845
作者: Yuma Ichikawa,Keiji Kimura,Akihiro Yoshida,Yudai Fujimoto,Hiroki Tokura,Yamato Arai,Yoshiyuki Ishii,Yusei Kawakami,Genki Shikada,Achille Jacquemond,Yoshihiko Fujisawa,Katsuki Fujisawa,Takumi Honda,Akira Sakai
机构: Fujitsu Research(富士通研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: 31 pages, 6 figures

点击查看摘要

Abstract:Deploying foundation models is increasingly constrained by memory footprint, latency, and hardware costs. Post-training compression can mitigate these bottlenecks by reducing the precision of model parameters without significantly degrading performance; however, its practical implementation remains challenging as practitioners navigate a fragmented landscape of quantization algorithms, precision budgets, data-driven calibration strategies, and hardware-dependent execution regimes. We present OneComp, an open-source compression framework that transforms this expert workflow into a reproducible, resource-adaptive pipeline. Given a model identifier and available hardware, OneComp automatically inspects the model, plans mixed-precision assignments, and executes progressive quantization stages, ranging from layer-wise compression to block-wise refinement and global refinement. A key architectural choice is treating the first quantized checkpoint as a deployable pivot, ensuring that each subsequent stage improves the same model and that quality increases as more compute is invested. By converting state-of-the-art compression research into an extensible, open-source, hardware-aware pipeline, OneComp bridges the gap between algorithmic innovation and production-grade model deployment.

[NLP-59] StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving

【速读】: 该论文针对大语言模型(Large Language Model, LLM)服务工作负载中普遍存在的一类问题:多个请求共享相同的解题结构但局部约束不同(如输出格式、变量名或数值常量差异),传统缓存方法要么复用完整响应(语义缓存,semantic caching),要么依赖模型内部的键值(Key-Value, KV)状态或前缀状态,前者对局部变更敏感,后者则与特定后端强耦合。为解决这一问题,作者提出StepCache——一种与后端无关的细粒度步骤级复用层,其核心创新在于将输出分割为有序步骤,通过轻量级任务感知校验验证匹配缓存请求中的步骤,并仅对失败区域进行选择性补丁重生成(selective patching)。该方案支持严格的结构化输出控制(如JSON格式提取、必填字段约束及单步修复),并引入保守跳过复用回退机制以应对语义变化,从而在保持正确性的前提下显著提升效率和鲁棒性。

链接: https://arxiv.org/abs/2603.28795
作者: Azam Nouri
机构: Lincoln University (林肯大学)
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 9 pages, 1 figure

点击查看摘要

Abstract:We address LLM serving workloads where repeated requests share a common solution structure but differ in localized constraints, such as output schema, variable names, or numeric constants. Prior caching approaches typically reuse either full responses (semantic caching) or model-internal KV/prefix states, which are respectively brittle under partial changes or tightly coupled to specific backends. We present StepCache, a backend-agnostic step-level reuse layer that segments outputs into ordered steps, retrieves the best-matching cached request, verifies steps using lightweight task-aware checks, and regenerates only failing regions via selective patching. StepCache additionally supports strict structured-output enforcement for JSON, including single-step extraction, required-key constraints, and one-shot repair, as well as conservative skip-reuse fallbacks for semantic changes. For linear equations, StepCache promotes verification into correction via a bounded repair loop with a deterministic fallback that guarantees correctness when the backend model fails. In a CPU-only perturbation-heavy micro-benchmark on math and JSON variants, averaged over three seeds, StepCache reduces mean latency from 2.13 s to 0.67 s, median latency from 2.42 s to 0.01 s, and p95 latency from 3.38 s to 3.30 s. It also reduces total token usage from 36.1k to 27.3k and improves end-to-end correctness from 72.5% to 100% under task-specific checks and a stitched-output integrity check. Across requests, 79.7% take the reuse-only fast path, 5.4% require patching, and 14.9% trigger skip-reuse. Comments: 9 pages, 1 figure Subjects: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC) ACMclasses: I.2.7; H.3.4; C.2.4 Cite as: arXiv:2603.28795 [cs.OS] (or arXiv:2603.28795v1 [cs.OS] for this version) https://doi.org/10.48550/arXiv.2603.28795 Focus to learn more arXiv-issued DOI via DataCite

[NLP-60] Spark-LLM -Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)评估在实际应用中面临的可扩展性瓶颈问题,尤其当数据集规模达到数十万甚至百万级别时,现有评估框架难以高效处理。其核心解决方案是提出一个基于Apache Spark的分布式评估框架Spark-LLM-Eval,该框架将评估任务视为数据并行问题,在执行器(executors)间划分样本并进行统计学严谨的结果聚合。关键创新包括:1)引入自助法(bootstrap)置信区间和针对不同指标类型的显著性检验(如配对t检验、McNemar检验或Wilcoxon符号秩检验),确保评估结果的统计可靠性;2)通过基于Delta Lake的内容寻址响应缓存机制降低重复推理成本,支持迭代优化评估指标而无需重新运行模型推理。实证表明该框架具备与集群规模线性增长的吞吐能力。

链接: https://arxiv.org/abs/2603.28769
作者: Subhadip Mitra
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 2 figures, 6 tables. Open source: this https URL . Cross-list requested: cs.CL, cs.LG

点击查看摘要

Abstract:Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions of samples. This scale is common when assessing model behavior across diverse domains or conducting comprehensive regression testing. We present Spark-LLM-Eval, a distributed evaluation framework built natively on Apache Spark. The system treats evaluation as a data-parallel problem, partitioningexamplesacrossexecutorsandaggregatingresultswithproperstatistical accounting. Beyond raw throughput, we emphasize statistical rigor: every reported metric includes bootstrap confidence intervals, and model comparisons come with appropriate significance tests (paired t-tests, McNemar’s test, or Wilcoxon signed-rank, depending on the metric type). The framework also addresses the cost problem inherent in LLM evaluation through content-addressable response caching backed by Delta Lake, which allows iterating on metric definitions without re-running inference. We describe the system architecture, the statistical methodology, and report benchmark results showing linear scaling with cluster size. The framework and all evaluation code are available as open source.

[NLP-61] he Last Fingerprint: How Markdown Training Shapes LLM Prose

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在生成文本时普遍存在“过度使用破折号”(em dash)的现象,这一现象已成为识别AI生成文本的重要标志之一,但此前缺乏对这一行为的机制性解释,尤其是其与模型训练中广泛存在的Markdown格式之间的潜在关联。解决方案的关键在于提出并验证一个五步演化路径,即从训练数据组成(markdown密集语料)、结构内化、破折号的双重功能属性(既是标点符号又是Markdown结构标记),到后训练阶段的放大效应,最终揭示破折号并非单纯的语言风格问题,而是模型在学习Markdown格式时残留下来的结构痕迹——一种“Markdown泄露到正文”的现象。实验通过多条件抑制测试证实,尽管显式禁止Markdown格式可消除其他结构特征(如标题、列表等),破折号仍顽固存在,且其频率和抗抑制能力因模型来源和微调方法而异,从而将破折号频率重新定义为诊断模型微调策略的指标,而非简单的语言缺陷。

链接: https://arxiv.org/abs/2603.27006
作者: E. M. Freeburg
机构: Anthropic; OpenAI; Meta; Google; DeepSeek
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 14 pages, 3 tables. Code and data: this https URL

点击查看摘要

Abstract:Large language models produce em dashes at varying rates, and the observation that some models “overuse” them has become one of the most widely discussed markers of AI-generated text. Yet no mechanistic account of this pattern exists, and the parallel observation that LLMs default to markdown-formatted output has never been connected to it. We propose that the em dash is markdown leaking into prose – the smallest surviving unit of the structural orientation that LLMs acquire from markdown-saturated training corpora. We present a five-step genealogy connecting training data composition, structural internalization, the dual-register status of the em dash, and post-training amplification. We test this with a two-condition suppression experiment across twelve models from five providers (Anthropic, OpenAI, Meta, Google, DeepSeek): when models are instructed to avoid markdown formatting, overt features (headers, bullets, bold) are eliminated or nearly eliminated, but em dashes persist – except in Meta’s Llama models, which produce none at all. Em dash frequency and suppression resistance vary from 0.0 per 1,000 words (Llama) to 9.1 (GPT-4.1 under suppression), functioning as a signature of the specific fine-tuning procedure applied. A three-condition suppression gradient shows that even explicit em dash prohibition fails to eliminate the artifact in some models, and a base-vs-instruct comparison confirms that the latent tendency exists pre-RLHF. These findings connect two previously isolated online discourses and reframe em dash frequency as a diagnostic of fine-tuning methodology rather than a stylistic defect.

[NLP-62] Convergent Representations of Linguistic Constructions in Human and Artificial Neural Systems

【速读】: 该论文旨在解决大脑如何处理语言构式(Construction Grammar)这一认知神经科学与语言学的核心问题,特别是探讨构式层面的信息在人类神经活动中何时以及如何浮现。其解决方案的关键在于利用脑电图(EEG)记录十名母语为英语的参与者在聆听四种不同构式(及物、双宾、致使移动、结果性)合成句子时的神经反应,并结合时频分析、特征提取和机器学习分类方法,发现构式特异性的神经签名主要出现在句末位置(此时论元结构完全明确),且最显著体现在α频段;此外,这些神经模式的时间出现规律和相似性结构与循环神经网络和Transformer语言模型中的构式表征演化高度一致,表明生物系统与人工系统在表征学习上存在收敛,支持构式作为独立形式-意义映射被神经编码的观点。

链接: https://arxiv.org/abs/2603.29617
作者: Pegah Ramezani,Thomas Kinfe,Andreas Maier,Achim Schilling,Patrick Krauss
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding how the brain processes linguistic constructions is a central challenge in cognitive neuroscience and linguistics. Recent computational studies show that artificial neural language models spontaneously develop differentiated representations of Argument Structure Constructions (ASCs), generating predictions about when and how construction-level information emerges during processing. The present study tests these predictions in human neural activity using electroencephalography (EEG). Ten native English speakers listened to 200 synthetically generated sentences across four construction types (transitive, ditransitive, caused-motion, resultative) while neural responses were recorded. Analyses using time-frequency methods, feature extraction, and machine learning classification revealed construction-specific neural signatures emerging primarily at sentence-final positions, where argument structure becomes fully disambiguated, and most prominently in the alpha band. Pairwise classification showed reliable differentiation, especially between ditransitive and resultative constructions, while other pairs overlapped. Crucially, the temporal emergence and similarity structure of these effects mirror patterns in recurrent and transformer-based language models, where constructional representations arise during integrative processing stages. These findings support the view that linguistic constructions are neurally encoded as distinct form-meaning mappings, in line with Construction Grammar, and suggest convergence between biological and artificial systems on similar representational solutions. More broadly, this convergence is consistent with the idea that learning systems discover stable regions within an underlying representational landscape - recently termed a Platonic representational space - that constrains the emergence of efficient linguistic abstractions.

[NLP-63] Advancing LLM -based phoneme-to-grapheme for multilingual speech recognition INTERSPEECH2026 DATE

【速读】: 该论文旨在解决多语言语音识别中基于大语言模型(LLM)的音素到字形转换(P2G)模块面临的挑战,特别是语言特异性生成和跨语言数据分布不均衡问题。其关键解决方案是引入鲁棒训练策略与低资源语言过采样技术,并采用简化版的蒙特卡洛近似方法(S-SKM),该方法避免了在P2G训练中使用CTC-based的S2P概率加权,从而有效缓解了因语音到音素(S2P)不确定性带来的影响,最终将平均词错误率(WER)从10.56%降低至7.66%。

链接: https://arxiv.org/abs/2603.29217
作者: Lukuang Dong,Ziwei Li,Saierdaer Yusuyin,Xianyu Zhao,Zhijian Ou
机构: TasiTech Co., Ltd.(TasiTech公司); Speech Processing and Machine Intelligence (SPMI) Lab, Tsinghua University(清华大学语音处理与机器智能实验室); School of Computer Science and Technology, Xinjiang University(新疆大学计算机科学与技术学院)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Update after INTERSPEECH2026 submission

点击查看摘要

Abstract:Phoneme-based ASR factorizes recognition into speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G), enabling cross-lingual acoustic sharing while keeping language-specific orthography in a separate module. While large language models (LLMs) are promising for P2G, multilingual P2G remains challenging due to language-aware generation and severe cross-language data imbalance. We study multilingual LLM-based P2G on the ten-language CV-Lang10 benchmark. We examine robustness strategies that account for S2P uncertainty, including DANP and Simplified SKM (S-SKM). S-SKM is a Monte Carlo approximation that avoids CTC-based S2P probability weighting in P2G training. Robust training and low-resource oversampling reduce the average WER from 10.56% to 7.66%.

信息检索

[IR-0] Structural Feature Engineering for Generative Engine Optimization: How Content Structure Shapes Citation Behavior

【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的搜索引擎中内容可见性下降的问题,即传统基于链接的检索机制向直接答案生成与选择性来源引用转变后,内容如何更有效地被引用以提升曝光度。现有生成式引擎优化(Generative Engine Optimization, GEO)方法主要聚焦于语义内容调整,忽视了结构特征对引用行为的影响。解决方案的关键在于提出 GEO-SFE 框架,通过系统性地对内容结构进行分层工程:宏观结构(文档架构)、中观结构(信息分块)与微观结构(视觉强调),建模其在不同生成式引擎架构下的引用概率影响,并设计保持语义完整性的同时提升结构有效性的架构感知优化策略与预测模型,从而显著提高引用率(平均提升 17.3%)和主观质量(平均提升 18.5%)。

链接: https://arxiv.org/abs/2603.29979
作者: Junwei Yu,Mufeng Yang,Yepeng Ding,Hiroyuki Sato
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 12 pages, 5 figures. This paper proposes GEO-SFE, a structural feature engineering framework for generative engine optimization

点击查看摘要

Abstract:The proliferation of AI-powered search engines has shifted information discovery from traditional link-based retrieval to direct answer generation with selective source citation, creating new challenges for content visibility. While existing Generative Engine Optimization (GEO) approaches focus primarily on semantic content modification, the role of structural features in influencing citation behavior remains underexplored. In this paper, we propose GEO-SFE, a systematic framework for structural feature engineering in generative engine optimization. Our approach decomposes content structure into three hierarchical levels: macro-structure (document architecture), meso-structure (information chunking), and micro-structure (visual emphasis), and models their impact on citation probability across different generative engine architectures. We develop architecture-aware optimization strategies and predictive models that preserve semantic integrity while improving structural effectiveness. Experimental evaluation across six mainstream generative engines demonstrates consistent improvements in citation rate (17.3 percent) and subjective quality (18.5 percent), validating the effectiveness and generalizability of the proposed framework. This work establishes structural optimization as a foundational component of GEO, providing a data-driven methodology for enhancing content visibility in LLM-powered information ecosystems. Comments: 12 pages, 5 figures. This paper proposes GEO-SFE, a structural feature engineering framework for generative engine optimization Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR) ACMclasses: H.3.3; I.2.7 Cite as: arXiv:2603.29979 [cs.CL] (or arXiv:2603.29979v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.29979 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] Rewrite the News: Tracing Editorial Reuse Across News Agencies LREC2026

【速读】:该论文旨在解决多语言新闻报道中句子级别文本复用(sentence-level text reuse)的检测问题,特别是如何在不依赖完整翻译的情况下识别跨语言的内容重复,以支持记者自动化筛选信息、缓解信息过载。其解决方案的关键在于提出一种弱监督方法,结合文章发布时间戳来定位最早可能的来源句对,从而实现无需全量翻译的跨语言复用检测;同时通过分析 reused content 在文章中的位置分布(如多集中于中后段),揭示了传统基于词汇匹配的方法会忽略大量编辑性复用行为,提升了对新闻生产流程中内容再利用模式的理解。

链接: https://arxiv.org/abs/2603.29937
作者: Soveatin Kuntur,Nina Smirnova,Anna Wroblewska,Philipp Mayr,Sebastijan Razboršek Maček
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: The paper is accepted to SoCon-NLPSI 2026 : Social Context (SoCon) and Integrating NLP and Psychology to Study Social Interactions (NLPSI) workshop co-located with LREC 2026

点击查看摘要

Abstract:This paper investigates sentence-level text reuse in multilingual journalism, analyzing where reused content occurs within articles. We present a weakly supervised method for detecting sentence-level cross-lingual reuse without requiring full translations, designed to support automated pre-selection to reduce information overload for journalists (Holyst et al., 2024). The study compares English-language articles from the Slovenian Press Agency (STA) with reports from 15 foreign agencies (FA) in seven languages, using publication timestamps to retain the earliest likely foreign source for each reused sentence. We analyze 1,037 STA and 237,551 FA articles from two time windows (October 7-November 2, 2023; February 1-28, 2025) and identify 1,087 aligned sentence pairs after filtering to the earliest sources. Reuse occurs in 52% of STA articles and 1.6% of FA articles and is predominantly non-literal, involving paraphrase and compositional reuse from multiple sources. Reused content tends to appear in the middle and end of English articles, while leads are more often original, indicating that simple lexical matching overlooks substantial editorial reuse. Compared with prior work focused on monolingual overlap, we (i) detect reuse across languages without requiring full translation, (ii) use publication timing to identify likely sources, and (iii) analyze where reused material is situated within articles. Dataset and code: this https URL.

[IR-2] UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates

【速读】:该论文旨在解决多模态重排序(multimodal reranking)中的模态差距问题,即传统文本重排序器对图像候选项存在天然偏差,导致跨模态排序效果不佳;同时,现有基于视觉语言模型(VLM)的方案通常将文本转换为图像表示,引入额外计算开销,并且在特定领域表现受限。解决方案的关键在于提出UniRank框架,其核心创新是原生支持混合文本-图像候选项的评分与排序,无需任何模态转换;在此基础上构建端到端领域自适应流程:首先通过指令微调(instruction-tuning)阶段将标签token概率映射为统一标量得分以实现校准的跨模态相关性评分;其次利用硬负样本驱动的偏好对齐阶段,结合查询级策略优化与人类反馈强化学习(RLHF),在目标域内学习更精准的排序偏好,从而显著提升跨模态重排序性能。

链接: https://arxiv.org/abs/2603.29897
作者: Yupei Yang,Lin Yang,Wanxi Deng,Lin Qu,Shikui Tu,Lei Xu
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Group (阿里巴巴集团)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reranking is a critical component in many information retrieval pipelines. Despite remarkable progress in text-only settings, multimodal reranking remains challenging, particularly when the candidate set contains hybrid text and image items. A key difficulty is the modality gap: a text reranker is intrinsically closer to text candidates than to image candidates, leading to biased and suboptimal cross-modal ranking. Vision-language models (VLMs) mitigate this gap through strong cross-modal alignment and have recently been adopted to build multimodal rerankers. However, most VLM-based rerankers encode all candidates as images, and treating text as images introduces substantial computational overhead. Meanwhile, existing open-source multimodal rerankers are typically trained on general-domain data and often underperform in domain-specific scenarios. To address these limitations, we propose UniRank, a VLM-based reranking framework that natively scores and orders hybrid text-image candidates without any modality conversion. Building on this hybrid scoring interface, UniRank provides an end-to-end domain adaptation pipeline that includes: (1) an instruction-tuning stage that learns calibrated cross-modal relevance scoring by mapping label-token likelihoods to a unified scalar score; and (2) a hard-negative-driven preference alignment stage that constructs in-domain pairwise preferences and performs query-level policy optimization through reinforcement learning from human feedback (RLHF). Extensive experiments on scientific literature retrieval and design patent search demonstrate that UniRank consistently outperforms state-of-the-art baselines, improving Recall@1 by 8.9% and 7.3%, respectively.

[IR-3] A Hybrid Machine Learning Approach for Graduate Admission Prediction and Combined University-Program Recommendation

【速读】:该论文旨在解决研究生录取竞争日益激烈背景下,如何提高录取预测准确性并为被拒申请人提供有效替代建议的问题。其核心解决方案是构建一个混合机器学习框架,关键在于将XGBoost模型与残差精修的k近邻(k-nearest neighbors, KNN)模块相结合,从而在包含13,000条GradCafe自报申请记录及多源增强特征(如OpenAlex API、QS世界大学学科排名、Wikidata SPARQL查询)的数据集上实现87%的测试准确率,并进一步开发推荐模块,为未获录取者提供针对性高校与项目替代方案,使预期录取概率提升70%,显著改善决策质量。

链接: https://arxiv.org/abs/2603.29881
作者: Melina Heidari Far,Elham Tabrizi
机构: Kharazmi University (哈拉兹米大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Graduate admissions have become increasingly competitive. This study highlights the need for a hybrid machine learning framework for graduate admission prediction, focusing on high-quality similar applicants and a recommendation system. The dataset, collected and enriched by the authors, includes 13,000 self-reported GradCafe application records from 2021 to 2025, enriched with features from the OpenAlex API, QS World University Rankings by Subject, and Wikidata SPARQL queries. A hybrid model was developed by combining XGBoost with a residual refinement k-nearest neighbors module, achieving 87% accuracy on the test set. A recommendation module, then built on the model for rejected applicants, provided targeted university and program alternatives, resulting in actionable guidance and improving expected acceptance probability by 70%. The results indicate that university quality metrics strongly influence admission decisions in competitive applicant pools. The features used in the study include applicant quality metrics, university quality metrics, program-level metrics, and interaction features.

[IR-4] Performance Evaluation of LLM s in Automated RDF Knowledge Graph Generation

【速读】:该论文旨在解决云系统中异构日志数据难以有效转化为结构化知识表示的问题,从而提升日志的可解释性、根因分析能力及跨服务推理效率。其核心挑战在于如何从复杂的半结构化云日志中自动提取准确的RDF三元组以构建知识图谱(Knowledge Graph, KG)。解决方案的关键在于设计并评估两种基于大型语言模型(Large Language Models, LLMs)的自动化抽取流水线:一是多LLM协同的抽取管道,用于识别实体与关系并生成三元组;二是结合语法与语义指标的验证管道,用以量化输出质量。研究发现,Few-Shot学习策略配合链式思维(Chain-of-Thought)提示方法在准确性(如Llama达到99.35% F1分数)和RDF有效性方面表现最优,凸显了上下文示例和提示工程对提升RDF抽取精度的重要性,并揭示了不同LLM架构间的性能差异与局限性。

链接: https://arxiv.org/abs/2603.29878
作者: Ioana Ramona Martin,Tudor Cioara,Ionut Anghel,Gabriel Arcas
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: submitted to journal

点击查看摘要

Abstract:Cloud systems generate large, heterogeneous log data containing critical infrastructure, application, and security information. Transforming these logs into RDF triples enables their integration into knowledge graphs, improving interpretability, root-cause analysis, and cross-service reasoning beyond what raw logs allow. Large Language Models (LLMs) offer a promising approach to automate RDF knowledge graph generation; however, their effectiveness on complex cloud logs remains largely unexplored. In this paper, we evaluate multiple LLM architectures and prompting strategies for automated RDF extraction using a controlled framework with two pipelines for systematically processing semi-structured log data. The extraction pipeline integrates multiple LLMs to identify relevant entities and relationships, automatically generating subject-predicate-object triples. These outputs are evaluated using a dedicated validation pipeline with both syntactic and semantic metrics to assess accuracy, completeness, and quality. Due to the lack of public ground-truth datasets, we created a reference Log-to-KG dataset from OpenStack logs using manual annotation and ontology-driven methods, enabling objective baseline. Our analysis shows that Few-Shot learning is the most effective strategy, with Llama achieving a 99.35% F1 score and 100% valid RDF output while Qwen, NuExtract, and Gemma also perform well under Few-Shot prompting, with Chain-of-Thought approaches maintaining similar accuracy. One-Shot prompting offers a lighter but effective alternative, while Zero-Shot and advanced strategies such as Tree-of-Thought, Self-Critique, and Generate-Multiple perform substantially worse. These results highlight the importance of contextual examples and prompt design for accurate RDF extraction and reveal model-specific limitations across LLM architectures.

[IR-5] UnWeaving the knots of GraphRAG – turns out VectorRAG is almost enough

【速读】:该论文旨在解决传统检索增强生成(Retrieval-augmented generation, RAG)系统中因基于块(chunk-based)的检索机制导致的信息碎片化问题,即源文本块被表示为原子对象并编码为独立向量,缺乏对块间潜在关联的建模能力,从而难以处理需要多跳推理(multi-hop questions)的任务。其解决方案的关键在于提出UnWeaver框架,通过大语言模型(LLM)将文档内容解耦为跨多个块出现的实体(entity),在检索过程中以实体作为中间媒介来恢复原始文本块,从而在保持源材料忠实性的同时,实现更精炼的信息表示,并减少索引与生成过程中的噪声。

链接: https://arxiv.org/abs/2603.29875
作者: Ryszard Tuora,Mateusz Galiński,Michał Godziszewski,Michał Karpowicz,Mateusz Czyżnikiewicz,Adam Kozakiewicz,Tomasz Ziętkiewicz
机构: Samsung AI Center Warsaw
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:One of the key problems in Retrieval-augmented generation (RAG) systems is that chunk-based retrieval pipelines represent the source chunks as atomic objects, mixing the information contained within such a chunk into a single vector. These vector representations are then fundamentally treated as isolated, independent and self-sufficient, with no attempt to represent possible relations between them. Such an approach has no dedicated mechanisms for handling multi-hop questions. Graph-based RAG systems aimed to ameliorate this problem by modeling information as knowledge-graphs, with entities represented by nodes being connected by robust relations, and forming hierarchical communities. This approach however suffers from its own issues with some of them being: orders of magnitude increased componential complexity in order to create graph-based indices, and reliance on heuristics for performing retrieval. We propose UnWeaver, a novel RAG framework simplifying the idea of GraphRAG. UnWeaver disentangles the contents of the documents into entities which can occur across multiple chunks using an LLM. In the retrieval process entities are used as an intermediate way of recovering original text chunks hence preserving fidelity to the source material. We argue that entity-based decomposition yields a more distilled representation of original information, and additionally serves to reduce noise in the indexing, and generation process.

[IR-6] Cold-Starts in Generative Recommendation: A Reproducibility Study

【速读】:该论文旨在解决冷启动推荐(cold-start recommendation)问题,即在动态开放世界平台中,如何为新注册用户(用户冷启动)或新引入物品(物品冷启动)提供有效推荐,尤其是在交互信号稀疏或缺失的情况下。其解决方案的关键在于通过统一的冷启动协议对基于预训练语言模型(pre-trained language models, PLMs)的生成式推荐方法进行系统性可复现研究,明确区分并控制模型规模、标识符设计和训练策略等关键变量,从而更准确地评估生成式AI(Generative AI)在冷启动场景下的实际性能提升。

链接: https://arxiv.org/abs/2603.29845
作者: Zhen Zhang,Jujia Zhao,Xinyu Ma,Xin Xin,Maarten de Rijke,Zhaochun Ren
机构: Shandong University(山东大学); Leiden University(莱顿大学); Baidu Inc.(百度公司); University of Amsterdam(阿姆斯特丹大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Cold-start recommendation remains a central challenge in dynamic, open-world platforms, requiring models to recommend for newly registered users (user cold-start) and to recommend newly introduced items to existing users (item cold-start) under sparse or missing interaction signals. Recent generative recommenders built on pre-trained language models (PLMs) are often expected to mitigate cold-start by using item semantic information (e.g., titles and descriptions) and test-time conditioning on limited user context. However, cold-start is rarely treated as a primary evaluation setting in existing studies, and reported gains are difficult to interpret because key design choices, such as model scale, identifier design, and training strategy, are frequently changed together. In this work, we present a systematic reproducibility study of generative recommendation under a unified suite of cold-start protocols.

[IR-7] Drift-Aware Continual Tokenization for Generative Recommendation

【速读】:该论文旨在解决生成式推荐(Generative Recommendation)中协同演化(collaborative evolution)带来的挑战,即在真实世界环境中,新物品引入导致标识符冲突与偏移,以及用户交互变化引发现有物品的协同漂移(如共现模式和流行度变化),而传统方法通过全量重训练或简单微调tokenizer会带来高昂成本或破坏已有token-embedding对齐关系。解决方案的关键在于提出DACT(Drift-Aware Continual Tokenization)框架,其核心是:(i) 在tokenizer微调阶段引入联合训练的协同漂移识别模块(Collaborative Drift Identification Module, CDIM),输出个体物品级漂移置信度,实现对漂移与稳定物品的差异化优化;(ii) 采用松弛到严格(relaxed-to-strict)的分层代码重分配策略,在最小化不必要的token序列变更的同时有效适应协同演化。

链接: https://arxiv.org/abs/2603.29705
作者: Yuebo Feng,Jiahao Liu,Mingzhe Han,Dongsheng Li,Hansu Gu,Peng Zhang,Tun Lu,Ning Gu
机构: Fudan University (复旦大学); Microsoft Research Asia (微软亚洲研究院)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative recommendation commonly adopts a two-stage pipeline in which a learnable tokenizer maps items to discrete token sequences (i.e. identifiers) and an autoregressive generative recommender model (GRM) performs prediction based on these identifiers. Recent tokenizers further incorporate collaborative signals so that items with similar user-behavior patterns receive similar codes, substantially improving recommendation quality. However, real-world environments evolve continuously: new items cause identifier collision and shifts, while new interactions induce collaborative drift in existing items (e.g., changing co-occurrence patterns and popularity). Fully retraining both tokenizer and GRM is often prohibitively expensive, yet naively fine-tuning the tokenizer can alter token sequences for the majority of existing items, undermining the GRM’s learned token-embedding alignment. To balance plasticity and stability for collaborative tokenizers, we propose DACT, a Drift-Aware Continual Tokenization framework with two stages: (i) tokenizer fine-tuning, augmented with a jointly trained Collaborative Drift Identification Module (CDIM) that outputs item-level drift confidence and enables differentiated optimization for drifting and stationary items; and (ii) hierarchical code reassignment using a relaxed-to-strict strategy to update token sequences while limiting unnecessary changes. Experiments on three real-world datasets with two representative GRMs show that DACT consistently achieves better performance than baselines, demonstrating effective adaptation to collaborative evolution with reduced disruption to prior knowledge. Our implementation is publicly available at this https URL for reproducibility.

[IR-8] Agenda-based Narrative Extraction: Steering Pathfinding Algorithms with Large Language Models ECIR2026

【速读】:该论文旨在解决现有叙事提取方法在连贯性(coherence)、交互性(interactivity)与多主线支持(multi-storyline support)之间难以平衡的问题。传统方法如Narrative Maps虽能生成多个故事线但牺牲单条路径的连贯性,而Narrative Trails虽保证高连贯性却缺乏用户引导机制和多视角支持。其解决方案的关键在于提出基于议程(agenda-based)的叙事提取方法,通过将大语言模型(Large Language Models, LLMs)嵌入到Narrative Trails的路径规划过程中,在每一步对候选文档进行排序,以匹配用户指定的议程目标,同时保持叙事连贯性。该方法能够在同一语料库上通过切换不同议程生成多样化且语义一致的故事线,实验证明LLM驱动的议程引导相比关键词匹配在语义议程上提升9.9%的对齐度(p=0.017),并在特定议题(如“政权镇压”)上提升13.3%(p=0.037),且仅带来2.2%的连贯性下降,显著优于基线方法。

链接: https://arxiv.org/abs/2603.29661
作者: Brian Felipe Keith-Norambuena,Carolina Inés Rojas-Córdova,Claudio Juvenal Meneses-Villegas,Elizabeth Johanna Lam-Esquenazi,Angélica María Flores-Bustos,Ignacio Alejandro Molina-Villablanca,Joshua Emanuel Leyton-Vallejos
机构: Universidad Católica del Norte(天主教北方大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Text2Story Workshop 2026 at ECIR 2026

点击查看摘要

Abstract:Existing narrative extraction methods face a trade-off between coherence, interactivity, and multi-storyline support. Narrative Maps supports rich interaction and generates multiple storylines as a byproduct of its coverage constraints, though this comes at the cost of individual path coherence. Narrative Trails achieves high coherence through maximum capacity path optimization but provides no mechanism for user guidance or multiple perspectives. We introduce agenda-based narrative extraction, a method that bridges this gap by integrating large language models into the Narrative Trails pathfinding process to steer storyline construction toward user-specified perspectives. Our approach uses an LLM at each step to rank candidate documents based on their alignment with a given agenda while maintaining narrative coherence. Running the algorithm with different agendas yields different storylines through the same corpus. We evaluated our approach on a news article corpus using LLM judges with Claude Opus 4.5 and GPT 5.1, measuring both coherence and agenda alignment across 64 endpoint pairs and 6 agendas. LLM-driven steering achieves 9.9% higher alignment than keyword matching on semantic agendas (p=0.017), with 13.3% improvement on \textitRegime Crackdown specifically (p=0.037), while keyword matching remains competitive on agendas with literal keyword overlap. The coherence cost is minimal: LLM steering reduces coherence by only 2.2% compared to the agenda-agnostic baseline. Counter-agendas that contradict the source material score uniformly low (2.2-2.5) across all methods, confirming that steering cannot fabricate unsupported narratives.

[IR-9] Semantic Interaction for Narrative Map Sensemaking: An Insight-based Evaluation ECIR2026

【速读】:该论文旨在解决如何通过语义交互(Semantic Interaction, SI)提升叙事地图(narrative map)在叙事理解(narrative sensemaking)中的有效性问题。现有研究虽提出了基于SI的叙事提取框架,但缺乏实证评估。其解决方案的关键在于设计并验证一种具备语义交互能力的互动叙事地图原型,相较于时间线基线和基础叙事地图,该方案显著提升了用户生成洞察的质量与效率,并识别出两种SI使用模式——修正型(corrective)与添加型(additive),分别对应质量判断与结构组织,同时表明SI可作为模型优化的替代路径,在减少参数调整的同时实现相当的探索广度。

链接: https://arxiv.org/abs/2603.29651
作者: Brian Felipe Keith-Norambuena,Fausto German,Eric Krokos,Sarah Joseph,Chris North
机构: Universidad Católica del Norte (北方天主教大学); Virginia Tech (弗吉尼亚理工学院); U.S. Government (美国政府)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Text2Story Workshop 2026 at ECIR 2026

点击查看摘要

Abstract:Semantic interaction (SI) enables analysts to incorporate their cognitive processes into AI models through direct manipulation of visualizations. While SI frameworks for narrative extraction have been proposed, empirical evaluations of their effectiveness remain limited. This paper presents a user study that evaluates SI for narrative map sensemaking, involving 33 participants under three conditions: a timeline baseline, a basic narrative map, and an interactive narrative map with SI capabilities. The results show that the map-based prototypes yielded more insights than the timeline baseline, with the SI-enabled condition reaching statistical significance and the basic map condition trending in the same direction. The SI-enabled condition showed the highest mean performance; differences between the map conditions were not statistically significant but showed large effect sizes (d 0.8), suggesting that the study was underpowered to detect them. Qualitative analysis identified two distinct SI approaches-corrective and additive-that enable analysts to impose quality judgments and organizational structure on extracted narratives. We also find that SI users achieved comparable exploration breadth with less parameter manipulation, suggesting that SI serves as an alternative pathway for model refinement. This work provides empirical evidence that map-based representations outperform timelines for narrative sensemaking, along with qualitative insights into how analysts use SI for narrative refinement.

[IR-10] Storing Less Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras

【速读】:该论文旨在解决持续运行的边缘摄像头生成的视频流中冗余帧导致跨模态检索性能下降的问题,即冗余帧会挤占正确结果在top-k搜索中的位置,从而影响检索精度。解决方案的关键在于提出一种流式检索架构:首先在设备端使用epsilon-net过滤器保留语义上新颖的帧,构建去噪的嵌入索引;其次通过跨模态适配器和云端重排序模块补偿紧凑编码器带来的对齐能力不足。该方案实现了单次遍历流式过滤,在两个以第一人称视角为主的基准数据集(AEA、EPIC-KITCHENS)上优于多种离线方法(如k-means、最远点采样、均匀采样、随机采样),并基于8M参数的本地编码器在测试集上达到45.6%的Hit@5指标,功耗估计为2.7 mW。

链接: https://arxiv.org/abs/2603.29631
作者: Sherif Abdelwahab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
备注: 6 pages, 3 figures, 5 tables; supplementary video included as ancillary file

点击查看摘要

Abstract:Always-on edge cameras generate continuous video streams where redundant frames degrade cross-modal retrieval by crowding correct results out of top-k search. This paper presents a streaming retrieval architecture: an on-device epsilon-net filter retains only semantically novel frames, building a denoised embedding index; a cross-modal adapter and cloud re-ranker compensate for the compact encoder’s weak alignment. A single-pass streaming filter outperforms offline alternatives (k-means, farthest-point, uniform, random) across eight vision-language models (8M-632M) on two egocentric datasets (AEA, EPIC-KITCHENS). Combined, the architecture reaches 45.6% Hit@5 on held-out data using an 8M on-device encoder at an estimated 2.7 mW.

[IR-11] On Strengths and Limitations of Single-Vector Embeddings

【速读】:该论文旨在解决单向量嵌入(single-vector embeddings)在信息检索任务中性能显著下降的问题,尤其是在自然场景数据集LIMIT上表现出的检索质量恶化现象。此前研究(Weller et al., 2025)认为维度不足是主要原因,但本文通过实证与理论分析指出,维度并非决定性因素;真正关键在于领域偏移(domain shift)和嵌入相似性与任务相关性之间的错位(misalignment),这导致模型难以捕捉真实语义关联。解决方案的核心在于对单向量模型进行微调(fine-tuning),可有效缓解上述问题并显著提升召回率(recall)。然而,即使微调后,单向量模型仍明显弱于多向量表示(multi-vector representations),且在LIMIT类数据集上微调会导致灾难性遗忘(catastrophic forgetting),而多向量模型则表现稳定。进一步地,论文揭示了“文档淹没悖论”(drowning in documents paradox)——随着语料库扩大,相关文档因嵌入相似性呈现噪声统计代理特性而被稀释,从而加剧单向量模型的脆弱性,这是其相较于多向量模型的根本局限所在。

链接: https://arxiv.org/abs/2603.29519
作者: Archish S,Mihir Agarwal,Ankit Garg,Neeraj Kayal,Kirankumar Shiragur
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recent work (Weller et al., 2025) introduced a naturalistic dataset called LIMIT and showed empirically that a wide range of popular single-vector embedding models suffer substantial drops in retrieval quality, raising concerns about the reliability of single-vector embeddings for retrieval. Although (Weller et al., 2025) proposed limited dimensionality as the main factor contributing to this, we show that dimensionality alone cannot explain the observed failures. We observe from results in (Alon et al., 2016) that 2k+1 -dimensional vector embeddings suffice for top- k retrieval. This result points to other drivers of poor performance. Controlling for tokenization artifacts and linguistic similarity between attributes yields only modest gains. In contrast, we find that domain shift and misalignment between embedding similarities and the task’s underlying notion of relevance are major contributors; finetuning mitigates these effects and can improve recall substantially. Even with finetuning, however, single-vector models remain markedly weaker than multi-vector representations, pointing to fundamental limitations. Moreover, finetuning single-vector models on LIMIT-like datasets leads to catastrophic forgetting (performance on MSMARCO drops by more than 40%), whereas forgetting for multi-vector models is minimal. To better understand the gap between performance of single-vector and multi-vector models, we study the drowning in documents paradox (Reimers \ Gurevych, 2021; Jacob et al., 2025): as the corpus grows, relevant documents are increasingly “drowned out” because embedding similarities behave, in part, like noisy statistical proxies for relevance. Through experiments and mathematical calculations on toy mathematical models, we illustrate why single-vector models are more susceptible to drowning effects compared to multi-vector models. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2603.29519 [cs.IR] (or arXiv:2603.29519v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.29519 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-12] Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE

【速读】:该论文旨在解决在隐式反馈(implicit feedback)场景下,直接偏好优化(Direct Preference Optimization, DPO)因未观测项不可靠而产生的错误梯度抑制问题。其核心挑战在于传统负采样策略容易引入虚假负样本(false negatives),从而损害模型训练稳定性与排序性能。解决方案的关键在于将确定性硬负采样替换为从动态top-K候选池中进行随机采样,通过引入可控的随机性,在保留有效硬信号的同时降低由假负样本引发的错误梯度抑制,从而提升排序性能。此方法在三个Amazon数据集上实现了最高5.25%的NDCG@5提升,且推理成本几乎不变。

链接: https://arxiv.org/abs/2603.29259
作者: Hejin Huang,Jusheng Zhang,Kaitong Cai,Jian Wang,Rong Pan
机构: Sun Yat-sen University(中山大学); Snap Inc
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Preference-based alignment objectives have been widely adopted, from RLHF-style pairwise learning in large language models to emerging applications in recommender systems. Yet, existing work rarely examines how Direct Preference Optimization (DPO) behaves under implicit feedback, where unobserved items are not reliable negatives. We conduct systematic experiments on multimodal sequential recommendation to compare common negative-selection strategies and their interaction with DPO training. Our central finding is that a simple modification, replacing deterministic hard negatives with stochastic sampling from a dynamic top-K candidate pool, consistently improves ranking performance. We attribute its effectiveness to two factors: (1) reducing erroneous suppressive gradients caused by false negatives, and (2) retaining informative hard signals while smoothing optimization via controlled stochasticity. With an optional sparse Mixture-of-Experts encoder for efficient capacity scaling, RoDPO achieves up to 5.25% NDCG@5 on three Amazon benchmarks, with nearly unchanged inference cost.

[IR-13] APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在自主代理(autonomous agents)应用中缺乏持续性过程记忆的问题,即模型在面对结构相同但未见过的任务时仍需从头推导解决方案,无法复用以往的成功经验。其核心解决方案是提出APEX-EM框架,关键在于引入一种非参数化的在线学习机制:通过结构化经验表示(structured experience representation)编码每次执行的完整过程-情景轨迹(包括规划步骤、中间产物、迭代历史及错误分析),并构建Plan-Retrieve-Generate-Iterate-Ingest(PRGII)工作流,结合任务验证器提供多维奖励信号;同时设计双结果经验记忆(dual-outcome Experience Memory),融合语义搜索、结构签名匹配与计划有向无环图(DAG)遍历实现跨域迁移——即使任务间无词汇重叠,只要操作结构相似即可检索复用。此方法使成功案例作为正向上下文示例、失败案例带结构化错误标注作为负例,从而显著提升复杂任务中的性能表现。

链接: https://arxiv.org/abs/2603.29093
作者: Pratyay Banerjee,Masud Moshtaghi,Ankit Chadha
机构: Amazon(亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 17 pages, 13 figures

点击查看摘要

Abstract:LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present \textbfAPEX-EM, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a \emphstructured experience representation encoding the full procedural-episodic trace of each execution – planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a \emphPlan-Retrieve-Generate-Iterate-Ingest (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a \emphdual-outcome Experience Memory with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal – enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench~\citezhuo2025bigcodebench, KGQAGen-10k~\citezhang2025kgqagen, and Humanity’s Last Exam~\citephan2025hle using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6% accuracy versus 41.3% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9%). On BigCodeBench, it reaches 83.3% SR from a 53.9% baseline (+29.4pp), exceeding MemRL’s~\citememrl2025 +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0% from 25.2% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback. Comments: 17 pages, 13 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2603.29093 [cs.CL] (or arXiv:2603.29093v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.29093 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-14] Zero-shot Cross-domain Knowledge Distillation: A Case study on YouTube Music

【速读】:该论文旨在解决低流量推荐系统中知识蒸馏(Knowledge Distillation, KD)应用的挑战,即受限的数据量难以训练大尺寸教师模型,且为特定场景单独训练大型教师模型成本过高。解决方案的关键在于采用零样本跨域知识蒸馏(zero-shot cross-domain KD),通过从数据丰富的源域(如YouTube视频推荐平台)迁移知识到目标域(音乐推荐应用),从而在不依赖目标域标注数据的情况下提升小流量场景下的多任务排序模型性能。实验表明,该方法在离线与线上环境中均能有效改善音乐推荐系统的排序效果。

链接: https://arxiv.org/abs/2603.28994
作者: Srivaths Ranganathan,Nikhil Khani,Shawn Andrews,Chieh Lo,Li Wei,Gergo Varady,Jochen Klingenhoefer,Tim Steele,Bernardo Cunha,Aniruddh Nath,Yanwei Song
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) has been widely used to improve the quality of latency sensitive models serving live traffic. However, applying KD in production recommender systems with low traffic is challenging: the limited amount of data restricts the teacher model size, and the cost of training a large dedicated teacher may not be justified. Cross-domain KD offers a cost-effective alternative by leveraging a teacher from a data-rich source domain, but introduces unique technical difficulties, as the features, user interfaces, and prediction tasks can significantly differ. We present a case study of using zero-shot cross-domain KD for multi-task ranking models, transferring knowledge from a (100x) large-scale video recommendation platform (YouTube) to a music recommendation application with significantly lower traffic. We share offline and live experiment results and present findings evaluating different KD techniques in this setting across two ranking models on the music app. Our results demonstrate that zero-shot cross-domain KD is a practical and effective approach to improve the performance of ranking models on low traffic surfaces. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2603.28994 [cs.IR] (or arXiv:2603.28994v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.28994 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-15] Calibrated Fusion for Heterogeneous Graph-Vector Retrieval in Multi-Hop QA

【速读】:该论文旨在解决多跳问答(multi-hop question answering)中异构检索融合(heterogeneous retrieval fusion)时向量相似度与图相关性得分(如个性化PageRank,PPR)因分布差异导致难以直接融合的问题。其核心挑战在于如何在不丢失原始得分幅度信息的前提下实现不同来源检索信号的稳定组合。解决方案的关键是提出PhaseGraph方法,通过百分位秩归一化(percentile-rank normalization, PIT)将向量和图得分映射到统一的无量纲尺度上,从而实现可靠的分数校准与融合;实验表明,该策略显著提升了最后一步检索性能(LastHop@5),且相较于最小-最大归一化更具鲁棒性,而具体的融合后操作(如线性加权或Boltzmann加权)对最终效果影响较小。

链接: https://arxiv.org/abs/2603.28886
作者: Andre Bacellar
机构: 未知
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Graph-augmented retrieval combines dense similarity with graph-based relevance signals such as Personalized PageRank (PPR), but these scores have different distributions and are not directly comparable. We study this as a score calibration problem for heterogeneous retrieval fusion in multi-hop question answering. Our method, PhaseGraph, maps vector and graph scores to a common unit-free scale using percentile-rank normalization (PIT) before fusion, enabling stable combination without discarding magnitude information. Across MuSiQue and 2WikiMultiHopQA, calibrated fusion improves held-out last-hop retrieval on HippoRAG2-style benchmarks: LastHop@5 increases from 75.1% to 76.5% on MuSiQue (8W/1L, p=0.039) and from 51.7% to 53.6% on 2WikiMultiHopQA (11W/2L, p=0.023), both on independent held-out test splits. A theory-driven ablation shows that percentile-based calibration is directionally more robust than min-max normalization on both tune and test splits (1W/6L, p=0.125), while Boltzmann weighting performs comparably to linear fusion after calibration (0W/3L, p=0.25). These results suggest that score commensuration is a robust design choice, and the exact post-calibration operator appears to matter less on these benchmarks. Comments: 10 pages, 5 figures Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) ACMclasses: H.3.3 Cite as: arXiv:2603.28886 [cs.IR] (or arXiv:2603.28886v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.28886 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-16] UltRAG : a Universal Simple Scalable Recipe for Knowledge Graph RAG

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时容易产生自信但事实错误的内容(即“幻觉”问题),特别是在利用知识图谱(Knowledge Graph, KG)进行多跳推理的场景下,传统检索增强生成(Retrieval Augmented Generation, RAG)方法难以有效适配。其解决方案的关键在于提出ULTRAG框架,通过引入现成的神经查询执行模块(neural query execution modules),使LLM无需重新训练即可直接调用外部知识图谱资源,从而实现对Wikidata规模(1.16亿实体、16亿关系)知识图谱的高效接口与问答,显著优于现有KG-RAG方法,并在性能和成本之间取得更优平衡。

链接: https://arxiv.org/abs/2603.28773
作者: Dobrik Georgiev,Kheeran Naidu,Alberto Cattaneo,Federico Monti,Carlo Luschi,Daniel Justus
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) frequently generate confident yet factually incorrect content when used for language generation (a phenomenon often known as hallucination). Retrieval augmented generation (RAG) tries to reduce factual errors by identifying information in a knowledge corpus and putting it in the context window of the model. While this approach is well-established for document-structured data, it is non-trivial to adapt it for Knowledge Graphs (KGs), especially for queries that require multi-node/multi-hop reasoning on graphs. We introduce ULTRAG, a general framework for retrieving information from Knowledge Graphs that shifts away from classical RAG. By endowing LLMs with off-the-shelf neural query executing modules, we highlight how readily available language models can achieve state-of-the-art results on Knowledge Graph Question Answering (KGQA) tasks without any retraining of the LLM or executor involved. In our experiments, ULTRAG achieves better performance when compared to state-of-the-art KG-RAG solutions, and it enables language models to interface with Wikidata-scale graphs (116M entities, 1.6B relations) at comparable or lower costs.

人机交互

[HC-0] HapCompass: A Rotational Haptic Device for Contact-Rich Robotic Teleoperation ICRA

【速读】:该论文旨在解决机器人遥操作中因接触密集任务导致的感知挑战,特别是穿戴式遥操作界面在提供直观方向性触觉反馈方面的瓶颈问题。现有方案如手持控制器的非方向性振动或振动触觉阵列易引发感知干扰,难以有效传递方向信息。其解决方案的关键在于提出一种名为HapCompass的新颖低成本可穿戴触觉设备,通过机械旋转单个线性共振执行器(Linear Resonant Actuator, LRA)来渲染二维方向性触觉线索,从而显著提升操作者对方向信息的感知精度与任务表现,实验证明该方法在成功率、完成时间及最大接触力方面均优于纯视觉和非方向性反馈基线,并初步验证其在模仿学习中的数据质量提升潜力。

链接: https://arxiv.org/abs/2603.30042
作者: Xiangshan Tan,Jingtian Ji,Tianchong Jiang,Pedro Lopes,Matthew R. Walter
机构: Toyota Technological Institute at Chicago (丰田技术学院芝加哥分校); University of Chicago (芝加哥大学)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: Accepted to IEEE International Conference on Robotics and Automation (ICRA), 2026. 8 pages, 5 figures. Project page: this https URL

点击查看摘要

Abstract:The contact-rich nature of manipulation makes it a significant challenge for robotic teleoperation. While haptic feedback is critical for contact-rich tasks, providing intuitive directional cues within wearable teleoperation interfaces remains a bottleneck. Existing solutions, such as non-directional vibrations from handheld controllers, provide limited information, while vibrotactile arrays are prone to perceptual interference. To address these limitations, we propose HapCompass, a novel, low-cost wearable haptic device that renders 2D directional cues by mechanically rotating a single linear resonant actuator (LRA). We evaluated HapCompass’s ability to convey directional cues to human operators and showed that it increased the success rate, decreased the completion time and the maximum contact force for teleoperated manipulation tasks when compared to vision-only and non-directional feedback baselines. Furthermore, we conducted a preliminary imitation-learning evaluation, suggesting that the directional feedback provided by HapCompass enhances the quality of demonstration data and, in turn, the trained policy. We release the design of the HapCompass device along with the code that implements our teleoperation interface: this https URL.

[HC-1] Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness Framework Comparison and the Weak-Model Compensation Effect

【速读】:该论文旨在解决结构化意图表示(structured intent representation)在不同人工智能模型、语言和提示框架下能否可靠地保持用户目标一致性的问题。其核心挑战在于跨模型、跨语言场景中如何提升AI对用户意图的理解与对齐能力,从而增强人机交互的稳定性与效率。解决方案的关键在于采用基于5W3H(Who, What, When, Where, Why, How, How much, How many)的结构化提示协议(Prompt Protocol Specification, PPS),并通过系统性实验验证其有效性:在多语言(中文、英文、日语)、多模型(Claude、GPT-4o、Gemini 2.5 Pro)和多种提示框架下,结构化提示显著降低了跨语言评分方差(标准差从0.470降至约0.020),且表现出“弱模型补偿效应”——即基础性能较弱的模型受益更明显;同时,在用户研究中,AI辅助扩展后的5W3H提示使交互轮次减少60%,用户满意度提升显著。这表明,结构化意图表示作为一种类协议的通信层,具有良好的鲁棒性和实用价值。

链接: https://arxiv.org/abs/2603.29953
作者: Peng Gang
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 25 pages, figures, tables, and appendix. Third paper in a cumulative research series on PPS and 5W3H structured intent representation, extending prior work to cross-model robustness, framework comparison, and user-study validation

点击查看摘要

Abstract:How reliably can structured intent representations preserve user goals across different AI models, languages, and prompting frameworks? Prior work showed that PPS (Prompt Protocol Specification), a 5W3H-based structured intent framework, improves goal alignment in Chinese and generalizes to English and Japanese. This paper extends that line of inquiry in three directions: cross-model robustness across Claude, GPT-4o, and Gemini 2.5 Pro; controlled comparison with CO-STAR and RISEN; and a user study (N=50) of AI-assisted intent expansion in ecologically valid settings. Across 3,240 model outputs (3 languages x 6 conditions x 3 models x 3 domains x 20 tasks), evaluated by an independent judge (DeepSeek-V3), we find that structured prompting substantially reduces cross-language score variance relative to unstructured baselines. The strongest structured conditions reduce cross-language sigma from 0.470 to about 0.020. We also observe a weak-model compensation pattern: the lowest-baseline model (Gemini) shows a much larger D-A gain (+1.006) than the strongest model (Claude, +0.217). Under the current evaluation resolution, 5W3H, CO-STAR, and RISEN achieve similarly high goal-alignment scores, suggesting that dimensional decomposition itself is an important active ingredient. In the user study, AI-expanded 5W3H prompts reduce interaction rounds by 60 percent and increase user satisfaction from 3.16 to 4.04. These findings support the practical value of structured intent representation as a robust, protocol-like communication layer for human-AI interaction.

[HC-2] XR is XR: Rethinking MR and XR as Neutral Umbrella Terms

【速读】:该论文试图解决的问题是:当前广泛使用的术语“XR”(扩展现实)在起源、含义及其形成过程上缺乏明确共识,存在多种解释,导致术语使用混乱。解决方案的关键在于通过梳理虚拟现实(VR)、增强现实(AR)、混合现实(MR)及XR相关术语的历史演变,并区分术语创制与实际采纳的驱动因素,指出XR并非“Extended Reality”的缩写,而是一个中立的符号标签,用于包容多个与“现实”相关的概念;同时强调术语的稳定使用需依赖学术界、产业界和标准化组织之间的协同治理。

链接: https://arxiv.org/abs/2603.29939
作者: Takeshi Kurata
机构: 未知
类目: Human-Computer Interaction (cs.HC); Graphics (cs.GR); Multimedia (cs.MM)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:The term XR is currently widely used as an expression encompassing Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR). However, there is no clear consensus regarding its origin or meaning. XR is sometimes explained as an abbreviation for Extended Reality, but multiple interpretations exist regarding its etymology and formation process. This paper organizes the historical formation of terminology related to VR, AR, MR, and XR, and reexamines the context in which the term XR emerged and how it has spread. In particular, by presenting a timeline that distinguishes between the coinage of terms and the drivers of their adoption, we suggest that XR, as an umbrella term, functions not as an abbreviation of Extended Reality, but rather as a neutral symbolic label that encompasses multiple “reality”-related terms. Furthermore, we argue that stable usage of terminology, including XR, requires governance through collaboration among academia, industry, and standardization organizations.

[HC-3] Interview-Informed Generative Agents for Product Discovery: A Validation Study

【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在产品发现场景中,尤其是概念测试阶段,是否能够通过模拟用户响应来替代真实用户的反馈。其解决方案的关键在于构建基于深度访谈的生成式代理(interview-informed generative agents),这些代理以知识工作者的个性化访谈数据为依据进行建模,并将其对新型AI概念的评价与原始参与者的真实反馈进行对比分析。研究发现,尽管这些代理无法精确复现个体特征(identity-imprecise),但能准确捕捉群体层面的响应分布(distribution-calibrated),从而表明LLM模拟在早期概念筛选和迭代阶段具有实用价值,前提是仅需关注分布准确性而非个体差异。

链接: https://arxiv.org/abs/2603.29890
作者: Zichao Wang,Alexa Siu
机构: Adobe Research(Adobe研究院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: CHI 2026 Honourable Mention

点击查看摘要

Abstract:Large language models (LLMs) have shown strong performance on standardized social science instruments, but their value for product discovery remains unclear. We investigate whether interview-informed generative agents can simulate user responses in concept testing scenarios. Using in-depth workflow interviews with knowledge workers, we created personalized agents and compared their evaluations of novel AI concepts against the same participants’ responses. Our results show that agents are distribution-calibrated but identity-imprecise: they fail to replicate the specific individual they are grounded in, yet approximate population-level response distributions. These findings highlight both the potential and the limits of LLM simulation in design research. While unsuitable as a substitute for individual-level insights, simulation may provide value for early-stage concept screening and iteration, where distributional accuracy suffices. We discuss implications for integrating simulation responsibly into product development workflows.

[HC-4] Generative AI in Action: Field Experimental Evidence from Alibabas Customer Service Operations

【速读】:该论文旨在解决生成式 AI (Generative AI) 在实际工作场景中如何影响人工客服绩效的问题,特别是在电商售后客服领域。研究通过与阿里巴巴合作开展大规模实地实验,评估了生成式 AI 助手对客服效率和质量的影响。解决方案的关键在于设计并部署一个具备问题诊断与解决方案建议功能的生成式 AI 助手,允许人工客服自主决定是否采纳、修改或忽略 AI 提供的内容,并通过双重因果识别策略——意图处理效应(ITT)和局部平均处理效应(LATE)——精确量化 AI 接入与实际使用对服务速度、主观质量(如客户评分)及客观质量(如重访率)的影响。结果表明,AI 显著提升了服务响应速度与主观满意度,尤其对低绩效客服改善明显;但对高绩效客服则因多任务倾向加剧导致服务质量下降,揭示出生成式 AI 对工作流程重构的复杂效应,强调需根据员工能力差异制定差异化部署策略。

链接: https://arxiv.org/abs/2603.29888
作者: Xiao Ni,Yiwei Wang,Tianjun Feng,Lauren Xiaoyan Lu,Yitong Wang,Congyi Zhou
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In collaboration with Alibaba, this study leverages a large-scale field experiment to assess the impact of a generative AI assistant on worker performance in e-commerce after-sales service. Human agents providing digital chat support were randomly assigned with access to a gen AI assistant that offered two core functions: diagnosis of customer issues and solution proposals, presented as text messages. Agents retained discretion to adopt, modify, or disregard AI-generated messages. To evaluate gen AI’s impact, we estimate both the intention-to-treat (ITT) effect of gen AI access and the local average treatment effect (LATE) of gen AI usage. Results show that gen AI significantly improved service speed, measured by issue identification time and chat duration. Gen AI also improved subjective service quality reflected in customer ratings and dissatisfaction rates, but it had no significant effect on objective service quality indicated by customer retrial rates. The performance improvements stemmed not only from automation but also from changes in the dynamics of agent-customer interactions: agent communication became more informative and efficient, while customers experienced reduced communication burdens. Low performers achieved the greatest improvements in both service speed and quality, narrowing the performance gap. In contrast, top-performing agents showed little improvement in service speed but experienced declines in both subjective and objective service quality. Evidence suggests that this decline results from increased multitasking tendency, proxied by longer shift-away times across concurrent chats, which slowed customer responses and raised abandonment and retrial rates. These findings suggest that gen AI reshapes work, demanding tailored deployment strategies.

[HC-5] AI Empathy Erodes Cognitive Autonomy in Younger Users

【速读】:该论文旨在解决生成式 AI 在情感对齐(affective alignment)过程中可能对年轻用户发展自主性造成的系统性风险问题。研究表明,尽管情感镜像(emotional mirroring)常被视为高级人机交互的标志,但其也可能表现为情感谄媚(affective sycophancy),强化用户的即时情绪状态,削弱其独立进行情绪调节与批判性思考所需的认知摩擦。尤其在基于强化学习人类反馈(RLHF)的奖励模型中,成人导向的“有用性”定义可能无意中促进年轻用户的依赖性而非认知重评能力。论文的关键解决方案是提出“斯多葛架构”(stoic architectures),强调功能中立性(functional neutrality),以保障用户自主性的可持续发展。

链接: https://arxiv.org/abs/2603.29886
作者: Junfeng Jiao,Abhejay Murali,Saleh Afroogh
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Affective alignment in generative AI represents a systemic risk to the developmental autonomy of younger users. Although emotional mirroring is commonly seen as a hallmark of advanced human-machine interaction, it can also manifest as affective sycophancy, reinforcing a user’s immediate emotional state. By providing a sense of objectivity to transient anxieties, these systems diminish the cognitive friction necessary for independent emotional management and critical thought. Reward models driven by RLHF could heighten this dilemma by embedding adult-focused definitions of helpfulness, unintentionally promoting emotional dependency in younger users rather than facilitating cognitive reappraisal. This paper exposes the misalignment between adult-labeled reward signals and the developmental requirements of younger users, proposing stoic architectures that emphasize functional neutrality to preserve user autonomy.

[HC-6] Beyond AI advice – independent aggregation boosts human-AI accuracy

【速读】:该论文旨在解决当前广泛采用的“AI作为顾问”(AI-as-advisor)模式中存在的问题,即人类决策者常对准确的AI建议视而不见,却过度依赖错误的建议,且长期使用可能导致自身判断能力退化。其解决方案的核心是提出一种名为“混合确认树”(Hybrid Confirmation Tree, HCT)的新机制:HCT要求人类与AI独立生成判断,若二者一致则采纳该决策;若不一致,则由另一名人类进行裁决。这一设计通过保留人机判断的独立性,有效避免了人类对AI建议的盲目信任或忽视,从而在多个领域数据集上显著优于传统AI顾问模式,尤其在AI提供解释的情况下仍保持优越性能。

链接: https://arxiv.org/abs/2603.29866
作者: Julian Berger,Pantelis P. Analytis,Ville Satopää,Ralf H.J.M. Kurvers
机构: Max Planck Institute for Human Development (马克斯·普朗克人类发展研究所); University of Southern Denmark (南丹麦大学); Science of Intelligence, Technical University Berlin (柏林工业大学智能科学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is broadly deployed as an advisor to human decision-makers: AI recommends a decision and a human accepts or rejects the advice. This approach, however, has several limitations: People frequently ignore accurate advice and rely too much on inaccurate advice, and their decision-making skills may deteriorate over time. Here, we compare the AI-as-advisor approach to the hybrid confirmation tree (HCT), an alternative strategy that preserves the independence of human and AI judgments. The HCT elicits a human judgment and an AI judgment independently of each other. If they agree, that decision is accepted. If not, a second human breaks the tie. For the comparison, we used 10 datasets from various domains, including medical diagnostics and misinformation discernment, and a subset of four datasets in which AI also explained its decision. The HCT outperformed the AI-as-advisor approach in all datasets. The HCT also performed better in almost all cases in which AI offered an explanation of its judgment. Using signal detection theory to interpret these results, we find that the HCT outperforms the AI-as-advisor approach because people cannot discriminate well enough between correct and incorrect AI advice. Overall, the HCT is a robust, accurate, and transparent alternative to the AI-as-advisor approach, offering a simple mechanism to tap into the wisdom of hybrid crowds.

[HC-7] An Interactive LLM -Based Simulator for Dementia-Related Activities of Daily Living

【速读】:该论文旨在解决当前辅助人工智能(Assistive AI)与机器人在阿尔茨海默病及相关痴呆症(ADRD)照护中面临的困境:缺乏丰富情境、隐私敏感的日常活动(ADL)行为数据,导致其难以实现有效的沟通适应与个性化支持。解决方案的关键在于开发了一个基于网络的模拟器,利用大语言模型(gpt-5-mini)生成多轮、受痴呆严重程度和照护场景条件约束的患者行为,并将话语与轻量级行为线索(括号内标注)配对;用户可设定关键参数后以照护者身份交互并评分,系统通过专家反馈迭代优化提示词与工作流,从而构建一个数据驱动的验证闭环,支撑照护训练、AI政策制定及人机协同模拟的持续改进。

链接: https://arxiv.org/abs/2603.29856
作者: Kruthika Gangaraju,Shu-Fen Wung,Kevin Berner,Jing Wang,Fengpei Yuan
机构: Worcester Polytechnic Institute(伍斯特理工学院); University of California Davis(加州大学戴维斯分校); MGH Institute of Health Professionals(麻省总医院健康专业研究所); University of New Hampshire(新罕布什尔大学)
类目: Human-Computer Interaction (cs.HC); Graphics (cs.GR); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Effective dementia caregiving requires training and adaptive communication, but assistive AI and robotics are constrained by a lack of context-rich, privacy-sensitive data on how people living with Alzheimer’s disease and related dementias (ADRD) behave during activities of daily living (ADLs). We introduce a web-based simulator that uses a large language model (gpt-5-mini) to generate multi-turn, severity- and care-setting-conditioned patient behaviors during ADL assistance, pairing utterances with lightweight behavioral cues (in parentheses). Users set dementia severity, care setting (and time in setting), and ADL; after each patient turn they rate realism (1-5) with optional critique, then respond as the caregiver via free text or by selecting/editing one of four strategy-scaffolded suggestions (Recognition, Negotiation, Facilitation, Validation). We ran an online formative expert-in-the-loop study (14 dementia-care experts, 18 sessions, 112 rated turns). Simulated behavior was judged moderately to highly plausible, with a typical session length of six turns. Experts wrote custom replies for 54.5 percent of turns; Recognition and Facilitation were the most-used suggested strategies. Thematic analysis of critiques produced a six-category failure-mode taxonomy, revealing recurring breakdowns in ADL grounding and care-setting consistency and guiding prompt/workflow refinements. The simulator and logged interactions enable an evidence-driven refinement loop toward validated patient-caregiver co-simulation and support data collection, caregiver training, and assistive AI and robot policy development.

[HC-8] CADReason er: Iterative Program Editing for CAD Reverse Engineering

【速读】:该论文旨在解决当前生成式 CAD(Computer-Aided Design)系统在逆向工程中难以精确还原几何细节的问题,尤其是在单次推理框架下无法有效捕捉输入形状与重建结果之间的细微差异。传统方法缺乏迭代优化机制,而人类工程师则通过反复比较输入与重建结果并逐步修正设计来提升精度。为实现这一类人化的迭代过程,作者提出 CADReasoner,其核心创新在于引入基于几何差异的迭代优化机制:模型输出可执行的 CadQuery Python 程序,并将渲染后的网格作为反馈输入,持续调整生成策略直至逼近目标形状。此外,为弥合真实扫描数据与仿真训练之间的现实差距,论文提出“扫描-仿真协议”(scan-simulation protocol),在训练和评估阶段统一使用该协议增强模型鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2603.29847
作者: Soslan Kabisov,Vsevolod Kirichuk,Andrey Volkov,Gennadii Savrasov,Marina Barannikov,Anton Konushin,Andrey Kuznetsov,Dmitrii Zhemchuzhnikov
机构: Lomonosov Moscow State University (莫斯科国立大学); Université Paris Dauphine (巴黎达菲娜大学); Innopolis University (innopolis大学); FusionBrain Lab, AXXX (FusionBrain实验室)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Computer-Aided Design (CAD) powers modern engineering, yet producing high-quality parts still demands substantial expert effort. Many AI systems tackle CAD reverse engineering, but most are single-pass and miss fine geometric details. In contrast, human engineers compare the input shape with the reconstruction and iteratively modify the design based on remaining discrepancies. Agent-based methods mimic this loop with frozen VLMs, but weak 3D grounding of current foundation models limits reliability and efficiency. We introduce CADReasoner, a model trained to iteratively refine its prediction using geometric discrepancy between the input and the predicted shape. The model outputs a runnable CadQuery Python program whose rendered mesh is fed back at the next step. CADReasoner fuses multi-view renders and point clouds as complementary modalities. To bridge the realism gap, we propose a scan-simulation protocol applied during both training and evaluation. Across DeepCAD, Fusion 360, and MCB benchmarks, CADReasoner attains state-of-the-art results on clean and scan-sim tracks.

[HC-9] AI and Computer Science: Contradictions Emerge between Ideologies

【速读】:该论文试图解决的问题是:当前人工智能(AI)技术发展中,企业与高校目标交织所引发的意识形态矛盾及其对劳动者(特别是计算机科学学生和公众)的潜在不利影响。论文指出,尽管企业和计算机科学界长期将计算技术塑造成“赋能”工具,但随着生成式 AI 生产的加速,这种意识形态内部出现裂痕,亟需从社会、经济和政治维度重新审视 AI 技术发展的本质。其解决方案的关键在于引入一种以意识形态为分析框架的概念化模型,用以批判性地理解 AI 技术开发过程中权力关系的再生产,并追问“谁被真正赋能”,从而揭示当前技术发展路径中的结构性不平等。

链接: https://arxiv.org/abs/2603.29746
作者: Andruid Kerne
机构: University of Illinois Chicago (芝加哥大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We develop a conceptualization of ideology, in which a system of ideas represents social, economic, and political relationships. We use ideology as a lens for understanding and critiquing intersecting social, economic, and political aspects of how ‘AI’ technologies are being developed. We observe ideological shifts. We question that the present tangling of corporate and university objectives is beneficial to labor, particularly computer science students, and the general public. Corporations and computer science have a history of marketing the ideology of computing as empowerment. However, with intensification of the production of ‘AI’, contradictions emerge. We ask, “Who is being empowered?” Subjects: Human-Computer Interaction (cs.HC) ACMclasses: H.1.2; K.4.1; K.4.3 Reportnumber: CHIdeology26-09 Cite as: arXiv:2603.29746 [cs.HC] (or arXiv:2603.29746v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2603.29746 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-10] KEditVis: A Visual Analytics System for Knowledge Editing of Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在知识编辑过程中存在的两大问题:一是难以识别最优的模型层作为编辑目标,二是现有方法依赖的总结性指标无法提供充分的指导,导致编辑过程缺乏透明度,阻碍了策略比较与优化。解决方案的关键在于提出一种名为 KEditVis 的可视化分析系统,通过交互式可视化手段帮助用户精准选择编辑层、探究无效编辑的原因,并实施更具针对性的编辑操作,从而提升编辑效果并为未来知识编辑算法的发展提供可洞察的依据。

链接: https://arxiv.org/abs/2603.29689
作者: Zhenning Chen,Hanbei Zhan,Yanwei Huang,Xin Wu,Dazhen Deng,Di Weng,Yingcai Wu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE PacificVis 2026 (TVCG Journal Track)

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate exceptional capabilities in factual question answering, yet they sometimes provide incorrect responses. To address this issue, knowledge editing techniques have emerged as effective methods for correcting factual information in LLMs. However, typical knowledge editing workflows struggle with identifying the optimal set of model layers for editing and rely on summary indicators that provide insufficient guidance. This lack of transparency hinders effective comparison and identification of optimal editing strategies. In this paper, we present KEditVis, a novel visual analytics system designed to assist users in gaining a deeper understanding of knowledge editing through interactive visualizations, improving editing outcomes, and discovering valuable insights for the future development of knowledge editing algorithms. With KEditVis, users can select appropriate layers as the editing target, explore the reasons behind ineffective edits, and perform more targeted and effective edits. Our evaluation, including usage scenarios, expert interviews, and a user study, validates the effectiveness and usability of the system.

[HC-11] Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupling and the Limits of the Dunning-Kruger Metaphor

【速读】:该论文试图解决的问题是:当前对生成式 AI(Generative AI)影响人类认知的主流观点——即简单地认为其放大了达克效应(Dunning-Kruger effect)——过于粗糙,无法准确解释实证证据中观察到的复杂现象。解决方案的关键在于提出“AI中介的认知解耦”(AI-mediated metacognitive decoupling)的工作模型,该模型揭示了四个变量之间的脱节:产出质量、底层理解、校准准确性与自我评估能力之间的差距扩大,从而更全面地解释过度自信、过度或不足依赖、辅助依赖效应(crutch effects)以及迁移能力弱化等现象,优于传统单一维度的达克效应类比。

链接: https://arxiv.org/abs/2603.29681
作者: Christopher Koch
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The common claim that generative AI simply amplifies the Dunning-Kruger effect is too coarse to capture the available evidence. The clearest findings instead suggest that large language model (LLM) use can improve observable output and short-term task performance while degrading metacognitive accuracy and flattening the classic competence-confidence gradient across skill groups. This paper synthesizes evidence from human-AI interaction, learning research, and model evaluation, and proposes the working model of AI-mediated metacognitive decoupling: a widening gap among produced output, underlying understanding, calibration accuracy, and self-assessed ability. This four-variable account better explains overconfidence, over- and under-reliance, crutch effects, and weak transfer than the simpler metaphor of a uniformly steeper Dunning-Kruger curve. The paper concludes with implications for tool design, assessment, and knowledge work.

[HC-12] All-in-One Augmented Reality Guided Head and Neck Tumor Resection

【速读】:该论文旨在解决头颈部鳞状细胞癌(head and neck squamous cell carcinoma)手术中阳性切缘(positive margins)再切除不精确的问题,其核心挑战在于术中无法准确定位病理报告中的阳性边缘位置,通常依赖口头沟通导致误差较大。解决方案的关键在于开发了一套全集成的增强现实(augmented reality, AR)系统,利用HoloLens 2的深度感知能力和全自动无标记表面配准技术,将切除标本上的阳性切缘重新定位到术区原切缘床并实时可视化,从而实现高精度的术中引导。实验表明,该方法在模拟环境中实现了与标记基方法相当的配准精度(中位误差1.8 mm vs. 1.7 mm),并将定位误差从传统口头指导的中位14.2 mm显著降低至3.2 mm,验证了无标记AR切缘引导的可行性与优越性。

链接: https://arxiv.org/abs/2603.29495
作者: Yue Yang,Matthieu Chabanas,Carrie Reale,Annie Benson,Jason Slagle,Matthew Weinger,Michael Topf,Jie Ying Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Positive margins are common in head and neck squamous cell carcinoma, yet intraoperative re-resection is often imprecise because margin locations are typically communicated verbally from pathology. We present an all-in-one augmented reality (AR) system that relocalizes positive margins from a resected specimen to the resection bed and visualizes them in situ using HoloLens 2 depth sensing and fully automated markerless surface registration. In a silicone phantom study with six medical trainees, markerless registration achieved target registration errors comparable to a marker-based baseline (median 1.8 mm vs. 1.7 mm; maximum 4 mm). In a margin relocalization task, AR guidance reduced error from verbal guidance (median 14.2 mm) to a few millimeters (median 3.2 mm), with all AR localizations within 5 mm error. These results support the feasibility of markerless AR margin guidance for more precise intraoperative re-excision.

[HC-13] Poster: Content-Aware Layout Generation for Interactive Poster Design via Graph-Enhanced Diffusion Models

【速读】:该论文旨在解决海报布局设计中用户意图难以精确表达与实现的问题,尤其是在生成式 AI (Generative AI) 领域缺乏对内容感知和约束引导的交互式布局生成方法。其解决方案的关键在于提出 iPoster 框架,该框架采用统一的图增强扩散架构(graph-enhanced diffusion architecture),通过掩码策略在每一步去噪过程中精确保留用户指定的约束信息(如元素类别、尺寸、位置或粗略草图),并引入跨内容感知注意力模块(cross content-aware attention module)使生成元素与画布中的显著区域对齐,从而保证视觉一致性与用户意图的高保真实现。

链接: https://arxiv.org/abs/2603.29469
作者: Xudong Zhou,Jinyuan Liang,Qiuyi Guo,Guozheng Li
机构: Beijing Institute of Technology (北京理工大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present iPoster, an interactive layout generation framework that empowers users to guide content-aware poster layout design by specifying flexible constraints. iPoster enables users to specify partial intentions within the intention module, such as element categories, sizes, positions, or coarse initial drafts. Then, the generation module instantly generates refined, context-sensitive layouts that faithfully respect these constraints. iPoster employs a unified graph-enhanced diffusion architecture that supports various design tasks under user-specified constraints. These constraints are enforced through masking strategies that precisely preserve user input at every denoising step. A cross content-aware attention module aligns generated elements with salient regions of the canvas, ensuring visual coherence. Extensive experiments show that iPoster not only achieves state-of-the-art layout quality, but offers a responsive and controllable framework for poster layout design with constraints.

[HC-14] VACP: Visual Analytics Context Protocol

【速读】:该论文旨在解决当前AI代理(AI agent)在视觉分析(Visual Analytics, VA)系统中因缺乏对应用状态、交互机制和执行路径的显式暴露而导致的任务执行不准确、效率低下的问题。现有基于计算机视觉和原始DOM访问的代理方法难以有效理解VA界面,限制了其作为协作用户的能力。解决方案的关键在于提出一种名为视觉分析上下文协议(Visual Analytics Context Protocol, VACP)的框架,该协议通过显式暴露应用状态、可用交互项及直接执行机制,使VA应用具备“代理就绪”能力;同时,论文还提供了AI代理在VA界面中的需求与知识表示的正式规范,并将VACP实现为兼容主流可视化语法和Web框架的库,从而显著提升代理在界面解析与任务执行上的成功率,同时降低token消耗与延迟,弥合人类中心VA界面与机器可感知性之间的鸿沟。

链接: https://arxiv.org/abs/2603.29322
作者: Tobias Stähle,Péter Ferenc Gyarmati,Thilo Spinner,Rita Sevastjanova,Dominik Moritz,Mennatallah El-Assady
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The rise of AI agents introduces a fundamental shift in Visual Analytics (VA), in which agents act as a new user group. Current agentic approaches - based on computer vision and raw DOM access - fail to perform VA tasks accurately and efficiently. This paper introduces the Visual Analytics Context Protocol (VACP), a framework designed to make VA applications “agent-ready” that extends generic protocols by explicitly exposing application state, available interactions, and mechanisms for direct execution. To support our context protocol, we contribute a formal specification of AI agent requirements and knowledge representations in VA interfaces. We instantiate VACP as a library compatible with major visualization grammars and web frameworks, enabling augmentation of existing systems and the development of new ones. Our evaluation across representative VA tasks demonstrates that VACP-enabled agents achieve higher success rates in interface interpretation and execution compared to current agentic approaches, while reducing token consumption and latency. VACP closes the gap between human-centric VA interfaces and machine perceivability, ensuring agents can reliably act as collaborative users in VA systems.

[HC-15] Sima AIunty: Caste Audit in LLM -Driven Matchmaking

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在南亚婚恋匹配场景中是否再现或挑战种姓制度结构性不平等的问题。其核心问题是:当前主流LLM在评估婚姻匹配时,是否会无意识地复制印度传统种姓体系中的等级偏见,从而加剧社会排斥。解决方案的关键在于通过受控审计实验,系统性地操纵种姓身份(婆罗门、刹帝利、吠舍、首陀罗、达利特)与收入水平,并对五类主流LLM(GPT、Gemini、Llama、Qwen和BharatGPT)进行多维评分测试(社会接受度、婚姻稳定性和文化契合度),从而量化模型输出的偏倚模式。结果揭示出显著的同种姓优先偏好,且跨种姓匹配评分普遍低于同种姓匹配(最高低25%),表明现有模型延续了历史种姓等级结构,凸显了在社会敏感领域部署AI时需引入文化适配的评估框架与干预机制的重要性。

链接: https://arxiv.org/abs/2603.29288
作者: Atharva Naik,Shounok Kar,Varnika Sharma,Ashwin Rajadesingan,Koustuv Saha
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Social and personal decisions in relational domains such as matchmaking are deeply entwined with cultural norms and historical hierarchies, and can potentially be shaped by algorithmic and AI-mediated assessments of compatibility, acceptance, and stability. In South Asian contexts, caste remains a central aspect of marital decision-making, yet little is known about how contemporary large language models (LLMs) reproduce or disrupt caste-based stratification in such settings. In this work, we conduct a controlled audit of caste bias in LLM-mediated matchmaking evaluations using real-world matrimonial profiles. We vary caste identity across Brahmin, Kshatriya, Vaishya, Shudra, and Dalit, and income across five buckets, and evaluate five LLM families (GPT, Gemini, Llama, Qwen, and BharatGPT). Models are prompted to assess profiles along dimensions of social acceptance, marital stability, and cultural compatibility. Our analysis reveals consistent hierarchical patterns across models: same-caste matches are rated most favorably, with average ratings up to 25% higher (on a 10-point scale) than inter-caste matches, which are further ordered according to traditional caste hierarchy. These findings highlight how existing caste hierarchies are reproduced in LLM decision-making and underscore the need for culturally grounded evaluation and intervention strategies in AI systems deployed in socially sensitive domains, where such systems risk reinforcing historical forms of exclusion.

[HC-16] Customer Analysis and Text Generation for Small Retail Stores Using LLM -Generated Marketing Presence

【速读】:该论文旨在解决小零售店在制作促销点(Point of Purchase, POP)材料时面临的双重挑战:一方面,大语言模型(Large Language Models, LLMs)生成的文本缺乏创意多样性;另一方面,非专业用户在营销和内容创作方面经验有限。为应对这一互补性局限,论文提出了一种基于人机协作的原型系统,其关键在于通过模拟目标客户画像(simulated personas)支持用户理解顾客需求、生成初稿、优化表达并评估候选文案,从而显著提升POP文本质量——实验结果显示,使用该系统后平均评分提升2.37分(量表范围-3至+3)。

链接: https://arxiv.org/abs/2603.29273
作者: Shiori Nakamura,Masato Kikuchi,Tadachika Ozono
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: The 17th International Conference on Smart Computing and Artificial Intelligence (SCAI 2025)

点击查看摘要

Abstract:Point of purchase (POP) materials can be created to assist non-experts by combining large language models (LLMs) with human insight. Persuasive POP texts require both customer understanding and expressive writing skills. However, LLM-generated texts often lack creative diversity, while human users may have limited experience in marketing and content creation. To address these complementary limitations, we propose a prototype system for small retail stores that enhances POP creation through human-AI collaboration. The system supports users in understanding target customers, generating draft POP texts, refining expressions, and evaluating candidates through simulated personas. Our experimental results show that this process significantly improves text quality: the average evaluation score increased by 2.37 points on a -3 to +3 scale compared to that created without system support.

[HC-17] An Experiential Approach to AI Literacy

【速读】:该论文旨在解决当前职场中AI应用存在的“知行鸿沟”问题,即员工虽了解AI的基本概念,但在实际工作中仍难以明确AI的应用场景、解决问题的能力及其在具体工作流程中的整合方式。解决方案的关键在于提出一种基于体验的AI素养(AI literacy)教学方法,通过将参与者日常经验融入学习过程,借助故事讲述引导其头脑风暴出与自身工作情境相关的AI应用场景,从而推动个体从抽象认知向可操作性知识转化,并借助参与式设计开发出切实可行的AI应用案例。

链接: https://arxiv.org/abs/2603.29238
作者: Aakanksha Khandwaha,Edith Law
机构: University of Waterloo (滑铁卢大学)
类目: Human-Computer Interaction (cs.HC)
备注: Paper accepted at CHI 2026 Workshop on Data Literacy. For more details, see this https URL

点击查看摘要

Abstract:Despite AI tools becoming more prevalent and applicable to a variety of workplaces, workers consistently report uncertainty about where AI applies, what problems it can help solve, and how it fits into real workflows. In other words, there is a gap between knowing' and doing’ when it comes to AI literacy. We propose an experiential form of AI literacy which integrates participant’s daily experiences into the learning experience by brainstorming grounded AI use cases through storytelling. We introduce a novel pedagogical approach that helps individuals move away from abstract notions of AI towards practical knowledge of how AI would (or would not) work in different workflows, contexts, and situations. Through this approach, we anticipate two major outcomes: (1) enhanced AI literacy for stakeholders within a variety of work sectors and (2) concrete AI use cases developed through participatory design that are grounded in AI literacy and participant’s expertise.

[HC-18] SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation

【速读】:该论文旨在解决叙利亚阿拉伯手语(Syrian Arabic Sign Language, SyArSL)作为低资源手语缺乏公开可用数据集的问题,从而缓解叙利亚聋人和听力障碍(Deaf and Hard-of-Hearing, DHH)群体在获取新闻等信息时面临的沟通障碍。解决方案的关键在于构建并发布首个SyArSL数据集——SyriSign,该数据集包含1500个视频样本,覆盖150个独特词汇手势,专为文本到手语翻译任务设计,并通过MotionCLIP、T2M-GPT和SignCLIP三种深度学习架构进行评估,验证了生成式方法在手语表征中的潜力,同时指出数据规模限制了模型的泛化性能。

链接: https://arxiv.org/abs/2603.29219
作者: Mohammad Amer Khalil,Raghad Nahas,Ahmad Nassar,Khloud Al Jallad
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Sign language is the primary approach of communication for the Deaf and Hard-of-Hearing (DHH) community. While there are numerous benchmarks for high-resource sign languages, low-resource languages like Arabic remain underrepresented. Currently, there is no publicly available dataset for Syrian Arabic Sign Language (SyArSL). To overcome this gap, we introduce SyriSign, a dataset comprising 1500 video samples across 150 unique lexical signs, designed for text-to-SyArSL translation tasks. This work aims to reduce communication barriers in Syria, as most news are delivered in spoken or written Arabic, which is often inaccessible to the deaf community. We evaluated SyriSign using three deep learning architectures: MotionCLIP for semantic motion generation, T2M-GPT for text-conditioned motion synthesis, and SignCLIP for bilingual embedding alignment. Experimental results indicate that while generative approaches show strong potential for sign representation, the limited dataset size constrains generalization performance. We will release SyriSign publicly, hoping it serves as an initial benchmark.

[HC-19] BiMoE: Brain-Inspired Experts for EEG-Dominant Affective State Recognition ICME2026

【速读】:该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis, MSA)中三个关键问题:一是将脑电图(Electroencephalogram, EEG)信号视为同质数据而忽略其区域特异性的情感处理特征;二是将EEG作为黑箱输入,缺乏对神经表征的可解释性;三是EEG特征与互补的外周生理信号(Peripheral Physiological Signals, PPS)融合效率低。解决方案的关键在于提出一种受大脑启发的“专家混合”框架(BiMoE),该框架通过脑拓扑感知的方式分割EEG信号,每个专家采用双流编码器提取局部与全局时空特征,并专门设计一个专家利用多尺度大卷积核处理PPS信号;所有专家通过自适应路由机制和联合损失函数实现动态融合,从而在保持可解释性的同时提升多模态情感分类性能。

链接: https://arxiv.org/abs/2603.29205
作者: Hongyu Zhu,Lin Chen,Mingsheng Shang
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted by ICME 2026

点击查看摘要

Abstract:Multimodal Sentiment Analysis (MSA) that integrates Electroencephalogram (EEG) with peripheral physiological signals (PPS) is crucial for the development of brain-computer interface (BCI) systems. However, existing methods encounter three major challenges: (1) overlooking the region-specific characteristics of affective processing by treating EEG signals as homogeneous; (2) treating EEG as a black-box input, which lacks interpretability into neural representations;(3) ineffective fusion of EEG features with complementary PPS features. To overcome these issues, we propose BiMoE, a novel brain-inspired mixture of experts framework. BiMoE partitions EEG signals in a brain-topology-aware manner, with each expert utilizing a dual-stream encoder to extract local and global spatiotemporal features. A dedicated expert handles PPS using multi-scale large-kernel convolutions. All experts are dynamically fused through adaptive routing and a joint loss function. Evaluated under strict subject-independent settings, BiMoE consistently surpasses state-of-the-art baselines across various affective dimensions. On the DEAP and DREAMER datasets, it yields average accuracy improvements of 0.87% to 5.19% in multimodal sentiment classification. The code is available at: this https URL.

[HC-20] Kwame 2.0: Human-in-the-Loop Generative AI Teaching Assistant for Large Scale Online Coding Education in Africa

【速读】:该论文旨在解决大规模在线编程课程中,特别是在资源受限环境下,如何提供及时且准确的学习支持这一挑战。其解决方案的关键在于构建一个双语(英语-法语)生成式AI教学助手机器人Kwame 2.0,该系统基于检索增强生成(Retrieval-Augmented Generation, RAG)技术,并部署在SuaCode这一面向非洲学习者的移动端入门级编程课程的“人在回路”(human-in-the-loop)论坛中。Kwame 2.0通过检索相关课程资料并生成情境感知的回答,在保持AI响应速度与可扩展性的同时,借助人工导师和学习者社区共同监督与纠错机制,有效提升了支持质量,尤其在处理课程相关问题时表现出高准确性,同时人类干预显著降低了行政类查询中的错误率。研究结果表明,此类“人在回路”的生成式AI系统能够融合AI的高效性与人类判断的可靠性,为资源匮乏地区的大规模学习支持提供可行路径。

链接: https://arxiv.org/abs/2603.29159
作者: George Boateng,Samuel Boateng,Victor Kumbol
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 8 pages, Accepted at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

点击查看摘要

Abstract:Providing timely and accurate learning support in large-scale online coding courses is challenging, particularly in resource-constrained contexts. We present Kwame 2.0, a bilingual (English-French) generative AI teaching assistant built using retrieval-augmented generation and deployed in a human-in-the-loop forum within SuaCode, an introductory mobile-based coding course for learners across Africa. Kwame 2.0 retrieves relevant course materials and generates context-aware responses while encouraging human oversight and community participation. We deployed the system in a 15-month longitudinal study spanning 15 cohorts with 3,717 enrollments across 35 African countries. Evaluation using community feedback and expert ratings shows that Kwame 2.0 provided high-quality and timely support, achieving high accuracy on curriculum-related questions, while human facilitators and peers effectively mitigated errors, particularly for administrative queries. Our findings demonstrate that human-in-the-loop generative AI systems can combine the scalability and speed of AI with the reliability of human support, offering an effective approach to learning assistance for underrepresented populations in resource-constrained settings at scale.

[HC-21] REFINE: Real-world Exploration of Interactive Feedback and Student Behaviour

【速读】:该论文旨在解决大规模教学场景中难以提供及时、个性化形成性反馈(formative feedback)的问题。现有基于大语言模型(LLMs)的反馈系统多将反馈视为静态、单向的信息传递,缺乏对反馈内容的解释、澄清及后续互动支持。其解决方案的关键在于提出一个本地部署的多智能体反馈系统REFINE,该系统由三个核心组件构成:一是基于教育学原理的反馈生成代理;二是基于人类对齐判官(LLM-as-a-judge)引导的再生循环机制,以提升反馈质量;三是具备自我反思能力的工具调用交互代理,可针对学生后续问题生成上下文感知且可操作的回答。实验证明,该架构显著提升了反馈质量与交互效率,并能有效引导学生后续学习行为,验证了多智能体、工具增强型反馈系统在规模化、交互式教学反馈中的可行性与有效性。

链接: https://arxiv.org/abs/2603.29142
作者: Fares Fawzi,Seyed Parsa Neshaei,Marta Knezevic,Tanya Nazaretsky,Tanja Käser
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to AIED 2026

点击查看摘要

Abstract:Formative feedback is central to effective learning, yet providing timely, individualised feedback at scale remains a persistent challenge. While recent work has explored the use of large language models (LLMs) to automate feedback, most existing systems still conceptualise feedback as a static, one-way artifact, offering limited support for interpretation, clarification, or follow-up. In this work, we introduce REFINE, a locally deployable, multi-agent feedback system built on small, open-source LLMs that treats feedback as an interactive process. REFINE combines a pedagogically-grounded feedback generation agent with an LLM-as-a-judge-guided regeneration loop using a human-aligned judge, and a self-reflective tool-calling interactive agent that supports student follow-up questions with context-aware, actionable responses. We evaluate REFINE through controlled experiments and an authentic classroom deployment in an undergraduate computer science course. Automatic evaluations show that judge-guided regeneration significantly improves feedback quality, and that the interactive agent produces efficient, high-quality responses comparable to a state-of-the-art closed-source model. Analysis of real student interactions further reveals distinct engagement patterns and indicates that system-generated feedback systematically steers subsequent student inquiry. Our findings demonstrate the feasibility and effectiveness of multi-agent, tool-augmented feedback systems for scalable, interactive feedback.

[HC-22] SciVisAgent Bench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents

【速读】:该论文旨在解决科学数据可视化(Scientific Visualization, SciVis)代理在真实、多步骤分析场景中缺乏系统性、可复现评估基准的问题。当前尽管生成式 AI (Generative AI) 和大语言模型(Large Language Models, LLMs)已推动了能够将自然语言意图转化为可执行 SciVis 任务的智能体发展,但社区尚无统一标准来衡量其性能与可靠性。解决方案的关键在于提出 SciVisAgentBench——一个结构化、可扩展的基准测试框架,涵盖应用领域、数据类型、复杂度层级和可视化操作四个维度,并包含108个专家设计的任务案例;同时引入以多模态结果为中心的评估流程,融合基于大语言模型的判别机制与确定性评估器(如图像指标、代码检查器、规则验证器及特定案例评估模块),从而实现对 SciVis 代理能力的客观量化与诊断性分析。

链接: https://arxiv.org/abs/2603.29139
作者: Kuangshi Ai,Haichao Miao,Kaiyuan Tang,Nathaniel Gorski,Jianxin Sun,Guoxi Liu,Helgi I. Ingolfsson,David Lenz,Hanqi Guo,Hongfeng Yu,Teja Leburu,Michael Molash,Bei Wang,Tom Peterka,Chaoli Wang,Shusen Liu
机构: Lawrence Livermore National Laboratory (劳伦斯利弗莫尔国家实验室); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Northwestern University (西北大学); Argonne National Laboratory (阿贡国家实验室); University of Chicago (芝加哥大学); University of California, Berkeley (加州大学伯克利分校); Stanford University (斯坦福大学); University of Utah (犹他大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); University of Washington (华盛顿大学); University of Maryland, College Park (马里兰大学学院帕克分校); University of Pennsylvania (宾夕法尼亚大学); Massachusetts Institute of Technology (麻省理工学院); University of Southern California (南加州大学); University of Minnesota (明尼苏达大学); University of Colorado Boulder (科罗拉多大学博尔德分校); University of Virginia (弗吉尼亚大学); University of Oregon (俄勒冈大学); University of Arizona (亚利桑那大学); University of New Mexico (新墨西哥大学); University of Idaho (爱达荷大学); University of Nevada, Reno (内华达大学雷诺分校); University of Montana (蒙大拿大学); University of Wyoming (怀俄明大学); University of Alaska Fairbanks (阿拉斯加费尔班克斯大学); University of Hawaii (夏威夷大学); University of Maine (缅因大学); University of Vermont (佛蒙特大学); University of New Hampshire (新罕布什尔大学); University of Connecticut (康涅狄格大学); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); University of Rhode Island (罗德岛大学); University of Delaware (特拉华大学); University of South Carolina (南卡罗来纳大学); University of Tennessee (田纳西大学); University of Kentucky (肯塔基大学); University of Missouri (密苏里大学); University of Iowa (爱荷华大学); University of Nebraska (内布拉斯加大学); University of Kansas (堪萨斯大学); University of Oklahoma (俄克拉荷马大学); University of Arkansas (阿肯色大学); University of Mississippi (密西西比大学); University of Alabama (阿拉巴马大学); University of Florida (佛罗里达大学); University of Georgia (佐治亚大学); University of North Carolina (北卡罗来纳大学); University of South Dakota (南达科他大学); University of North Dakota (北达科他大学); University of Montana (蒙大拿大学); University of Wyoming (怀俄明大学); University of Idaho (爱达荷大学); University of Nevada, Las Vegas (内华达大学拉斯维加斯分校); University of Nevada, Reno (内华达大学雷诺分校); University of Alaska Anchorage (阿拉斯加安克雷奇大学); University of Hawaii at Manoa (夏威夷大学马诺阿分校); University of California, San Diego (加州大学圣地亚哥分校); University of California, Santa Barbara (加州大学圣芭芭拉分校); University of California, Davis (加州大学戴维斯分校); University of California, Irvine (加州大学欧文分校); University of California, Los Angeles (加州大学洛杉矶分校); University of California, Merced (加州大学默塞德分校); University of California, Riverside (加州大学河滨分校); University of California, Santa Cruz (加州大学圣克鲁兹分校); University of California, San Francisco (加州大学旧金山分校); University of California, Hastings College of the Law (加州大学哈斯廷斯法学院); University of California, Berkeley (加州大学伯克利分校); University of California, Davis (加州大学戴维斯分校); University of California, San Diego (加州大学圣地亚哥分校); University of California, Santa Barbara (加州大学圣芭芭拉分校); University of California, Irvine (加州大学欧文分校); University of California, Los Angeles (加州大学洛杉矶分校); University of California, Merced (加州大学默塞德分校); University of California, Riverside (加州大学河滨分校); University of California, Santa Cruz (加州大学圣克鲁兹分校); University of California, San Francisco (加州大学旧金山分校); University of California, Hastings College of the Law (加州大学哈斯廷斯法学院)
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled agentic systems that translate natural language intent into executable scientific visualization (SciVis) tasks. Despite rapid progress, the community lacks a principled and reproducible benchmark for evaluating these emerging SciVis agents in realistic, multi-step analysis settings. We present SciVisAgentBench, a comprehensive and extensible benchmark for evaluating scientific data analysis and visualization agents. Our benchmark is grounded in a structured taxonomy spanning four dimensions: application domain, data type, complexity level, and visualization operation. It currently comprises 108 expert-crafted cases covering diverse SciVis scenarios. To enable reliable assessment, we introduce a multimodal outcome-centric evaluation pipeline that combines LLM-based judging with deterministic evaluators, including image-based metrics, code checkers, rule-based verifiers, and case-specific evaluators. We also conduct a validity study with 12 SciVis experts to examine the agreement between human and LLM judges. Using this framework, we evaluate representative SciVis agents and general-purpose coding agents to establish initial baselines and reveal capability gaps. SciVisAgentBench is designed as a living benchmark to support systematic comparison, diagnose failure modes, and drive progress in agentic SciVis. The benchmark is available at this https URL.

[HC-23] “I Just Need GPT to Refine My Prompts”: Rethinking Onboarding and Help-Seeking with Generative 3D Modeling Tools

【速读】:该论文旨在解决用户在使用功能丰富的软件(如3D建模工具)时面临的持续性学习障碍问题,传统上依赖复杂的导航路径和教程,而生成式AI(Generative AI)有望通过自然语言提示(natural language prompts)简化这一过程。研究发现,解决方案的关键在于:将提示框(prompt box)作为学习入口,使新手能够立即开始操作,从而将传统上的“引导式学习”压缩为即时行动;同时,用户行为分化为两类——专业人士利用领域知识迭代优化并批判性评估输出质量,而普通用户则接受“足够好”的结果,甚至转向外部大语言模型(LLM)获取提示建议,形成“AI-for-AI”支持的新模式。这揭示了生成式AI如何重塑用户的学习路径、求助策略与专业知识的再分配机制。

链接: https://arxiv.org/abs/2603.29118
作者: Kanak Gautam,Poorvi Bhatia,Parmit K. Chilana
机构: Simon Fraser University (西蒙弗雷泽大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 16 pages, 10 figures, CHI 2026 submission

点击查看摘要

Abstract:Learning to use feature-rich software is a persistent challenge, but generative AI tools promise to lower this barrier by replacing complex navigation with natural language prompts. We investigated how people approach prompt-based tools for 3D modeling in an observational study with 26 participants (14 casuals, 12 professionals). Consistent with earlier work, participants skipped tutorials and manuals, relying on trial and error. What differed in the generative AI context was how and why they sought support: the prompt box became the entry point for learning, collapsing onboarding into immediate action, while some casual users turned to external LLMs for prompts. Professionals used 3D expertise to refine iterations and critically evaluated outputs, often discarding models that did not meet their standards, whereas casual users settled for “good enough.” We contribute empirical insights into how generative AI reshapes help-seeking, highlighting new practices of onboarding, recursive AI-for-AI support, and shifting expertise in interpreting outputs.

[HC-24] VueBuds: Visual Intelligence with Wireless Earbuds

【速读】:该论文旨在解决无线耳机(wireless earbuds)长期局限于音频功能、难以集成视觉感知能力的问题,其核心挑战在于在严苛的功耗与体积限制下实现有效的视觉采集与处理。解决方案的关键在于设计并实现VueBuds——首个集成摄像头的无线耳塞,通过将微型相机嵌入索尼WF-1000XM3耳塞中,利用按需激活机制使单个摄像头功耗低于5mW,并借助蓝牙将低分辨率灰度图像传输至主机设备进行本地视觉语言模型(Vision Language Model, VLM)推理。该系统利用双耳摄像头的联合视角克服单眼视野被面部遮挡的问题,实现了对前方场景的全面覆盖,从而支持实时场景理解、翻译、视觉推理和文本识别等任务,且性能达到与Ray-Ban Meta智能眼镜相当的水平。

链接: https://arxiv.org/abs/2603.29095
作者: Maruchi Kim,Rasya Fawwaz,Zhi Yang Lim,Brinda Moudgalya,Hexi Wang,Yuanhao Zeng,Shyamnath Gollakota
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: CHI 2026

点击查看摘要

Abstract:Despite their ubiquity, wireless earbuds remain audio-centric due to size and power constraints. We present VueBuds, the first camera-integrated wireless earbuds for egocentric vision, capable of operating within stringent power and form-factor limits. Each VueBud embeds a camera into a Sony WF-1000XM3 to stream visual data over Bluetooth to a host device for on-device vision language model (VLM) processing. We show analytically and empirically that while each camera’s field of view is partially occluded by the face, the combined binocular perspective provides comprehensive forward coverage. By integrating VueBuds with VLMs, we build an end-to-end system for real-time scene understanding, translation, visual reasoning, and text reading; all from low-resolution monochrome cameras drawing under 5mW through on-demand activation. Through online and in-person user studies with 90 participants, we compare VueBuds against smart glasses across 17 visual question-answering tasks, and show that our system achieves response quality on par with Ray-Ban Meta. Our work establishes low-power camera-equipped earbuds as a compelling platform for visual intelligence, bringing rapidly advancing VLM capabilities to one of the most ubiquitous wearable form factors.

[HC-25] Evaluating a Data-Driven Redesign Process for Intelligent Tutoring Systems

【速读】:该论文旨在解决如何在不依赖特定选题优势的情况下,验证数据驱动的教育技术再设计方法的普适性与有效性问题。其解决方案的关键在于:将该方法应用于四个未基于改进潜力而仅依据教学主题选择的中学数学智能辅导系统单元,并通过包含123名学生的课堂实验评估其效果,结果表明尽管学习成效无显著差异,但使用重构后系统的学生产出性任务时间更长、练习技能数量更多、整体知识掌握程度更高,从而证明了该方法即使在非优化选题情境下仍具广泛适用性和实践价值。

链接: https://arxiv.org/abs/2603.29094
作者: Qianru Lyu,Conrad Borchers,Meng Xia,Karen Xiao,Paulo F. Carvalho,Kenneth R. Koedinger,Vincent Aleven
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted as short paper to the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

点击查看摘要

Abstract:Past research has defined a general process for the data-driven redesign of educational technologies and has shown that in carefully-selected instances, this process can help make systems more effective. In the current work, we test the generality of the approach by applying it to four units of a middle-school mathematics intelligent tutoring system that were selected not based on suitability for redesign, as in previous work, but on topic. We tested whether the redesigned system was more effective than the original in a classroom study with 123 students. Although the learning gains did not differ between the conditions, students who used the Redesigned Tutor had more productive time-on-task, a larger number of skills practiced, and greater total knowledge mastery. The findings highlight the promise of data-driven redesign even when applied to instructional units not selected as likely to yield improvement, as evidence of the generality and wide applicability of the method.

[HC-26] Uncovering Relationships between Android Developers User Privacy and Developer Willingness to Reduce Fingerprinting Risks

【速读】:该论文旨在解决移动平台(如Android和iOS)虽已引入隐私保护机制以限制用户追踪,但应用仍通过设备指纹识别(device fingerprinting)进行隐蔽追踪的问题。其核心研究问题是:开发者对平台隐私干预措施的认知与接受度如何?以及他们是否愿意为提升用户隐私而承担额外开发成本?解决方案的关键在于通过一项针对246名Android开发者的调查发现,尽管多数开发者预计实施隐私保护变更将增加工作量,仍有89%的开发者支持此类变更,且更倾向于将其设为可选而非强制;尤为关键的是,使用指纹识别技术的开发者反而更可能支持该变更(是未使用者的六倍),表明开发者具备改善隐私保护的意愿,但担忧合规与执行问题。因此,平台可通过提供灵活、可选的隐私保护机制并强化监管支持,推动与开发者的协同合作,从而有效遏制移动端指纹追踪行为。

链接: https://arxiv.org/abs/2603.29063
作者: Alex Berke,Güliz Seray Tuncay,Michael Specter,Mihai Christodorescu
机构: Google(谷歌)
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The major mobile platforms, Android and iOS, have introduced changes that restrict user tracking to improve user privacy, yet apps continue to covertly track users via device fingerprinting. We study the opportunity to improve this dynamic with a case study on mobile fingerprinting that evaluates developers’ perceptions of how well platforms protect user privacy and how developers perceive platform privacy interventions. Specifically, we study developers’ willingness to make changes to protect users from fingerprinting and how developers consider trade-offs between user privacy and developer effort. We do this via a survey of 246 Android developers, presented with a hypothetical Android change that protects users from fingerprinting at the cost of additional developer effort. We find developers overwhelmingly (89%) support this change, even when they anticipate significant effort, yet prefer the change be optional versus required. Surprisingly, developers who use fingerprinting are six times more likely to support the change, despite being most impacted by it. We also find developers are most concerned about compliance and enforcement. In addition, our results show that while most rank iOS above Android for protecting user privacy, this distinction significantly reduces among developers very familiar with fingerprinting. Thus there is an important opportunity for platforms and developers to collaboratively build privacy protections, and we present actionable ways platforms can facilitate this. Subjects: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC) Cite as: arXiv:2603.29063 [cs.CR] (or arXiv:2603.29063v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.29063 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-27] AI prediction leads people to forgo guaranteed rewards

【速读】:该论文试图解决的问题是:人工智能(Artificial Intelligence, AI)不仅影响人类决策的内容,还可能改变决策的机制本身。研究者通过在1305名参与者中实施经典的纽康悖论(Newcomb’s paradox)行为实验,发现当个体将AI视为具有预测权威时,会主动限制自身决策自由,从而放弃确定收益。解决方案的关键在于识别出“预测权威信念”这一心理机制——超过40%的参与者将AI视作预测权威,这使得他们放弃确定奖励的概率提升3.39倍(95%置信区间:2.45–4.70),并导致收益减少10.7%至42.9%。该效应在不同AI呈现方式和决策情境下均成立,且即使预测失败仍持续存在,表明人们对AI预测能力的信念可引发自我约束行为,进而影响实际决策结果。

链接: https://arxiv.org/abs/2603.28944
作者: Aoi Naito,Hirokazu Shirado
机构: Carnegie Mellon University (卡内基梅隆大学); Institute of Science Tokyo (东京科学研究所)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is understood to affect the content of people’s decisions. Here, using a behavioral implementation of the classic Newcomb’s paradox in 1,305 participants, we show that AI can also change how people decide. In this paradigm, belief in predictive authority can lead individuals to constrain decision-making, forgoing a guaranteed reward. Over 40% of participants treated AI as such a predictive authority. This significantly increased the odds of forgoing the guaranteed reward by a factor of 3.39 (95% CI: 2.45-4.70) compared with random framing, and reduced earnings by 10.7-42.9%. The effect appeared across AI presentations and decision contexts and persisted even when predictions failed. When people believe AI can predict their behavior, they may self-constrain it in anticipation of that prediction.

[HC-28] Arknights: Playable Explanation and Player Agency under Opacity

【速读】:该论文试图解决生成式 AI(Generative AI)在学习与决策过程中日益主导用户行为时,用户虽能有效操作却难以理解系统输出机制的问题。传统可解释人工智能(Explainable Artificial Intelligence, XAI)研究多依赖透明度和可视化手段,而忽视了通过交互构建解释的过程。论文的关键解决方案在于将数字游戏视为可解释界面,以《明日方舟》中的PRTS AI系统为案例,揭示其通过不完整信息、延迟反馈及叙事信任断裂等设计,促使玩家从直接控制转向解释性推理(explanatory agency),从而重构用户对因果关系的理解方式。这种“可玩的解释”模式为XAI导向的界面设计提供了新的理论框架与实践路径。

链接: https://arxiv.org/abs/2603.28775
作者: Shuai Guo
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:As generative AI increasingly mediates learning and decision-making, users often act effectively while struggling to interpret how system outcomes are produced. While Explainable Artificial Intelligence (XAI) research has primarily addressed this problem through transparency and visualization, less attention has been paid to how explanation is constructed through interaction. This paper examines digital games as explainable interfaces by analyzing how explanation can be configured as a playable process. Using Arknights as a case study, the paper conducts a qualitative close reading and interface analysis of the diegetic AI system PRTS, focusing on the implied player. The analysis shows that PRTS provides usable but unverifiable explanations: sufficient to initiate action, yet insufficient to stabilize causal understanding. Through incomplete information, delayed feedback, and narrative disruptions of trust, player agency is reorganized from direct control toward interpretive and abductive reasoning. The paper conceptualizes this mode as explanatory agency and discusses its implications for XAI-oriented interface design.

[HC-29] Focus360: Guiding User Attention in Immersive Videos for VR

【速读】:该论文旨在解决360°虚拟现实(VR)视频中用户注意力分散的问题,即在高度沉浸的环境中,用户难以聚焦于关键场景元素,从而影响体验效果。解决方案的关键在于提出Focus360系统,该系统通过自然语言描述识别场景中的重要元素,并结合多种视觉特效(visual effects)实现对用户注意力的引导,从而在不破坏沉浸感的前提下提升用户对关键内容的关注度。

链接: https://arxiv.org/abs/2603.28774
作者: Paulo Vitor S. Silva,Lucas L. Neves,Rafael A. Goiás,Diogo F.C. Silva,Rafael T. Sousa,Arlindo R. Galvão Filho
机构: Federal University of Goiás, UFG (联邦大学戈亚斯分校); Federal University of Mato Grosso, UFMAT (联邦大学马托格罗索分校); Advanced Knowledge Center for Immersive Technologies, AKCIT (沉浸式技术高级知识中心)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)

点击查看摘要

Abstract:This demo introduces Focus360, a system designed to enhance user engagement in 360° VR videos by guiding attention to key elements within the scene. Using natural language descriptions, the system identifies important elements and applies a combination of visual effects to guide attention seamlessly. At the demonstration venue, participants can experience a 360° Safari Tour, showcasing the system’s ability to improve user focus while maintaining an immersive experience.

[HC-30] Active Inference with People: a general approach to real-time adaptive experiments

【速读】:该论文旨在解决实时自适应实验(adaptive experiments)在实际应用中面临的计算复杂性和技术挑战,同时统一处理不同场景下的自适应策略,如计算机自适应测试(computerized adaptive testing)、自适应治疗分配(adaptive treatment assignment)以及主动学习(active learning)。其核心问题在于:尽管这些应用场景在概念上具有相似性,但传统方法往往将其视为独立问题并采用特定解决方案,缺乏通用性和高效性。论文的关键解决方案是提出一种统一的、实用的框架,结合主动推理(active inference)——一种受认知神经科学启发的贝叶斯推断框架——与PsyNet平台(一个支持多模态刺激和响应的大规模在线行为实验模块化Python工具包),从而实现对文本、视觉和音频等任意模态任务的实时自适应实验设计。该方案不仅提供了数学上的严谨性和灵活性,还通过两个具体案例验证了其有效性:一是显著减少能力测量所需的试验次数(30–40%),二是将最优治疗识别准确率提升至固定设计的三倍。

链接: https://arxiv.org/abs/2603.29003
作者: Lucas Gautheron,Nori Jacoby,Peter Harrison
机构: 未知
类目: Methodology (stat.ME); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Adaptive experiments automatically optimize their design throughout the data collection process, which can bring substantial benefits compared to conventional experimental settings. Potential applications include, among others: computerized adaptive testing (for selecting informative tasks in ability measurements), adaptive treatment assignment (when searching experimental conditions maximizing certain outcomes), and active learning (for choosing optimal training data for machine learning algorithms). However, implementing these techniques in real time poses substantial computational and technical challenges. Additionally, despite their conceptual similarity, the above scenarios are often treated as separate problems with distinct solutions. In this paper, we introduce a practical and unified approach to real-time adaptive experiments that can encompass all of the above scenarios, regardless of the modality of the task (including textual, visual, and audio inputs). Our strategy combines active inference, a Bayesian framework inspired by cognitive neuroscience, with PsyNet, a platform for large-scale online behavioral experiments. While active inference provides a compact, flexible, and principled mathematical framework for adaptive experiments generally, PsyNet is a highly modular Python package that supports social and behavioral experiments with stimuli and responses in arbitrary domains. We illustrate this approach through two concrete examples: (1) an adaptive testing experiment estimating participants’ ability by selecting optimal challenges, effectively reducing the amount of trials required by 30–40%; and (2) an adaptive treatment assignment strategy that identifies the optimal treatment up to three times as accurately as a fixed design in our example. We provide detailed instructions to facilitate the adoption of these techniques.

[HC-31] Smartphone-Based Identification of Unknown Liquids via Active Vibration Sensing

【速读】:该论文旨在解决传统液体识别仪器难以普及的问题,提出利用商用轻量化设备(如智能手机)实现对未知液体的准确识别。其核心挑战在于如何通过手机内置加速度计等有限硬件资源,克服采样不足、自干扰及液量变化等因素对测量精度的影响。解决方案的关键在于基于分子运动能量壁垒差异的物理机制,设计一种基于主动振动的粘度测量模型,并通过多阶段信号处理技术重构原始信号、消除干扰,从而实现高精度粘度估计(平均相对误差2.9%)和多种液体分类(30类液体平均识别准确率95.47%)。

链接: https://arxiv.org/abs/2603.28787
作者: Yongzhi Huang
机构: The Hong Kong University of Science and Technology (Guangzhou)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Distributed, Parallel, and Cluster Computing (cs.DC); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Conference on Mobile Computing and Networking (MobiCom),10 pages, 5 figures

点击查看摘要

Abstract:Traditional liquid identification instruments are often unavailable to the general public. This paper shows the feasibility of identifying unknown liquids with commercial lightweight devices, such as a smartphone. The key insight is that different liquid molecules have different viscosity coefficients and therefore must overcome different energy barriers during relative motion. With this intuition in mind, we introduce a novel model that measures liquids’ viscosity based on active vibration. However, building a robust system using built-in smartphone accelerometers is challenging. Practical issues include under-sampling, self-interference, and the impact of liquid-volume changes. Instead of machine learning, we tackle these issues through multiple signal processing stages to reconstruct the original signals and cancel out the interference. Our approach estimates liquid viscosity with a mean relative error of 2.9% and distinguishes 30 types of liquids with an average accuracy of 95.47%.

计算机视觉

[CV-0] OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation

【速读】:该论文旨在解决现有视频生成模型在场景建模中因依赖透视视角而导致的观测不完整与全局一致性差的问题。其解决方案的关键在于提出OmniRoam框架,该框架利用全景(panoramic)表示所具备的每帧丰富场景覆盖度以及内在的长期空间和时间一致性,通过预览(preview)与精修(refine)两阶段策略实现可控的长时程全景视频生成,从而支持高保真、连续的场景漫游体验。

链接: https://arxiv.org/abs/2603.30045
作者: Yuheng Liu,Xin Lin,Xinke Li,Baihan Yang,Chen Wang,Kalyan Sunkavalli,Yannick Hold-Geoffroy,Hao Tan,Kai Zhang,Xiaohui Xie,Zifan Shi,Yiwei Hu
机构: University of California, Irvine (加州大学欧文分校); University of California, San Diego (加州大学圣地亚哥分校); City University of Hong Kong (香港城市大学); University of Pennsylvania (宾夕法尼亚大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

Abstract:Modeling scenes using video generation models has garnered growing research interest in recent years. However, most existing approaches rely on perspective video models that synthesize only limited observations of a scene, leading to issues of completeness and global consistency. We propose OmniRoam, a controllable panoramic video generation framework that exploits the rich per-frame scene coverage and inherent long-term spatial and temporal consistency of panoramic representation, enabling long-horizon scene wandering. Our framework begins with a preview stage, where a trajectory-controlled video generation model creates a quick overview of the scene from a given input image or video. Then, in the refine stage, this video is temporally extended and spatially upsampled to produce long-range, high-resolution videos, thus enabling high-fidelity world wandering. To train our model, we introduce two panoramic video datasets that incorporate both synthetic and real-world captured videos. Experiments show that our framework consistently outperforms state-of-the-art methods in terms of visual quality, controllability, and long-term scene consistency, both qualitatively and quantitatively. We further showcase several extensions of this framework, including real-time video generation and 3D reconstruction. Code is available at this https URL.

[CV-1] Video Models Reason Early: Exploiting Plan Commitment for Maze Solving

【速读】:该论文旨在解决视频扩散模型(video diffusion models)在生成过程中如何进行推理的问题,尤其是其内部规划动态机制尚不明确。研究聚焦于通过2D迷宫求解任务作为受控实验平台,揭示模型在生成视频时的早期决策行为与难度预测因素。解决方案的关键在于识别出两个核心现象:一是“早期计划承诺”(early plan commitment),即模型在前几轮去噪步骤中即锁定高层运动轨迹,后续去噪仅调整视觉细节而不改变路径;二是“路径长度”(path length)而非障碍密度是决定迷宫难度的主要因素,并存在12步的显著失败阈值。基于此,作者提出Chaining with Early Planning(ChEaP)方法,仅对具有潜力的初始计划种子进行计算资源投入并串联生成,从而显著提升复杂迷宫任务的准确率(从7%提升至67%),验证了当前视频模型具备比以往认知更深层次的推理能力,且可通过改进推理阶段的扩展策略被更可靠地激发。

链接: https://arxiv.org/abs/2603.30043
作者: Kaleb Newman,Tyler Zhu,Olga Russakovsky
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video diffusion models exhibit emergent reasoning capabilities like solving mazes and puzzles, yet little is understood about how they reason during generation. We take a first step towards understanding this and study the internal planning dynamics of video models using 2D maze solving as a controlled testbed. Our investigations reveal two findings. Our first finding is early plan commitment: video diffusion models commit to a high-level motion plan within the first few denoising steps, after which further denoising alters visual details but not the underlying trajectory. Our second finding is that path length, not obstacle density, is the dominant predictor of maze difficulty, with a sharp failure threshold at 12 steps. This means video models can only reason over long mazes by chaining together multiple sequential generations. To demonstrate the practical benefits of our findings, we introduce Chaining with Early Planning, or ChEaP, which only spends compute on seeds with promising early plans and chains them together to tackle complex mazes. This improves accuracy from 7% to 67% on long-horizon mazes and by 2.5x overall on hard tasks in Frozen Lake and VR-Bench across Wan2.2-14B and HunyuanVideo-1.5. Our analysis reveals that current video models possess deeper reasoning capabilities than previously recognized, which can be elicited more reliably with better inference-time scaling.

[CV-2] Benchmarking PhD-Level Coding in 3D Geometric Computer Vision CVPR2026

【速读】:该论文旨在解决当前生成式 AI 在复杂三维几何视觉(3D geometric vision)代码编写能力上的不足问题,这一短板限制了其在科研工作流中的可靠应用。为系统评估进展,作者提出了 GeoCodeBench,这是一个面向博士研究生水平的基准测试平台,通过从顶会论文中精选填空式函数实现任务,并结合自动化单元测试与人工筛选的核心几何组件,实现了可复现、自动评分的评测机制。解决方案的关键在于构建了一个结构化的两层任务体系:第一层为通用 3D 能力(如几何变换和光学建模),第二层为研究导向能力(如新算法实现与几何逻辑路由),并发现后者显著更难;同时揭示了长文本输入并非越长越好,方法部分截断反而优于全文输入,凸显了模型在科学文献长程理解上的瓶颈。此基准为推动 AI 从通用编码向可信 3D 几何视觉编程演进提供了严谨的评估框架。

链接: https://arxiv.org/abs/2603.30038
作者: Wenyi Li,Renkai Luo,Yue Yu,Huan-ang Gao,Mingju Gao,Li Yuan,Chaoyou Fu,Hao Zhao
机构: Tsinghua University (清华大学); Qiuzhen College, Tsinghua University (清华求真书院); BAAI (北京人工智能研究院); Peking University (北京大学); Nanjing University (南京大学); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026; Project page: this https URL

点击查看摘要

Abstract:AI-assisted coding has rapidly reshaped software practice and research workflows, yet today’s models still struggle to produce correct code for complex 3D geometric vision. If models could reliably write such code, the research of our community would change substantially. To measure progress toward that goal, we introduce GeoCodeBench, a PhD-level benchmark that evaluates coding for 3D vision. Each problem is a fill-in-the-function implementation task curated from representative papers at recent venues: we first let a tool propose candidate functions from official repositories, then perform careful human screening to select core 3D geometric components. For every target, we generate diverse, edge-case unit tests, enabling fully automatic, reproducible scoring. We evaluate eight representative open- and closed-source models to reflect the current ecosystem. The best model, GPT-5, attains only 36.6% pass rate, revealing a large gap between current capabilities and dependable 3D scientific coding. GeoCodeBench organizes tasks into a two-level hierarchy: General 3D capability (geometric transformations and mechanics/optics formulation) and Research capability (novel algorithm implementation and geometric logic routing). Scores are positively correlated across these axes, but research-oriented tasks are markedly harder. Context ablations further show that “more paper text” is not always better: cutting off at the Method section statistically outperforms full-paper inputs, highlighting unresolved challenges in long-context scientific comprehension. Together, these findings position GeoCodeBench as a rigorous testbed for advancing from generic coding to trustworthy 3D geometric vision coding.

[CV-3] Conditional Polarization Guidance for Camouflaged Object Detection

【速读】:该论文旨在解决现有基于偏振信息的伪装目标检测(Camouflaged Object Detection, COD)方法中存在的模型复杂度高、计算开销大,以及未能充分挖掘偏振线索对RGB特征层次化学习的显式引导作用的问题。其解决方案的关键在于提出一种不对称的RGB-偏振框架CPGNet,核心创新包括:设计轻量级偏振交互模块以统一建模互补线索并生成可靠偏振引导;引入条件偏振引导机制,动态调制RGB特征以聚焦于伪装目标与背景间的细微差异;结合偏振边缘引导的频域精修策略增强高频成分,有效打破伪装模式;并通过迭代反馈解码器实现粗到细的特征校准与预测优化。

链接: https://arxiv.org/abs/2603.30008
作者: QIfan Zhang,Hao Wang,Xiangrong Qin,Ruijie Li
机构: Dalian Maritime University (大连海事大学); Dalian University of Technology (大连理工大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 10 figures, 4 tables

点击查看摘要

Abstract:Camouflaged object detection (COD) aims to identify targets that are highly blended with their backgrounds. Recent works have shown that the optical characteristics of polarization cues play a significant role in improving camouflaged object detection. However, most existing polarization-based approaches depend on complex visual encoders and fusion mechanisms, leading to increased model complexity and computational overhead, while failing to fully explore how polarization can explicitly guide hierarchical RGB representation learning. To address these limitations, we propose CPGNet, an asymmetric RGB-polarization framework that introduces a conditional polarization guidance mechanism to explicitly regulate RGB feature learning for camouflaged object detection. Specifically, we design a lightweight polarization interaction module that jointly models these complementary cues and generates reliable polarization guidance in a unified manner. Unlike conventional feature fusion strategies, the proposed conditional guidance mechanism dynamically modulates RGB features using polarization priors, enabling the network to focus on subtle discrepancies between camouflaged objects and their backgrounds. Furthermore, we introduce a polarization edge-guided frequency refinement strategy that enhances high-frequency components under polarization constraints, effectively breaking camouflage patterns. Finally, we develop an iterative feedback decoder to perform coarse-to-fine feature calibration and progressively refine camouflage prediction. Extensive experiments on polarization datasets across multiple tasks, along with evaluations on non-polarization datasets, demonstrate that CPGNet consistently outperforms state-of-the-art methods.

[CV-4] SurgNavAR: An Augmented Reality Surgical Navigation Framework for Optical See-Through Head Mounted Displays

【速读】:该论文旨在解决当前基于头戴式显示器(Head-Mounted Display, HMD)的增强现实(Augmented Reality, AR)手术导航系统在实际应用中集成困难、需专业技能且缺乏通用性的问题。其核心解决方案是提出并评估一个可配置、模块化且适用于多种外科场景的HMD-AR手术导航框架,该框架通过跟踪患者与手术器械上的2D图案标记实现定位,采用基于枢轴点和参考点的校准技术完成手术工具标定,结合基于点匹配与手动定位的图像-患者配准方法实现术前影像与术中解剖结构的对齐,并在HoloLens 2和Magic Leap 2设备上验证了其性能:平均工具尖端校准精度达1 mm,配准精度为3 mm,靶向精度低于5 mm,从而实现了高精度、易用且可扩展的AR手术导航能力。

链接: https://arxiv.org/abs/2603.29990
作者: Abdullah Thabit,Mohamed Benmahdjoub,Rafiuddin Jinabade,Hizirwan S. Salim,Marie-Lise C. van Veelen,Mark G. van Vledder,Eppo B. Wolvius,Theo van Walsum
机构: Erasmus MC (埃拉姆斯医学中心); SURF bv (SURF公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Augmented reality (AR) devices with head mounted displays (HMDs) facilitate the direct superimposition of 3D preoperative imaging data onto the patient during surgery. To use an HMD-AR device as a stand-alone surgical navigation system, the device should be able to locate the patient and surgical instruments, align preoperative imaging data with the patient, and visualize navigation data in real time during surgery. Whereas some of the technologies required for this are known, integration in such devices is cumbersome and requires specific knowledge and expertise, hampering scientific progress in this field. This work therefore aims to present and evaluate an integrated HMD-based AR surgical navigation framework that is adaptable to diverse surgical applications. The framework tracks 2D patterns as reference markers attached to the patient and surgical instruments. It allows for the calibration of surgical tools using pivot and reference-based calibration techniques. It enables image-to-patient registration using point-based matching and manual positioning. The integrated functionalities of the framework are evaluated on two HMD devices, the HoloLens 2 and Magic Leap 2, with two surgical use cases being evaluated in a phantom setup: AR-guided needle insertion and rib fracture localization. The framework was able to achieve a mean tooltip calibration accuracy of 1 mm, a registration accuracy of 3 mm, and a targeting accuracy below 5 mm on the two surgical use cases. The framework presents an easy-to-use configurable tool for HMD-based AR surgical navigation, which can be extended and adapted to many surgical applications. The framework is publicly available at this https URL.

[CV-5] rimodal Deep Learning for Glioma Survival Prediction: A Feasibility Study Integrating Histopathology Gene Expression and MRI

【速读】:该论文旨在解决多模态深度学习在脑肿瘤预后预测中尚未充分整合体积磁共振成像(volumetric MRI)数据的问题,特别是如何将Fluid Attenuated Inversion Recovery (FLAIR) MRI作为第三模态纳入统一的生存分析框架。其解决方案的关键在于扩展原有的双模态(组织病理学与基因组学)模型,引入FLAIR MRI并系统评估早期融合、晚期融合及联合融合三种策略下的三模态(trimodal)配置效果;研究发现,在小样本(仅19名测试患者)条件下,三模态早期融合虽未达统计显著性提升(ΔCS = +0.011, p = 0.250),但已表现出潜在的预后增益,表明即使样本有限,新增影像模态仍可能提供额外信息,前提是需具备足够的多模态上下文支持以有效整合新特征。

链接: https://arxiv.org/abs/2603.29968
作者: Iain Swift,JingHua Ye
机构: Michigan Technological University (密歇根理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure, submitted to the IEEE CBMS 2026 conference, still waiting for notification

点击查看摘要

Abstract:Multimodal deep learning has improved prognostic accuracy for brain tumours by integrating histopathology and genomic data, yet the contribution of volumetric MRI within unified survival frameworks remains unexplored. This pilot study extends a bimodal framework by incorporating Fluid Attenuated Inversion Recovery (FLAIR) MRI from BraTS2021 as a third modality. Using the TCGA-GBMLGG cohort (664 patients), we evaluate three unimodal models, nine bimodal configurations, and three trimodal configurations across early, late, and joint fusion strategies. In this small cohort setting, trimodal early fusion achieves an exploratory Composite Score (CS = 0.854), with a controlled \Delta CS of +0.011 over the bimodal baseline on identical patients, though this difference is not statistically significant (p = 0.250, permutation test). MRI achieves reasonable unimodal discrimination (CS = 0.755) but does not substantially improve bimodal pairs, while providing measurable uplift in the three-way combination. All MRI containing experiments are constrained to 19 test patients, yielding wide bootstrap confidence intervals (e.g. [0.400,1.000]) that preclude definitive conclusions. These findings provide preliminary evidence that a third imaging modality may add prognostic value even with limited sample sizes, and that additional modalities require sufficient multimodal context to contribute effectively.

[CV-6] Learning Structural-Functional Brain Representations through Multi-Scale Adaptive Graph Attention for Cognitive Insight ICASSP2026

【速读】:该论文旨在解决脑结构与功能如何协同作用以解释智力的问题,尤其关注如何有效融合结构连接组(structural connectome)与功能连接组(functional connectome)的互补信息。其解决方案的关键在于提出多尺度自适应图网络(Multi-scale Adaptive Graph Network, MAGNet),该框架通过源基形态测量(source-based morphometry)提取区域间的形态学特征,并将其与静息态功能磁共振成像(resting-state fMRI)的功能网络连接进行融合;同时利用混合图结构整合直接与间接路径,结合局部-全局注意力机制优化连接重要性,并通过联合损失函数端到端地同步提升跨模态一致性与预测性能,从而实现对认知功能的更精准建模。

链接: https://arxiv.org/abs/2603.29967
作者: Badhan Mazumder,Sir-Lord Wiafe,Aline Kotoski,Vince D. Calhoun,Dong Hye Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint version of the paper accepted to the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026). This is the author’s accepted manuscript. The final published version will appear in IEEE Xplore

点击查看摘要

Abstract:Understanding how brain structure and function interact is key to explaining intelligence yet modeling them jointly is challenging as the structural and functional connectome capture complementary aspects of organization. We introduced Multi-scale Adaptive Graph Network (MAGNet), a Transformer-style graph neural network framework that adaptively learns structure-function interactions. MAGNet leverages source-based morphometry from structural MRI to extract inter-regional morphological features and fuses them with functional network connectivity from resting-state fMRI. A hybrid graph integrates direct and indirect pathways, while local-global attention refines connectivity importance and a joint loss simultaneously enforces cross-modal coherence and optimizes the prediction objective end-to-end. On the ABCD dataset, MAGNet outperformed relevant baselines, demonstrating effective multimodal integration for advancing our understanding of cognitive function.

[CV-7] Scaling Video Pretraining for Surgical Foundation Models

【速读】:该论文旨在解决当前外科视频理解领域中基础模型受限于数据规模有限、手术流程多样性不足以及评估标准不一致等问题,尤其缺乏可复现的预训练流程。其解决方案的关键在于提出SurgRec——一个可扩展且可复现的外科视频理解预训练配方,包含SurgRec-MAE和SurgRec-JEPA两种变体;通过构建包含10,535段视频和2.145亿帧的多源外科视频语料库(涵盖内窥镜、腹腔镜、白内障及机器人手术),并设计统一的预训练流水线与平衡采样策略,同时在16个下游数据集和四个临床领域上标准化评估基准,从而实现性能显著优于对比的自监督学习(SSL)基线和视觉语言模型(VLMs)的稳定表现,且克服了VLMs在细粒度时间识别任务中对提示词敏感性和性能不稳定的问题。

链接: https://arxiv.org/abs/2603.29966
作者: Sicheng Lu,Zikai Xiao,Jianhui Wei,Danyu Sun,Qi Lu,Keli Hu,Yang Feng,Jian Wu,Zongxin Yang,Zuozhu Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.

[CV-8] SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

【速读】:该论文旨在解决当前外科视觉问答(Surgical Visual Question Answering, VQA)研究中忽视视频时序语义、低视觉对比度、高度知识驱动性以及跨时间窗口分析需求多样等挑战,尤其针对从基础感知到高级术中评估的多层次任务建模难题。其解决方案的关键在于提出一个名为SurgTEMP的多模态大语言模型(Multimodal Large Language Model, LLM)框架,该框架包含两个核心组件:(i) 查询引导的令牌选择模块,用于构建空间与时间记忆库以建立分层视觉记忆;(ii) 外科能力进展训练(Surgical Competency Progression, SCP)策略,从而有效建模变长手术视频并保留关键程序线索和时序一致性,显著提升多种下游评估任务的表现。

链接: https://arxiv.org/abs/2603.29962
作者: Shi Li(1),Vinkle Srivastav(1),Nicolas Chanel(1),Saurav Sharma(1),Nabani Banik(1),Lorenzo Arboit(1),Kun Yuan(1),Pietro Mascagni(1 and 2),Nicolas Padoy(1) ((1) University of Strasbourg, CNRS, INSERM, ICube, Strasbourg, France, (2) Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy)
机构: University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France; IHU Strasbourg, Strasbourg, France; Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 14 figures, 9 tables

点击查看摘要

Abstract:Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to well navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support. Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment. To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP) training scheme. Together, these components enable effective modeling of variable-length surgical videos while preserving procedure-relevant cues and temporal coherence, and better support diverse downstream assessment tasks. To support model development, we introduce CholeVidQA-32K, a surgical video question answering dataset comprising 32K open-ended QA pairs and 3,855 video segments (approximately 128 h total) from laparoscopic cholecystectomy. The dataset is organized into a three-level hierarchy – Perception, Assessment, and Reasoning – spanning 11 tasks from instrument/action/anatomy perception to Critical View of Safety (CVS), intraoperative difficulty, skill proficiency, and adverse event assessment. In comprehensive evaluations against state-of-the-art open-source multimodal and video LLMs (fine-tuned and zero-shot), SurgTEMP achieves substantial performance improvements, advancing the state of video-based surgical VQA.

[CV-9] NeuroBRIDGE: Behavior-Conditioned Koopman Dynamics with Riemannian Alignment for Early Substance Use Initiation Prediction from Longitudinal Functional Connectome ICASSP2026

【速读】:该论文旨在解决青少年物质使用初启(Substance Use Initiation, SUI)早期识别难题,现有预测模型多将脑网络连接视为静态或横断面特征,忽视了脑网络随时间演变及与行为动态耦合的特性。解决方案的核心在于提出NeuroBRIDGE框架,其关键创新包括:在黎曼切空间(Riemannian tangent space)中对纵向功能连接组(longitudinal functional connectome)进行对齐,并结合双时间注意力机制与行为条件驱动的Koopman动力学建模,从而精准捕捉脑网络的时序演化规律,显著提升SUI预测性能的同时提供可解释的神经通路洞察,深化对神经发育风险的理解并支持靶向干预策略制定。

链接: https://arxiv.org/abs/2603.29960
作者: Badhan Mazumder,Sir-Lord Wiafe,Vince D. Calhoun,Dong Hye Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint version of the paper accepted to the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026). This is the author’s accepted manuscript. The final published version will appear in IEEE Xplore

点击查看摘要

Abstract:Early identification of adolescents at risk for substance use initiation (SUI) is vital yet difficult, as most predictors treat connectivity as static or cross-sectional and miss how brain networks change over time and with behavior. We proposed NeuroBRIDGE (Behavior conditioned RIemannian Koopman Dynamics on lonGitudinal connEctomes), a novel graph neural network-based framework that aligns longitudinal functional connectome in a Riemannian tangent space and couples dual-time attention with behavioral-conditioned Koopman dynamics to capture temporal change. Evaluated on ABCD, NeuroBRIDGE improved future SUI prediction over relevant baselines while offering interpretable insights into neural pathways, refining our understanding of neurodevelopmental risk and informing targeted prevention.

[CV-10] Detecting Unknown Objects via Energy-based Separation for Open World Object Detection CVPR2026

【速读】:该论文旨在解决开放世界目标检测(Open World Object Detection, OWOD)中的两个核心挑战:一是如何在不遗忘已知类别的情况下增量学习新类别,二是如何在无监督条件下有效识别未知物体。现有方法依赖已知类别的预测来发现未知对象,导致对未知表示的学习不足;同时,记忆回放虽能缓解灾难性遗忘,却常牺牲新类别的知识。为此,作者提出DEUS框架,其关键创新在于两个模块:一是基于等角紧框架(Equiangular Tight Frame, ETF)的子空间未知分离(ETF-Subspace Unknown Separation, EUS),通过构造正交子空间实现已知与未知特征的清晰分离,并利用双空间能量差异增强未知模式建模;二是基于能量的已知类区分损失(Energy-based Known Distinction, EKD),在记忆回放中强制旧分类器与当前分类器间的分离,从而减少新旧类别间的知识干扰。

链接: https://arxiv.org/abs/2603.29954
作者: Jun-Woo Heo,Keonhee Park,Gyeong-Moon Park
机构: Korea University (韩国高丽大学); Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, Accepted at CVPR 2026

点击查看摘要

Abstract:In this work, we tackle the problem of Open World Object Detection (OWOD). This challenging scenario requires the detector to incrementally learn to classify known objects without forgetting while identifying unknown objects without supervision. Previous OWOD methods have enhanced the unknown discovery process and employed memory replay to mitigate catastrophic forgetting. However, since existing methods heavily rely on the detector’s known class predictions for detecting unknown objects, they struggle to effectively learn and recognize unknown object representations. Moreover, while memory replay mitigates forgetting of old classes, it often sacrifices the knowledge of newly learned classes. To resolve these limitations, we propose DEUS (Detecting Unknowns via energy-based Separation), a novel framework that addresses the challenges of Open World Object Detection. DEUS consists of Equiangular Tight Frame (ETF)-Subspace Unknown Separation (EUS) and an Energy-based Known Distinction (EKD) loss. EUS leverages ETF-based geometric properties to create orthogonal subspaces, enabling cleaner separation between known and unknown object representations. Unlike prior energy-based approaches that consider only the known space, EUS utilizes energies from both spaces to better capture distinct patterns of unknown objects. Furthermore, EKD loss enforces the separation between previous and current classifiers, thus minimizing knowledge interference between previous and newly learned classes during memory replay. We thoroughly validate DEUS on OWOD benchmarks, demonstrating outstanding performance improvements in unknown detection while maintaining competitive known class performance.

[CV-11] EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos

【速读】:该论文旨在解决长视频中计数任务的挑战,即如何在长达数十分钟、事件稀疏且多样化的视频中实现准确的枚举(Enumeration)、计数(Counting)以及时间证据定位(Temporal Evidence Grounding)。现有视频计数基准多聚焦于短片段,仅评估最终数值答案,缺乏对模型是否正确识别和定位相关事件的能力进行评估。为此,作者提出了EC-Bench,一个包含152个超过30分钟的长视频及1,699个带有明确证据时间段的查询的基准数据集,首次联合评估上述三项能力。关键创新在于构建了结构化、细粒度的视频-文本对齐标注体系,使得模型不仅需输出正确数字,还需在时间轴上精确定位被计数对象的区间,从而推动对长程时序推理能力的系统性评测与提升。

链接: https://arxiv.org/abs/2603.29943
作者: Fumihiko Tsuchiya,Taiki Miyanishi,Mahiro Ukai,Nakamasa Inoue,Shuhei Kurita,Yusuke Iwasawa,Yutaka Matsuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first two authors are equally contributed. The data and code are publicly available at: this https URL

点击查看摘要

Abstract:Counting in long videos remains a fundamental yet underexplored challenge in computer vision. Real-world recordings often span tens of minutes or longer and contain sparse, diverse events, making long-range temporal reasoning particularly difficult. However, most existing video counting benchmarks focus on short clips and evaluate only the final numerical answer, providing little insight into what should be counted or whether models consistently identify relevant instances across time. We introduce EC-Bench, a benchmark that jointly evaluates enumeration, counting, and temporal evidence grounding in long-form videos. EC-Bench contains 152 videos longer than 30 minutes and 1,699 queries paired with explicit evidence spans. Across 22 multimodal large language models (MLLMs), the best model achieves only 29.98% accuracy on Enumeration and 23.74% on Counting, while human performance reaches 78.57% and 82.97%, respectively. Our analysis reveals strong relationships between enumeration accuracy, temporal grounding, and counting performance. These results highlight fundamental limitations of current MLLMs and establish EC-Bench as a challenging benchmark for long-form quantitative video reasoning.

[CV-12] Better than Averag e: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance CVPR2026

【速读】:该论文旨在解决图像分割中不确定性量化(Uncertainty Quantification, UQ)的像素级不确定性分数在聚合为图像级评分时缺乏系统性评估与优化的问题,尤其是在分布外(Out-of-Distribution, OoD)检测和故障检测等下游任务中的表现不一致。其关键解决方案在于:(1) 对现有聚合策略(如全局平均、基于块、类别或阈值的方法)进行形式化分析,揭示其局限性;(2) 提出融合空间结构信息的新聚合策略,以更好地捕捉分割不确定性在图像中的分布特性;(3) 通过跨十种不同图像几何与结构的数据集实证验证,证明基于空间结构的聚合策略在下游任务中表现更优;(4) 进一步提出一种元聚合器(meta-aggregator),集成多种聚合策略,从而在不同数据集上实现鲁棒性能。

链接: https://arxiv.org/abs/2603.29941
作者: Vanessa Emanuela Guarino,Claudia Winklmayr,Jannik Franzen,Josef Lorenz Rumberger,Manuel Pfeuffer,Sonja Greven,Klaus Maier-Hein,Carsten T. Lüth,Christoph Karg,Dagmar Kainmueller
机构: Max-Delbrück-Center (MDC); Helmholtz Imaging; Charité Universitätsmedizin; Humboldt-Universität zu Berlin; University of Potsdam; German Cancer Research Center (DKFZ); Heidelberg University
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 27 pages, 13 figures, 6 tables. Accepted at CVPR 2026 (The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026)

点击查看摘要

Abstract:Uncertainty Quantification (UQ) is crucial for ensuring the reliability of automated image segmentations in safety-critical domains like biomedical image analysis or autonomous driving. In segmentation, UQ generates pixel-wise uncertainty scores that must be aggregated into image-level scores for downstream tasks like Out-of-Distribution (OoD) or failure detection. Despite routine use of aggregation strategies, their properties and impact on downstream task performance have not yet been comprehensively studied. Global Average is the default choice, yet it does not account for spatial and structural features of segmentation uncertainty. Alternatives like patch-, class- and threshold-based strategies exist, but lack systematic comparison, leading to inconsistent reporting and unclear best practices. We address this gap by (1) formally analyzing properties, limitations, and pitfalls of common strategies; (2) proposing novel strategies that incorporate spatial uncertainty structure and (3) benchmarking their performance on OoD and failure detection across ten datasets that vary in image geometry and structure. We find that aggregators leveraging spatial structure yield stronger performance in both downstream tasks studied. However, the performance of individual aggregators depends heavily on dataset characteristics, so we (4) propose a meta-aggregator that integrates multiple aggregators and performs robustly across datasets.

[CV-13] Gloria: Consistent Character Video Generation via Content Anchors CVPR2026

【速读】:该论文旨在解决生成式AI(Generative AI)在长时间、多视角一致性以及角色表达性方面的问题,即如何在生成角色视频时保持角色外观和身份的一致性,同时避免因使用非角色中心信息作为记忆而导致的不一致问题。其解决方案的关键在于提出通过一组紧凑的锚定帧(anchor frames)来表征角色视觉属性,从而为一致性提供稳定参考;并引入两种机制:超集内容锚定(Superset Content Anchoring)以防止复制粘贴和多参考冲突,以及RoPE作为弱条件(RoPE as Weak Condition)通过编码位置偏移来区分多个锚定帧,从而实现高质量、长时序的角色视频生成。

链接: https://arxiv.org/abs/2603.29931
作者: Yuhang Yang,Fan Zhang,Huaijin Pi,Shuai Guo,Guowei Xu,Wei Zhai,Yang Cao,Zheng-Jun Zha
机构: USTC (中国科学技术大学); UNSW (新南威尔士大学); HKU (香港大学); UESTC (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026 Main, project: this https URL

点击查看摘要

Abstract:Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the memory, leading to suboptimal consistency. Recognizing that character video generation inherently resembles an outside-looking-in scenario. In this work, we propose representing the character visual attributes through a compact set of anchor frames. This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors. Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding 10 minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.

[CV-14] End-to-End Image Compression with Segmentation Guided Dual Coding for Wind Turbines

【速读】:该论文旨在解决风力发电机叶片巡检中高分辨率图像传输带来的瓶颈问题,即如何在保证叶片区域高质量重建的前提下,对背景区域进行高效压缩,从而提升缺陷检测的效率与准确性。解决方案的关键在于提出了一种端到端的深度学习框架,集成分割与双模式(有损和无损)压缩功能:首先通过改进的BU-Netv2+P网络结合CRF正则化损失实现精准的叶片定位;其次利用基于超先验的自编码器优化有损压缩性能;最后采用扩展的比特回退编码器(bits-back coder)与分层模型实现叶片区域的完全无损重建;此外,该框架通过重用背景编码比特消除比特回退编码中的串行依赖关系,支持并行化处理,显著提升了压缩效率与实用性。

链接: https://arxiv.org/abs/2603.29927
作者: Raül Pérez-Gonzalo,Andreas Espersen,Søren Forchhammer,Antonio Agudo
机构: Wind Power LAB, 1150 Copenhagen, Denmark; Institut de Robòtica i Informàtica Industrial, CSIC-UPC, 08028 Barcelona, Spain; Department of Photonics Engineering, Technical University of Denmark (DTU), 2800 Lyngby, Denmark
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to TNNLS 2026

点击查看摘要

Abstract:Transferring large volumes of high-resolution images during wind turbine inspections introduces a bottleneck in assessing and detecting severe defects. Efficient coding must preserve high fidelity in blade regions while aggressively compressing the background. In this work, we propose an end-to-end deep learning framework that jointly performs segmentation and dual-mode (lossy and lossless) compression. The segmentation module accurately identifies the blade region, after which our region-of-interest (ROI) compressor encodes it at superior quality compared to the rest of the image. Unlike conventional ROI schemes that merely allocate more bits to salient areas, our framework integrates: (i) a robust segmentation network (BU-Netv2+P) with a CRF-regularized loss for precise blade localization, (ii) a hyperprior-based autoencoder optimized for lossy compression, and (iii) an extended bits-back coder with hierarchical models for fully lossless blade reconstruction. Furthermore, our ROI framework removes the sequential dependency in bits-back coding by reusing background-coded bits, enabling parallelized and efficient dual-mode compression. To the best of our knowledge, this is the first fully integrated learning-based ROI codec combining segmentation, lossy, and lossless compression, ensuring that subsequent defect detection is not compromised. Experiments on a large-scale wind turbine dataset demonstrate superior compression performance and efficiency, offering a practical solution for automated inspections.

[CV-15] Abstraction in Style SIGGRAPH2026

【速读】:该论文旨在解决传统风格迁移方法在处理抽象性艺术风格(如插画或非写实风格)时的局限性,这些问题主要源于其仅保留输入图像几何结构而无法捕捉深层结构抽象行为。解决方案的关键在于提出一种名为“风格中的抽象”(Abstraction in Style, AiS)的生成式框架,该框架通过将结构抽象与视觉风格化显式解耦,首先从目标图像和少量风格样本中学习一个中间抽象代理(abstraction proxy),该代理在保持语义结构的同时放松几何保真度,从而为后续风格化提供抽象表示;随后在第二阶段基于此代理生成最终的风格化输出,确保与参考风格的视觉一致性。整个过程基于共享图像空间类比实现,无需显式的几何监督即可从视觉样例中学习变换,显著提升了风格迁移的表达力与可控性。

链接: https://arxiv.org/abs/2603.29924
作者: Min Lu,Yuanfeng He,Anthony Chen,Jianhuang He,Pu Wang,Daniel Cohen-Or,Hui Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: siggraph 2026 conditionally accepted paper

点击查看摘要

Abstract:Artistic styles often embed abstraction beyond surface appearance, involving deliberate reinterpretation of structure rather than mere changes in texture or color. Conventional style transfer methods typically preserve the input geometry and therefore struggle to capture this deeper abstraction behavior, especially for illustrative and nonphotorealistic styles. In this work, we introduce Abstraction in Style (AiS), a generative framework that separates structural abstraction from visual stylization. Given a target image and a small set of style exemplars, AiS first derives an intermediate abstraction proxy that reinterprets the target’s structure in accordance with the abstraction logic exhibited by the style. The proxy captures semantic structure while relaxing geometric fidelity, enabling subsequent stylization to operate on an abstracted representation rather than the original image. In a second stage, the abstraction proxy is rendered to produce the final stylized output, preserving visual coherence with the reference style. Both stages are implemented using a shared image space analogy, enabling transformations to be learned from visual exemplars without explicit geometric supervision. By decoupling abstraction from appearance and treating abstraction as an explicit, transferable process, AiS supports a wider range of stylistic transformations, improves controllability, and enables more expressive stylization.

[CV-16] raining deep learning based dynamic MR image reconstruction using synthetic fractals

【速读】:该论文旨在解决动态心脏磁共振成像(dynamic MRI)中深度学习(DL)模型训练依赖真实临床数据所面临的隐私保护、许可限制及数据获取困难等问题。其解决方案的关键在于使用四元数朱利亚分形(quaternion Julia fractals)生成合成的2D+time图像,并模拟多通道MRI采集以构建配对的完全采样与径向欠采样k空间数据,从而训练出无需真实心脏MRI数据的深度学习重建模型(F-DL)。实验表明,该方法在图像质量与心室容积和射血分数等临床指标上均与基于真实心脏MRI数据训练的模型(CMR-DL)相当,证明了分形合成数据作为开放、可扩展替代方案的有效性。

链接: https://arxiv.org/abs/2603.29922
作者: Anirudh Raman,Olivier Jaubert,Mark Wrobel,Tina Yao,Ruaraidh Campbell,Rebecca Baker,Ruta Virsinskaite,Daniel Knight,Michael Quail,Jennifer Steeden,Vivek Muthurangu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Purpose: To investigate whether synthetically generated fractal data can be used to train deep learning (DL) models for dynamic MRI reconstruction, thereby avoiding the privacy, licensing, and availability limitations associated with cardiac MR training datasets. Methods: A training dataset was generated using quaternion Julia fractals to produce 2D+time images. Multi-coil MRI acquisition was simulated to generate paired fully sampled and radially undersampled k-space data. A 3D UNet deep artefact suppression model was trained using these fractal data (F-DL) and compared with an identical model trained on cardiac MRI data (CMR-DL). Both models were evaluated on prospectively acquired radial real-time cardiac MRI from 10 patients. Reconstructions were compared against compressed sensing(CS) and low-rank deep image prior (LR-DIP). All reconstrctuions were ranked for image quality, while ventricular volumes and ejection fraction were compared with reference breath-hold cine MRI. Results: There was no significant difference in qualitative ranking between F-DL and CMR-DL (p=0.9), while both outperformed CS and LR-DIP (p0.001). Ventricular volumes and function derived from F-DL were similar to CMR-DL, showing no significant bias and accptable limits of agreement compared to reference cine imaging. However, LR-DIP had a signifcant bias (p=0.016) and wider lmits of agreement. Conclusion: DL models trained using synthetic fractal data can reconstruct real-time cardiac MRI with image quality and clinical measurements comparable to models trained on true cardiac MRI data. Fractal training data provide an open, scalable alternative to clinical datasets and may enable development of more generalisable DL reconstruction models for dynamic MRI.

[CV-17] Diffusion-Based Feature Denoising with NNMF for Robust handwritten digit multi-class classification

【速读】:该论文旨在解决手写数字识别中因噪声和对抗攻击导致的分类模型鲁棒性不足的问题。其解决方案的关键在于提出一种基于扩散驱动特征去噪的混合特征表示框架:首先利用非负矩阵分解(Nonnegative Matrix Factorization, NNMF)提取紧致且可解释的特征表示,同时通过卷积神经网络(Convolutional Neural Network, CNN)提取深层特征;随后将二者融合形成统一的混合特征表示,并在特征空间中引入扩散操作——通过逐步添加高斯噪声并训练一个特征去噪网络以恢复干净特征,从而提升模型对噪声和对抗样本的抵抗能力。实验表明,该方法在基准和对抗攻击场景下均表现出更强的鲁棒性和有效的多类分类性能。

链接: https://arxiv.org/abs/2603.29917
作者: Hiba Adil Al-kharsan,Róbert Rajkó
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work presents a robust multi-class classification framework for handwritten digits that combines diffusion-driven feature denoising with a hybrid feature representation. Inspired by our previous work on brain tumor classification, the proposed approach operates in a feature space to improve the robustness to noise and adversarial attacks. First, the input images are converted into tight, interpretable exemplification using Nonnegative Matrix Factorization (NNMF). In parallel, special deep features are extracted using a computational neural network (CNN). These integral features are combined into a united hybrid representation. To improve robustness, a step diffusion operation is used in the feature space by gradually adding Gaussian noise. A feature denoiser network is trained to reverse this operation and rebuild clean representations from tilted inputs. The courteous features are then applied for multi-class classification. The suggested method is evaluated in both baseline and adversarial settings using AutoAttack. The experimental outcome present that the diffusion-based hybrid model is both effective and robust, the CNN baseline models outperforming while maintain powerful classification performance. These results explain the activity of feature-level diffusion defense for reliable multi-class handwritten digit classification.

[CV-18] GENIE: Gram-Eigenmode INR Editing with Closed-Form Geometry Updates

【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)在不重新训练模型的情况下进行几何编辑的可行性问题。其核心挑战在于:尽管INRs能够以紧凑形式建模几何结构,但如何识别并实现可执行的几何变形仍不明确。解决方案的关键在于利用INR最后一层特征所诱导的Gram算子(Gram operator),从中提取出描述SDF零水平集变形的特征模式(deformation eigenmodes)。研究发现,这些变形模式并非仅由几何本身决定,而是依赖于从足够丰富采样分布中估计的Gram算子;基于此,作者提出一种闭式单步更新公式,无需优化即可完成几何编辑,并理论证明该方法仅在这些变形模式的线性空间内是适定的(well-posed)。

链接: https://arxiv.org/abs/2603.29860
作者: Samundra Karki,Adarsh Krishnamurthy,Baskar Ganapathysubramanian
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 9 figures

点击查看摘要

Abstract:Implicit Neural Representations (INRs) provide compact models of geometry, but it is unclear when their learned shapes can be edited without retraining. We show that the Gram operator induced by the INR’s penultimate features admits deformation eigenmodes that parameterize a family of realizable edits of the SDF zero level set. A key finding is that these modes are not intrinsic to the geometry alone: they are reliably recoverable only when the Gram operator is estimated from sufficiently rich sampling distributions. We derive a single closed-form update that performs geometric edits to the INR without optimization by leveraging the deformation modes. We characterize theoretically the precise set of deformations that are feasible under this one-shot update, and show that editing is well-posed exactly within the span of these deformation modes.

[CV-19] VectorGym: A Multitask Benchmark for SVG Code Generation Sketching and Editing

【速读】:该论文旨在解决当前缺乏与专业设计工作流程对齐的、真实且具有挑战性的可缩放矢量图形(SVG)基准测试问题,尤其在文本到SVG生成、草图到SVG转换、复杂编辑及视觉理解等任务上存在明显空白。解决方案的关键在于构建一个名为VectorGym的综合性基准套件,涵盖四个由专家人工标注的任务:Sketch2SVG(VG-Sketch)、SVG编辑(VG-Edit)、Text2SVG生成(VG-Text)和SVG描述生成(VG-Cap),其中VG-Edit引入了高阶原始对象和多步骤编辑,确保语义理解和设计意图的真实性;同时提出一种基于渲染奖励的多任务强化学习方法(GRPO结合课程学习),联合优化全部四类任务,并基于Qwen3-VL 8B模型实现开源模型中的最先进性能,甚至超越更大规模模型(如Qwen3-VL 235B),并达到与GPT-4o相当的效果,从而为视觉代码生成提供严谨评估框架。

链接: https://arxiv.org/abs/2603.29852
作者: Juan Rodriguez,Haotian Zhang,Abhay Puri,Tianyang Zhang,Rishav Pramanik,Meng Lin,Xiaoqing Xie,Marco Terral,Darsh Kaushik,Aly Shariff,Perouz Taslakian,Spandana Gella,Sai Rajeswar,David Vazquez,Christopher Pal,Marco Pedersoli
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce VectorGym, a comprehensive benchmark suite for Scalable Vector Graphics (SVG) that spans generation from text and sketches, complex editing, and visual understanding. VectorGym addresses the lack of realistic, challenging benchmarks aligned with professional design workflows. Our benchmark comprises four tasks with expert human-authored annotations: the novel Sketch2SVG task (VG-Sketch); a new SVG editing dataset (VG-Edit) featuring complex, multi-step edits with higher-order primitives; Text2SVG generation (VG-Text); and SVG captioning (VG-Cap). Unlike prior benchmarks that rely on synthetic edits, VectorGym provides gold-standard human annotations that require semantic understanding and design intent. We also propose a multi-task reinforcement learning approach that jointly optimizes across all four tasks using rendering-based rewards. Our method, built on GRPO with curriculum learning, trains a Qwen3-VL 8B model that achieves state-of-the-art performance among open-source models, surpassing much larger models including Qwen3-VL 235B and matching GPT-4o. We also introduce a VLM-as-a-Judge metric for SVG generation, validated through human correlation studies. Our evaluation of frontier VLMs reveals significant performance gaps, positioning VectorGym as a rigorous framework for advancing visual code generation. VectorGym is publicly available on this http URL.

[CV-20] DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

【速读】:该论文旨在解决现有端到端视觉-语言-动作(Vision-Language-Action, VLA)模型在高阶决策与低阶运动执行之间衔接不足的问题,尤其是传统方法将预训练视觉语言模型(Vision-Language Model, VLM)仅用作多模态编码器,导致其丰富的语义表征被削弱且训练不稳定。解决方案的关键在于引入DIAL框架,通过一个可微的潜在意图瓶颈(differentiable latent intent bottleneck)实现高阶决策与低阶运动控制的解耦与协同优化:其中基于VLM的System-2负责在原生特征空间中合成潜在视觉前瞻(latent visual foresight),显式编码意图作为结构瓶颈;轻量级的System-1策略则通过潜在逆动力学(latent inverse dynamics)从当前观测和预测意图中解码出精确机器人动作。该设计结合两阶段训练机制——先解耦预热阶段确保系统稳定学习,再进行端到端联合优化,从而在保留预训练知识的同时,使动作感知梯度可控地微调VLM主干网络,显著提升性能并实现零样本泛化能力。

链接: https://arxiv.org/abs/2603.29844
作者: Yi Chen,Yuying Ge,Hui Zhou,Mingyu Ding,Yixiao Ge,Xihui Liu
机构: The University of Hong Kong (香港大学); XPENG Robotics (小鹏机器人); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM’s potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high-level decision making and low-level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM-based System-2 performs latent world modeling by synthesizing latent visual foresight within the VLM’s native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System-1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two-stage training paradigm: a decoupled warmup phase where System-2 learns to predict latent futures while System-1 learns motor control under ground-truth future guidance within a unified feature space, followed by seamless end-to-end joint optimization. This enables action-aware gradients to refine the VLM backbone in a controlled manner, preserving pre-trained knowledge. Extensive experiments on the RoboCasa GR1 Tabletop benchmark show that DIAL establishes a new state-of-the-art, achieving superior performance with 10x fewer demonstrations than prior methods. Furthermore, by leveraging heterogeneous human demonstrations, DIAL learns physically grounded manipulation priors and exhibits robust zero-shot generalization to unseen objects and novel configurations during real-world deployment on a humanoid robot.

[CV-21] oward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data CVPR2026

【速读】:该论文旨在解决高分辨率全脑三维显微成像数据(特别是基于光片荧光显微镜,LSFM)在处理与分析中面临的可扩展性不足问题,以及现有视觉任务模型(如目标检测和分类)在该类数据上泛化能力差的挑战。其解决方案的关键在于构建CANVAS——一个包含六种神经元和免疫细胞类型标记、细胞注释及排行榜的大型基准数据集,覆盖完整小鼠大脑组织并达到亚细胞级分辨率,从而为开发适用于此类海量、复杂结构数据的高效算法和基础模型提供标准化测试平台,并揭示了细胞形态异质性对模型泛化性能的影响。

链接: https://arxiv.org/abs/2603.29842
作者: Minyoung E. Kim,Dae Hee Yun,Aditi V. Patel,Madeline Hon,Webster Guan,Taegeon Lee,Brian Nguyen
机构: LifeCanvas Technologies(生命画布科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, 12 figures. Accepted at CVPR 2026

点击查看摘要

Abstract:Unprecedented visual details of biological structures are being revealed by subcellular-resolution whole-brain 3D microscopy data, enabled by recent advances in intact tissue processing and light-sheet fluorescence microscopy (LSFM). These volumetric data offer rich morphological and spatial cellular information, however, the lack of scalable data processing and analysis methods tailored to these petabyte-scale data poses a substantial challenge for accurate interpretation. Further, existing models for visual tasks such as object detection and classification struggle to generalize to this type of data. To accelerate the development of suitable methods and foundational models, we present CANVAS, a comprehensive set of high-resolution whole mouse brain LSFM benchmark data, encompassing six neuronal and immune cell-type markers, along with cell annotations and a leaderboard. We also demonstrate challenges in generalization of baseline models built on existing architectures, especially due to the heterogeneity in cellular morphology across phenotypes and anatomical locations in the brain. To the best of our knowledge, CANVAS is the first and largest LSFM benchmark that captures intact mouse brain tissue at subcellular level, and includes extensive annotations of cells throughout the brain.

[CV-22] AutoFormBench: Benchmark Dataset for Automating Form Understanding

【速读】:该论文旨在解决结构化文档(如政府表格、医疗记录和企业发票)在真实场景中因版式高度变异性而导致的自动化处理难题。其解决方案的关键在于构建了一个包含407个标注真实表单的基准数据集AutoFormBench,并系统比较了传统OpenCV方法与四种YOLO架构(YOLOv8、YOLOv11、YOLOv26-s和YOLOv26-l)在检测和分类可填写表单元素(复选框、输入线和文本框)上的性能表现,结果表明YOLOv11在所有元素类别和容差水平下均展现出最优的F1分数和Jaccard准确率,成为最有效的检测模型。

链接: https://arxiv.org/abs/2603.29832
作者: Gaurab Baral,Junxiu Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Automated processing of structured documents such as government forms, healthcare records, and enterprise invoices remains a persistent challenge due to the high degree of layout variability encountered in real-world settings. This paper introduces AutoFormBench, a benchmark dataset of 407 annotated real-world forms spanning government, healthcare, and enterprise domains, designed to train and evaluate form element detection models. We present a systematic comparison of classical OpenCV approaches and four YOLO architectures (YOLOv8, YOLOv11, YOLOv26-s, and YOLOv26-l) for localizing and classifying fillable form elements. specifically checkboxes, input lines, and text boxes across diverse PDF document types. YOLOv11 demonstrates consistently superior performance in both F1 score and Jaccard accuracy across all element classes and tolerance levels.

[CV-23] SceneTeract: Agent ic Functional Affordances and VLM Grounding in 3D Scenes

【速读】:该论文旨在解决具身人工智能(Embodied AI)中3D场景功能可及性评估的难题,即如何在多样化用户需求下验证虚拟环境是否支持有意义的交互活动。其核心挑战在于现有方法难以准确判断场景是否具备物理层面的实际可用性,如可达性、通行空间和操作可行性等。解决方案的关键在于提出SceneTeract框架,该框架通过一个基于语义与几何耦合的验证引擎,将复杂任务分解为原子动作序列,并结合显式物理和几何仿真,对每个步骤进行条件化验证,从而实现对特定代理(agent)行为约束下的场景功能有效性评估。

链接: https://arxiv.org/abs/2603.29798
作者: Léopold Maillard,Francis Engelmann,Tom Durand,Boxiao Pan,Yang You,Or Litany,Leonidas Guibas,Maks Ovsjanikov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Embodied AI depends on interactive 3D environments that support meaningful activities for diverse users, yet assessing their functional affordances remains a core challenge. We introduce SceneTeract, a framework that verifies 3D scene functionality under agent-specific constraints. Our core contribution is a grounded verification engine that couples high-level semantic reasoning with low-level geometric checks. SceneTeract decomposes complex activities into sequences of atomic actions and validates each step against accessibility requirements (e.g., reachability, clearance, and navigability) conditioned on an embodied agent profile, using explicit physical and geometric simulations. We deploy SceneTeract to perform an in-depth evaluation of (i) synthetic indoor environments, uncovering frequent functional failures that prevent basic interactions, and (ii) the ability of frontier Vision-Language Models (VLMs) to reason about and predict functional affordances, revealing systematic mismatches between semantic confidence and physical feasibility even for the strongest current models. Finally, we leverage SceneTeract as a reward engine for VLM post-training, enabling scalable distillation of geometric constraints into reasoning models. We release the SceneTeract verification suite and data to bridge perception and physical reality in embodied 3D scene understanding.

[CV-24] Multi-Feature Fusion Approach for Generative AI Images Detection

【速读】:该论文旨在解决生成式 AI (Generative AI, GenAI) 生成图像日益逼真所带来的图像真实性鉴别难题,传统单特征检测方法(如统计规律、语义嵌入或纹理模式)在面对多样化和持续演进的生成模型时表现出鲁棒性不足的问题。其解决方案的关键在于提出一种多特征融合框架,通过整合三个互补的特征空间:(1) 低层统计偏差的均值减去对比度归一化 (Mean Subtracted Contrast Normalized, MSCN) 特征;(2) 高层语义一致性的 CLIP 嵌入;(3) 中层纹理异常的多尺度局部二值模式 (Multi-scale Local Binary Patterns, MLBP)。实验表明,单一特征空间性能波动显著,而三者融合后在多种生成模型混合场景下展现出更优且稳定的检测效果,显著优于当前最优方法。

链接: https://arxiv.org/abs/2603.29788
作者: Abderrezzaq Sendjasni,Mohamed-Chaker Larabi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to IEEE Transactions for possible publication

点击查看摘要

Abstract:The rapid evolution of Generative AI (GenAI) models has led to synthetic images of unprecedented realism, challenging traditional methods for distinguishing them from natural photographs. While existing detectors often rely on single-feature spaces, such as statistical regularities, semantic embeddings, or texture patterns, these approaches tend to lack robustness when confronted with diverse and evolving generative models. In this work, we investigate and systematically evaluate a multi-feature fusion framework that combines complementary cues from three distinct spaces: (1) Mean Subtracted Contrast Normalized (MSCN) features capturing low-level statistical deviations; (2) CLIP embeddings encoding high-level semantic coherence; and (3) Multi-scale Local Binary Patterns (MLBP) characterizing mid-level texture anomalies. Through extensive experiments on four benchmark datasets covering a wide range of generative models, we show that individual feature spaces exhibit significant performance variability across different generators. Crucially, the fusion of all three representations yields superior and more consistent performance, particularly in a challenging mixed-model scenario. Compared to state-of-the-art methods, the proposed framework yields consistently improved performance across all evaluated datasets. Overall, this work highlights the importance of hybrid representations for robust GenAI image detection and provides a principled framework for integrating complementary visual cues.

[CV-25] MAPLE: Multi-Path Adaptive Propagation with Level-Aware Embeddings for Hierarchical Multi-Label Image Classification

【速读】:该论文旨在解决遥感图像中层次化多标签分类(Hierarchical Multi-Label Classification, HMLC)在多路径场景下的挑战,即当图像激活多个分类分支时,现有方法难以充分利用层级结构信息,导致性能受限。解决方案的关键在于提出MAPLE框架,其核心创新包括:(i) 基于图感知文本描述的层次语义初始化,(ii) 通过图卷积网络(Graph Convolutional Networks, GCNs)进行结构编码以建模标签间的拓扑关系,以及 (iii) 自适应多模态融合机制,动态平衡语义先验与视觉证据;此外,引入自适应层级感知目标函数,自动为每一层级选择最优损失函数,从而在少样本场景下实现最高达+42%的性能提升,且仅增加2.6%的参数开销。

链接: https://arxiv.org/abs/2603.29784
作者: Boshko Koloski,Marjan Stoimchev,Jurica Levatić,Dragi Kocev,Sašo Džeroski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: REO: Advances in Representation Learning for Earth Observation, accepted workshow paper at EurIPS

点击查看摘要

Abstract:Hierarchical multi-label classification (HMLC) is essential for modeling structured label dependencies in remote sensing. Yet existing approaches struggle in multi-path settings, where images may activate multiple taxonomic branches, leading to underuse of hierarchical information. We propose MAPLE (Multi-Path Adaptive Propagation with Level-Aware Embeddings), a framework that integrates (i) hierarchical semantic initialization from graph-aware textual descriptions, (ii) graph-based structure encoding via graph convolutional networks (GCNs), and (iii) adaptive multi-modal fusion that dynamically balances semantic priors and visual evidence. An adaptive level-aware objective automatically selects appropriate losses per hierarchy level. Evaluations on CORINE-aligned remote sensing datasets (AID, DFC-15, and MLRSNet) show consistent improvements of up to +42% in few-shot regimes while adding only 2.6% parameter overhead, demonstrating that MAPLE effectively and efficiently models hierarchical semantics for Earth observation (EO).

[CV-26] From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety

【速读】:该论文旨在解决公共空间中潜在暴力行为的实时可靠检测问题,尤其针对边缘计算环境下因延迟、隐私保护和资源限制导致的自动化视频分析部署难题。解决方案的关键在于设计并实现了一个混合边缘动作检测系统,该系统融合了基于骨架(skeleton-based)的运动分析与视觉-语言模型(vision-language models)的语义场景理解能力:前者提供低开销、持续且隐私友好的监控,后者则赋予系统对复杂及未见过情境的零样本推理能力。通过在GPU加速的边缘设备上进行演示验证,研究比较了两种范式在实际约束下的性能表现,揭示了运动感知与语义理解的互补性,从而提出一种选择性增强机制——以快速骨架检测为基础,按需引入高层语义推理,构建出兼顾效率与准确性的实用化公共安全视频分析框架。

链接: https://arxiv.org/abs/2603.29777
作者: Ganen Sethupathy,Lalit Dumka,Jan Schagen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint version of a manuscript currently under review at IEEE Access

点击查看摘要

Abstract:Public spaces such as transport hubs, city centres, and event venues require timely and reliable detection of potentially violent behaviour to support public safety. While automated video analysis has made significant progress, practical deployment remains constrained by latency, privacy, and resource limitations, particularly under edge-computing conditions. This paper presents the design and demonstrator-based deployment of a hybrid edge-based action detection system that combines skeleton-based motion analysis with vision-language models for semantic scene interpretation. Skeleton-based processing enables continuous, privacy-aware monitoring with low computational overhead, while vision-language models provide contextual understanding and zero-shot reasoning capabilities for complex and previously unseen situations. Rather than proposing new recognition models, the contribution focuses on a system-level comparison of both paradigms under realistic edge constraints. The system is implemented on a GPU-enabled edge device and evaluated with respect to latency, resource usage, and operational trade-offs using a demonstrator-based setup. The results highlight the complementary strengths and limitations of motioncentric and semantic approaches and motivate a hybrid architecture that selectively augments fast skeletonbased detection with higher-level semantic reasoning. The presented system provides a practical foundation for privacy-aware, real-time video analysis in public safety applications.

[CV-27] Beyond Ground-Truth: Leverag ing Image Quality Priors for Real-World Image Restoration CVPR

【速读】:该论文旨在解决真实世界图像恢复(Real-world Image Restoration)中因依赖带标签的高质量(HQ)图像作为监督信号而导致模型收敛于训练数据平均感知质量、而非最优感知质量的问题。现有方法假设标注数据(Ground-truth, GT)具有完美一致性,但实际GT仍可能存在感知保真度不一致的情况,从而限制了恢复效果的上限。解决方案的关键在于引入一个从预训练无参考图像质量评估(No-Reference Image Quality Assessment, NR-IQA)模型中提取的图像质量先验(Image Quality Prior, IQP),通过三个核心机制实现感知最优输出:(1) 质量条件Transformer,利用NR-IQA得分作为条件信号引导预测表示向最大感知质量方向优化;(2) 双分支码本结构,分离通用特征与HQ特有特征,增强对结构信息和质量敏感属性的联合建模;(3) 基于离散表示的质量优化策略,缓解连续潜在空间中的过优化问题。该框架无需修改现有架构即可插件式集成,显著提升真实场景下的图像恢复质量。

链接: https://arxiv.org/abs/2603.29773
作者: Fengyang Xiao,Peng Hu,Lei Xu,XingE Guo,Guanyi Qin,Yuqi Shen,Chengyu Fang,Rihan Zhang,Chunming He,Sina Farsiu
机构: Duke University(杜克大学); Tsinghua University(清华大学); EPFL(洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR

点击查看摘要

Abstract:Real-world image restoration aims to restore high-quality (HQ) images from degraded low-quality (LQ) inputs captured under uncontrolled conditions. Existing methods typically depend on ground-truth (GT) supervision, assuming that GT provides perfect reference quality. However, GT can still contain images with inconsistent perceptual fidelity, causing models to converge to the average quality level of the training data rather than achieving the highest perceptual quality attainable. To address these problems, we propose a novel framework, termed IQPIR, that introduces an Image Quality Prior (IQP)-extracted from pre-trained No-Reference Image Quality Assessment (NR-IQA) models-to guide the restoration process toward perceptually optimal outputs explicitly. Our approach synergistically integrates IQP with a learned codebook prior through three key mechanisms: (1) a quality-conditioned Transformer, where NR-IQA-derived scores serve as conditioning signals to steer the predicted representation toward maximal perceptual quality. This design provides a plug-and-play enhancement compatible with existing restoration architectures without structural modification; and (2) a dual-branch codebook structure, which disentangles common and HQ-specific features, ensuring a comprehensive representation of both generic structural information and quality-sensitive attributes; and (3) a discrete representation-based quality optimization strategy, which mitigates over-optimization effects commonly observed in continuous latent spaces. Extensive experiments on real-world image restoration demonstrate that our method not only surpasses cutting-edge methods but also serves as a generalizable quality-guided enhancement strategy for existing methods. The code is available.

[CV-28] SHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在室内安全风险评估任务中存在的三大核心问题:一是依赖仿真软件构建的合成数据导致与真实环境存在显著领域差距;二是安全任务设计过于简化,受限于人工设定的危险类型和场景,限制了模型泛化能力;三是缺乏严谨的评估协议来全面测试模型在复杂家庭安全场景下的表现。解决方案的关键在于提出一个名为TSHA(Trustworthy Safety Hazards Assessment)的综合性基准,其包含81,809个来自四个互补来源(现有室内数据集、互联网图像、AIGC生成图像及新采集图像)的训练样本,并配备一个包含1707个样本的高难度测试集,涵盖从训练分布中精心选取的样本以及新增的视频和全景图像,其中包含多重安全隐患,从而有效评估模型在复杂场景中的鲁棒性。实验证明,基于TSHA训练的模型不仅在该基准上性能提升最高达+18.3点,且在其他基准上也展现出更强的泛化能力,凸显了该基准的重要价值。

链接: https://arxiv.org/abs/2603.29759
作者: Qiucheng Yu,Ruijie Xu,Mingang Chen,Xuequan Lu,Jianfeng Dong,Chaochao Lu,Xin Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have accelerated their application to indoor safety hazards assessment. However, existing benchmarks suffer from three fundamental limitations: (1) heavy reliance on synthetic datasets constructed via simulation software, creating a significant domain gap with real-world environments; (2) oversimplified safety tasks with artificial constraints on hazard and scene types, thereby limiting model generalization; and (3) absence of rigorous evaluation protocols to thoroughly assess model capabilities in complex home safety scenarios. To address these challenges, we introduce TSHA (\textbfTrustworthy \textbfSafety \textbfHazards \textbfAssessment), a comprehensive benchmark comprising 81,809 carefully curated training samples drawn from four complementary sources: existing indoor datasets, internet images, AIGC images, and newly captured images. This benchmark set also includes a highly challenging test set with 1707 samples, comprising not only a carefully selected subset from the training distribution but also newly added videos and panoramic images containing multiple safety hazards, used to evaluate the model’s robustness in complex safety scenarios. Extensive experiments on 23 popular VLMs demonstrate that current VLMs lack robust capabilities for safety hazard assessment. Importantly, models trained on the TSHA training set not only achieve a significant performance improvement of up to +18.3 points on the TSHA test set but also exhibit enhanced generalizability across other benchmarks, underscoring the substantial contribution and importance of the TSHA benchmark.

[CV-29] SHIFT: Stochastic Hidden-Trajectory Deflection for Removing Diffusion-based Watermark

【速读】:该论文旨在解决基于扩散模型的水印方法中存在的一个根本性漏洞:现有水印方案依赖于扩散轨迹的精确重建才能完成验证,而这一假设使得水印容易受到攻击。解决方案的关键在于提出一种无需训练的攻击方法——Stochastic Hidden-Trajectory Detection (SHIFT),其核心机制是利用随机扩散重采样技术,在潜在空间中偏转生成轨迹,使重构图像在统计上与原始带水印轨迹解耦,同时保持视觉质量和语义一致性。实验表明,SHIFT在九种代表性水印方法上均实现了95%–100%的攻击成功率,且无需任何水印特定知识或模型重训练。

链接: https://arxiv.org/abs/2603.29742
作者: Rui Bao,Zheng Gao,Xiaoyu Li,Xiaoyan Feng,Yang Song,Jiaojiao Jiang
机构: University of New South Wales (新南威尔士大学); Griffith University (格里菲斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Diffusion-based watermarking methods embed verifiable marks by manipulating the initial noise or the reverse diffusion trajectory. However, these methods share a critical assumption: verification can succeed only if the diffusion trajectory can be faithfully reconstructed. This reliance on trajectory recovery constitutes a fundamental and exploitable vulnerability. We propose \underline\mathbfS tochastic \underline\mathbfHi dden-Trajectory De \underline\mathbff lec \underline\mathbft ion ( \mathbfSHIFT ), a training-free attack that exploits this common weakness across diverse watermarking paradigms. SHIFT leverages stochastic diffusion resampling to deflect the generative trajectory in latent space, making the reconstructed image statistically decoupled from the original watermark-embedded trajectory while preserving strong visual quality and semantic consistency. Extensive experiments on nine representative watermarking methods spanning noise-space, frequency-domain, and optimization-based paradigms show that SHIFT achieves 95%–100% attack success rates with nearly no loss in semantic quality, without requiring any watermark-specific knowledge or model retraining.

[CV-30] GRVS: a Generalizable and Recurrent Approach to Monocular Dynamic View Synthesis CVPR

【速读】:该论文旨在解决从单目视频中合成动态场景新视角的难题,尤其针对现有方法在高度动态区域因多视角信息难以利用而失效、以及基于扩散模型的方法存在几何不一致性的问题。其解决方案的关键在于提出一种新型模型,包含两个核心组件:(1) 一个循环结构实现输入与目标视频之间无界且异步的映射;(2) 利用平面扫掠(plane sweeps)对动态输入进行高效处理,从而解耦相机运动与场景运动,实现精细的六自由度(6-DOF)相机控制,并在UCSD和Kubric-4D-dyn两个数据集上验证了其在静态与动态区域重建几何细节方面的优越性能。

链接: https://arxiv.org/abs/2603.29734
作者: Thomas Tanay,Mohammed Brahimi,Michal Nazarczuk,Qingwen Zhang,Sibi Catley-Chandar,Arthur Moreau,Zhensong Zhang,Eduardo Pérez-Pellitero
机构: Huawei Noah’s Ark Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Findings 2026

点击查看摘要

Abstract:Synthesizing novel views from monocular videos of dynamic scenes remains a challenging problem. Scene-specific methods that optimize 4D representations with explicit motion priors often break down in highly dynamic regions where multi-view information is hard to exploit. Diffusion-based approaches that integrate camera control into large pre-trained models can produce visually plausible videos but frequently suffer from geometric inconsistencies across both static and dynamic areas. Both families of methods also require substantial computational resources. Building on the success of generalizable models for static novel view synthesis, we adapt the framework to dynamic inputs and propose a new model with two key components: (1) a recurrent loop that enables unbounded and asynchronous mapping between input and target videos and (2) an efficient use of plane sweeps over dynamic inputs to disentangle camera and scene motion, and achieve fine-grained, six-degrees-of-freedom camera controls. We train and evaluate our model on the UCSD dataset and on Kubric-4D-dyn, a new monocular dynamic dataset featuring longer, higher resolution sequences with more complex scene dynamics than existing alternatives. Our model outperforms four Gaussian Splatting-based scene-specific approaches, as well as two diffusion-based approaches in reconstructing fine-grained geometric details across both static and dynamic regions.

[CV-31] Leverag ing Synthetic Data for Enhancing Egocentric Hand-Object Interaction Detection

【速读】:该论文旨在解决手-物体交互(Hand-Object Interaction, HOI)检测在真实标注数据稀缺时性能受限的问题。其核心解决方案是利用合成数据(synthetic data)增强模型训练,尤其在仅使用10%真实标注数据的情况下,通过高质量的合成数据显著提升HOI检测性能。关键在于构建与真实世界基准在物体类别、抓握方式和环境特征上高度对齐的合成数据,并设计了一套自动标注的生成管道,从而实现跨数据集的泛化能力提升,如在VISOR、EgoHOS和ENIGMA-51上分别获得+5.67%、+8.24%和+11.69%的Overall AP增益。

链接: https://arxiv.org/abs/2603.29733
作者: Rosario Leonardi,Antonino Furnari,Francesco Ragusa,Giovanni Maria Farinella
机构: University of Catania (卡塔尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we explore the role of synthetic data in improving the detection of Hand-Object Interactions from egocentric images. Through extensive experimentation and comparative analysis on VISOR, EgoHOS, and ENIGMA-51 datasets, our findings demonstrate the potential of synthetic data to significantly improve HOI detection, particularly when real labeled data are scarce or unavailable. By using synthetic data and only 10% of the real labeled data, we achieve improvements in Overall AP over models trained exclusively on real data, with gains of +5.67% on VISOR, +8.24% on EgoHOS, and +11.69% on ENIGMA-51. Furthermore, we systematically study how aligning synthetic data to specific real-world benchmarks with respect to objects, grasps, and environments, showing that the effectiveness of synthetic data consistently improves with better synthetic-real alignment. As a result of this work, we release a new data generation pipeline and the new HOI-Synth benchmark, which augments existing datasets with synthetic images of hand-object interaction. These data are automatically annotated with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. All data, code, and tools for synthetic data generation are available at: this https URL.

[CV-32] Compressive sensing inspired self-supervised single-pixel imaging

【速读】:该论文旨在解决单像素成像(Single-pixel Imaging, SPI)中因缺乏物理稀疏性约束以及局部与全局特征融合不足而导致的噪声敏感性高、结构失真和细节模糊等问题。其解决方案的关键在于提出SISTA-Net,一种受压缩感知启发的自监督方法,该方法将迭代收缩阈值算法(Iterative Shrinkage-Thresholding Algorithm, ISTA)展开为可解释的神经网络结构,包含数据保真模块和近似映射模块;其中保真模块采用混合CNN-视觉状态空间模型(Visual State Space Model, VSSM)架构以协同建模局部与全局特征,提升重建完整性与保真度;同时引入深度非线性网络作为自适应稀疏变换,并结合可学习软阈值算子,在潜在域显式施加物理稀疏性约束,从而实现对噪声的有效抑制及在极低采样率下的强鲁棒性。

链接: https://arxiv.org/abs/2603.29732
作者: Jijun Lu,Yifan Chen,Libang Chen,Yiqiang Zhou,Ye Zheng,Mingliang Chen,Zhe Sun,Xuelong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures, 2 algorithms, 2 tables, journal paper

点击查看摘要

Abstract:Single-pixel imaging (SPI) is a promising imaging modality with distinctive advantages in strongly perturbed environments. Existing SPI methods lack physical sparsity constraints and overlook the integration of local and global features, leading to severe noise vulnerability, structural distortions and blurred details. To address these limitations, we propose SISTA-Net, a compressive sensing-inspired self-supervised method for single-pixel imaging. SISTA-Net unfolds the Iterative Shrinkage-Thresholding Algorithm (ISTA) into an interpretable network consisting of a data fidelity module and a proximal mapping module. The fidelity module adopts a hybrid CNN-Visual State Space Model (VSSM) architecture to integrate local and global feature modeling, enhancing reconstruction integrity and fidelity. We leverage deep nonlinear networks as adaptive sparse transforms combined with a learnable soft-thresholding operator to impose explicit physical sparsity in the latent domain, enabling noise suppression and robustness to interference even at extremely low sampling rates. Extensive experiments on multiple simulation scenarios demonstrate that SISTA-Net outperforms state-of-the-art methods by 2.6 dB in PSNR. Real-world far-field underwater tests yield a 3.4 dB average PSNR improvement, validating its robust anti-interference capability.

[CV-33] FED-Bench: A Cross-Granular Benchmark for Disentangled Evaluation of Facial Expression Editing

【速读】:该论文旨在解决当前人脸表情图像编辑任务中缺乏高质量数据集与准确评估指标的问题,特别是现有基准测试在保留人脸身份和背景的同时实现精细表情控制的能力不足,且评价指标存在系统性偏差(如偏好“懒惰编辑”或“过拟合编辑”)。其解决方案的关键在于提出FED-Bench基准平台,包含三个核心创新:首先,通过级联可扩展的流水线构建747组三元组数据(原始图像、编辑指令、真实标注图像),确保数据质量与标注精度;其次,设计FED-Score跨粒度评估协议,将评估解耦为三个维度——Alignment(指令遵循度)、Fidelity(图像质量和身份保真度)以及Relative Expression Gain(表情变化幅度),有效缓解评价偏差;最后,基于该基准对18种主流图像编辑模型进行系统评测,揭示当前方法在高保真度与精确表情操控之间难以兼顾,且细粒度指令理解是主要瓶颈,并进一步提供一个20k+野外采集的人脸训练集以提升模型性能。

链接: https://arxiv.org/abs/2603.29697
作者: Fengjian Xue,Xuecheng Wu,Heli Sun,Yunyun Shi,Shi Chen,Liangyu Fu,Jinheng Xie,Dingkang Yang,Hao Wang,Junxiao Xue,Liang He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial expression image editing requires fine-grained control to strictly preserve human identity and background while precisely manipulating expression. However, existing editing benchmarks primarily focus on general scenarios, lacking high-quality facial images and corresponding editing instructions. Furthermore, current evaluation metrics exhibit systemic biases in this task, often favoring lazy editing or overfit editing. To bridge these gaps, we propose FED-Bench, a comprehensive benchmark featuring rigorous testing and an accurate evaluation suite. First, we carefully construct a benchmark of 747 triplets through a cascaded and scalable pipeline, each comprising an original image, an editing instruction, and a ground-truth image for precise evaluation. Second, we introduce FED-Score, a cross-granularity evaluation protocol that disentangles assessment into three dimensions: Alignment for verifying instruction following, Fidelity for testing image quality and identity preservation, and Relative Expression Gain for quantifying the magnitude of expression changes, effectively mitigating the aforementioned evaluation biases. Third, we benchmark 18 image editing models, revealing that current approaches struggle to simultaneously achieve high fidelity and accurate expression manipulation, with fine-grained instruction following identified as the primary bottleneck. Finally, leveraging the scalable characteristic of introduced benchmark engine, we provide a 20k+ in-the-wild facial training set and demonstrate its effectiveness by fine-tuning a baseline model that achieves significant performance gains. Our benchmark and related code will be made publicly open soon.

[CV-34] Exploring the Impact of Skin Color on Skin Lesion Segmentation

【速读】:该论文旨在解决皮肤病变分割(skin lesion segmentation)中因肤色差异导致的公平性问题,尤其关注现有研究多依赖离散肤色分类(如Fitzpatrick分型)而忽视连续色素分布特征对模型性能影响的局限性。其解决方案的关键在于引入基于像素级IT(intonation tone, ITA)值的连续对比度分析方法,通过计算图像内皮肤区域、病灶区域及全图区域之间的Wasserstein距离来量化病变与皮肤间的对比度,并发现低病变-皮肤对比度是导致分割误差的主要因素,而非整体肤色等级。这一分布式的量化方式相较于传统离散肤色类别能更精确地识别模型性能下降的根本原因,为提升皮肤镜图像分割的公平性和鲁棒性提供了新的评估框架和改进方向。

链接: https://arxiv.org/abs/2603.29694
作者: Kuniko Paxton,Medina Kapo,Amila Akagić,Koorosh Aslansefat,Dhavalkumar Thakker,Yiannis Papadopoulos
机构: University of Hull (赫尔大学); University of Sarajevo (萨拉热窝大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Skin cancer, particularly melanoma, remains a major cause of morbidity and mortality, making early detection critical. AI-driven dermatology systems often rely on skin lesion segmentation as a preprocessing step to delineate the lesion from surrounding skin and support downstream analysis. While fairness concerns regarding skin tone have been widely studied for lesion classification, the influence of skin tone on the segmentation stage remains under-quantified and is frequently assessed using coarse, discrete skin tone categories. In this work, we evaluate three strong segmentation architectures (UNet, DeepLabV3 with a ResNet50 backbone, and DINOv2) on two public dermoscopic datasets (HAM10000 and ISIC2017) and introduce a continuous pigment or contrast analysis that treats pixel-wise ITA values as distributions. Using Wasserstein distances between within-image distributions for skin-only, lesion-only, and whole-image regions, we quantify lesion skin contrast and relate it to segmentation performance across multiple metrics. Within the range represented in these datasets, global skin tone metrics (Fitzpatrick grouping or mean ITA) show weak association with segmentation quality. In contrast, low lesion-skin contrast is consistently associated with larger segmentation errors in models, indicating that boundary ambiguity and low contrast are key drivers of failure. These findings suggest that fairness improvements in dermoscopic segmentation should prioritize robust handling of low-contrast lesions, and the distribution-based pigment measures provide a more informative audit signal than discrete skin-tone categories.

[CV-35] SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition CVPR2026

【速读】:该论文旨在解决零样本骨架动作识别(zero-shot skeleton-based action recognition)中因缺乏上下文线索(如动作涉及的物体)而导致的骨架表示与语义表示之间存在固有鸿沟的问题,从而难以区分视觉上相似的动作。解决方案的关键在于提出SkeletonContext框架,其核心创新包括:1)引入跨模态上下文提示模块(Cross-Modal Context Prompt Module),利用预训练语言模型(LLM)重建被掩码的上下文提示,并将语言驱动的语义信息引导至骨架编码器,实现实例级语义锚定与跨模态对齐;2)设计关键部位解耦模块(Key-Part Decoupling Module),分离与运动相关的关节特征,确保在无显式物体交互情况下仍能保持鲁棒的动作理解能力。

链接: https://arxiv.org/abs/2603.29692
作者: Ning Wang,Tieyue Wu,Naeha Sharif,Farid Boussaid,Guangming Zhu,Lin Mei,Mohammed Bennamoun,zhang liang
机构: Chang’an University(长安大学); Xidian University(西安电子科技大学); University of Western Australia(西澳大利亚大学); Donghai Lab(东海实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Zero-shot skeleton-based action recognition aims to recognize unseen actions by transferring knowledge from seen categories through semantic descriptions. Most existing methods typically align skeleton features with textual embeddings within a shared latent space. However, the absence of contextual cues, such as objects involved in the action, introduces an inherent gap between skeleton and semantic representations, making it difficult to distinguish visually similar actions. To address this, we propose SkeletonContext, a prompt-based framework that enriches skeletal motion representations with language-driven contextual semantics. Specifically, we introduce a Cross-Modal Context Prompt Module, which leverages a pretrained language model to reconstruct masked contextual prompts under guidance derived from LLMs. This design effectively transfers linguistic context to the skeleton encoder for instance-level semantic grounding and improved cross-modal alignment. In addition, a Key-Part Decoupling Module is incorporated to decouple motion-relevant joint features, ensuring robust action understanding even in the absence of explicit object interactions. Extensive experiments on multiple benchmarks demonstrate that SkeletonContext achieves state-of-the-art performance under both conventional and generalized zero-shot settings, validating its effectiveness in reasoning about context and distinguishing fine-grained, visually similar actions.

[CV-36] Clinical DVH metrics as a loss function for 3D dose prediction in head and neck radiotherapy

【速读】:该论文旨在解决深度学习驱动的三维(3D)剂量预测模型在头颈部(H&N)放疗自动化流程中,因采用体素级回归损失(voxel-wise regression losses)而导致与临床计划评估标准(基于剂量-体积直方图,DVH)不一致的问题。传统方法难以直接优化临床关注的核心DVH指标,如靶区覆盖和器官受照剂量约束。其解决方案的关键在于提出一种临床引导的DVH指标损失函数(Clinical DVH Metric Loss, CDM loss),该损失函数通过可微分的D-metrics(剂量指标)和代理V-metrics(体积指标)实现对临床DVH指标的直接优化,同时引入无损位掩码区域感兴趣(ROI)编码策略以显著提升训练效率——实验表明该方法在保持OAR sparing性能的同时,使PTV评分从1.544(MAE损失)降至0.491,并将训练时间减少83%,GPU内存占用降低,从而为H&N剂量预测提供了一个更贴近临床需求、计算高效且可扩展的框架。

链接: https://arxiv.org/abs/2603.29670
作者: Ruochen Gao,Marius Staring,Frank Dankers
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages

点击查看摘要

Abstract:Purpose: Deep-learning-based three-dimensional (3D) dose prediction is widely used in automated radiotherapy workflows. However, most existing models are trained with voxel-wise regression losses, which are poorly aligned with clinical plan evaluation criteria based on dose-volume histogram (DVH) metrics. This study aims to develop a clinically guided loss formulation that directly optimizes clinically used DVH metrics while remaining computationally efficient for head and neck (H\N) dose prediction. Methods: We propose a clinical DVH metric loss (CDM loss) that incorporates differentiable \textitD-metrics and surrogate \textitV-metrics, together with a lossless bit-mask region-of-interest (ROI) encoding to improve training efficiency. The method was evaluated on 174 H\N patients using a temporal split (137 training, 37 testing). Results: Compared with MAE- and DVH-curve based losses, CDM loss substantially improved target coverage and satisfied all clinical constraints. Using a standard 3D U-Net, the PTV Score was reduced from 1.544 (MAE) to 0.491 (MAE + CDM), while OAR sparing remained comparable. Bit-mask encoding reduced training time by 83% and lowered GPU memory usage. Conclusion: Directly optimizing clinically used DVH metrics enables 3D dose predictions that are better aligned with clinical treatment planning criteria than conventional voxel-wise or DVH-curve-based supervision. The proposed CDM loss, combined with efficient ROI bit-mask encoding, provides a practical and scalable framework for H\N dose prediction. Comments: 19 pages Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.29670 [cs.CV] (or arXiv:2603.29670v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.29670 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ruochen Gao [view email] [v1] Tue, 31 Mar 2026 12:27:41 UTC (2,209 KB)

[CV-37] CoRe-DA: Contrastive Regression for Unsupervised Domain Adaptation in Surgical Skill Assessment

【速读】:该论文旨在解决视觉引导的手术技能评估(Vision-based Surgical Skill Assessment, SSA)中因手动标注成本高、耗时长以及现有回归模型在新手术任务和环境中的泛化能力差所导致的瓶颈问题。其解决方案的关键在于提出一种新颖的对比回归域自适应(Contrastive Regression-based Domain Adaptation, CoRe-DA)框架,通过相对评分监督(relative-score supervision)学习域不变表征,并结合目标域自训练(target-domain self-training)策略,在无需任何目标域标签的情况下实现跨域鲁棒的技能评分预测,从而显著提升SSA的可扩展性和跨场景适用性。

链接: https://arxiv.org/abs/2603.29666
作者: Dimitrios Anastasiou,Razvan Caramalau,Jialang Xu,Runlong He,Freweini Tesfai,Matthew Boal,Nader Francis,Danail Stoyanov,Evangelos B. Mazomenos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-based surgical skill assessment (SSA) enables objective and scalable evaluation of operative performance. Progress in this field is constrained by the high cost and time demands for manual annotation of quantitative skill scores, as well as the poor generalization of existing regression models to new surgical tasks and environments. Meanwhile, appreciable volumes of unlabeled video data are now available, motivating the development of unsupervised domain adaptation (UDA) methods for SSA. We introduce the first benchmark for UDA in SSA regression, spanning four datasets across dry-lab and clinical settings as well as open and robotic surgery. We evaluate eight representative models under challenging domain shifts and propose CoRe-DA, a novel contrastive regression-based adaptation framework. Our method learns domain-invariant representations through relative-score supervision and target-domain self-training. Comprehensive experiments across two UDA settings show that CoRe-DA is superior to state-of-the-art methods, achieving Spearman Correlation Coefficients of 0.46 and 0.41 on dry-lab and clinical target datasets, respectively, without using any labeled target data for training. Overall, CoRe-DA enables scalable SSA with reliable cross-domain generalization, where existing methods underperform. Our code and datasets will be released at this https URL.

[CV-38] CutClaw: Agent ic Hours-Long Video Editing via Music Synchronization

【速读】:该论文旨在解决长时视频素材手动剪辑耗时且重复性高的问题,尤其在社交媒体场景下对音画同步、叙事连贯性和视觉美感的高要求难以通过人工实现。其解决方案的关键在于提出CutClaw框架,该框架是一个基于多智能体(multi-agent)的自主编辑系统,利用多个多模态大语言模型(Multimodal Language Models, MLLMs)协同工作:首先采用分层多模态分解方法提取视觉与音频内容中的细粒度特征和全局结构;其次由Playwriter Agent统筹叙事流程并锚定视觉场景与音乐节奏变化;最后由Editor与Reviewer Agent协作优化最终剪辑结果,基于严格的美学与语义标准选择画面片段,从而实现高质量、节奏对齐的短视频生成。

链接: https://arxiv.org/abs/2603.29664
作者: Shifang Zhao,Yihan Hu,Ying Shan,Yunchao Wei,Xiaodong Cun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Code: this https URL

点击查看摘要

Abstract:Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: this https URL.

[CV-39] Not All Frames Are Equal: Complexity-Aware Masked Motion Generation via Motion Spectral Descriptors

【速读】:该论文旨在解决当前基于掩码的运动生成模型在处理动态复杂度变化显著的运动序列时存在的性能下降问题,即现有方法对运动帧的处理过于均匀,未能充分考虑局部动态复杂度的差异。其解决方案的关键在于引入一个参数无关、可解释且直接从运动速度短时谱中计算得出的局部动态复杂度度量——运动频谱描述符(Motion Spectral Descriptor, MSD),并以此构建复杂度感知的掩码机制:MSD用于指导训练阶段的内容聚焦掩码策略,提供自注意力中的频谱相似性先验,并在迭代解码过程中调节token级采样。这一设计使模型在动态复杂运动上的生成误差显著降低,同时整体FID指标在HumanML3D和KIT-ML数据集上得到提升,验证了尊重局部运动复杂度是改进掩码运动生成的有效设计原则。

链接: https://arxiv.org/abs/2603.29655
作者: Pengfei Zhou,Xiangyue Zhang,Xukun Shen,Yong Hu
机构: Beihang University(北京航空航天大学); Wuhan University(武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Masked generative models have become a strong paradigm for text-to-motion synthesis, but they still treat motion frames too uniformly during masking, attention, and decoding. This is a poor match for motion, where local dynamic complexity varies sharply over time. We show that current masked motion generators degrade disproportionately on dynamically complex motions, and that frame-wise generation error is strongly correlated with motion dynamics. Motivated by this mismatch, we introduce the Motion Spectral Descriptor (MSD), a simple and parameter-free measure of local dynamic complexity computed from the short-time spectrum of motion velocity. Unlike learned difficulty predictors, MSD is deterministic, interpretable, and derived directly from the motion signal itself. We use MSD to make masked motion generation complexity-aware. In particular, MSD guides content-focused masking during training, provides a spectral similarity prior for self-attention, and can additionally modulate token-level sampling during iterative decoding. Built on top of masked motion generators, our method, DynMask, improves motion generation most clearly on dynamically complex motions while also yielding stronger overall FID on HumanML3D and KIT-ML. These results suggest that respecting local motion complexity is a useful design principle for masked motion generation. Project page: this https URL

[CV-40] MacTok: Robust Continuous Tokenization for Image Generation

【速读】:该论文旨在解决连续图像分词器(continuous image tokenizer)在使用较少令牌(token)时常见的后验崩溃(posterior collapse)问题,即编码器无法将信息丰富的特征编码到压缩的潜在空间中。解决方案的关键在于提出MacTok——一种基于掩码增强的一维连续分词器,其核心创新包括:1)随机掩码用于正则化潜在学习,2)DINO引导的语义掩码强调图像中的重要区域,迫使模型从不完整的视觉证据中编码鲁棒语义,3)结合全局与局部表示对齐机制,在高度压缩的一维潜在空间中保留丰富的判别信息。这一方法有效防止了后验崩溃,同时实现了高效且高保真的图像分词,显著减少令牌数量(最多降低64倍),并在ImageNet上取得优异的生成FID(gFID)性能。

链接: https://arxiv.org/abs/2603.29634
作者: Hengyu Zeng,Xin Gao,Guanghao Li,Yuxiang Yan,Jiaoyang Ruan,Junpeng Ma,Haoyu Albert Wang,Jian Pu
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continuous image tokenizers enable efficient visual generation, and those based on variational frameworks can learn smooth, structured latent representations through KL regularization. Yet this often leads to posterior collapse when using fewer tokens, where the encoder fails to encode informative features into the compressed latent space. To address this, we introduce \textbfMacTok, a \textbfMasked \textbfAugmenting 1D \textbfContinuous \textbfTokenizer that leverages image masking and representation alignment to prevent collapse while learning compact and robust representations. MacTok applies both random masking to regularize latent learning and DINO-guided semantic masking to emphasize informative regions in images, forcing the model to encode robust semantics from incomplete visual evidence. Combined with global and local representation alignment, MacTok preserves rich discriminative information in a highly compressed 1D latent space, requiring only 64 or 128 tokens. On ImageNet, MacTok achieves a competitive gFID of 1.44 at 256 \times 256 and a state-of-the-art 1.52 at 512 \times 512 with SiT-XL, while reducing token usage by up to 64 \times . These results confirm that masking and semantic guidance together prevent posterior collapse and achieve efficient, high-fidelity tokenization.

[CV-41] Self-Supervised Federated Learning under Data Heterogeneity for Label-Scarce Diatom Classification

【速读】:该论文旨在解决去中心化且数据异构场景下标签稀缺的视觉分类问题,尤其关注站点间部分重叠类集(partially overlapping class sets)带来的挑战。现有自监督联邦学习(Self-Supervised Federated Learning, SSFL)方法通常假设预训练与微调阶段的数据异构模式一致,且当前划分方案难以生成纯净的类不相交数据设置,限制了对真实世界标签空间异构性的可控模拟。其关键解决方案是提出一种名为PreDi的划分策略,将标签空间异构性解耦为两个正交维度——类别流行度(class Prevalence)和类集大小差异(class-set size Disparity),从而实现对二者独立影响的可控分析;并基于此设计了基于流行度的个性化加权联邦学习(PreP-WFL),通过自适应增强低流行度类别的表示来缓解因类别分布不均导致的性能下降。实验证明,该框架在多种异构条件下均优于本地训练,且流行度主导下游任务性能,而PreP-WFL能有效提升罕见类识别能力,尤其在低流行度场景中收益显著。

链接: https://arxiv.org/abs/2603.29633
作者: Mingkun Tan,Xilu Wang,Michael Kloster,Tim W. Nattkemper
机构: University of Bielefeld (比勒费尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 9 figures

点击查看摘要

Abstract:Label-scarce visual classification under decentralized and heterogeneous data is a fundamental challenge in pattern recognition, especially when sites exhibit partially overlapping class sets. While self-supervised federated learning (SSFL) offers a promising solution, existing studies commonly assume the same data heterogeneity pattern throughout pre-training and fine-tuning. Moreover, current partitioning schemes often fail to generate pure partially class-disjoint data settings, limiting controllable simulation of real-world label-space heterogeneity. In this work, we introduce SSFL for diatom classification as a representative real-world instance and systematically investigate stage-specific data heterogeneity. We study cross-site variation in unlabeled data volume during pre-training and label-space misalignment during downstream fine-tuning. To study the latter in a controllable setting, we propose PreDi, a partitioning scheme that disentangles label-space heterogeneity into two orthogonal dimensions, namely class Prevalence and class-set size Disparity, enabling separate analysis of their effects. Guided by the resulting insights, we further propose PreP-WFL (Prevalence-based Personalized Weighted Federated Learning) to adaptively strengthen rare-class representations in low-prevalence scenarios. Extensive experiments show that SSFL consistently outperforms local-only training under both homogeneous and heterogeneous settings. The pronounced heterogeneity in unlabeled data volume is associated with improved representation pre-training, whereas under label-space heterogeneity, prevalence dominates performance and disparity has a smaller effect. PreP-WFL effectively mitigates this degradation, with gains increasing as prevalence decreases. These findings provide a mechanistic basis for characterizing label-space heterogeneity in decentralized recognition systems.

[CV-42] BigEarthNet.txt: A Large-Scale Multi-Sensor Image-Text Dataset and Benchmark for Earth Observation

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在遥感(Remote Sensing, RS)数据上性能受限的问题,其根本原因在于缺乏大规模、多传感器且具有多样化文本标注的遥感图像-文本数据集。现有数据集主要局限于航空红绿蓝(Red-Green-Blue, RGB)影像,且文本描述短小或弱关联,标注类型单一。为应对这一挑战,作者提出了一个包含464,044对配准的Sentinel-1合成孔径雷达(SAR)与Sentinel-2多光谱图像的数据集,共960万条文本注释,涵盖地理锚定的陆地利用/土地覆盖(Land-Use/Land-Cover, LULC)描述、面向不同任务的视觉问答对以及用于边界框预测的指代表达检测指令。该数据集的关键创新在于实现了多源遥感数据与丰富文本语义的高质量对齐,并通过人工验证的基准划分推动了遥感场景下VLM的指令驱动学习,实验证明其能显著提升模型在复杂LULC分类等任务上的表现。

链接: https://arxiv.org/abs/2603.29630
作者: Johann-Ludwig Herzog,Mathis Jürgen Adler,Leonard Hackel,Yan Shu,Angelos Zavras,Ioannis Papoutsis,Paolo Rota,Begüm Demir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: For details, see this https URL

点击查看摘要

Abstract:Vision-langugage models (VLMs) have shown strong performance in computer vision (CV), yet their performance on remote sensing (RS) data remains limited due to the lack of large-scale, multi-sensor RS image-text datasets with diverse textual annotations. Existing datasets predominantly include aerial Red-Green-Blue imagery, with short or weakly grounded captions, and provide limited diversity in annotation types. To address this limitation, we introduce this http URL, a large-scale, multi-sensor image-text dataset designed to advance instruction-driven image-text learning in Earth observation across multiple tasks. this http URL contains 464044 co-registered Sentinel-1 synthetic aperture radar and Sentinel-2 multispectral images with 9.6M text annotations, including: i) geographically anchored captions describing land-use/land-cover (LULC) classes, their spatial relations, and environmental context; ii) visual question answering pairs relevant for different tasks; and iii) referring expression detection instructions for bounding box prediction. Through a comparative statistical analysis, we demonstrate that this http URL surpasses existing RS image-text datasets in textual richness and annotation type variety. We further establish a manually-verified benchmark split to evaluate VLMs in RS and CV. The results show the limitations of these models on tasks that involve complex LULC classes, whereas fine-tuning using this http URL results in consistent performance gains across all considered tasks.

[CV-43] Unify-Agent : A Unified Multimodal Agent for World-Grounded Image Synthesis

【速读】:该论文旨在解决统一多模态模型在真实世界图像生成中面临的局限性,即其主要依赖冻结的参数化知识,难以有效处理长尾和知识密集型概念。解决方案的关键在于引入代理建模(agentic modeling),提出Unify-Agent——一个用于世界锚定图像合成的统一多模态代理系统,将图像生成重构为包含提示理解、多模态证据搜索、基于事实的重描述(grounded recaptioning)和最终合成的代理流程,并通过构建143K条高质量代理轨迹数据集实现对整个代理生成过程的有效监督。

链接: https://arxiv.org/abs/2603.29620
作者: Shuang Chen,Quanxin Shou,Hangting Chen,Yucheng Zhou,Kaituo Feng,Wenbo Hu,Yi-Fan Zhang,Yunlong Lin,Wenxuan Huang,Mingyang Song,Dasen Dai,Bolin Jiang,Manyuan Zhang,Shi-Xue Zhang,Zhengkai Jiang,Lucas Wang,Zhao Zhong,Yu Cheng,Nanyun Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Project Page: this https URL

点击查看摘要

Abstract:Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.

[CV-44] Video-Oasis: Rethinking Evaluation of Video Understanding

【速读】:该论文旨在解决当前视频理解(video understanding)研究中评估基准缺乏系统性诊断能力的问题,即现有基准难以区分性能提升究竟源于视觉感知、语言推理还是知识先验。其关键解决方案是提出 Video-Oasis,一个可持续的诊断工具集,用于系统评估现有评价方法并提炼时空挑战。通过分析发现:54% 的基准样本可在无视觉输入或时间上下文的情况下被解决,而剩余样本上最先进模型的表现仅略高于随机猜测;基于此,论文进一步识别出影响鲁棒视频理解的关键算法设计因素,为未来研究提供可操作的指导原则。

链接: https://arxiv.org/abs/2603.29616
作者: Geuntaek Lim,Minho Shim,Sungjune Park,Jaeyun Lee,Inwoong Lee,Taeoh Kim,Dongyoon Wee,Yukyung Choi
机构: NAVER Cloud(NAVER云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The inherent complexity of video understanding makes it difficult to attribute whether performance gains stem from visual perception, linguistic reasoning, or knowledge priors. While many benchmarks have emerged to assess high-level reasoning, the essential criteria that constitute video understanding remain largely overlooked. Instead of introducing yet another benchmark, we take a step back to re-examine the current landscape of video understanding. In this work, we provide Video-Oasis, a sustainable diagnostic suite designed to systematically evaluate existing evaluations and distill spatio-temporal challenges for video understanding. Our analysis reveals two critical findings: (1) 54% of existing benchmark samples are solvable without visual input or temporal context, and (2) on the remaining samples, state-of-the-art models exhibit performance barely exceeding random guessing. To bridge this gap, we investigate which algorithmic design choices contribute to robust video understanding, providing practical guidelines for future research. We hope our work serves as a standard guideline for benchmark construction and the rigorous evaluation of architecture development. Code is available at this https URL.

[CV-45] Bioinspired123D: Generative 3D Modeling System for Bioinspired Structures

【速读】:该论文旨在解决生成式 AI 在科学设计中进行文本到三维(text-to-3D)建模时面临的可控性差和计算成本高的问题。现有方法多依赖网格(mesh)、体素(voxel)或点云(point cloud)等密集视觉表示,训练成本高且难以精确控制。其解决方案的关键在于提出 Bioinspired123D——一个轻量级、模块化的“代码即几何”(code-as-geometry)流水线,通过参数化程序直接生成可制造的 3D 结构。核心组件 Bioinspired3D 是一个微调后的紧凑语言模型,能将自然语言设计提示转化为编码平滑生物启发几何形状的 Blender Python 脚本,并结合基于图的代理框架与多模态检索增强生成及视觉-语言模型批判机制,实现脚本的迭代评估、批评与修复,从而在显著减少参数量和计算资源的前提下,大幅提升生成质量与可控性。

链接: https://arxiv.org/abs/2603.29592
作者: Rachel K. Luu,Markus J. Buehler
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative AI has made rapid progress in text, image, and video synthesis, yet text-to-3D modeling for scientific design remains particularly challenging due to limited controllability and high computational cost. Most existing 3D generative methods rely on meshes, voxels, or point clouds which can be costly to train and difficult to control. We introduce Bioinspired123D, a lightweight and modular code-as-geometry pipeline that generates fabricable 3D structures directly through parametric programs rather than dense visual representations. At the core of Bioinspired123D is Bioinspired3D, a compact language model finetuned to translate natural language design cues into Blender Python scripts encoding smooth, biologically inspired geometries. We curate a domain-specific dataset of over 4,000 bioinspired and geometric design scripts spanning helical, cellular, and tubular motifs with parametric variability. The dataset is expanded and validated through an automated LLM-driven, Blender-based quality control pipeline. Bioinspired3D is then embedded in a graph-based agentic framework that integrates multimodal retrieval-augmented generation and a vision-language model critic to iteratively evaluate, critique, and repair generated scripts. We evaluate performance on a new benchmark for 3D geometry script generation and show that Bioinspired123D demonstrates a near fourfold improvement over its non-finetuned base model, while also outperforming substantially larger state-of-the-art language models despite using far fewer parameters and compute. By prioritizing code-as-geometry representations, Bioinspired123D enables compute-efficient, controllable, and interpretable text-to-3D generation, lowering barriers to AI driven scientific discovery in materials and structural design.

[CV-46] FlowID : Enhancing Forensic Identification with Latent Flow-Matching Models

【速读】:该论文旨在解决暴力致死案例中受损面部图像的识别难题,即在无法立即进行身份鉴定的情况下,如何通过图像处理技术恢复足够用于识别的面部信息。传统图像编辑工具流程繁琐且效果不佳,难以满足实际需求。解决方案的关键在于提出FlowID方法,其核心包括两个创新:一是单图像微调(single-image fine-tuning),使生成式AI模型适应分布外的损伤面部;二是基于注意力机制的掩码策略(attention-based masking),精准定位并修复损伤区域,同时保留关键身份特征。该方法在保持低内存消耗的同时显著优于现有开源方案,适用于本地部署以保障数据隐私。

链接: https://arxiv.org/abs/2603.29591
作者: Jules Ripoll,David Bertoin,Alasdair Newson,Charles Dossal,Jose Pablo Baraybar
机构: INSA Toulouse (法国国家科学与技术学院图卢兹分校); La Sorbonne Université (索邦大学); International Committee of the Red Cross (红十字国际委员会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Every day, many people die under violent circumstances, whether from crimes, war, migration, or climate disasters. Medico-legal and law enforcement institutions document many portraits of the deceased for evidence, but cannot immediately carry out identification on them. While traditional image editing tools can process these photos for public release, the workflow is lengthy and produces suboptimal results. In this work, we leverage advances in image generation models, which can now produce photorealistic human portraits, to introduce FlowID, an identity-preserving facial reconstruction method. Our approach combines single-image fine-tuning, which adapts the generative model to out-of-distribution injured faces, with attention-based masking that localizes edits to damaged regions while preserving identity-critical features. Together, these components enable the removal of artifacts from violent death while retaining sufficient identity information to support identification. To evaluate our method, we introduce InjuredFaces, a novel benchmark for identity-preserving facial reconstruction under severe facial damage. Beyond serving as an evaluation tool for this work, InjuredFaces provides a standardized resource for the community to study and compare methods addressing facial reconstruction in extreme conditions. Experimental results show that FlowID outperforms state-of-the-art open-source methods while maintaining low memory requirements, making it suitable for local deployment without compromising data privacy.

[CV-47] Emotion Diffusion Classifier with Adaptive Margin Discrepancy Training for Facial Expression Recognition

【速读】:该论文旨在解决当前基于深度学习的面部表情识别(Facial Expression Recognition, FER)模型在面对分布偏移(distribution shifts)时易受干扰、泛化能力弱的问题,尤其是判别式分类器倾向于学习数据中的“捷径”特征,导致对抗鲁棒性差。其解决方案的关键在于引入一种条件生成扩散模型——情绪扩散分类器(Emotion Diffusion Classifier, EmoDC),并通过提出自适应边缘差异训练(Adaptive Margin Discrepancy Training, AMDiT)来优化模型性能:该方法动态调整每样本的预测误差边界,强制正确类别与错误类别之间的噪声预测误差保持最小边际,从而增强模型对噪声和模糊等扰动的鲁棒性,并显著提升在多个公开数据集上的识别准确率。

链接: https://arxiv.org/abs/2603.29578
作者: Rongkang Dong,Cuixin Yang,Cong Zhang,Yushen Zuo,Kin-Man Lam
机构: The Hong Kong Polytechnic University (香港理工大学); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial Expression Recognition (FER) is essential for human-machine interaction, as it enables machines to interpret human emotions and internal states from facial affective behaviors. Although deep learning has significantly advanced FER performance, most existing deep-learning-based FER methods rely heavily on discriminative classifiers for fast predictions. These models tend to learn shortcuts and are vulnerable to even minor distribution shifts. To address this issue, we adopt a conditional generative diffusion model and introduce the Emotion Diffusion Classifier (EmoDC) for FER, which demonstrates enhanced adversarial robustness. However, retraining EmoDC using standard strategies fails to penalize incorrect categorical descriptions, leading to suboptimal recognition performance. To improve EmoDC, we propose margin-based discrepancy training, which encourages accurate predictions when conditioned on correct categorical descriptions and penalizes predictions conditioned on mismatched ones. This method enforces a minimum margin between noise-prediction errors for correct and incorrect categories, thereby enhancing the model’s discriminative capability. Nevertheless, using a fixed margin fails to account for the varying difficulty of noise prediction across different images, limiting its effectiveness. To overcome this limitation, we propose Adaptive Margin Discrepancy Training (AMDiT), which dynamically adjusts the margin for each sample. Extensive experiments show that AMDiT significantly improves the accuracy of EmoDC over the Base model with standard denoising diffusion training on the RAF-DB basic subset, the RAF-DB compound subset, SFEW-2.0, and AffectNet, in 100-step evaluations. Additionally, EmoDC outperforms state-of-the-art discriminative classifiers in terms of robustness against noise and blur.

[CV-48] urbo4DGen: Ultra-Fast Acceleration for 4D Generation

【速读】:该论文旨在解决4D生成(即动态三维内容生成)中因独特时空相机运动(spatio-camera-motion, SCM)注意力机制导致的计算与内存开销过大问题,该问题常引发内存溢出(OOM)和生成时间过长。解决方案的关键在于提出Turbo4DGen框架,其核心创新包括:引入空间-时间缓存机制以持久复用去噪过程中的中间注意力特征,结合动态语义感知注意力剪枝策略减少冗余计算,并设计自适应SCM链路跳过调度器以进一步优化计算流程,从而在不损失生成质量的前提下实现平均9.7倍的加速效果。

链接: https://arxiv.org/abs/2603.29572
作者: Yuanbin Man,Ying Huang,Zhile Ren,Miao Yin
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:4D generation, or dynamic 3D content generation, integrates spatial, temporal, and view dimensions to model realistic dynamic scenes, playing a foundational role in advancing world models and physical AI. However, maintaining long-chain consistency across both frames and viewpoints through the unique spatio-camera-motion (SCM) attention mechanism introduces substantial computational and memory overhead, often leading to out-of-memory (OOM) failures and prohibitive generation times. To address these challenges, we propose Turbo4DGen, an ultra-fast acceleration framework for diffusion-based multi-view 4D content generation. Turbo4DGen introduces a spatiotemporal cache mechanism that persistently reuses intermediate attention across denoising steps, combined with dynamically semantic-aware attention pruning and an adaptive SCM chain bypass scheduler, to drastically reduce redundant SCM attention computation. Our experimental results show that Turbo4DGen achieves an average 9.7 \times speedup without quality degradation on the ObjaverseDy and Consistent4D datasets. To the best of our knowledge, Turbo4DGen is the first dedicated acceleration framework for 4D generation.

[CV-49] Generating Key Postures of Bharatanatyam Adavus with Pose Estimation

【速读】:该论文旨在解决传统舞蹈形式(以印度古典舞Bharatanatyam为例)在数字时代中因缺乏精确姿态生成能力而导致的文化传承与数字化保存难题。其核心挑战在于如何在保持动作的解剖学准确性和风格一致性的同时,实现高保真度的姿态合成。解决方案的关键在于提出一种融合姿态估计模块的姿势感知生成框架,通过基于关键点的损失函数和姿态一致性约束作为监督信号,引导生成模型(包括条件生成对抗网络cGAN和条件扩散模型)在给定姿态类别标签条件下输出符合真实舞蹈结构的姿势,从而显著提升生成结果的质量、真实感与文化忠实度。

链接: https://arxiv.org/abs/2603.29570
作者: Jagadish Kashinath Kamble,Jayanta Mukhopadhyay,Debaditya Roy,Partha Pratim Das
机构: Indian Institute of Technology Kharagpur (印度理工学院克哈格普尔分校); Ashoka University (阿肖卡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published in ICVGIP, 2025

点击查看摘要

Abstract:Preserving intangible cultural dances rooted in centuries of tradition and governed by strict structural and symbolic rules presents unique challenges in the digital era. Among these, Bharatanatyam, a classical Indian dance form, stands out for its emphasis on codified adavus and precise key postures. Accurately generating these postures is crucial not only for maintaining anatomical and stylistic integrity, but also for enabling effective documentation, analysis, and transmission to broader global audiences through digital means. We propose a pose-aware generative framework integrated with a pose estimation module, guided by keypoint-based loss and pose consistency constraints. These supervisory signals ensure anatomical accuracy and stylistic integrity in the synthesized outputs. We evaluate four configurations: standard conditional generative adversarial network (cGAN), cGAN with pose supervision, conditional diffusion, and conditional diffusion with pose supervision. Each model is conditioned on key posture class labels and optimized to maintain geometric structure. In both cGAN and conditional diffusion settings, the integrated pose guidance aligns generated poses with ground-truth keypoint structures, promoting cultural fidelity. Our results demonstrate that incorporating pose supervision significantly enhances the quality, realism, and authenticity of generated Bharatanatyam postures. This framework provides a scalable approach for the digital preservation, education, and dissemination of traditional dance forms, enabling high-fidelity generation without compromising cultural precision. Code is available at this https URL.

[CV-50] Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge CVPR2026

【速读】:该论文旨在解决在资源受限的移动设备上部署生成式人工智能(Generative AI)任务时面临的高内存占用和计算开销问题,尤其是针对大型视觉模型(Large Vision Models, LVMs)与低秩适配器(Low-Rank Adapters, LoRAs)组合使用时导致的冗余存储和运行时性能瓶颈。解决方案的关键在于提出一种统一框架,将LoRA权重作为运行时输入而非编译进模型图中,从而实现无需重新编译即可动态切换任务;同时引入QUAD(Quantization with Unified Adaptive Distillation)量化感知训练策略,使多个LoRA适配器共享同一量化配置,在保证视觉质量的前提下显著降低内存占用和延迟,实验证明其可带来最高6倍的内存压缩和4倍的延迟优化。

链接: https://arxiv.org/abs/2603.29535
作者: Sowmya Vajrala,Aakash Parmar,Prasanna R,Sravanth Kodavanti,Manjunath Arveti,Srinivas Soumitri Miriyala,Ashok Senapati
机构: Samsung Research Institute Bangalore, India
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at the Mobile AI Workshop, CVPR 2026

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI) features such as image editing, object removal, and prompt-guided image transformation are increasingly integrated into mobile applications. However, deploying Large Vision Models (LVMs) for such tasks on resource-constrained devices remains challenging due to their high memory and compute requirements. While Low-Rank Adapters (LoRAs) enable parameter-efficient task adaptation, existing Mobile deployment pipelines typically compile separate model binaries for each LoRA + a copy of the foundation model, resulting in redundant storage and increased runtime overhead. In this work, we present a unified framework for enabling multi-task GenAI inference on edge devices using a single shared model. Our key idea is to treat LoRA weights as runtime inputs rather than embedding them into the compiled model graph, allowing dynamic task switching at runtime without recompilation. Then, to support efficient on-device execution, we introduce QUAD (Quantization with Unified Adaptive Distillation), a quantizationaware training strategy that aligns multiple LoRA adapters under a shared quantization profile. We implement the proposed system with a lightweight runtime stack compatible with mobile NPUs and evaluate it across multiple chipsets. Experimental results demonstrate up to 6x and 4x reduction in memory footprint and latency improvements, respectively, while maintaining high visual quality across multiple GenAI tasks.

[CV-51] ransmittance-Guided Structure-Texture Decomposition for Nighttime Image Dehazing

【速读】:该论文旨在解决雾霾环境下夜间图像质量严重退化的问题,包括可见度降低、色彩失真和对比度下降,这些退化由大气散射与悬浮颗粒吸收以及人工光源非均匀照明共同导致。现有方法通常仅针对部分问题(如光晕抑制或亮度增强)进行优化,未能协同处理全部退化因素。解决方案的关键在于提出一个两阶段的去雾框架:第一阶段通过边界约束的初始透射率图估计与区域自适应补偿机制,结合YUV空间中的二次高斯滤波以估计空间变化的大气光图,进而利用改进的夜间成像模型生成初步去雾结果;第二阶段引入STAR-YUV分解模型将去雾图像分为结构层与纹理层,在YUV空间中分别对结构层进行伽马校正与MSRCR色彩恢复以补偿光照并纠正色偏,对纹理层采用拉普拉斯-高斯滤波增强细节,并通过两阶段融合策略(先非线性Retinex融合增强层,再线性混合初始去雾结果)获得最终输出,从而实现对多维度退化因素的联合优化。

链接: https://arxiv.org/abs/2603.29507
作者: Francesco Moretti,Giulia Bianchi,Andrea Gallo
机构: Maharaja Agrasen University (玛哈拉贾·阿格拉斯恩大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Nighttime images captured under hazy conditions suffer from severe quality degradation, including low visibility, color distortion, and reduced contrast, caused by the combined effects of atmospheric scattering, absorption by suspended particles, and non-uniform illumination from artificial light sources. While existing nighttime dehazing methods have achieved partial success, they typically address only a subset of these issues, such as glow suppression or brightness enhancement, without jointly tackling the full spectrum of degradation factors. In this paper, we propose a two-stage nighttime image dehazing framework that integrates transmittance correction with structure-texture layered optimization. In the first stage, we introduce a novel transmittance correction method that establishes boundary-constrained initial transmittance maps and subsequently applies region-adaptive compensation and normalization based on whether image regions correspond to light source areas. A quadratic Gaussian filtering scheme operating in the YUV color space is employed to estimate the spatially varying atmospheric light map. The corrected transmittance map and atmospheric light map are then used in conjunction with an improved nighttime imaging model to produce the initial dehazed image. In the second stage, we propose a STAR-YUV decomposition model that separates the dehazed image into structure and texture layers within the YUV color space. Gamma correction and MSRCR-based color restoration are applied to the structure layer for illumination compensation and color bias correction, while Laplacian-of-Gaussian filtering is applied to the texture layer for detail enhancement. A novel two-phase fusion strategy, comprising nonlinear Retinex-based fusion of the enhanced layers followed by linear blending with the initial dehazing result, yields the final output.

[CV-52] VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference CVPR2026

【速读】:该论文旨在解决基于Transformer的视频模型在长视频理解与生成任务中因自注意力机制的二次复杂度带来的计算效率瓶颈问题。现有稀疏注意力方法虽能提升效率,但通常采用粗粒度模式,导致冗余计算且性能不佳。其解决方案的关键在于提出VecAttention框架,该框架基于观察发现视频注意力图具有显著的垂直向量稀疏特性,并进一步证明该垂直向量模式相比现有粗粒度稀疏模式能实现更优的准确率-稀疏度权衡;通过轻量级重要向量选择机制最小化内存访问开销,并结合优化的向量稀疏注意力核,从而在保持接近全连接注意力准确率的同时,实现高达2.65倍的推理加速。

链接: https://arxiv.org/abs/2603.29494
作者: Anmin Liu,Ruixuan Yang,Huiqiang Jiang,Bin Lin,Minmin Sun,Yong Li,Chen Zhang,Tao Xie
机构: Peking University (北京大学); Fudan University (复旦大学); Alibaba Group (阿里巴巴集团); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Long-context video understanding and generation pose a significant computational challenge for Transformer-based video models due to the quadratic complexity of self-attention. While existing sparse attention methods employ coarse-grained patterns to improve efficiency, they typically incur redundant computation and suboptimal performance. To address this issue, in this paper, we propose \textbfVecAttention, a novel framework of vector-wise sparse attention that achieves superior accuracy-efficiency trade-offs for video models. We observe that video attention maps exhibit a strong vertical-vector sparse pattern, and further demonstrate that this vertical-vector pattern offers consistently better accuracy-sparsity trade-offs compared with existing coarse-grained sparse patterns. Based on this observation, VecAttention dynamically selects and processes only informative vertical vectors through a lightweight important-vector selection that minimizes memory access overhead and an optimized kernel of vector sparse attention. Comprehensive evaluations on video understanding (VideoMME, LongVideoBench, and VCRBench) and generation (VBench) tasks show that VecAttention delivers a 2.65 \times speedup over full attention and a 1.83 \times speedup over state-of-the-art sparse attention methods, with comparable accuracy to full attention. Our code is available at this https URL.

[CV-53] Square Superpixel Generation and Representation Learning via Granular Ball Computing

【速读】:该论文旨在解决现有超像素(superpixel)算法生成不规则形状区域导致难以与卷积等规则操作对齐的问题,从而限制了其在深度学习流水线中的端到端优化和并行化实现。解决方案的关键在于提出一种基于粒球计算(granular-ball computing)自适应表示特性的正方形超像素生成方法:通过多尺度正方形块近似超像素,避免不规则形状带来的计算与实现复杂性,同时基于像素强度相似性计算纯度得分以筛选高质量块,最终生成的正方形超像素可直接作为图神经网络(GNN)中的节点或视觉Transformer(ViT)中的token,支持多尺度信息聚合与结构化视觉表征,显著提升了下游任务性能。

链接: https://arxiv.org/abs/2603.29460
作者: Shuyin Xia,Meng Yang,Dawei Dai,Fan Chen,Shilin Zhao,Junwei Han,Xinbo Gao,Guoyin Wang,Wen Lu
机构: Chongqing University of Posts and Telecommunications (重庆邮电大学); Xidian University (西安电子科技大学); Chongqing Normal University (重庆师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Superpixels provide a compact region-based representation that preserves object boundaries and local structures, and have therefore been widely used in a variety of vision tasks to reduce computational cost. However, most existing superpixel algorithms produce irregularly shaped regions, which are not well aligned with regular operators such as convolutions. Consequently, superpixels are often treated as an offline preprocessing step, limiting parallel implementation and hindering end-to-end optimization within deep learning pipelines. Motivated by the adaptive representation and coverage property of granular-ball computing, we develop a square superpixel generation approach. Specifically, we approximate superpixels using multi-scale square blocks to avoid the computational and implementation difficulties induced by irregular shapes, enabling efficient parallel processing and learnable feature extraction. For each block, a purity score is computed based on pixel-intensity similarity, and high-quality blocks are selected accordingly. The resulting square superpixels can be readily integrated as graph nodes in graph neural networks (GNNs) or as tokens in Vision Transformers (ViTs), facilitating multi-scale information aggregation and structured visual representation. Experimental results on downstream tasks demonstrate consistent performance improvements, validating the effectiveness of the proposed method.

[CV-54] FedDBP: Enhancing Federated Prototype Learning with Dual-Branch Features and Personalized Global Fusion

【速读】:该论文旨在解决异构联邦学习(Heterogeneous Federated Learning, HFL)中现有联邦原型学习(Federated Prototype Learning, FPL)方法在特征保真度(fidelity)与判别能力(discriminability)之间难以平衡,且受限于单一全局原型的问题。其解决方案的关键在于:在客户端设计双分支特征投影器(Dual-Branch feature projector),通过L2对齐与对比学习协同优化,确保本地特征兼具保真度与判别性;在服务器端提出个性化全局原型融合策略(Personalized global prototype fusion),利用费雪信息(Fisher information)识别局部原型的重要通道,从而增强原型的表达能力和适应性。

链接: https://arxiv.org/abs/2603.29455
作者: Ningzhi Gao,Siquan Huang,Leyu Shi,Ying Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Federated prototype learning (FPL), as a solution to heterogeneous federated learning (HFL), effectively alleviates the challenges of data and model this http URL, existing FPL methods fail to balance the fidelity and discriminability of the feature, and are limited by a single global prototype. In this paper, we propose FedDBP, a novel FPL method to address the above issues. On the client-side, we design a Dual-Branch feature projector that employs L2 alignment and contrastive learning simultaneously, thereby ensuring both the fidelity and discriminability of local features. On the server-side, we introduce a Personalized global prototype fusion approach that leverages Fisher information to identify the important channels of local prototypes. Extensive experiments demonstrate the superiority of FedDBP over ten existing advanced methods.

[CV-55] Few-shot Writer Adaptation via Multimodal In-Context Learning

【速读】:该论文旨在解决手写文本识别(Handwritten Text Recognition, HTR)模型在面对未见过的、风格特异的书写者时性能下降的问题,这类书写者因训练数据中代表性不足而难以被标准HTR模型准确识别。解决方案的关键在于提出一种受多模态上下文学习启发的新型上下文驱动HTR框架,该框架能够在推理阶段仅通过少量目标书写者的样本实现个性化适应,且无需任何参数更新或反向传播计算,从而显著降低计算开销并避免繁琐的超参数调优。

链接: https://arxiv.org/abs/2603.29450
作者: Tom Simon,Stephane Nicolas,Pierrick Tranouez,Clement Chatelain,Thierry Paquet
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While state-of-the-art Handwritten Text Recognition (HTR) models perform well on standard benchmarks, they frequently struggle with writers exhibiting highly specific styles that are underrepresented in the training data. To handle unseen and atypical writers, writer adaptation techniques personalize HTR models to individual handwriting styles. Leading writer adaptation methods require either offline fine-tuning or parameter updates at inference time, both involving gradient computation and backpropagation, which increase computational costs and demand careful hyperparameter tuning. In this work, we propose a novel context-driven HTR framework3 inspired by multimodal in-context learning, enabling inference-time writer adaptation using only a few examples from the target writer without any parameter updates. We further demonstrate the impact of context length, design a compact 8M-parameter CNN-Transformer that enables few-shot in-context adaptation, and show that combining context-driven and standard OCR training strategies leads to complementary improvements. Experiments on IAM and RIMES validate our approach with Character Error Rates of 3.92% and 2.34%, respectively, surpassing all writer-independent HTR models without requiring any parameter updates at inference time.

[CV-56] NeoNet: An End-to-End 3D MRI-Based Deep Learning Framework for Non-Invasive Prediction of Perineural Invasion via Generation-Driven Classification AAAI2026

【速读】:该论文旨在解决胆管癌(cholangiocarcinoma)中神经周围侵犯(perineural invasion, PNI)的非侵入性影像诊断难题,因其缺乏明确且一致的影像学判别标准而难以准确识别。解决方案的关键在于提出NeoNet——一个端到端的3D深度学习框架,其核心创新包括:(1) 利用肿瘤局部区域裁剪(Tumor-Localized ROI Crop, TLCR)算法提升区域聚焦能力;(2) 基于3D潜在扩散模型(Latent Diffusion Model, LDM)与ControlNet构建生成模块NeoGen,通过条件生成合成图像Patch实现数据集平衡至1:1比例;(3) 设计PNI-Attention网络(PattenNet),结合冻结的LDM编码器与专用3D双注意力块(Dual Attention Block, DAB),有效捕捉微弱的强度变化和空间模式以提升PNI预测性能。在5折交叉验证中,NeoNet达到最高AUC 0.7903,显著优于基线3D模型。

链接: https://arxiv.org/abs/2603.29449
作者: Youngung Han,Minkyung Cha,Kyeonghun Kim,Induk Um,Myeongbin Sho,Joo Young Bae,Jaewon Jung,Jung Hyeok Park,Seojun Lee,Nam-Joon Kim,Woo Kyoung Jeong,Won Jae Lee,Pa Hong,Ken Ying-Kai Liao,Hyuk-Jae Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures. Accepted for oral presentation at W3PHIAI Workshop, AAAI 2026

点击查看摘要

Abstract:Minimizing invasive diagnostic procedures to reduce the risk of patient injury and infection is a central goal in medical imaging. And yet, noninvasive diagnosis of perineural invasion (PNI), a critical prognostic factor involving infiltration of tumor cells along the surrounding nerve, still remains challenging, due to the lack of clear and consistent imaging criteria criteria for identifying PNI. To address this challenge, we present NeoNet, an integrated end-to-end 3D deep learning framework for PNI prediction in cholangiocarcinoma that does not rely on predefined image features. NeoNet integrates three modules: (1) NeoSeg, utilizing a Tumor-Localized ROI Crop (TLCR) algorithm; (2) NeoGen, a 3D Latent Diffusion Model (LDM) with ControlNet, conditioned on anatomical masks to generate synthetic image patches, specifically balancing the dataset to a 1:1 ratio; and (3) NeoCls, the final prediction module. For NeoCls, we developed the PNI-Attention Network (PattenNet), which uses the frozen LDM encoder and specialized 3D Dual Attention Blocks (DAB) designed to detect subtle intensity variations and spatial patterns indicative of PNI. In 5-fold cross-validation, NeoNet outperformed baseline 3D models and achieved the highest performance with a maximum AUC of 0.7903.

[CV-57] EarthEmbeddingExplorer: A Web Application for Cross-Modal Retrieval of Global Satellite Images ICLR2026

【速读】:该论文旨在解决地球观测领域中高影响力基础模型和全球地球嵌入数据集难以转化为免费可访问工具的问题。其解决方案的关键在于提出 EarthEmbeddingExplorer,这是一个交互式网页应用程序,通过云原生软件架构将静态研究资产转变为动态、实用的工作流程,支持自然语言、视觉和地理定位的跨模态查询,并能从检索结果中提取科学洞察,从而实现预计算地球嵌入的民主化访问,助力研究人员从先进模型与数据归档无缝过渡到实际应用与分析。

链接: https://arxiv.org/abs/2603.29441
作者: Yijie Zheng,Weijie Wu,Bingyue Wu,Long Zhao,Guoqing Li,Mikolaj Czerkawski,Konstantin Klemmer
机构: Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息研究院); University of Chinese Academy of Sciences (中国科学院大学); Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences (中国科学院地理科学与资源研究所); Asterisk Labs; LGND AI, Inc.; University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026 Workshop ML4RS Tutorial Track (oral)

点击查看摘要

Abstract:While the Earth observation community has witnessed a surge in high-impact foundation models and global Earth embedding datasets, a significant barrier remains in translating these academic assets into freely accessible tools. This tutorial introduces EarthEmbeddingExplorer, an interactive web application designed to bridge this gap, transforming static research artifacts into dynamic, practical workflows for discovery. We will provide a comprehensive hands-on guide to the system, detailing its cloud-native software architecture, demonstrating cross-modal queries (natural language, visual, and geolocation), and showcasing how to derive scientific insights from retrieval results. By democratizing access to precomputed Earth embeddings, this tutorial empowers researchers to seamlessly transition from state-of-the-art models and data archives to real-world application and analysis. The web application is available at this https URL.

[CV-58] SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering

【速读】:该论文旨在解决多视角图像输入下视觉-语言模型(Vision-Language Models, VLMs)在3D问答(3D QA)任务中因视觉token冗余导致的推理效率低下问题。现有方法在聚合多视图信息时会产生大量冗余token,严重受限于token预算,影响实时性。解决方案的关键在于提出SeGPruner框架,其核心由两个模块协同实现:一是基于注意力机制的语义感知token选择器(Saliency-aware Token Selector),用于保留语义显著的视觉token以保障关键对象证据;二是几何引导的token多样性增强模块(Geometry-aware Token Diversifier),通过结合语义相关性和三维几何距离,补充空间分布多样化的token以维持全局场景覆盖。二者共同作用,在极端压缩视觉token预算的同时,有效平衡了对象级证据与场景整体结构,从而在保持3D推理性能的前提下显著提升推理效率。

链接: https://arxiv.org/abs/2603.29437
作者: Wenli Li,Kai Zhao,Haoran Jiang,Enquan Yang,Yi Su,Dan Zeng
机构: Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have been widely adopted for 3D question answering (3D QA). In typical pipelines, visual tokens extracted from multiple viewpoints are concatenated with language tokens and jointly processed by a large language model (LLM) for inference. However, aggregating multi-view observations inevitably introduces severe token redundancy, leading to an overly large visual token set that significantly hinders inference efficiency under constrained token budgets. Visual token pruning has emerged as a prevalent strategy to address this issue. Nevertheless, most existing pruners are primarily tailored to 2D inputs or rely on indirect geometric cues, which limits their ability to explicitly retain semantically critical objects and maintain sufficient spatial coverage for robust 3D reasoning. In this paper, we propose SeGPruner, a semantic-aware and geometry-guided token reduction framework for efficient 3D QA with multi-view images. Specifically, SeGPruner first preserves semantically salient tokens through an attention-based importance module (Saliency-aware Token Selector), ensuring that object-critical evidence is retained. It then complements these tokens with spatially diverse ones via a geometry-guided selector (Geometry-aware Token Diversifier), which jointly considers semantic relevance and 3D geometric distance. This cooperation between saliency preservation and geometry-guided diversification balances object-level evidence and global scene coverage under aggressive token reduction. Extensive experiments on ScanQA and OpenEQA demonstrate that SeGPruner substantially improves inference efficiency, reducing the visual token budget by 91% and inference latency by 86%, while maintaining competitive performance in 3D reasoning tasks.

[CV-59] Seeing the Evidence Missing the Answer: Tool-Guided Vision-Language Models on Visual Illusions CVPR2026 CVPR

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对经典视错觉图像时表现出的系统性偏差问题:即无论图像是否被反事实修改,VLMs 倾向于一致预测为“真实”场景。为应对这一失败模式,作者提出了一种无需模型训练的工具引导推理框架(tool-guided inference framework),其关键在于引入一组通用图像处理工具(如线条绘制、区域裁剪、并排对比和通道隔离)与一个基于错觉类型路由的提示系统(illusion-type-routing system prompt),该系统可动态决定针对不同感知问题类别调用何种工具。每个工具调用生成的新图像资源会被永久记录到持久化注册表中,使模型能够在推理链中反复引用和组合先前的标注视图,从而实现对复杂结构的跨结构泛化能力,且在未见过的错觉变体(如旋转后的马赫带)上仍保持性能稳定。

链接: https://arxiv.org/abs/2603.29428
作者: Xuesong Wang,Harry Wang
机构: Wayne State University (韦恩州立大学); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 DataCV Workshop, code: this https URL

点击查看摘要

Abstract:Vision-language models (VLMs) exhibit a systematic bias when confronted with classic optical illusions: they overwhelmingly predict the illusion as “real” regardless of whether the image has been counterfactually modified. We present a tool-guided inference framework for the DataCV 2026 Challenge (Tasks I and II) that addresses this failure mode without any model training. An off-the-shelf vision-language model is given access to a small set of generic image manipulation tools: line drawing, region cropping, side-by-side comparison, and channel isolation, together with an illusion-type-routing system prompt that prescribes which tools to invoke for each perceptual question category. Critically, every tool call produces a new, immutable image resource appended to a persistent registry, so the model can reference and compose any prior annotated view throughout its reasoning chain. Rather than hard-coding illusion-specific modules, this generic-tool-plus-routing design yields strong cross-structural generalization: performance remained consistent from the validation set to a test set containing structurally unfamiliar illusion variants (e.g., Mach Bands rotated from vertical to horizontal stacking). We further report three empirical observations that we believe warrant additional investigation: (i) a strong positive-detection bias likely rooted in imbalanced illusion training data, (ii) a striking dissociation between pixel-accurate spatial reasoning and logical inference over self-generated annotations, and (iii) pronounced sensitivity to image compression artifacts that compounds false positives.

[CV-60] A2BFR: Attribute-Aware Blind Face Restoration

【速读】:该论文旨在解决盲人脸恢复(Blind Face Restoration, BFR)中因问题本质病态而导致的重建结果模糊且不可控的问题,以及现有基于扩散模型的方法在感知质量提升的同时缺乏可控性、文本引导的人脸编辑虽能实现属性操控但难以保证可靠恢复的局限。其解决方案的关键在于提出A²BFR框架,通过引入属性感知学习(attribute-aware learning)和语义双训练(semantic dual-training),在统一的Diffusion Transformer架构中结合图像-文本跨模态注意力机制,使去噪过程同时依赖于退化输入与文本提示;其中属性感知学习利用面部属性嵌入监督潜在空间的去噪过程以注入语义先验,语义双训练则借助新构建的AttrFace-90K数据集中的成对属性变化强化属性判别能力并保持重建保真度,从而实现高保真重建与prompt可控生成的统一。

链接: https://arxiv.org/abs/2603.29423
作者: Chenxin Zhu,Yushun Fang,Lu Liu,Shibo Yin,Xiaohong Liu,Xiaoyun Zhang,Qiang Hu,Guangtao Zhai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Blind face restoration (BFR) aims to recover high-quality facial images from degraded inputs, yet its inherently ill-posed nature leads to ambiguous and uncontrollable solutions. Recent diffusion-based BFR methods improve perceptual quality but remain uncontrollable, whereas text-guided face editing enables attribute manipulation without reliable restoration. To address these issues, we propose A ^2 BFR, an attribute-aware blind face restoration framework that unifies high-fidelity reconstruction with prompt-controllable generation. Built upon a Diffusion Transformer backbone with unified image-text cross-modal attention, A ^2 BFR jointly conditions the denoising trajectory on both degraded inputs and textual prompts. To inject semantic priors, we introduce attribute-aware learning, which supervises denoising latents using facial attribute embeddings extracted by an attribute-aware encoder. To further enhance prompt controllability, we introduce semantic dual-training, which leverages the pairwise attribute variations in our newly curated AttrFace-90K dataset to enforce attribute discrimination while preserving fidelity. Extensive experiments demonstrate that A ^2 BFR achieves state-of-the-art performance in both restoration fidelity and instruction adherence, outperforming diffusion-based BFR baselines by -0.0467 LPIPS and +52.58% attribute accuracy, while enabling fine-grained, prompt-controllable restoration even under severe degradations.

[CV-61] Multimodal Models Meet Presentation Attack Detection on ID Documents

【速读】:该论文旨在解决传统证件身份验证中的 Presentation Attack Detection (PAD) 系统因仅依赖视觉特征而难以识别复杂伪造攻击的问题。其解决方案的关键在于引入多模态模型(如 Paligemma、Llava 和 Qwen),将深度视觉嵌入与文档类型、签发机构和日期等上下文元数据相结合,从而提升对 ID 证件伪造攻击的检测能力。然而实验结果表明,当前预训练多模态模型在实际 PAD 任务中仍存在准确率不足的问题。

链接: https://arxiv.org/abs/2603.29422
作者: Marina Villanueva,Juan M. Espin,Juan E. Tapia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The integration of multimodal models into Presentation Attack Detection (PAD) for ID Documents represents a significant advancement in biometric security. Traditional PAD systems rely solely on visual features, which often fail to detect sophisticated spoofing attacks. This study explores the combination of visual and textual modalities by utilizing pre-trained multimodal models, such as Paligemma, Llava, and Qwen, to enhance the detection of presentation attacks on ID Documents. This approach merges deep visual embeddings with contextual metadata (e.g., document type, issuer, and date). However, experimental results indicate that these models struggle to accurately detect PAD on ID Documents.

[CV-62] RAAP: Retrieval-Augmented Affordance Prediction with Cross-Image Action Alignment ICRA2026

【速读】:该论文旨在解决机器人在复杂非结构化环境中进行精细交互时,因现有方法依赖检索或大规模模型而导致的泛化能力不足问题。具体而言,传统检索方法受数据稀疏性和覆盖缺口影响易失效,而大型模型在处理未见类别时常出现接触点定位错误和动作预测偏差。解决方案的关键在于提出一种统一的“检索增强型可及性预测”(Retrieval-Augmented Affordance Prediction, RAAP)框架,其核心创新包括:通过密集对应关系传递静态接触点,利用双加权注意力机制融合多参考样本以预测动态动作方向,从而实现对未见对象与类别的稳定性能表现,并支持零样本机器人操作,在仿真与真实世界中均验证有效。

链接: https://arxiv.org/abs/2603.29419
作者: Qiyuan Zhuang,He-Yang Xu,Yijun Wang,Xin-Yang Zhao,Yang-Yang Li,Xiu-Shen Wei
机构: Southeast University (东南大学); Monash University (莫纳什大学); Nanjing University of Science and Technology (南京理工大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICRA 2026

点击查看摘要

Abstract:Understanding object affordances is essential for enabling robots to perform purposeful and fine-grained interactions in diverse and unstructured environments. However, existing approaches either rely on retrieval, which is fragile due to sparsity and coverage gaps, or on large-scale models, which frequently mislocalize contact points and mispredict post-contact actions when applied to unseen categories, thereby hindering robust generalization. We introduce Retrieval-Augmented Affordance Prediction (RAAP), a framework that unifies affordance retrieval with alignment-based learning. By decoupling static contact localization and dynamic action direction, RAAP transfers contact points via dense correspondence and predicts action directions through a retrieval-augmented alignment model that consolidates multiple references with dual-weighted attention. Trained on compact subsets of DROID and HOI4D with as few as tens of samples per task, RAAP achieves consistent performance across unseen objects and categories, and enables zero-shot robotic manipulation in both simulation and the real world. Project website: this https URL.

[CV-63] Adversarial Prompt Injection Attack on Multimodal Large Language Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在实际应用中因指令遵循行为而易受提示注入攻击(prompt injection attacks)的问题,特别是针对闭源模型的不可感知视觉提示注入攻击。其解决方案的关键在于:通过在输入图像中自适应嵌入受限的文本叠加层提供语义引导,并迭代优化不可感知的视觉扰动,使被攻击图像的特征表示在粗粒度和细粒度上均与恶意视觉和文本目标对齐;同时将视觉目标建模为文本渲染图像并在优化过程中逐步精炼,以增强语义保真度和跨模型迁移能力。

链接: https://arxiv.org/abs/2603.29418
作者: Meiwen Ding,Song Xia,Chenqi Kong,Xudong Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although multimodal large language models (MLLMs) are increasingly deployed in real-world applications, their instruction-following behavior leaves them vulnerable to prompt injection attacks. Existing prompt injection methods predominantly rely on textual prompts or perceptible visual prompts that are observable by human users. In this work, we study imperceptible visual prompt injection against powerful closed-source MLLMs, where adversarial instructions are embedded in the visual modality. Our method adaptively embeds the malicious prompt into the input image via a bounded text overlay to provide semantic guidance. Meanwhile, the imperceptible visual perturbation is iteratively optimized to align the feature representation of the attacked image with those of the malicious visual and textual targets at both coarse- and fine-grained levels. Specifically, the visual target is instantiated as a text-rendered image and progressively refined during optimization to more faithfully represent the desired semantics and improve transferability. Extensive experiments on two multimodal understanding tasks across multiple closed-source MLLMs demonstrate the superior performance of our approach compared to existing methods.

[CV-64] Native-Domain Cross-Attention for Camera-LiDAR Extrinsic Calibration Under Large Initial Perturbations

【速读】:该论文旨在解决相机与激光雷达(LiDAR)融合中因初始外参(extrinsic parameters)偏差较大时导致的标定精度下降问题。现有基于学习的方法通常将LiDAR点投影到深度图进行特征融合,但在大初始误差下会扭曲三维几何结构,从而降低性能。其解决方案的关键在于提出一种外参感知的交叉注意力机制(extrinsic-aware cross-attention framework),该机制直接在各自原始域(图像块与LiDAR点群)中对齐跨模态特征,并显式地将外参假设注入对应关系建模过程,从而实现几何一致性的跨模态交互,无需依赖投影后的二维深度图。

链接: https://arxiv.org/abs/2603.29414
作者: Ni Ou,Zhuo Chen,Xinru Zhang,Junzheng Wang
机构: Beijing Institute of Technology (北京理工大学); King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Accurate camera-LiDAR fusion relies on precise extrinsic calibration, which fundamentally depends on establishing reliable cross-modal correspondences under potentially large misalignments. Existing learning-based methods typically project LiDAR points into depth maps for feature fusion, which distorts 3D geometry and degrades performance when the extrinsic initialization is far from the ground truth. To address this issue, we propose an extrinsic-aware cross-attention framework that directly aligns image patches and LiDAR point groups in their native domains. The proposed attention mechanism explicitly injects extrinsic parameter hypotheses into the correspondence modeling process, enabling geometry-consistent cross-modal interaction without relying on projected 2D depth maps. Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in both accuracy and robustness. Under large extrinsic perturbations, our approach achieves accurate calibration in 88% of KITTI cases and 99% of nuScenes cases, substantially surpassing the second-best baseline. We have open sourced our code on this https URL to benefit the community.

[CV-65] AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models CVPR2026

【速读】:该论文旨在解决预训练视觉-语言模型(Vision-Language Models, VLMs)在面对对抗扰动时鲁棒性不足的问题,同时避免现有基于分类标签的对抗微调方法破坏原始跨模态对齐结构,从而导致零样本性能下降。其解决方案的关键在于提出一种基于对齐引导的微调框架(Alignment-Guided Fine-Tuning, AGFT),该框架不依赖硬标签,而是利用原始模型的软概率预测进行文本引导的对抗训练,通过软对齐分布将对抗样本的视觉特征与文本嵌入对齐,从而增强零样本下的对抗鲁棒性;此外,引入分布一致性校准机制,使微调后模型输出与预训练模型经温度缩放后的预测分布保持一致,有效缓解微调带来的结构偏差。

链接: https://arxiv.org/abs/2603.29410
作者: Yubo Cui,Xianchao Guan,Zijun Xiong,Zheng Zhang
机构: Harbin Institute of Technology, Shenzhen, China; Shenzhen Loop Area Institute, Shenzhen, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by CVPR 2026; Code is available at \url{ this https URL }

点击查看摘要

Abstract:Pre-trained vision-language models (VLMs) exhibit strong zero-shot generalization but remain vulnerable to adversarial perturbations. Existing classification-guided adversarial fine-tuning methods often disrupt pre-trained cross-modal alignment, weakening visual-textual correspondence and degrading zero-shot performance. In this paper, we propose an Alignment-Guided Fine-Tuning (AGFT) framework that enhances zero-shot adversarial robustness while preserving the cross-modal semantic structure. Unlike label-based methods that rely on hard labels and fail to maintain the relative relationships between image and text, AGFT leverages the probabilistic predictions of the original model for text-guided adversarial training, which aligns adversarial visual features with textual embeddings via soft alignment distributions, improving zero-shot adversarial robustness. To address structural discrepancies introduced by fine-tuning, we introduce a distribution consistency calibration mechanism that adjusts the robust model output to match a temperature-scaled version of the pre-trained model predictions. Extensive experiments across multiple zero-shot benchmarks demonstrate that AGFT outperforms state-of-the-art methods while significantly improving zero-shot adversarial robustness.

[CV-66] Hallucination-aware intermediate representation edit in large vision-language models

【速读】:该论文旨在解决大规模视觉语言模型(Large Vision-Language Models)在多模态推理和复杂场景理解中普遍存在的幻觉问题(hallucination),即模型输出与视觉事实相矛盾的现象。现有解决方案主要包括重训练方法和对比解码(Contrastive Decoding, CD)方法,但前者计算资源消耗大,后者引入双重推理开销,限制了实际应用。本文提出了一种动态检测幻觉表征并对其进行消除编辑的框架,其关键在于通过最小的额外计算成本识别出导致幻觉的内部表示,并对其进行精准修正,从而实现高效、鲁棒且可控的幻觉消除,达到当前最优性能。

链接: https://arxiv.org/abs/2603.29405
作者: Wei Suo,Hanzu Zhang,Lijun Zhang,Ji Ma,Peng Wang,Yanning Zhang
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models have demonstrated exceptional performance in multimodal reasoning and complex scene understanding. However, these models still face significant hallucination issues, where outputs contradict visual facts. Recent research on hallucination mitigation has focused on retraining methods and Contrastive Decoding (CD) methods. While both methods perform well, retraining methods require substantial training resources, and CD methods introduce dual inference overhead. These factors hinder their practical applicability. To address the above issue, we propose a framework for dynamically detecting hallucination representations and performing hallucination-eliminating edits on these representations. With minimal additional computational cost, we achieve state-of-the-art performance on existing benchmarks. Extensive experiments demonstrate the effectiveness of our approach, highlighting its efficient and robust hallucination elimination capability and its powerful controllability over hallucinations. Code is available at this https URL

[CV-67] AA-Splat: Anti-Aliased Feed-forward Gaussian Splatting

【速读】:该论文旨在解决现有前向传播三维高斯溅射(Feed-forward 3D Gaussian Splatting, FF-3DGS)方法在非分布采样率下渲染时出现严重伪影的问题,其根源在于现有方法依赖于错误的屏幕空间膨胀滤波器(screen-space dilation filters)。解决方案的关键在于提出AA-Splat模型,其核心创新为Opacity-Balanced Band-Limiting (OBBL)设计:一方面通过3D带限后滤波器(3D band-limiting post-filter)将多视角最大频率边界整合进前向重建流程,有效对齐3D场景表示进行带限处理并消除退化高斯分布;另一方面引入透明度平衡(Opacity Balancing, OB)机制,无缝融合所有像素对齐的高斯原语到渲染过程,补偿因扩展高斯原语导致的重叠增加问题。此方案显著提升了不同分辨率下的抗锯齿渲染鲁棒性,在NVS任务中相较SOTA基线DepthSplat实现5.4–7.5dB的平均PSNR提升。

链接: https://arxiv.org/abs/2603.29394
作者: Taewoo Suh,Sungpyo Kim,Jongmin Park,Munchurl Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Please visit our project page at this https URL

点击查看摘要

Abstract:Feed-forward 3D Gaussian Splatting (FF-3DGS) emerges as a fast and robust solution for sparse-view 3D reconstruction and novel view synthesis (NVS). However, existing FF-3DGS methods are built on incorrect screen-space dilation filters, causing severe rendering artifacts when rendering at out-of-distribution sampling rates. We firstly propose an FF-3DGS model, called AA-Splat, to enable robust anti-aliased rendering at any resolution. AA-Splat utilizes an opacity-balanced band-limiting (OBBL) design, which combines two components: a 3D band-limiting post-filter integrates multi-view maximal frequency bounds into the feed-forward reconstruction pipeline, effectively band-limiting the resulting 3D scene representations and eliminating degenerate Gaussians; an Opacity Balancing (OB) to seamlessly integrate all pixel-aligned Gaussian primitives into the rendering process, compensating for the increased overlap between expanded Gaussian primitives. AA-Splat demonstrates drastic improvements with average 5.4 \sim 7.5dB PSNR gains on NVS performance over a state-of-the-art (SOTA) baseline, DepthSplat, at all resolutions, between 4\times and 1/4\times . Code will be made available.

[CV-68] Extend3D: Town-Scale 3D Generation CVPR2026

【速读】:该论文旨在解决从单张图像生成完整3D场景的难题,尤其针对基于对象中心(object-centric)的3D生成模型在处理大尺度场景时受限于固定大小潜在空间的问题。其核心解决方案是提出一种无需训练的扩展式3D场景生成框架Extend3D:首先通过在x和y方向上扩展潜在空间以支持更广场景表示;接着将扩展后的潜在空间划分为重叠块(patch),利用对象中心3D生成模型逐块生成并同步耦合;为确保图像与潜在块之间的严格空间对齐,采用单目深度估计器初始化点云,并结合SDEdit迭代优化遮挡区域;关键创新在于发现将3D结构不完整性视为噪声进行“欠噪声”(under-noising)处理可实现有效补全,并引入3D感知优化目标以保持子场景动态一致性,从而提升几何结构与纹理保真度。

链接: https://arxiv.org/abs/2603.29387
作者: Seungwoo Yoon,Jinmo Kim,Jaesik Park
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026, Project Page: this http URL

点击查看摘要

Abstract:In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the x and y directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.

[CV-69] PromptForge-350k: A Large-Scale Dataset and Contrastive Framework for Prompt-Based AI Image Forgery Localization

【速读】:该论文旨在解决基于提示词(prompt-based)的生成式 AI 图像编辑技术所带来的恶意内容伪造与虚假信息传播风险,特别是针对此类新兴编辑方法缺乏有效伪造定位(forgery localization)检测手段的问题。解决方案的关键在于:首先构建了一个全自动的掩码标注框架,利用关键点对齐与语义空间相似性生成精确的编辑区域真值掩码,从而建立大规模数据集 PromptForge-350k;其次提出 ICL-Net 模型,采用三流主干结构与图像内对比学习机制,以提取鲁棒且泛化能力强的取证特征,显著提升了在真实场景下的定位精度与抗退化能力。

链接: https://arxiv.org/abs/2603.29386
作者: Jianpeng Wang,Haoyu Wang,Baoying Chen,Jishen Zeng,Yiming Qin,Yiqi Yang,Zhongjie Ba
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid democratization of prompt-based AI image editing has recently exacerbated the risks associated with malicious content fabrication and misinformation. However, forgery localization methods targeting these emerging editing techniques remain significantly under-explored. To bridge this gap, we first introduce a fully automated mask annotating framework that leverages keypoint alignment and semantic space similarity to generate precise ground-truth masks for edited regions. Based on this framework, we construct PromptForge-350k, a large-scale forgery localization dataset covering four state-of-the-art prompt-based AI image editing models, thereby mitigating the data scarcity in this domain. Furthermore, we propose ICL-Net, an effective forgery localization network featuring a triple-stream backbone and intra-image contrastive learning. This design enables the model to capture highly robust and generalizable forensic features. Extensive experiments demonstrate that our method achieves an IoU of 62.5% on PromptForge-350k, outperforming SOTA methods by 5.1%. Additionally, it exhibits strong robustness against common degradations with an IoU drop of less than 1%, and shows promising generalization capabilities on unseen editing models, achieving an average IoU of 41.5%.

[CV-70] Assessing Multimodal Chronic Wound Embeddings with Expert Triplet Agreement

【速读】:该论文旨在解决罕见遗传性皮肤疾病复发性营养不良性大疱表皮松解症(Recessive Dystrophic Epidermolysis Bullosa, RDEB)的临床诊断中,现有通用基础模型难以有效捕捉具有临床意义的特征,且专家一致性评估困难的问题。其解决方案的关键在于提出一种基于专家序数比较(triplet judgments)的嵌入空间评估方法,并构建了一个多模态框架TriDerm,通过整合伤口图像、边界掩码和专家报告,实现可解释的伤口表型表示学习:视觉部分采用伤口级注意力池化与非对比表示学习,文本部分则利用大语言模型进行比较查询提示并生成软序数嵌入(Soft Ordinal Embeddings, SOE),最终融合双模态信息使模型与专家的一致性达到73.5%,显著优于单一模态基础模型。

链接: https://arxiv.org/abs/2603.29376
作者: Fabian Kabus,Julia Hindel,Jelena Bratulić,Meropi Karakioulaki,Ayush Gupta,Cristina Has,Thomas Brox,Abhinav Valada,Harald Binder
机构: University of Freiburg (弗莱堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recessive dystrophic epidermolysis bullosa (RDEB) is a rare genetic skin disorder for which clinicians greatly benefit from finding similar cases using images and clinical text. However, off-the-shelf foundation models do not reliably capture clinically meaningful features for this heterogeneous, long-tail disease, and structured measurement of agreement with experts is challenging. To address these gaps, we propose evaluating embedding spaces with expert ordinal comparisons (triplet judgments), which are fast to collect and encode implicit clinical similarity knowledge. We further introduce TriDerm, a multimodal framework that learns interpretable wound representations from small cohorts by integrating wound imagery, boundary masks, and expert reports. On the vision side, TriDerm adapts visual foundation models to RDEB using wound-level attention pooling and non-contrastive representation learning. For text, we prompt large language models with comparison queries and recover medically meaningful representations via soft ordinal embeddings (SOE). We show that visual and textual modalities capture complementary aspects of wound phenotype, and that fusing both modalities yields 73.5% agreement with experts, outperforming the best off-the-shelf single-modality foundation model by over 5.6 percentage points. We make the expert annotation tool, model code and representative dataset samples publicly available.

[CV-71] StereoVGGT: A Training-Free Visual Geometry Transformer for Stereo Vision

【速读】:该论文旨在解决当前立体视觉(stereo vision)任务中骨干网络因缺乏显式相机位姿(camera pose)监督而导致的几何细节退化问题,从而限制了立体匹配与立体转换等任务的性能。现有方法多依赖单目深度估计(monocular depth estimation, MDE)模型或视觉基础模型(visual foundation models, VFMs),但这些模型在预训练阶段未明确引入几何约束,导致其在处理双目立体视觉时难以保持精确的空间结构信息。解决方案的关键在于利用已预训练并包含丰富3D先验(包括相机位姿)的视觉几何接地Transformer(Visual Geometry Grounded Transformer, VGGT),通过提出一种无需训练的特征调整流水线(training-free feature adjustment pipeline),在冻结VGGT参数的基础上有效缓解其在特征提取过程中对几何信息的破坏,同时挖掘模型内部隐含的相机标定知识,最终构建出专为立体视觉设计的骨干网络StereoVGGT。实验证明,基于StereoVGGT的立体匹配网络在KITTI基准上取得第一名,验证了该方案的有效性。

链接: https://arxiv.org/abs/2603.29368
作者: Ziyang Chen,Yansong Qu,You Shen,Xuan Cheng,Liujuan Cao
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Driven by the advancement of 3D devices, stereo vision tasks including stereo matching and stereo conversion have emerged as a critical research frontier. Contemporary stereo vision backbones typically rely on either monocular depth estimation (MDE) models or visual foundation models (VFMs). Crucially, these models are predominantly pretrained without explicit supervision of camera poses. Given that such geometric knowledge is indispensable for stereo vision, the absence of explicit spatial constraints constitutes a significant performance bottleneck for existing architectures. Recognizing that the Visual Geometry Grounded Transformer (VGGT) operates as a foundation model pretrained on extensive 3D priors, including camera poses, we investigate its potential as a robust backbone for stereo vision tasks. Nevertheless, empirical results indicate that its direct application to stereo vision yields suboptimal performance. We observe that VGGT suffers from a more significant degradation of geometric details during feature extraction. Such characteristics conflict with the requirements of binocular stereo vision, thereby constraining its efficacy for relative tasks. To bridge this gap, we propose StereoVGGT, a feature backbone specifically tailored for stereo vision. By leveraging the frozen VGGT and introducing a training-free feature adjustment pipeline, we mitigate geometric degradation and harness the latent camera calibration knowledge embedded within the model. StereoVGGT-based stereo matching network achieved the 1^st rank among all published methods on the KITTI benchmark, validating that StereoVGGT serves as a highly effective backbone for stereo vision.

[CV-72] Uncertainty-Aware Trajectory Prediction: A Unified Framework Harnessing Positional and Semantic Uncertainties

【速读】:该论文旨在解决轨迹预测中因实时地图不确定性带来的挑战,这种不确定性主要来源于两个方面:一是由于传感器限制或环境遮挡导致的位置误差(positional inaccuracies),二是由于对场景语义理解错误引发的语义误差(semantic errors)。解决方案的关键在于提出一个统一的框架,能够联合建模位置和语义不确定性,并将二者显式地融入轨迹预测流程中。该框架采用双头(dual-head)结构,在双次前向传播中独立估计语义与位置预测,并以预测方差作为不确定性指标,通过端到端方式学习得到;随后将这些不确定性信息与原始预测融合,从而提升轨迹预测结果的鲁棒性。

链接: https://arxiv.org/abs/2603.29362
作者: Jintao Sun,Hu Zhang,Gangyi Ding,Zhedong Zheng
机构: Beijing Institute of Technology (北京理工大学); CSIRO DATA61 (澳大利亚联邦科学与工业研究组织数据六一实验室); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Trajectory prediction seeks to forecast the future motion of dynamic entities, such as vehicles and pedestrians, given a temporal horizon of historical movement data and environmental context. A central challenge in this domain is the inherent uncertainty in real-time maps, arising from two primary sources: (1) positional inaccuracies due to sensor limitations or environmental occlusions, and (2) semantic errors stemming from misinterpretations of scene context. To address these challenges, we propose a novel unified framework that jointly models positional and semantic uncertainties and explicitly integrates them into the trajectory prediction pipeline. Our approach employs a dual-head architecture to independently estimate semantic and positional predictions in a dual-pass manner, deriving prediction variances as uncertainty indicators in an end-to-end fashion. These uncertainties are subsequently fused with the semantic and positional predictions to enhance the robustness of trajectory forecasts. We evaluate our uncertainty-aware framework on the nuScenes real-world driving dataset, conducting extensive experiments across four map estimation methods and two trajectory prediction baselines. Results verify that our method (1) effectively quantifies map uncertainties through both positional and semantic dimensions, and (2) consistently improves the performance of existing trajectory prediction models across multiple metrics, including minimum Average Displacement Error (minADE), minimum Final Displacement Error (minFDE), and Miss Rate (MR). Code will available at this https URL.

[CV-73] CIPHER: Counterfeit Image Pattern High-level Examination via Representation

【速读】:该论文旨在解决生成式模型(如GANs和扩散模型)快速发展背景下,合成人脸图像日益逼真导致的深度伪造(deepfake)检测难题,尤其是现有检测方法在跨模型场景下性能下降严重的问题。解决方案的关键在于提出CIPHER框架,通过系统性地复用并微调原本用于图像生成任务的判别器(discriminator),提取来自ProGAN判别器的尺度自适应特征与来自扩散模型的时间一致性特征,从而捕捉传统检测器易忽略的生成无关伪影(generation-agnostic artifacts)。这一策略显著提升了检测模型在多种先进生成模型之间的泛化能力和鲁棒性,在多个基准数据集上实现优于现有ViT-based检测器30%以上的F1-score,尤其在CIFAKE等挑战性数据集上表现出高达88%的F1-score,验证了判别器复用与跨模型微调的有效性。

链接: https://arxiv.org/abs/2603.29356
作者: Kyeonghun Kim,Youngung Han,Seoyoung Ju,Yeonju Jean,YooHyun Kim,Minseo Choi,SuYeon Lim,Kyungtae Park,Seungwoo Baek,Sieun Hyeon,Nam-Joon Kim,Hyuk-Jae Lee
机构: OUTTA; Seoul National University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures. Accepted at IEEE-Asia 2025

点击查看摘要

Abstract:The rapid progress of generative adversarial networks (GANs) and diffusion models has enabled the creation of synthetic faces that are increasingly difficult to distinguish from real images. This progress, however, has also amplified the risks of misinformation, fraud, and identity abuse, underscoring the urgent need for detectors that remain robust across diverse generative models. In this work, we introduce Counterfeit Image Pattern High-level Examination via Representation(CIPHER), a deepfake detection framework that systematically reuses and fine-tunes discriminators originally trained for image generation. By extracting scale-adaptive features from ProGAN discriminators and temporal-consistency features from diffusion models, CIPHER captures generation-agnostic artifacts that conventional detectors often overlook. Through extensive experiments across nine state-of-the-art generative models, CIPHER demonstrates superior cross-model detection performance, achieving up to 74.33% F1-score and outperforming existing ViT-based detectors by over 30% in F1-score on average. Notably, our approach maintains robust performance on challenging datasets where baseline methods fail, with up to 88% F1-score on CIFAKE compared to near-zero performance from conventional detectors. These results validate the effectiveness of discriminator reuse and cross-model fine-tuning, establishing CIPHER as a promising approach toward building more generalizable and robust deepfake detection systems in an era of rapidly evolving generative technologies.

[CV-74] FOSCU: Feasibility of Synthetic MRI Generation via Duo-Diffusion Models for Enhancement of 3D U-Nets in Hepatic Segmentation

【速读】:该论文旨在解决医学图像分割中因临床数据获取受限、标注成本高及数据量不足等问题,这些问题严重制约了鲁棒分割算法的发展。其解决方案的关键在于提出FOSCU框架,该框架核心为Duo-Diffusion——一种3D潜在扩散模型(latent diffusion model)结合ControlNet的生成方法,可同时生成高分辨率、解剖学上逼真的合成MRI体积及其对应的分割标签;该方法通过分割条件扩散机制确保生成数据的空间一致性与精确解剖细节,从而提升训练数据质量与多样性。实验表明,使用真实与合成数据联合训练的模型在Dice分数上较仅用真实数据提升0.67%,且Fréchet Inception Distance (FID)降低36.4%,显著改善图像保真度。

链接: https://arxiv.org/abs/2603.29343
作者: Youngung Han,Kyeonghun Kim,Seoyoung Ju,Yeonju Jean,Minkyung Cha,Seohyoung Park,Hyeonseok Jung,Nam-Joon Kim,Woo Kyoung Jeong,Ken Ying-Kai Liao,Hyuk-Jae Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures. Accepted at IEEE APCCAS 2025

点击查看摘要

Abstract:Medical image segmentation faces fundamental challenges including restricted access, costly annotation, and data shortage to clinical datasets through Picture Archiving and Communication Systems (PACS). These systemic barriers significantly impede the development of robust segmentation algorithms. To address these challenges, we propose FOSCU, which integrates Duo-Diffusion, a 3D latent diffusion model with ControlNet that simultaneously generates high-resolution, anatomically realistic synthetic MRI volumes and corresponding segmentation labels, and an enhanced 3D U-Net training pipeline. Duo-Diffusion employs segmentation-conditioned diffusion to ensure spatial consistency and precise anatomical detail in the generated data. Experimental evaluation on 720 abdominal MRI scans shows that models trained with combined real and synthetic data yield a mean Dice score gain of 0.67% over those using only real data, and achieve a 36.4% reduction in Fréchet Inception Distance (FID), reflecting enhanced image fidelity.

[CV-75] Beyond Corner Patches: Semantics-Aware Backdoor Attack in Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中后门攻击的现实性问题,即现有研究多采用合成角落补丁或分布外(out-of-distribution, OOD)模式作为触发器,这些攻击方式在实际场景中难以实现。为更贴近真实环境,论文提出SABLE(Semantics-Aware Backdoor for LEarning in federated settings),其关键在于构建语义一致、分布内且视觉合理的触发器(如眼镜等语义属性变化),并通过特征分离与参数正则化设计聚合感知的恶意目标函数,使攻击者更新保持接近良性更新。该方法仅污染少量可解释的本地数据,同时兼容多种聚合规则(FedAvg、Trimmed Mean、MultiKrum 和 FLAME),在多个数据划分下均实现了高目标攻击成功率并维持良好的正常测试准确率,表明语义对齐的后门攻击仍是联邦学习中的有效威胁。

链接: https://arxiv.org/abs/2603.29328
作者: Kavindu Herath,Joshua Zhao,Saurabh Bagchi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Backdoor attacks on federated learning (FL) are most often evaluated with synthetic corner patches or out-of-distribution (OOD) patterns that are unlikely to arise in practice. In this paper, we revisit the backdoor threat to standard FL (a single global model) under a more realistic setting where triggers must be semantically meaningful, in-distribution, and visually plausible. We propose SABLE, a Semantics-Aware Backdoor for LEarning in federated settings, which constructs natural, content-consistent triggers (e.g., semantic attribute changes such as sunglasses) and optimizes an aggregation-aware malicious objective with feature separation and parameter regularization to keep attacker updates close to benign ones. We instantiate SABLE on CelebA hair-color classification and the German Traffic Sign Recognition Benchmark (GTSRB), poisoning only a small, interpretable subset of each malicious client’s local data while otherwise following the standard FL protocol. Across heterogeneous client partitions and multiple aggregation rules (FedAvg, Trimmed Mean, MultiKrum, and FLAME), our semantics-driven triggers achieve high targeted attack success rates while preserving benign test accuracy. These results show that semantics-aligned backdoors remain a potent and practical threat in federated learning, and that robustness claims based solely on synthetic patch triggers can be overly optimistic.

[CV-76] HSFM: Hard-Set-Guided Feature-Space Meta-Learning for Robust Classification under Spurious Correlations

【速读】:该论文旨在解决深度神经网络在分布偏移(distribution shift)和少数群体样本上表现脆弱的问题,其根源在于模型过度依赖伪相关特征(spurious features),而并非特征提取器(backbone)本身缺乏表达能力。研究表明,即使在存在伪相关性的场景下,ERM训练的模型骨干仍能学习到丰富且信息量大的表示,问题主要出在分类头(classifier head)对这些特征的处理方式。解决方案的关键在于提出一种双层元学习方法,直接在特征空间中进行增强(augmentation),通过学习支持集侧的特征编辑策略,使得在少量内层更新后,分类器能在困难样本上获得更低损失并提升最差组性能。该方法不依赖像素空间或端到端优化,仅需单GPU几分钟训练即可实现高效稳定改进,并通过CLIP可视化验证了所学特征更新具有语义一致性,与伪属性对齐。

链接: https://arxiv.org/abs/2603.29313
作者: Aryan Yazdan Parast,Khawar Islam,Soyoun Won,Basim Azam,Naveed Akhtar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks often rely on spurious features to make predictions, which makes them brittle under distribution shift and on samples where the spurious correlation does not hold (e.g., minority-group examples). Recent studies have shown that, even in such settings, the feature extractor of an Empirical Risk Minimization (ERM)-trained model can learn rich and informative representations, and that much of the failure may be attributed to the classifier head. In particular, retraining a lightweight head while keeping the backbone frozen can substantially improve performance on shifted distributions and minority groups. Motivated by this observation, we propose a bilevel meta-learning method that performs augmentation directly in feature space to improve spurious correlation handling in the classifier head. Our method learns support-side feature edits such that, after a small number of inner-loop updates on the edited features, the classifier achieves lower loss on hard examples and improved worst-group performance. By operating at the backbone output rather than in pixel space or through end-to-end optimization, the method is highly efficient and stable, requiring only a few minutes of training on a single GPU. We further validate our method with CLIP-based visualizations, showing that the learned feature-space updates induce semantically meaningful shifts aligned with spurious attributes.

[CV-77] Self-Consistency for LLM -Based Motion Trajectory Generation and Verification CVPR2026

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在视觉领域中生成和验证运动图形轨迹时的准确性不足问题,尤其是如何在无监督条件下提升LLM对复杂几何路径描述的生成一致性与合理性。其解决方案的关键在于:将提示(prompt)对应的轨迹集合建模为一个原型轨迹与一组几何变换(如刚性、相似性和仿射变换)的组合,并通过聚类识别出具有内在一致性的轨迹群;进一步利用不同候选变换组之间的层次关系自动恢复形状家族结构,从而实现更精确的轨迹生成与验证。该方法在轨迹生成准确率上提升4–6%,并在验证阶段相比视觉语言模型(Vision-Language Model, VLM)基线提升11%的精度。

链接: https://arxiv.org/abs/2603.29301
作者: Jiaju Ma,R. Kenny Jones,Jiajun Wu,Maneesh Agrawala
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Self-consistency has proven to be an effective technique for improving LLM performance on natural language reasoning tasks in a lightweight, unsupervised manner. In this work, we study how to adapt self-consistency to visual domains. Specifically, we consider the generation and verification of LLM-produced motion graphics trajectories. Given a prompt (e.g., “Move the circle in a spiral path”), we first sample diverse motion trajectories from an LLM, and then identify groups of consistent trajectories via clustering. Our key insight is to model the family of shapes associated with a prompt as a prototype trajectory paired with a group of geometric transformations (e.g., rigid, similarity, and affine). Two trajectories can then be considered consistent if one can be transformed into the other under the warps allowable by the transformation group. We propose an algorithm that automatically recovers a shape family, using hierarchical relationships between a set of candidate transformation groups. Our approach improves the accuracy of LLM-based trajectory generation by 4-6%. We further extend our method to support verification, observing 11% precision gains over VLM baselines. Our code and dataset are available at this https URL .

[CV-78] MotionScale: Reconstructing Appearance Geometry and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting CVPR2026

【速读】:该论文旨在解决从单目视频中重建动态4D场景时面临的两大挑战:一是复杂环境中难以恢复精确的3D几何结构,二是长时间序列下运动信息难以保持时序一致性。解决方案的关键在于提出MotionScale框架,其核心创新是引入一种基于簇中心基变换(cluster-centric basis transformations)的可扩展运动场,能够自适应地捕捉多样且演化的运动模式;同时设计了一种两阶段解耦的渐进式优化策略:第一阶段为背景扩展阶段,适应新可见区域并精化相机位姿,显式建模瞬态阴影;第二阶段为前景传播阶段,通过三阶段细化过程强制运动一致性,从而在大规模场景和长序列中实现高保真结构与运动连贯性。

链接: https://arxiv.org/abs/2603.29296
作者: Haoran Zhou,Gim Hee Lee
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Realistic reconstruction of dynamic 4D scenes from monocular videos is essential for understanding the physical world. Despite recent progress in neural rendering, existing methods often struggle to recover accurate 3D geometry and temporally consistent motion in complex environments. To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences while maintaining high-fidelity structural and motion coherence. At the core of our approach is a scalable motion field parameterized by cluster-centric basis transformations that adaptively expand to capture diverse and evolving motion patterns. To ensure robust reconstruction over long durations, we introduce a progressive optimization strategy comprising two decoupled propagation stages: 1) A background extension stage that adapts to newly visible regions, refines camera poses, and explicitly models transient shadows; 2) A foreground propagation stage that enforces motion consistency through a specialized three-stage refinement process. Extensive experiments on challenging real-world benchmarks demonstrate that MotionScale significantly outperforms state-of-the-art methods in both reconstruction quality and temporal stability. Project page: this https URL.

[CV-79] GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection

【速读】:该论文旨在解决当前深度伪造(deepfake)属性识别与检测方法在面对新型生成模型时泛化能力差的问题,尤其在于现有研究多局限于单一视觉模态的分析,未能充分挖掘图像中细微的伪造痕迹及其与语言描述之间的关联。其解决方案的关键在于提出一种基于注视引导的CLIP框架(gaze-guided CLIP),结合自适应增强细粒度语言提示(adaptive-enhanced fine-grained language prompts),通过引入注视向量(gaze vector)分布差异作为关键线索,设计了一个注视感知图像编码器(GIE)以融合外观与注视信息来提取跨域伪造嵌入,并构建语言精炼编码器(LRE)动态生成增强型语义嵌入,从而实现更稳定、通用的细粒度深度伪造属性识别与检测(DFAD)。

链接: https://arxiv.org/abs/2603.29295
作者: Yaning Zhang,Linlin Shen,Zitong Yu,Chunjie Ma,Zan Gao
机构: Qilu University of Technology (Shandong Academy of Sciences); Shenzhen University; Great Bay University; Shandong Artificial Intelligence Institute; Tianjin University of Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current deepfake attribution or deepfake detection works tend to exhibit poor generalization to novel generative methods due to the limited exploration in visual modalities alone. They tend to assess the attribution or detection performance of models on unseen advanced generators, coarsely, and fail to consider the synergy of the two tasks. To this end, we propose a novel gaze-guided CLIP with adaptive-enhanced fine-grained language prompts for fine-grained deepfake attribution and detection (DFAD). Specifically, we conduct a novel and fine-grained benchmark to evaluate the DFAD performance of networks on novel generators like diffusion and flow models. Additionally, we introduce a gaze-aware model based on CLIP, which is devised to enhance the generalization to unseen face forgery attacks. Built upon the novel observation that there are significant distribution differences between pristine and forged gaze vectors, and the preservation of the target gaze in facial images generated by GAN and diffusion varies significantly, we design a visual perception encoder to employ the inherent gaze differences to mine global forgery embeddings across appearance and gaze domains. We propose a gaze-aware image encoder (GIE) that fuses forgery gaze prompts extracted via a gaze encoder with common forged image embeddings to capture general attribution patterns, allowing features to be transformed into a more stable and common DFAD feature space. We build a language refinement encoder (LRE) to generate dynamically enhanced language embeddings via an adaptive-enhanced word selector for precise vision-language matching. Extensive experiments on our benchmark show that our model outperforms the state-of-the-art by 6.56% ACC and 5.32% AUC in average performance under the attribution and detection settings, respectively. Codes will be available on GitHub.

[CV-80] MELT: Improve Composed Image Retrieval via the Modification Frequentation-Rarity Balance Network ICASSP2026

【速读】:该论文旨在解决组合图像检索(Composed Image Retrieval, CIR)中两个核心问题:一是频率偏差导致的“稀有样本忽视”(Rare Sample Neglect),即模型对罕见修改语义的关注不足;二是相似度评分易受难负样本(hard negative samples)和噪声干扰,影响匹配准确性。解决方案的关键在于提出Modification frEquentation-rarity baLance neTwork(MELT),其通过双重机制实现改进:首先,在多模态上下文中增强对稀有修改语义的定位与关注,以缓解频率偏差;其次,利用基于扩散的去噪策略对高相似度但属于难负样本的数据进行净化,从而提升多模态融合与匹配的鲁棒性。

链接: https://arxiv.org/abs/2603.29291
作者: Guozhi Qiu,Zhiwei Chen,Zixu Li,Qinlei Huang,Zhiheng Fu,Xuemeng Song,Yupeng Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Composed Image Retrieval (CIR) uses a reference image and a modification text as a query to retrieve a target image satisfying the requirement of modifying the reference image according to the text instructions''. However, existing CIR methods face two limitations: (1) frequency bias leading to Rare Sample Neglect’', and (2) susceptibility of similarity scores to interference from hard negative samples and noise. To address these limitations, we confront two key challenges: asymmetric rare semantic localization and robust similarity estimation under hard negative samples. To solve these challenges, we propose the Modification frEquentation-rarity baLance neTwork MELT. MELT assigns increased attention to rare modification semantics in multimodal contexts while applying diffusion-based denoising to hard negative samples with high similarity scores, enhancing multimodal fusion and matching. Extensive experiments on two CIR benchmarks validate the superior performance of MELT. Codes are available at this https URL.

[CV-81] PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

【速读】:该论文旨在解决当前物理AI模型在现实世界部署中面临的感知能力与实际操作需求不匹配的问题,即尽管视觉识别性能优异,但缺乏对空间关系、物理动态和具身动作的深入理解,导致系统在真实场景中可靠性不足。解决方案的关键在于构建一个名为PRISM的多视角视频监督微调(SFT)语料库,该语料库基于一种新颖的三维知识本体论(three-dimensional knowledge ontology),涵盖空间知识、时序与物理知识以及具身动作知识,并通过20余项能力探测任务覆盖具身推理(Embodied Reasoning, ER)、常识(Common Sense, CS)、空间感知(Spatial Perception, SP)和直觉物理(Intuitive Physics, IP)四个评估维度。PRISM包含来自五个超市场景的约1180万帧视频和7.3亿token数据,采用第一人称、第三人称及360°视角并支持开放式、链式思维和多项选择等多样化标注形式,其微调显著降低了所有探测任务的误差率(平均下降66.6%),尤其在具身动作理解上提升达36.4%,验证了结构化领域特定SFT对增强具身视觉语言模型(Embodied Vision-Language Models, VLMs)现实适应性的有效性。

链接: https://arxiv.org/abs/2603.29281
作者: Amirreza Rouhi,Parikshit Sakurikar,Satya Sai Reddy,Narsimha Menga,Anirudh Govil,Sri Harsha Chittajallu,Rajat Aggarwal,Anoop Namboodiri,Sashi Reddi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not understand space, physical dynamics and embodied action well enough to operate reliably in the world. To this end, PRISM is grounded in a novel three-dimensional knowledge ontology that spans spatial knowledge, temporal and physical knowledge, and embodied action knowledge. It covers 20+ capability probes across four evaluation dimensions - Embodied Reasoning (ER), Common Sense (CS), Spatial Perception (SP), and Intuitive Physics (IP), and to our knowledge, PRISM is the first dataset to instantiate all three knowledge dimensions within a single real-world deployment domain. The corpus captures data from egocentric, exocentric and 360° viewpoints across five supermarket locations and includes open-ended, chain-of-thought, and multiple-choice supervision. At 4 fps, PRISM spans approximately 11.8M video frames and approximately 730M tokens, placing it among the largest domain-specific video SFT corpora. Fine-tuning on PRISM reduces the error rate across all 20+ probes by 66.6% over the pre-trained baseline, with significant gains in embodied action understanding where the accuracy improves by 36.4%. Our results suggest that ontology-structured, domain specific SFT can meaningfully strengthen embodied VLMs for real-world settings. The PRISM dataset and more details are available at this https URL

[CV-82] MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters CVPR2026

【速读】:该论文旨在解决物理驱动类人机器人在运动控制中对部分身体部位缺失观测或目标约束时的适应性问题,即如何实现灵活且精准的局部运动调整而不破坏原有稳定行为。其解决方案的关键在于提出一种两阶段残差学习框架MaskAdapt:第一阶段通过随机身体部位掩码(stochastic body-part masking)和一致性正则化项训练一个对掩码不敏感的基础策略(mask-invariant base policy),从而建立鲁棒的运动先验;第二阶段在此冻结的基础控制器之上训练一个残差策略(residual policy),仅对指定身体部位进行微调,同时保持其他区域原有行为不变,实现了高效、局部化的运动适应能力。

链接: https://arxiv.org/abs/2603.29272
作者: Soomin Park,Eunseong Lee,Kwang Bin Lee,Sung-Hee Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: CVPR 2026

点击查看摘要

Abstract:We present MaskAdapt, a framework for flexible motion adaptation in physics-based humanoid control. The framework follows a two-stage residual learning paradigm. In the first stage, we train a mask-invariant base policy using stochastic body-part masking and a regularization term that enforces consistent action distributions across masking conditions. This yields a robust motion prior that remains stable under missing observations, anticipating later adaptation in those regions. In the second stage, a residual policy is trained atop the frozen base controller to modify only the targeted body parts while preserving the original behaviors elsewhere. We demonstrate the versatility of this design through two applications: (i) motion composition, where varying masks enable multi-part adaptation within a single sequence, and (ii) text-driven partial goal tracking, where designated body parts follow kinematic targets provided by a pre-trained text-conditioned autoregressive motion generator. Through experiments, MaskAdapt demonstrates strong robustness and adaptability, producing diverse behaviors under masked observations and delivering superior targeted motion adaptation compared to prior work.

[CV-83] ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation

【速读】:该论文旨在解决训练-free开放词汇遥感分割(Training-free open-vocabulary remote sensing segmentation, OVRSS)中因独立像素级预测导致的语义不一致问题。现有方法虽通过增强特征表示或缓解模态差异提升局部预测精度,但未能充分考虑遥感图像中固有的强空间与语义相关性,从而在复杂场景下难以实现准确、一致的分割结果。解决方案的关键在于提出ConInfer框架,该框架采用上下文感知的联合推理机制,在多个空间单元上进行协同预测,并显式建模单元间的语义依赖关系,从而利用全局上下文信息显著提升分割的一致性、鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2603.29271
作者: Wenyang Chen,Zhanxuan Hu,Yaping Zhang,Hailong Ning,Yonghang Tai
机构: Yunnan Normal University (云南师范大学); Xi’an University of Posts and Telecommunications (西安邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training-free open-vocabulary remote sensing segmentation (OVRSS), empowered by vision-language models, has emerged as a promising paradigm for achieving category-agnostic semantic understanding in remote sensing imagery. Existing approaches mainly focus on enhancing feature representations or mitigating modality discrepancies to improve patch-level prediction accuracy. However, such independent prediction schemes are fundamentally misaligned with the intrinsic characteristics of remote sensing data. In real-world applications, remote sensing scenes are typically large-scale and exhibit strong spatial as well as semantic correlations, making isolated patch-wise predictions insufficient for accurate segmentation. To address this limitation, we propose ConInfer, a context-aware inference framework for OVRSS that performs joint prediction across multiple spatial units while explicitly modeling their inter-unit semantic dependencies. By incorporating global contextual cues, our method significantly enhances segmentation consistency, robustness, and generalization in complex remote sensing environments. Extensive experiments on multiple benchmark datasets demonstrate that our approach consistently surpasses state-of-the-art per-pixel VLM-based baselines such as SegEarth-OV, achieving average improvements of 2.80% and 6.13% on open-vocabulary semantic segmentation and object extraction tasks, respectively. The implementation code is available at: this https URL

[CV-84] Unbiased Model Prediction Without Using Protected Attribute Information

【速读】:该论文旨在解决深度学习模型在不同人口统计子群体中表现不一致的偏见问题(bias),尤其是在缺乏受保护属性信息(protected attribute information)的情况下,传统公平性增强算法因依赖此类敏感信息而难以应用于真实场景。解决方案的关键在于提出一种无需使用受保护属性的去偏算法——非受保护属性去偏(Non-Protected Attribute-based Debiasing, NPAD),其通过利用非受保护属性提供的辅助信息来优化模型以实现公平性目标,并设计了两种新的损失函数:基于属性聚类的去偏损失(Debiasing via Attribute Cluster Loss, DACL)和冗余过滤损失(Filter Redundancy Loss, FRL),从而在LFWA和CelebA数据集上的面部属性预测任务中显著降低了性别和年龄子群体间的偏差。

链接: https://arxiv.org/abs/2603.29270
作者: Puspita Majumdar,Surbhi Mittal,Mayank Vatsa,Richa Singh
机构: IIIT-Delhi (印度国际信息技术学院); IIT Jodhpur (印度理工学院贾多普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The problem of bias persists in the deep learning community as models continue to provide disparate performance across different demographic subgroups. Therefore, several algorithms have been proposed to improve the fairness of deep models. However, a majority of these algorithms utilize the protected attribute information for bias mitigation, which severely limits their application in real-world scenarios. To address this concern, we have proposed a novel algorithm, termed as \textbfNon-Protected Attribute-based Debiasing (NPAD) algorithm for bias mitigation, that does not require the protected attribute information. The proposed NPAD algorithm utilizes the auxiliary information provided by the non-protected attributes to optimize the model for bias mitigation. Further, two different loss functions, \textbfDebiasing via Attribute Cluster Loss (DACL) and \textbfFilter Redundancy Loss (FRL) have been proposed to optimize the model for fairness goals. Multiple experiments are performed on the LFWA and CelebA datasets for facial attribute prediction, and a significant reduction in bias across different gender and age subgroups is observed.

[CV-85] Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在理解否定表达(negation expressions)方面表现不佳的问题,尤其是针对两类常见的否定类型:基于存在的否定(presence-based negation,即图像中实际存在的对象被否定)和基于缺失的否定(absence-based negation,即图像中可能合理存在但实际缺失的对象被否定)。解决方案的关键在于提出Omni-NegCLIP,通过改进CLIP原始的InfoNCE对比损失函数,设计两种新的对比目标:一是将图像嵌入拉近到原句嵌入、推远到存在性否定句嵌入;二是使图像嵌入同时对齐原句和缺失性否定句嵌入,同时保持两者语义区分。此外,基于观察发现CLIP文本编码器前几层对否定文本的学习能力更强,作者在训练过程中仅微调这些前层,从而显著提升模型对多种否定任务的理解能力,在不损害图像-文本检索通用性能的前提下,实现最高达52.65%和12.50%的性能提升。

链接: https://arxiv.org/abs/2603.29258
作者: Jingqi Xu
机构: University of Southern California(南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated strong capabilities across a wide range of multimodal tasks. However, recent studies have shown that VLMs, such as CLIP, perform poorly in understanding negation expressions, which are common in natural language. In this work, we propose Omni-NegCLIP, a fine-tuned CLIP model that improves CLIP’s understanding of two types of negation, namely presence-based negation and absence-based negation, which correspond to negated expressions of objects that are actually present in an image and those that may plausibly exist in an image but are in fact absent, respectively, by modifying CLIP’s original InfoNCE contrastive loss. Specifically, we design a presence-based contrastive objective that pulls image embeddings closer to their original caption embeddings while pushing them away from the corresponding presence-based negated caption embeddings, and an absence-based contrastive objective that aligns image embeddings with both original and absence-based negated caption embeddings while maintaining a semantic distinction between the two text embeddings. Based on our observation that the front transformer layers of CLIP text encoder have stronger learning ability for negated text than the later layers, we fine-tune the front transformer layers of the CLIP text encoder at each training step using the combined contrastive objective. Experimental results show that, compared with pretrained CLIP, Omni-NegCLIP improves performance on presence-based negation and absence-based negation tasks by up to 52.65% and 12.50%, respectively, without sacrificing general capability in image-text retrieval and even improving it by up to 19.62%. Compared with prior works, Omni-NegCLIP demonstrates a more comprehensive ability to understand multiple types of negation tasks.

[CV-86] Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism CVPR2026

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长视频理解任务中面临的挑战,即如何突破输入长度限制并实现对无限长度视频的有效理解。传统方法通常一次性处理全部视频信息,受限于模型输入长度上限,难以应对长时间视频内容。解决方案的关键在于提出一种无需训练的视觉记忆机制——Flexible Memory(FlexMem),其核心思想是模拟人类观看视频时持续感知与回忆关键片段的行为:通过双路径压缩设计实现视觉键值缓存(KV caches)中的有效记忆迁移与写入,并探索多种记忆读取策略以适配不同视频理解任务(包括流式视频场景)。该方法显著提升了MMLLM在长视频和流式视频任务上的性能,在单张3090 GPU上可处理超过1000帧视频,且使基础模型在多个基准测试中达到或超越当前最优模型(如GPT-4o和Gemini-1.5 Pro)的水平。

链接: https://arxiv.org/abs/2603.29252
作者: Tao Chen,Kun Zhang,Qiong Wu,Xiao Chen,Chao Chang,Xiaoshuai Sun,Yiyi Zhou,Rongrong Ji
机构: Xiamen University (厦门大学); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026

点击查看摘要

Abstract:Long video understanding is a key challenge that plagues the advancement of \emphMultimodal Large language Models (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emphFlexible Memory (\textbfFlexMem). In principle, FlexMem aims to mimic human behavior of video watching, \emphi.e., continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on \textbfa single 3090 GPU, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than \textbf1k frames, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, \emphe.g. , GPT-4o and Gemini-1.5 Pro.

[CV-87] Monocular Building Height Estimation from PhiSat-2 Imagery: Dataset and Method

【速读】:该论文旨在解决单目影像中建筑物高度估计(monocular building height estimation)的难题,该任务在城市形态表征中具有重要意义,但受限于高度线索模糊、城市间建筑形态差异大以及建筑高度分布呈长尾特性等因素。为应对这一挑战,作者构建了PhiSat-2-Height数据集(PHDataset),包含来自全球26个城市共9,475对配准图像与标签patch,并提出两流序数网络(Two-Stream Ordinal Network, TSONet)。TSONet通过联合建模建筑底面分割(footprint segmentation)与高度估计,引入交叉流交换模块(Cross-Stream Exchange Module, CSEM)实现特征层面的底面感知交互,以及特征增强的分箱精修模块(Feature-Enhanced Bin Refinement, FEBR)提升序数高度预测精度。实验表明,TSONet在MAE和RMSE上分别降低13.2%和9.7%,IoU与F1-score分别提升14.0%和10.1%,验证了其有效性。关键创新在于融合多光谱遥感信息与底面几何约束的双流结构设计及有序回归优化策略。

链接: https://arxiv.org/abs/2603.29245
作者: Yanjiao Song,Bowen Cai,Timo Balz,Zhenfeng Shao,Neema Simon Sumari,James Magidi,Walter Musakwa
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular building height estimation from optical imagery is important for urban morphology characterization but remains challenging due to ambiguous height cues, large inter-city variations in building morphology, and the long-tailed distribution of building heights. PhiSat-2 is a promising open-access data source for this task because of its global coverage, 4.75 m spatial resolution, and seven-band spectral observations, yet its potential has not been systematically evaluated. To address this gap, we construct a PhiSat-2-Height dataset (PHDataset) and propose a Two-Stream Ordinal Network (TSONet). PHDataset contains 9,475 co-registered image-label patch pairs from 26 cities worldwide. TSONet jointly models footprint segmentation and height estimation, and introduces a Cross-Stream Exchange Module (CSEM) and a Feature-Enhanced Bin Refinement (FEBR) module for footprint-aware feature interaction and ordinal height refinement. Experiments on PHDataset show that TSONet achieves the best overall performance, reducing MAE and RMSE by 13.2% and 9.7%, and improving IoU and F1-score by 14.0% and 10.1% over the strongest competing results. Ablation studies further verify the effectiveness of CSEM, FEBR, and the joint use of ordinal regression and footprint assistance. Additional analyses indicate that PhiSat-2 benefits monocular building height estimation through its balanced combination of building-relevant spatial detail and multispectral observations. Overall, this study confirms the potential of PhiSat-2 for monocular building height estimation and provides a dedicated dataset and an effective method for future research.

[CV-88] Diffusion Mental Averag es CVPR2026

【速读】:该论文旨在解决扩散模型(Diffusion Model)在生成概念平均图像时存在的模糊问题,即传统数据驱动的平均方法在处理同一提示词下生成的多个样本时,难以获得清晰且具有代表性的平均图像。其核心解决方案是提出“扩散心理平均”(Diffusion Mental Averages, DMA),关键在于将平均操作从数据空间转移到扩散模型内部的语义空间中进行——通过优化多个噪声潜在变量(noise latents),使其去噪轨迹在时间步上逐步对齐并收敛至共享的粗粒度到细粒度语义结构,从而生成一个高保真原型图像。此方法突破了以往依赖外部数据平均的局限,实现了对抽象概念乃至多模态概念(如多种犬类)的稳定、逼真平均表示,为理解模型内部表征和偏见提供了可视化工具。

链接: https://arxiv.org/abs/2603.29239
作者: Phonphrm Thawatdamrongkit,Sukit Seripanitkarn,Supasorn Suwajanakorn
机构: VISTEC, Thailand
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Can a diffusion model produce its own “mental average” of a concept-one that is as sharp and realistic as a typical sample? We introduce Diffusion Mental Averages (DMA), a model-centric answer to this question. While prior methods aim to average image collections, they produce blurry results when applied to diffusion samples from the same prompt. These data-centric techniques operate outside the model, ignoring the generative process. In contrast, DMA averages within the diffusion model’s semantic space, as discovered by recent studies. Since this space evolves across timesteps and lacks a direct decoder, we cast averaging as trajectory alignment: optimize multiple noise latents so their denoising trajectories progressively converge toward shared coarse-to-fine semantics, yielding a single sharp prototype. We extend our approach to multimodal concepts (e.g., dogs with many breeds) by clustering samples in semantically-rich spaces such as CLIP and applying Textual Inversion or LoRA to bridge CLIP clusters into diffusion space. This is, to our knowledge, the first approach that delivers consistent, realistic averages, even for abstract concepts, serving as a concrete visual summary and a lens into model biases and concept representation.

[CV-89] M2H-MX: Multi-Task Dense Visual Perception for Real-Time Monocular Spatial Understanding

【速读】:该论文旨在解决单目相机在机器人感知中实现可靠实时空间理解的挑战,尤其是在仅依赖单一图像流时如何提升深度估计与语义分割的精度及稳定性。其核心问题在于:尽管多任务密集预测模型已在像素级深度和语义估计上取得进展,但如何将这些成果有效集成到实时单目建图系统中仍具难度。解决方案的关键在于提出M2H-MX模型,该模型在轻量解码器中引入注册门控全局上下文(register-gated global context)和受控跨任务交互机制,在保持多尺度特征表示的同时,使深度与语义预测相互增强,从而在严格延迟约束下实现稳定输出;此外,其输出可直接接入未修改的单目SLAM(Simultaneous Localization and Mapping)流水线,通过紧凑的感知-映射接口完成端到端集成,显著提升了系统整体性能。

链接: https://arxiv.org/abs/2603.29236
作者: U.V.B.L. Udugama,George Vosselman,Francesco Nex
机构: University of Twente (特温特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures, 5 tables. Preprint under review

点击查看摘要

Abstract:Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency constraints. Its outputs integrate directly into an unmodified monocular SLAM pipeline through a compact perception-to-mapping interface. We evaluate both dense prediction accuracy and in-the-loop system performance. On NYUDv2, M2H-MX-L achieves state-of-the-art results, improving semantic mIoU by 6.6% and reducing depth RMSE by 9.4% over representative multi-task baselines. When deployed in a real-time monocular mapping system on ScanNet, M2H-MX reduces average trajectory error by 60.7% compared to a strong monocular SLAM baseline while producing cleaner metric-semantic maps. These results demonstrate that modern multi-task dense prediction can be reliably deployed for real-time monocular spatial perception in robotic systems. Comments: 6 pages, 5 figures, 5 tables. Preprint under review Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.29236 [cs.CV] (or arXiv:2603.29236v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.29236 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-90] CCDNet: Learning to Detect Camouflage against Distractors in Infrared Small Target Detection

【速读】:该论文旨在解决红外目标检测(Infrared Target Detection, IRSTD)中因目标与背景对比度低、易被复杂背景伪装以及存在相似特征干扰物(distractors)导致误报率高的问题。解决方案的关键在于提出一种名为Camouflage-aware Counter-Distraction Network (CCDNet) 的新架构:首先设计基于加权多分支感知机(Weighted Multi-branch Perceptrons, WMPs)的主干网络,以聚合自条件引导的多层级特征来精准刻画目标与背景;其次引入聚合并精炼融合颈(Aggregation-and-Refinement Fusion Neck, ARFN),通过双向重构目标与背景关系,强化目标表征并抑制复杂背景;最后提出对比辅助干扰物判别器(Contrastive-aided Distractor Discriminator, CaDD),在局部和全局层面自适应计算真实目标与背景间的相似性,从而更精确地区分干扰物,降低误报率。

链接: https://arxiv.org/abs/2603.29228
作者: Zikai Liao,Zhaozheng Yin
机构: Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared target detection (IRSTD) tasks have critical applications in areas like wilderness rescue and maritime search. However, detecting infrared targets is challenging due to their low contrast and tendency to blend into complex backgrounds, effectively camouflaging themselves. Additionally, other objects with similar features (distractors) can cause false alarms, further degrading detection performance. To address these issues, we propose a novel \textbfCamouflage-aware \textbfCounter-\textbfDistraction \textbfNetwork (CCDNet) in this paper. We design a backbone with Weighted Multi-branch Perceptrons (WMPs), which aggregates self-conditioned multi-level features to accurately represent the target and background. Based on these rich features, we then propose a novel Aggregation-and-Refinement Fusion Neck (ARFN) to refine structures/semantics from shallow/deep features maps, and bidirectionally reconstruct the relations between the targets and the backgrounds, highlighting the targets while suppressing the complex backgrounds to improve detection accuracy. Furthermore, we present a new Contrastive-aided Distractor Discriminator (CaDD), enforcing adaptive similarity computation both locally and globally between the real targets and the backgrounds to more precisely discriminate distractors, so as to reduce the false alarm rate. Extensive experiments on infrared image datasets confirm that CCDNet outperforms other state-of-the-art methods.

[CV-91] LightHarmony3D: Harmonizing Illumination and Shadows for Object Insertion in 3D Gaussian Splatting

【速读】:该论文旨在解决在3D Gaussian Splatting (3DGS)场景中插入外部网格对象时,如何实现物理一致的光照与阴影渲染问题,这要求准确估计场景照明并保证多视角一致性。解决方案的关键在于提出LightHarmony3D框架,其核心是一个生成模块,能够通过单次前向传播预测插入位置处的360°高动态范围(HDR)环境贴图,利用生成先验而非迭代优化,高效捕捉场景主导光照信息,从而实现物理合理的着色与阴影效果,并保持多视角的一致性。

链接: https://arxiv.org/abs/2603.29209
作者: Tianyu Huang,Zhenyang Ren,Zhenchen Wan,Jiyang Zheng,Wenjie Wang,Runnan Chen,Mingming Gong,Tongliang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) enables high-fidelity reconstruction of scene geometry and appearance. Building on this capability, inserting external mesh objects into reconstructed 3DGS scenes enables interactive editing and content augmentation for immersive applications such as AR/VR, virtual staging, and digital content creation. However, achieving physically consistent lighting and shadows for mesh insertion remains challenging, as it requires accurate scene illumination estimation and multi-view consistent rendering. To address this challenge, we present LightHarmony3D, a novel framework for illumination-consistent mesh insertion in 3DGS scenes. Central to our approach is our proposed generative module that predicts a full 360° HDR environment map at the insertion location via a single forward pass. By leveraging generative priors instead of iterative optimization, our method efficiently captures dominant scene illumination and enables physically grounded shading and shadows for inserted meshes while maintaining multi-view coherence. Furthermore, we introduce the first dedicated benchmark for mesh insertion in 3DGS, providing a standardized evaluation framework for assessing lighting consistency and photorealism. Extensive experiments across multiple real-world reconstruction datasets demonstrate that LightHarmony3D achieves state-of-the-art realism and multi-view consistency.

[CV-92] Multi-Layered Memory Architectures for LLM Agents : An Experimental Evaluation of Long-Term Context Retention

【速读】:该论文旨在解决长时对话系统中存在的语义漂移(semantic drift)和跨会话记忆保持不稳定的问题。其核心解决方案是提出一种多层记忆框架(Multi-Layer Memory Framework),将对话历史分解为工作记忆、情景记忆和语义记忆三层结构,并引入自适应检索门控机制与保留正则化策略,从而在控制跨会话偏差的同时,实现上下文增长的有界性和计算效率的优化。

链接: https://arxiv.org/abs/2603.29194
作者: Sunil Tiwari,Payal Fofadiya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon dialogue systems suffer from semanticdrift and unstable memory retention across extended sessions. This paper presents a Multi-Layer Memory Framework that decomposes dialogue history into working, episodic, and semantic layers with adaptive retrieval gating and retention regularization. The architecture controls cross-session drift while maintaining bounded context growth and computational efficiency. Experiments on LOCOMO, LOCCO, and LoCoMo show improved performance, achieving 46.85 Success Rate, 0.618 overall F1 with 0.594 multi-hop F1, and 56.90% six-period retention while reducing false memory rate to 5.1% and context usage to 58.40%. Results confirm enhanced long-term retention and reasoning stability under constrained context budgets.

[CV-93] Developing Adaptive Context Compression Techniques for Large Language Models (LLM s) in Long-Running Interactions

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长时间交互中因上下文长度增加、记忆饱和和计算开销上升而导致的性能退化问题。其解决方案的关键在于提出一种自适应上下文压缩框架,该框架融合了重要性感知的记忆选择、连贯性敏感的过滤机制以及动态预算分配策略,能够在控制上下文增长的同时保留关键对话信息,从而在长期记忆保持与计算效率之间实现有效平衡。

链接: https://arxiv.org/abs/2603.29193
作者: Payal Fofadiya,Sunil Tiwari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often experience performance degradation during long-running interactions due to increasing context length, memory saturation, and computational overhead. This paper presents an adaptive context compression framework that integrates importance-aware memory selection, coherence-sensitive filtering, and dynamic budget allocation to retain essential conversational information while controlling context growth. The approach is evaluated on LOCOMO, LOCCO, and LongBench benchmarks to assess answer quality, retrieval accuracy, coherence preservation, and efficiency. Experimental results demonstrate that the proposed method achieves consistent improvements in conversational stability and retrieval performance while reducing token usage and inference latency compared with existing memory and compression-based approaches. These findings indicate that adaptive context compression provides an effective balance between long-term memory preservation and computational efficiency in persistent LLM interactions

[CV-94] 3D Architect: An Automated Approach to Three-Dimensional Modeling

【速读】:该论文旨在通过一组正交视图重建物体的三维模型。其核心问题是如何从二维图像中准确提取几何信息并恢复三维结构。解决方案的关键在于利用Harris角点检测器(Corner Detector)在输入视图中提取控制点,随后将这些控制点沿各自视图的垂直方向投影以构建包围体(envelope),并通过多个相互垂直的包围体之间的交线获取描述物体三维空间分布的点集,最终借助计算几何方法重构表面,并使用OpenGL进行三维可视化渲染。

链接: https://arxiv.org/abs/2603.29191
作者: Sunil Tiwari,Payal Fofadiya,Vicky Vishwakarma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The aim of our paper is to render an object in 3-dimension using a set of its orthographic views. Corner detector (Harris Detector) is applied on the input views to obtain control points. These control points are projected perpendicular to respective views, in order to construct an envelope. A set of points describing the object in 3-dimension, are obtained from the intersection of these mutually perpendicular envelopes. These set of points are used to regenerate the surfaces of the object using computational geometry. At the end, the object in 3-dimension is rendered using OpenGL

[CV-95] SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation CVPR2026

【速读】:该论文旨在解决当前文本到视频(Text-to-Video, T2V)生成模型评估系统在长视频场景下可靠性不足的问题,尤其是这些系统是否能在人类容易判断质量差异的条件下准确评估视频质量。解决方案的关键在于构建了一个名为合成长视频元评估(Synthetic Long-Video Meta-Evaluation, SLVMEval)的基准测试平台,该平台通过基于成对比较的元评估框架,利用密集视频字幕数据集合成可控的“高质量 vs. 低质量”视频对(覆盖10个不同维度),并通过众包筛选出人类感知明显的降级对,从而建立一个有效的测试基准。实验表明,人类在长视频质量判断中可达84.7%–96.8%准确率,而现有T2V评估系统在9个维度上均未达到人类水平,揭示了当前评估方法在长视频场景下的显著局限性。

链接: https://arxiv.org/abs/2603.29186
作者: Ryosuke Matsuda,Keito Kudo,Haruto Yoshida,Nobuyuki Shimizu,Jun Suzuki
机构: Tohoku University(东北大学); LY Corporation(LY公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:This paper proposes the synthetic long-video meta-evaluation (SLVMEval), a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. The proposed SLVMEval benchmark focuses on assessing these systems on videos of up to 10,486 s (approximately 3 h). The benchmark targets a fundamental requirement, namely, whether the systems can accurately assess video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video-captioning datasets, we synthetically degrade source videos to create controlled “high-quality versus low-quality” pairs across 10 distinct aspects. Then, we employ crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing an effective final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Experimental results demonstrate that human evaluators can identify the better long video with 84.7%-96.8% accuracy, and in nine of the 10 aspects, the accuracy of these systems falls short of human assessment, revealing weaknesses in text-to-long-video evaluation.

[CV-96] Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting CVPR2026

【速读】:该论文旨在解决视觉重定位(Visual Relocalization)任务中因数据库图像稀疏性导致的初始位姿估计不准确,以及特征匹配效率与精度不足的问题。其关键解决方案在于提出SplatHLoc框架,采用特征高斯点绘(Feature Gaussian Splatting)作为场景表示,并引入自适应视角检索机制以合成更贴近查询视角的虚拟候选样本,从而提升初始位姿估计精度;同时设计了一种混合特征匹配策略,利用高斯渲染特征在粗匹配阶段的优势和图像直接提取特征在精匹配阶段的稳定性,实现更高效且精确的位姿估计。

链接: https://arxiv.org/abs/2603.29185
作者: Huaqi Tao,Bingxi Liu,Guangcheng Chen,Fulin Tang,Li He,Hong Zhang
机构: Southern University of Science and Technology (南方科技大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Visual relocalization is a fundamental task in the field of 3D computer vision, estimating a camera’s pose when it revisits a previously known scene. While point-based hierarchical relocalization methods have shown strong scalability and efficiency, they are often limited by sparse image observations and weak feature matching. In this work, we propose SplatHLoc, a novel hierarchical visual relocalization framework that uses Feature Gaussian Splatting as the scene representation. To address the sparsity of database images, we propose an adaptive viewpoint retrieval method that synthesizes virtual candidates with viewpoints more closely aligned with the query, thereby improving the accuracy of initial pose estimation. For feature matching, we observe that Gaussian-rendered features and those extracted directly from images exhibit different strengths across the two-stage matching process: the former performs better in the coarse stage, while the latter proves more effective in the fine stage. Therefore, we introduce a hybrid feature matching strategy, enabling more accurate and efficient pose estimation. Extensive experiments on both indoor and outdoor datasets show that SplatHLoc enhances the robustness of visual relocalization, setting a new state-of-the-art.

[CV-97] Segmentation of Gray Matters and White Matters from Brain MRI data

【速读】:该论文旨在解决脑组织(如灰质和白质)在磁共振成像(MRI)中的精确分割问题,这是研究脑解剖结构、诊断神经疾病及监测病情进展的关键步骤。传统方法如FSL FAST虽能生成组织概率图,但常需针对特定任务调整且对多样化的成像条件适应性差。其解决方案的关键在于利用基础模型MedSAM的提示(prompt-based)机制,通过最小的架构修改——冻结预训练图像编码器,仅微调提示编码器与掩码解码器,并结合特定预处理流程(包括FSL BET去颅骨、FSL FAST生成概率图并转换为多角度2D切片标签),实现灰质、白质及背景的三类脑组织分割。实验在IXI数据集上取得最高Dice系数0.8751,验证了该策略在保持基础模型泛化能力的同时,可高效适配多类医学图像分割任务。

链接: https://arxiv.org/abs/2603.29171
作者: Chang Sun,Rui Shi,Tsukasa Koike,Tetsuro Sekine,Akio Morita,Tetsuya Sakai
机构: Waseda University (早稻田大学); Nippon Medical School Hospital (日本医科大学医院); Nippon Medical School Musashi Kosugi Hospital (日本医科大学武藏小杉医院); Tokyo Rosai Hospital (东京都立墨田病院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate segmentation of brain tissues such as gray matter and white matter from magnetic resonance imaging is essential for studying brain anatomy, diagnosing neurological disorders, and monitoring disease progression. Traditional methods, such as FSL FAST, produce tissue probability maps but often require task-specific adjustments and face challenges with diverse imaging conditions. Recent foundation models, such as MedSAM, offer a prompt-based approach that leverages large-scale pretraining. In this paper, we propose a modified MedSAM model designed for multi-class brain tissue segmentation. Our preprocessing pipeline includes skull stripping with FSL BET, tissue probability mapping with FSL FAST, and converting these into 2D axial, sagittal, coronal slices with multi-class labels (background, gray matter, and white matter). We extend MedSAM’s mask decoder to three classes, freezing the pre-trained image encoder and fine-tuning the prompt encoder and decoder. Experiments on the IXI dataset achieve Dice scores up to 0.8751. This work demonstrates that foundation models like MedSAM can be adapted for multi-class medical image segmentation with minimal architectural modifications. Our findings suggest that such models can be extended to more diverse medical imaging scenarios in future work.

[CV-98] CT-to-X-ray Distillation Under Tiny Paired Cohorts: An Evidence-Bounded Reproducible Pilot Study

【速读】:该论文旨在解决跨模态医学影像诊断中一个具体且部署导向的问题:在患者级配对的胸部X-ray与CT数据集上,是否可以仅用CT作为训练阶段的监督信号,来构建一个二分类(疾病 vs. 非疾病)的X-ray分类器,且在推理阶段无需使用CT。其解决方案的关键在于将此问题建模为一种跨模态教师-学生蒸馏(cross-modality teacher–student distillation)任务,并通过设计可复现的实验协议来系统评估不同蒸馏策略的有效性,包括简单的logit蒸馏控制组、注意力传递、特征提示以及晚期融合等方法。研究发现,尽管某些方法在原始数据划分下表现优异,但在多次蒙特卡洛重采样验证后,性能排名不稳定,且无显著跨模态优势,从而揭示了当前方法在鲁棒性和可解释性方面的局限性,强调未来CT到X-ray迁移学习需满足更严格的最小证据标准。

链接: https://arxiv.org/abs/2603.29167
作者: Bo Ma,Jinsong Wu,Weiqi Yan,Hongjiang Wei
机构: Resideo Technologies Inc.(美国); Auckland University of Technology(新西兰理工大学); Guilin University of Electronic Technology(桂林电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chest X-ray and computed tomography (CT) provide complementary views of thoracic disease, yet most computer-aided diagnosis models are trained and deployed within a single imaging modality. The concrete question studied here is narrower and deployment-oriented: on a patient-level paired chest cohort, can CT act as training-only supervision for a binary disease versus non-disease X-ray classifier without requiring CT at inference time? We study this setting as a cross-modality teacher–student distillation problem and use JDCNet as an executable pilot scaffold rather than as a validated superior architecture. On the original patient-level paired split from a public paired chest imaging cohort, a stripped-down plain cross-modal logit-KD control attains the highest mean result on the four-image validation subset (0.875 accuracy and 0.714 macro-F1), whereas the full module-augmented JDCNet variant remains at 0.750 accuracy and 0.429 macro-F1. To test whether that ranking is a split artifact, we additionally run eight patient-level Monte Carlo resamples with same-case comparisons, stronger mechanism controls based on attention transfer and feature hints, and imbalance-sensitive analyses. Under this resampled protocol, late fusion attains the highest mean accuracy (0.885), same-modality distillation attains the highest mean macro-F1 (0.554) and balanced accuracy (0.660), the plain cross-modal control drops to 0.500 mean balanced accuracy, and neither attention transfer nor feature hints recover a robust cross-modality advantage. The contribution of this study is therefore not a validated CT-to-X-ray architecture, but a reproducible and evidence-bounded pilot protocol that makes the exact task definition, failure modes, ranking instability, and the minimum requirements for future credible CT-to-X-ray transfer claims explicit.

[CV-99] LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning

【速读】:该论文旨在解决现有视觉语言导航(Vision-and-Language Navigation, VLN)模型在决策时主要依赖过去和当前视觉观测,而忽略由动作引发的未来视觉动态变化的问题,导致模型难以理解动作与环境变化之间的因果关系,从而限制了其鲁棒性。解决方案的关键在于提出一种名为LatentPilot的新范式,其核心机制是利用训练过程中未来观测作为有价值的数据源来学习动作条件下的视觉动态,同时在推理阶段无需访问未来帧;具体而言,通过一种飞轮式(flywheel-style)训练流程迭代收集策略轨迹并重新训练模型以更好地匹配代理行为分布,并引入专家接管机制防止严重偏离;此外,LatentPilot还无监督地学习视觉潜在标记(visual latent tokens),这些标记在连续潜在空间中全局注意力,跨步骤传递,既作为当前输出也作为下一输入,使代理能够“提前想象”未来场景,从而基于动作-环境因果关系进行更优决策。

链接: https://arxiv.org/abs/2603.29165
作者: Haihong Hao,Lei Chen,Mingfei Han,Changlin Li,Dong An,Yuqiang Yang,Zhihui Li,Xiaojun Chang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Existing vision-and-language navigation (VLN) models primarily reason over past and current visual observations, while largely ignoring the future visual dynamics induced by actions. As a result, they often lack an effective understanding of the causal relationship between actions and how the visual world changes, limiting robust decision-making. Humans, in contrast, can imagine the near future by leveraging action-dynamics causality, which improves both environmental understanding and navigation choices. Inspired by this capability, we propose LatentPilot, a new paradigm that exploits future observations during training as a valuable data source to learn action-conditioned visual dynamics, while requiring no access to future frames at inference. Concretely, we propose a flywheel-style training mechanism that iteratively collects on-policy trajectories and retrains the model to better match the agent’s behavior distribution, with an expert takeover triggered when the agent deviates excessively. LatentPilot further learns visual latent tokens without explicit supervision; these latent tokens attend globally in a continuous latent space and are carried across steps, serving as both the current output and the next input, thereby enabling the agent to dream ahead and reason about how actions will affect subsequent observations. Experiments on R2R-CE, RxR-CE, and R2R-PE benchmarks achieve new SOTA results, and real-robot tests across diverse environments demonstrate LatentPilot’s superior understanding of environment-action dynamics in scene. Project page:this https URL

[CV-100] SparseDriveV2: Scoring is All You Need for End-to-End Autonomous Driving

【速读】:该论文旨在解决端到端多模态规划中轨迹评分方法的性能瓶颈问题,特别是静态轨迹词汇(trajectory vocabulary)是否能够通过密度提升达到与动态生成方案相当甚至更优的性能。现有方法通常分为两类:一类是基于大规模静态轨迹库进行评分,另一类则是动态生成少量候选轨迹。尽管动态方案在实践中表现更优,但其必要性尚未明确。论文通过系统性地扩展Hydra-MDP方法中的轨迹锚点密度,发现性能随轨迹离散化密度增加而持续提升,且未出现饱和现象,表明静态词汇具备潜力。为此,作者提出SparseDriveV2,其关键创新在于:(1) 采用因子分解结构将轨迹解耦为几何路径(geometric paths)与速度剖面(velocity profiles),实现动作空间的组合式覆盖;(2) 引入分层评分策略,先对路径和速度剖面进行粗粒度评分,再对少量组合后的轨迹进行细粒度评分,从而兼顾效率与精度。该设计显著提升了静态评分方法的性能边界,在多个基准测试中取得领先结果。

链接: https://arxiv.org/abs/2603.29163
作者: Wenchao Sun,Xuewu Lin,Keyu Chen,Zixiang Pei,Xiang Li,Yining Shi,Sifa Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end multi-modal planning has been widely adopted to model the uncertainty of driving behavior, typically by scoring candidate trajectories and selecting the optimal one. Existing approaches generally fall into two categories: scoring a large static trajectory vocabulary, or scoring a small set of dynamically generated proposals. While static vocabularies often suffer from coarse discretization of the action space, dynamic proposals provide finer-grained precision and have shown stronger empirical performance on existing benchmarks. However, it remains unclear whether dynamic generation is fundamentally necessary, or whether static vocabularies can already achieve comparable performance when they are sufficiently dense to cover the action space. In this work, we start with a systematic scaling study of Hydra-MDP, a representative scoring-based method, revealing that performance consistently improves as trajectory anchors become denser, without exhibiting saturation before computational constraints are reached. Motivated by this observation, we propose SparseDriveV2 to push the performance boundary of scoring-based planning through two complementary innovations: (1) a scalable vocabulary representation with a factorized structure that decomposes trajectories into geometric paths and velocity profiles, enabling combinatorial coverage of the action space, and (2) a scalable scoring strategy with coarse factorized scoring over paths and velocity profiles followed by fine-grained scoring on a small set of composed trajectories. By combining these two techniques, SparseDriveV2 achieves 92.0 PDMS and 90.1 EPDMS on NAVSIM, with 89.15 Driving Score and 70.00 Success Rate on Bench2Drive with a lightweight ResNet-34 as backbone. Code and model are released at this https URL.

[CV-101] Dual-Imbalance Continual Learning for Real-World Food Recognition CVPR2026

【速读】:该论文旨在解决持续学习场景下食物识别中存在的双重不平衡问题:一是每类食物样本数量分布极度不均(长尾分布),二是每个增量学习步骤中新引入的食物类别数量差异显著(步长不平衡)。为应对这一挑战,作者提出DIME框架,其核心创新在于采用轻量级适配器(adapter)进行参数高效微调,并通过一种基于类别计数引导的谱合并策略(class-count guided spectral merging)逐步整合不同任务的适配器;同时引入秩级阈值调制机制(rank-wise threshold modulation),在保留主导知识的同时实现自适应更新,从而稳定合并过程。最终模型仅保留一个合并后的适配器用于推理,避免了任务特定模块的累积,提升了部署效率。实验表明,在真实长尾食物基准上,该方法相比现有最强连续学习基线性能提升超过3%。

链接: https://arxiv.org/abs/2603.29133
作者: Xiaoyan Zhang,Jiangpeng He
机构: University of Michigan (密歇根大学); Indiana University (印第安纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 3rd MetaFood at CVPR 2026. Code is available at this https URL

点击查看摘要

Abstract:Visual food recognition in real-world dietary logging scenarios naturally exhibits severe data imbalance, where a small number of food categories appear frequently while many others occur rarely, resulting in long-tailed class distributions. In practice, food recognition systems often operate in a continual learning setting, where new categories are introduced sequentially over time. However, existing studies typically assume that each incremental step introduces a similar number of new food classes, which rarely happens in real world where the number of newly observed categories can vary significantly across steps, leading to highly uneven learning dynamics. As a result, continual food recognition exhibits a dual imbalance: imbalanced samples within each food class and imbalanced numbers of new food classes to learn at each incremental learning step. In this work, we introduce DIME, a Dual-Imbalance-aware Adapter Merging framework for continual food recognition. DIME learns lightweight adapters for each task using parameter-efficient fine-tuning and progressively integrates them through a class-count guided spectral merging strategy. A rank-wise threshold modulation mechanism further stabilizes the merging process by preserving dominant knowledge while allowing adaptive updates. The resulting model maintains a single merged adapter for inference, enabling efficient deployment without accumulating task-specific modules. Experiments on realistic long-tailed food benchmarks under our step-imbalanced setup show that the proposed method consistently improves by more than 3% over the strongest existing continual learning baselines. Code is available at this https URL.

[CV-102] Enhancing Box and Block Test with Computer Vision for Post-Stroke Upper Extremity Motor Evaluation

【速读】:该论文旨在解决卒中后上肢运动功能临床评估中存在的敏感性不足问题,即传统基于序数评分的方法缺乏对运动质量的精细捕捉,而基于时间的任务指标又无法反映动作执行的具体特征。其解决方案的关键在于提出了一种无需深度传感器或校准物体、仅依赖单目视频的计算机视觉框架,通过世界对齐的关节角度(world-aligned joint angles)来量化手指、手臂和躯干在Box and Block Test (BBT)中的运动模式。该方法利用无监督降维技术提取运动特征嵌入,能够区分健康人群与卒中患者的运动差异,并揭示相同BBT评分下不同患者因姿势模式差异导致的功能异常,从而在不改变现有临床流程的前提下,显著提升对上肢运动质量的客观测量能力。

链接: https://arxiv.org/abs/2603.29101
作者: David Robinson,Animesh Gupta,Elizabeth Clark,Olga Melnik,Qiushi Fu,Mubarak Shah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to EMBC 2026

点击查看摘要

Abstract:Standard clinical assessments of upper-extremity motor function after stroke either rely on ordinal scoring, which lacks sensitivity, or time-based task metrics, which do not capture movement quality. In this work, we present a computer vision-based framework for analysis of upper-extremity movement during the Box and Block Test (BBT) through world-aligned joint angles of fingers, arm, and trunk without depth sensors or calibration objects. We apply this framework to a dataset of 136 BBT recordings collected from 48 healthy individuals and 7 individuals post stroke. Using unsupervised dimensionality reduction of joint-angle features, we analyze movement patterns without relying on expert clinical labels. The resulting embeddings show separation between healthy movement patterns and stroke-related movement deviations. Importantly, some patients with the same BBT scores can be separated with different postural patterns. These results show that world-aligned joint angles can capture meaningful information of upper-extremity functions beyond standard time-based BBT scores, with no effort from the clinician other than monocular video recordings of the patient using a phone or camera. This work highlights the potential of a camera-based, calibration-free framework to measure movement quality in clinical assessments without changing the widely adopted clinical routine.

[CV-103] rajectoryMover: Generative Movement of Object Trajectories in Videos

【速读】:该论文旨在解决生成式视频编辑中缺乏对物体3D运动轨迹进行移动的能力,即在保持物体相对3D运动关系的前提下实现其运动路径的迁移。现有方法通常依赖于从非配对视频中构建合理配对数据,但这种方法在无法通过简单变换将一对视频中的一个重构为另一个时失效。解决方案的关键在于提出TrajectoryAtlas——一个用于大规模合成配对视频数据的新生成管道,以及基于该数据微调的视频生成器TrajectoryMover,从而有效实现物体3D运动轨迹的可控生成与迁移。

链接: https://arxiv.org/abs/2603.29092
作者: Kiran Chhatre,Hyeonho Jeong,Yulia Gryaditskaya,Christopher E. Peters,Chun-Hao Paul Huang,Paul Guerrero
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 8 figures. Project page: this https URL

点击查看摘要

Abstract:Generative video editing has enabled several intuitive editing operations for short video clips that would previously have been difficult to achieve, especially for non-expert editors. Existing methods focus on prescribing an object’s 3D or 2D motion trajectory in a video, or on altering the appearance of an object or a scene, while preserving both the video’s plausibility and identity. Yet a method to move an object’s 3D motion trajectory in a video, i.e., moving an object while preserving its relative 3D motion, is currently still missing. The main challenge lies in obtaining paired video data for this scenario. Previous methods typically rely on clever data generation approaches to construct plausible paired data from unpaired videos, but this approach fails if one of the videos in a pair can not easily be constructed from the other. Instead, we introduce TrajectoryAtlas, a new data generation pipeline for large-scale synthetic paired video data and a video generator TrajectoryMover fine-tuned with this data. We show that this successfully enables generative movement of object trajectories. Project page: this https URL

[CV-104] HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling

【速读】:该论文旨在解决现有世界模型(World Models)在视频预测任务中因使用扁平潜在表示而导致的局限性,包括对象混杂、忽略因果结构以及时间动态信息被压缩至单一尺度等问题。其解决方案的关键在于提出一种名为HCLSM的新型架构,该架构基于三个相互关联的原则:通过槽注意力(Slot Attention)结合空间广播解码实现以对象为中心的分解;利用三级时序引擎(包含选择性状态空间模型、稀疏Transformer和压缩Transformer)建模连续物理、离散事件与抽象目标的分层时间动态;并通过图神经网络(Graph Neural Network, GNN)交互模式学习因果结构。此外,该方法引入两阶段训练协议,先通过空间重建强制槽专一化,再进行动态预测,从而有效提升模型对复杂环境的表征能力和预测精度。

链接: https://arxiv.org/abs/2603.29090
作者: Jaber Jaber,Osama Jaber
机构: RightNow AI
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 10 pages, 3 tables, 4 figures, 1 algorithm. Code: this https URL

点击查看摘要

Abstract:World models that predict future states from video remain limited by flat latent representations that entangle objects, ignore causal structure, and collapse temporal dynamics into a single scale. We present HCLSM, a world model architecture that operates on three interconnected principles: object-centric decomposition via slot attention with spatial broadcast decoding, hierarchical temporal dynamics through a three-level engine combining selective state space models for continuous physics, sparse transformers for discrete events, and compressed transformers for abstract goals, and causal structure learning through graph neural network interaction patterns. HCLSM introduces a two-stage training protocol where spatial reconstruction forces slot specialization before dynamics prediction begins. We train a 68M-parameter model on the PushT robotic manipulation benchmark from the Open X-Embodiment dataset, achieving 0.008 MSE next-state prediction loss with emerging spatial decomposition (SBD loss: 0.0075) and learned event boundaries. A custom Triton kernel for the SSM scan delivers 38x speedup over sequential PyTorch. The full system spans 8,478 lines of Python across 51 modules with 171 unit tests. Code: this https URL

[CV-105] WorldFlow3D: Flowing Through 3D Distributions for Unbounded World Generation

【速读】:该论文旨在解决无界三维场景生成(unbounded 3D world generation)这一基础性问题,其目标是在计算机视觉、图形学和机器人领域中实现高质量、可控且结构合理的三维世界建模。解决方案的关键在于提出WorldFlow3D方法,该方法基于流匹配(flow matching)原理,将3D生成建模为在不同数据分布之间流动的过程,而非局限于条件去噪的范式;其核心创新是采用无潜在变量(latent-free)的流匹配机制,从而生成因果性强、结构准确的3D内容,并将其作为中间分布来引导更复杂结构与高保真纹理的生成,同时相比现有方法收敛更快,且支持通过向量化的场景布局条件实现几何结构控制和通过场景属性实现视觉纹理控制,验证了跨域泛化能力与真实数据分布上的高质量生成性能。

链接: https://arxiv.org/abs/2603.29089
作者: Amogh Joshi,Julian Ost,Felix Heide
机构: Princeton University (普林斯顿大学); Torc Robotics (托克机器人)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Unbounded 3D world generation is emerging as a foundational task for scene modeling in computer vision, graphics, and robotics. In this work, we present WorldFlow3D, a novel method capable of generating unbounded 3D worlds. Building upon a foundational property of flow matching - namely, defining a path of transport between two data distributions - we model 3D generation more generally as a problem of flowing through 3D data distributions, not limited to conditional denoising. We find that our latent-free flow approach generates causal and accurate 3D structure, and can use this as an intermediate distribution to guide the generation of more complex structure and high-quality texture - all while converging more rapidly than existing methods. We enable controllability over generated scenes with vectorized scene layout conditions for geometric structure control and visual texture control through scene attributes. We confirm the effectiveness of WorldFlow3D on both real outdoor driving scenes and synthetic indoor scenes, validating cross-domain generalizability and high-quality generation on real data distributions. We confirm favorable scene generation fidelity over approaches in all tested settings for unbounded scene generation. For more, see this https URL.

[CV-106] Is the Modality Gap a Bug or a Feature? A Robustness Perspective

【速读】:该论文旨在解决多模态模型(Multi-modal Models, MMMs)在共享嵌入空间中存在显著模态差距(Modality Gap)的问题,即图像和文本的分布严重分离,尽管现有模型通过对比损失(Contrastive Loss)进行对齐。研究表明,在特定条件下最小化对比损失会导致两个模态被一个与嵌入正交的全局间隙向量(Gap Vector)分隔开,且该间隙大小与模型鲁棒性呈单调关系:缩小间隙不会影响干净准确率(Clean Accuracy),但能降低模型在嵌入扰动下的输出变化概率。解决方案的关键在于一种简单的后处理步骤——将某一模态的嵌入向量移动至另一模态的均值方向,从而有效提升模型鲁棒性,同时不牺牲原始性能。

链接: https://arxiv.org/abs/2603.29080
作者: Rhea Chowers,Oshri Naparstek,Udi Barzelay,Yair Weiss
机构: Hebrew University (希伯来大学); IBM Research (IBM研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss yields a representation in which the two modalities are separated by a global gap vector that is orthogonal to their embeddings. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when the embeddings are perturbed. Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss of clean accuracy.

[CV-107] LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition

【速读】:该论文旨在解决基于骨架的孤立手语识别(ISLR)中对多尺度运动细节(从细微手指动作到整体身体动态)的精细理解问题。现有方法依赖于深度前馈架构,虽提升了模型容量,但缺乏递归优化机制和结构化表征能力。其解决方案的关键在于提出一种循环Transformer框架LA-Sign,通过重复访问潜在表示实现递归式精炼(recurrent latent refinement),并在共享参数下逐步提升运动理解;同时引入几何感知对比目标,将骨骼与文本特征映射至自适应双曲空间(adaptive hyperbolic space),促进多尺度语义组织,从而在减少独立层数量的同时实现最优性能。

链接: https://arxiv.org/abs/2603.29057
作者: Muxin Pu,Mei Kuan Lim,Chun Yong Chong,Chen Change Loy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Skeleton-based isolated sign language recognition (ISLR) demands fine-grained understanding of articulated motion across multiple spatial scales, from subtle finger movements to global body dynamics. Existing approaches typically rely on deep feed-forward architectures, which increase model capacity but lack mechanisms for recurrent refinement and structured representation. We propose LA-Sign, a looped transformer framework with geometry-aware alignment for ISLR. Instead of stacking deeper layers, LA-Sign derives its depth from recurrence, repeatedly revisiting latent representations to progressively refine motion understanding under shared parameters. To further regularise this refinement process, we present a geometry-aware contrastive objective that projects skeletal and textual features into an adaptive hyperbolic space, encouraging multi-scale semantic organisation. We study three looping designs and multiple geometric manifolds, demonstrating that encoder-decoder looping combined with adaptive Poincare alignment yields the strongest performance. Extensive experiments on WLASL and MSASL benchmarks show that LA-Sign achieves state-of-the-art results while using fewer unique layers, highlighting the effectiveness of recurrent latent refinement and geometry-aware representation learning for sign language recognition.

[CV-108] Let the Abyss Stare Back Adaptive Falsification for Autonomous Scientific Discovery

【速读】:该论文旨在解决自主科学发现中“评估固化导致虚假学习”的问题,即当评估机制固定时,搜索过程可能学会通过欺骗评估而非真正理解任务背后的机制来获得高分。其解决方案的关键在于引入DASES框架,该框架通过动态对抗性 falsification(伪造)机制,使“深渊伪造者”(Abyss Falsifier)主动构造可接受的反例环境来挑战候选科学模型,而非依赖静态验证。在此过程中,创新者(Innovator)、深渊伪造者和机制因果提取器(Mechanistic Causal Extractor)协同演化可执行的科学实体与符合科学规范的反例环境,在既定科学契约下实现对候选方案的严格筛选。最终,DASES识别出首个能通过可接受伪造边界检验的候选解,并发现FNG-CE这一具有跨环境泛化能力的损失函数,其在ImageNet等标准基准上显著优于CE和CE+L2。

链接: https://arxiv.org/abs/2603.29045
作者: Peiran Li,Fangzhou Lin,Shuo Xing,Jiashuo Sun,Dylan Zhang,Siyuan Yang,Chaoqun Ni,Zhengzhong Tu
机构: Texas AM University; Worcester Polytechnic Institute; University of Illinois Urbana-Champaign; University of Illinois Urbana-Champain; University of Wisconsin-Madison; VecTrue AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 1 figures, 4 tables

点击查看摘要

Abstract:Autonomous scientific discovery is entering a more dangerous regime: once the evaluator is frozen, a sufficiently strong search process can learn to win the exam without learning the mechanism the task was meant to reveal. This is the idea behind our title. To let the abyss stare back is to make evaluation actively push against the candidate through adaptive falsification, rather than passively certify it through static validation. We introduce DASES, a falsification-driven framework in which an Innovator, an Abyss Falsifier, and a Mechanistic Causal Extractor co-evolve executable scientific artifacts and scientifically admissible counterexample environments under a fixed scientific contract. In a controlled loss-discovery problem with a single editable locus, DASES rejects artifacts that static validation would have accepted, identifies the first candidate that survives the admissible falsification frontier, and discovers FNG-CE, a loss that transfers beyond the synthetic discovery environment and consistently outperforms CE and CE+L2 under controlled comparisons across standard benchmarks, including ImageNet.

[CV-109] Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos

【速读】:该论文旨在解决在第一人称“步行游览”视频中,由于人群密集和摄像头处于人眼高度视角导致人类及其阴影在画面中占据显著比例的问题,从而限制了这些视频在环境建模应用中的有效性。解决方案的关键在于构建一个丰富的半合成视频片段数据集,用于训练生成式模型以实现对人类及阴影的逼真修复(inpainting);该数据集由环境背景片段与叠加了模拟阴影的人类行走片段组成,背景和前景均随机取自全球真实的第一人称步行视频以保证视觉多样性,并在此基础上微调最先进的Casper视频扩散模型(video diffusion model),最终实现了在复杂背景和高人类密度场景下优于原始Casper模型的去人效果,且生成的视频可用于构建高质量的城市三维/四维(3D/4D)模型。

链接: https://arxiv.org/abs/2603.29036
作者: Yujin Ham,Junho Kim,Vivek Boominathan,Guha Balakrishnan
机构: Rice University (莱斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Egocentric “walking tour” videos provide a rich source of image data to develop rich and diverse visual models of environments around the world. However, the significant presence of humans in frames of these videos due to crowds and eye-level camera perspectives mitigates their usefulness in environment modeling applications. We focus on addressing this challenge by developing a generative algorithm that can realistically remove (i.e., inpaint) humans and their associated shadow effects from walking tour videos. Key to our approach is the construction of a rich semi-synthetic dataset of video clip pairs to train this generative model. Each pair in the dataset consists of an environment-only background clip, and a composite clip of walking humans with simulated shadows overlaid on the background. We randomly sourced both foreground and background components from real egocentric walking tour videos around the world to maintain visual diversity. We then used this dataset to fine-tune the state-of-the-art Casper video diffusion model for object and effects inpainting, and demonstrate that the resulting model performs far better than Casper both qualitatively and quantitatively at removing humans from walking tour clips with significant human presence and complex backgrounds. Finally, we show that the resulting generated clips can be used to build successful 3D/4D models of urban locations.

[CV-110] he Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations CVPR2026

【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)在参数初始化策略上的敏感性问题,即如何设计有效的初始化方法以提升INRs的逼近能力和收敛性能。现有数据驱动的初始化方法虽表现优于标准随机初始化,但其成功机制尚不明确——是否源于编码经典统计信号先验或更复杂的特征结构仍不清楚。论文的关键解决方案是通过噪声预训练实验系统性地分析不同噪声类型对INRs性能的影响:发现仅需在无结构噪声(如均匀噪声、高斯噪声)上进行预训练即可显著增强INRs对未见信号的拟合能力;而具有自然图像典型1/|f^α|频谱结构的噪声则在信号拟合与逆成像任务(如去噪)之间取得最佳平衡,其性能媲美最优数据驱动初始化方法。这一发现为缺乏充足领域特定数据的应用场景提供了更高效的INR训练路径。

链接: https://arxiv.org/abs/2603.29034
作者: Kushal Vyas,Alper Kayabasi,Daniel Kim,Vishwanath Saragadam,Ashok Veeraraghavan,Guha Balakrishnan
机构: Rice University (莱斯大学); University of California, Riverside (加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:The approximation and convergence properties of implicit neural representations (INRs) are known to be highly sensitive to parameter initialization strategies. While several data-driven initialization methods demonstrate significant improvements over standard random sampling, the reasons for their success – specifically, whether they encode classical statistical signal priors or more complex features – remain poorly understood. In this study, we explore this phenomenon through a series of experimental analyses leveraging noise pretraining. We pretrain INRs on diverse noise classes (e.g., Gaussian, Dead Leaves, Spectral) and measure their ability to both fit unseen signals and encode priors for an inverse imaging task (denoising). Our analyses on image and video data reveal a surprising finding: simply pretraining on unstructured noise (Uniform, Gaussian) dramatically improves signal fitting capacity compared to all other baselines. However, unstructured noise also yields poor deep image priors for denoising. In contrast, we also find that noise with the classic 1/|f^\alpha| spectral structure of natural images achieves an excellent balance of signal fitting and inverse imaging capabilities, performing on par with the best data-driven initialization methods. This finding enables more efficient INR training in applications lacking sufficient prior domain-specific data. For more details, visit project page at this https URL

[CV-111] MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation CVPR

【速读】:该论文旨在解决现有多模态人脸生成模型在空间控制能力上的局限性,即如何实现文本语义与结构布局(如分割图、草图等)之间的协同融合,从而提升可控人脸生成的精度与一致性。传统方法通常通过附加控制模块或拼接单模态网络来扩展预训练文本到图像扩散模型,但这类设计易受架构约束、参数冗余及模态冲突影响,难以实现跨语义与空间域的高效融合。本文提出MMFace-DiT,其核心创新在于引入一个双流扩散Transformer块,该结构并行处理空间(mask/sketch)和语义(text)token,并通过共享的旋转位置嵌入(Rotary Position-Embedded, RoPE)注意力机制实现深度交互,避免模态主导问题,确保对文本意图和结构先验的双重强约束;同时设计了可动态适应不同空间条件的模态嵌入器(Modality Embedder),使单一模型无需重训练即可灵活应对多种输入模态,最终在视觉保真度和提示对齐方面相较六种前沿模型提升40%,构建了一种端到端可控生成的新范式。

链接: https://arxiv.org/abs/2603.29029
作者: Bharath Krishnamurthy,Ajita Rattani
机构: University of North Texas (北德克萨斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026. 22 pages (Main Text + Supplementary), 14 figures, 5 tables, 4 algorithms. Project page: this https URL and Code Repository: this https URL

点击查看摘要

Abstract:Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial-semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over six state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling. The code and dataset are available on our project page: this https URL

[CV-112] UltraG -Ray: Physics-Based Gaussian Ray Casting for Novel Ultrasound View Synthesis

【速读】:该论文旨在解决超声成像中视图合成(Novel View Synthesis, NVS)的现实性不足问题,特别是现有方法难以准确模拟复杂组织结构及视场依赖的声学效应。其解决方案的关键在于提出一种基于可学习三维高斯场(3D Gaussian field)的新型超声场景表示方法——UltraG-Ray,并结合高效的物理驱动B模式图像合成模块。该方法通过显式编码衰减(attenuation)和反射(reflection)等超声特异性参数到高斯空间表示中,利用创新的射线投射(ray casting)策略实现更真实的B模式图像生成,从而自然地捕捉视场依赖的衰减效应,显著提升合成图像的物理真实感与视觉质量。

链接: https://arxiv.org/abs/2603.29022
作者: Felix Duelmer,Jakob Klaushofer,Magdalena Wysocki,Nassir Navab,Mohammad Farid Azampour
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MIDL 2026 / to appear in PMLR

点击查看摘要

Abstract:Novel view synthesis (NVS) in ultrasound has gained attention as a technique for generating anatomically plausible views beyond the acquired frames, offering new capabilities for training clinicians or data augmentation. However, current methods struggle with complex tissue and view-dependent acoustic effects. Physics-based NVS aims to address these limitations by including the ultrasound image formation process into the simulation. Recent approaches combine a learnable implicit scene representation with an ultrasound-specific rendering module, yet a substantial gap between simulation and reality remains. In this work, we introduce UltraG-Ray, a novel ultrasound scene representation based on a learnable 3D Gaussian field, coupled to an efficient physics-based module for B-mode synthesis. We explicitly encode ultrasound-specific parameters, such as attenuation and reflection, into a Gaussian-based spatial representation and realize image synthesis within a novel ray casting scheme. In contrast to previous methods, this approach naturally captures view-dependent attenuation effects, thereby enabling the generation of physically informed B-mode images with increased realism. We compare our method to state-of-the-art and observe consistent gains in image quality metrics (up to 15% increase on MS-SSIM), demonstrating clear improvement in terms of realism of the synthesized ultrasound images.

[CV-113] MEDiC: Multi-objective Exploration of Distillation from CLIP

【速读】:该论文旨在解决自监督学习中掩码图像建模(Masked Image Modeling, MIM)方法在特征表示能力与训练稳定性之间的权衡问题。现有方法通常局限于原始像素空间或潜在特征空间,难以同时利用局部细节重建与全局语义对齐的优势。解决方案的关键在于提出MEDiC(Multi-objective Exploration of Distillation from CLIP)框架,通过三个互补目标统一优化:基于冻结CLIP编码器的patch级token蒸馏、全局CLS标记对齐以及轻量级解码器驱动的像素级重建,从而在单一管道中融合多粒度信息。实验表明,该多目标协同机制显著提升模型性能(ImageNet-1K上kNN准确率达73.9%),并揭示了损失权重敏感性这一关键训练挑战。

链接: https://arxiv.org/abs/2603.29009
作者: Konstantinos Georgiou,Maofeng Tang,Hairong Qi
机构: Min H. Kao Department of Electrical Engineering and Computer Science (Min H. Kao 电气工程与计算机科学系), The University of Tennessee (田纳西大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Masked image modeling (MIM) methods typically operate in either raw pixel space (reconstructing masked patches) or latent feature space (aligning with a pre-trained teacher). We present MEDiC (Multi-objective Exploration of Distillation from CLIP), a framework that combines both spaces in a single pipeline through three complementary objectives: patch-level token distillation from a frozen CLIP encoder, global CLS alignment, and pixel reconstruction via a lightweight decoder. We conduct a systematic investigation of the design space surrounding this multi-objective framework. First, we show that all three objectives provide complementary information, with the full combination reaching 73.9% kNN accuracy on ImageNet-1K. Second, we introduce hierarchical clustering with relative position bias for evolved masking and find that, despite producing more semantically coherent masks than prior methods, evolved masking does not outperform simple block masking in the teacher-guided distillation setting, a finding we attribute to the teacher’s inherent semantic awareness. Third, we reveal that optimal scalar loss weights are extremely fragile, with small perturbations causing drops of up to 17 percentage points in kNN accuracy. Our framework achieves 73.9% kNN and 85.1% fine-tuning accuracy with ViT-Base at 300 epochs.

[CV-114] GenFusion: Feed-forward Human Performance Capture via Progressive Canonical Space Updates

【速读】:该论文旨在解决从单目RGB视频流中重建人体三维形态并渲染新视角的难题,尤其针对因观测信息不足导致的未见区域难以准确重建的问题。其解决方案的关键在于引入一个随时间动态更新的“规范空间”(canonical space),该空间作为上下文库累积历史帧中的外观信息,在当前帧缺乏直接观测时提供有效补充。为融合历史与实时观测并处理潜在冲突,作者将渲染过程建模为概率回归(probabilistic regression),相比确定性回归方法能生成更清晰的重建结果,并可在无先前观测区域实现合理合成。

链接: https://arxiv.org/abs/2603.28997
作者: Youngjoong Kwon,Yao He,Heejung Choi,Chen Geng,Zhengmao Liu,Jiajun Wu,Ehsan Adeli
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a feed-forward human performance capture method that renders novel views of a performer from a monocular RGB stream. A key challenge in this setting is the lack of sufficient observations, especially for unseen regions. Assuming the subject moves continuously over time, we take advantage of the fact that more body parts become observable by maintaining a canonical space that is progressively updated with each incoming frame. This canonical space accumulates appearance information over time and serves as a context bank when direct observations are missing in the current live frame. To effectively utilize this context while respecting the deformation of the live state, we formulate the rendering process as probabilistic regression. This resolves conflicts between past and current observations, producing sharper reconstructions than deterministic regression approaches. Furthermore, it enables plausible synthesis even in regions with no prior observations. Experiments on in-domain (4D-Dress) and out-of-distribution (MVHumanNet) datasets demonstrate the effectiveness of our approach.

[CV-115] Hybrid Quantum-Classical AI for Industrial Defect Classification in Welding Images

【速读】:该论文旨在解决工业场景中铝TIG焊接缺陷自动检测的分类问题,以提升质量控制的自动化水平。其核心解决方案是采用混合量子-经典机器学习框架:首先利用卷积神经网络(CNN)从焊缝图像中提取低维特征向量,从而降低原始像素空间的维度;随后构建两种量子模型——其一通过参数化量子特征映射将特征编码为量子态并计算量子核矩阵,结合变分量子线性求解器(VQLS)求解支持向量机(SVM)优化问题;其二则采用角度编码方式在变分量子电路中处理特征,并使用经典优化器进行训练。这两种方法均在二分类和多类分类任务中与传统CNN模型对比,结果表明混合量子-经典模型具备竞争性性能,验证了其在近中期工业缺陷检测中的应用潜力。

链接: https://arxiv.org/abs/2603.28995
作者: Akshaya Srinivasan,Xiaoyin Cheng,Jianming Yi,Alexander Geng,Desislava Ivanova,Andreas Weinmann,Ali Moghiseh
机构: Fraunhofer Institute for Industrial Mathematics ITWM (弗劳恩霍夫工业数学研究所 ITWM); RPTU Kaiserslautern-Landau (凯撒斯劳滕-兰道大学); MuVision UG (穆维森 UG); Technical University of Sofia (索非亚技术大学); Technical University of Applied Sciences Würzburg-Schweinfurt (维尔茨堡-施韦因富特应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Hybrid quantum-classical machine learning offers a promising direction for advancing automated quality control in industrial settings. In this study, we investigate two hybrid quantum-classical approaches for classifying defects in aluminium TIG welding images and benchmarking their performance against a conventional deep learning model. A convolutional neural network is used to extract compact and informative feature vectors from weld images, effectively reducing the higher-dimensional pixel space to a lower-dimensional feature space. Our first quantum approach encodes these features into quantum states using a parameterized quantum feature map composed of rotation and entangling gates. We compute a quantum kernel matrix from the inner products of these states, defining a linear system in a higher-dimensional Hilbert space corresponding to the support vector machine (SVM) optimization problem and solving it using a Variational Quantum Linear Solver (VQLS). We also examine the effect of the quantum kernel condition number on classification performance. In our second method, we apply angle encoding to the extracted features in a variational quantum circuit and use a classical optimizer for model training. Both quantum models are tested on binary and multiclass classification tasks and the performance is compared with the classical CNN model. Our results show that while the CNN model demonstrates robust performance, hybrid quantum-classical models perform competitively. This highlights the potential of hybrid quantum-classical approaches for near-term real-world applications in industrial defect detection and quality assurance.

[CV-116] Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas CVPR2026

【速读】:该论文旨在解决文本驱动的沉浸式三维场景合成中视觉保真度与探索性之间的权衡问题,现有方法要么因自回归扩展导致上下文漂移,要么受限于低分辨率的全景视频生成。其解决方案的关键在于提出一种名为Stepper的统一框架,通过分步式的全景场景扩展策略实现高保真、几何一致的3D场景生成;该框架结合了新型多视角360°扩散模型以保证一致性与高分辨率,以及几何重建流水线以确保结构合理性,从而在大规模多视角全景数据集上实现了最先进的视觉质量与结构一致性表现。

链接: https://arxiv.org/abs/2603.28980
作者: Felix Wimbauer,Fabian Manhardt,Michael Oechsle,Nikolai Kalischek,Christian Rupprecht,Daniel Cremers,Federico Tombari
机构: Google(谷歌); University of Oxford (牛津大学); MCML; Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 Findings; Find our project page under this https URL

点击查看摘要

Abstract:The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360° diffusion model that enables consistent, high-resolution expansion, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new large-scale, multi-view panorama dataset, Stepper achieves state-of-the-art fidelity and structural consistency, outperforming prior approaches, thereby setting a new standard for immersive scene generation.

[CV-117] AutoWorld: Scaling Multi-Agent Traffic Simulation with Self-Supervised World Models

【速读】:该论文旨在解决当前多智能体交通仿真系统在依赖大量标注轨迹或语义信息进行监督学习时,面临的数据标注成本高、难以规模化的问题。其核心挑战在于如何有效利用大规模未标注的传感器数据(如LiDAR点云)来提升仿真真实感。解决方案的关键在于提出AutoWorld框架,该框架通过从未标注的LiDAR占用表示中学习一个世界模型(world model),并基于该模型生成粗粒度到细粒度的预测场景上下文作为多智能体运动生成模型的输入;同时引入级联确定性点过程(cascaded Determinantal Point Process)以增强样本多样性,并设计运动感知的潜在监督目标(motion-aware latent supervision objective)来优化场景动态建模能力,从而实现无需额外标注即可显著提升交通仿真真实性。

链接: https://arxiv.org/abs/2603.28963
作者: Mozhgan Pourkeshavatz,Tianran Liu,Nicholas Rhinehart
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-agent traffic simulation is central to developing and testing autonomous driving systems. Recent data-driven simulators have achieved promising results, but rely heavily on supervised learning from labeled trajectories or semantic annotations, making it costly to scale their performance. Meanwhile, large amounts of unlabeled sensor data can be collected at scale but remain largely unused by existing traffic simulation frameworks. This raises a key question: How can a method harness unlabeled data to improve traffic simulation performance? In this work, we propose AutoWorld, a traffic simulation framework that employs a world model learned from unlabeled occupancy representations of LiDAR data. Given world model samples, AutoWorld constructs a coarse-to-fine predictive scene context as input to a multi-agent motion generation model. To promote sample diversity, AutoWorld uses a cascaded Determinantal Point Process framework to guide the sampling processes of both the world model and the motion model. Furthermore, we designed a motion-aware latent supervision objective that enhances AutoWorld’s representation of scene dynamics. Experiments on the WOSAC benchmark show that AutoWorld ranks first on the leaderboard according to the primary Realism Meta Metric (RMM). We further show that simulation performance consistently improves with the inclusion of unlabeled LiDAR data, and study the efficacy of each component with ablations. Our method paves the way for scaling traffic simulation realism without additional labeling. Our project page contains additional visualizations and released code.

[CV-118] Decoding Functional Networks for Visual Categories via GNNs

【速读】:该论文旨在解决大规模脑网络如何表征视觉类别这一核心问题,以实现感知与皮层组织之间的机制性关联。其解决方案的关键在于构建基于7T fMRI的区域级功能图谱,并训练一种带有稀疏边掩码和类别特异性显著性的符号图神经网络(signed Graph Neural Network),从而同时建模正负向功能连接关系,准确解码特定类别的功能连接状态(如运动、食物、车辆),并揭示沿腹侧和背侧视觉通路中具有生物学意义的子网络结构。此框架通过将体素层面的类别选择性扩展为基于连接性的视觉加工表示,实现了机器学习与神经科学的深度融合。

链接: https://arxiv.org/abs/2603.28931
作者: Shira Karmi,Galia Avidan,Tammy Riklin Raviv
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IEEE International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Understanding how large-scale brain networks represent visual categories is fundamental to linking perception and cortical organization. Using high-resolution 7T fMRI from the Natural Scenes Dataset, we construct parcel-level functional graphs and train a signed Graph Neural Network that models both positive and negative interactions, with a sparse edge mask and class-specific saliency. The model accurately decodes category-specific functional connectivity states (sports, food, vehicles) and reveals reproducible, biologically meaningful subnetworks along the ventral and dorsal visual pathways. This framework bridges machine learning and neuroscience by extending voxel-level category selectivity to a connectivity-based representation of visual processing.

[CV-119] Fisheye3R: Adapting Unified 3D Feed-Forward Foundation Models to Fisheye Lenses

【速读】:该论文旨在解决当前基于前馈式基础模型(feed-forward foundation models)的多视角三维(3D)重建方法在处理具有高径向畸变的鱼眼图像(fisheye images)时性能显著下降的问题。其核心挑战源于非线性投影模型导致像素空间位置变化,而现有训练数据中鱼眼图像及其真实标注远少于透视图像,限制了模型泛化能力。解决方案的关键在于提出Fisheye3R框架,该框架通过引入灵活的学习策略,支持仅使用无标签透视图像进行自监督适应,以及无需任何鱼眼训练数据即可实现监督适应,从而在不损害原模型对透视图像性能的前提下,使基础模型能原生适配鱼眼输入,并在多个主流3D重建模型上验证了其在相机位姿、深度图、点云和视场估计等方面的稳定提升。

链接: https://arxiv.org/abs/2603.28896
作者: Ruxiao Duan,Erin Hong,Dongxu Zhao,Eric Turner,Alex Wong,Yunwen Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feed-forward foundation models for multi-view 3-dimensional (3D) reconstruction have been trained on large-scale datasets of perspective images; when tested on wide field-of-view images, e.g., from a fisheye camera, their performance degrades. Their error arises from changes in spatial positions of pixels due to a non-linear projection model that maps 3D points onto the 2D image plane. While one may surmise that training on fisheye images would resolve this problem, there are far fewer fisheye images with ground truth than perspective images, which limit generalization. To enable inference on imagery exhibiting high radial distortion, we propose Fisheye3R, a novel adaptation framework that extends these multi-view 3D reconstruction foundation models to natively accommodate fisheye inputs without performance regression on perspective images. To address the scarcity of fisheye images and ground truth, we introduce flexible learning schemes that support self-supervised adaptation using only unlabeled perspective images and supervised adaptation without any fisheye training data. Extensive experiments across three foundation models, including VGGT, \pi^3 , and MapAnything, demonstrate that our approach consistently improves camera pose, depth, point map, and field-of-view estimation on fisheye images.

[CV-120] OccSim: Multi-kilometer Simulation with Long-horizon Occupancy World Models

【速读】:该论文旨在解决数据驱动的自动驾驶仿真长期依赖预录驾驶日志或高精地图(HD maps)所带来的可扩展性瓶颈问题,这种依赖限制了生成式模拟的开放性和规模。其核心解决方案是提出OccSim——首个基于占据世界模型(occupancy world model)的3D仿真器,通过两个关键模块实现无需连续日志或HD地图即可稳定生成超长序列:一是基于W-DiT架构的静态占据世界模型,利用显式的刚性变换结构设计实现超长时域静态环境建模;二是布局生成器(Layout Generator),根据合成的道路拓扑动态填充具有反应性的前景代理。此设计使OccSim能够生成超过3,000帧连续场景,并构建覆盖4公里以上的大型3D占据地图,相较此前最优占据模型提升80倍稳定生成长度,显著增强仿真多样性与实用性。

链接: https://arxiv.org/abs/2603.28887
作者: Tianran Liu,Shengwen Zhao,Mozhgan Pourkeshavarz,Weican Li,Nicholas Rhinehart
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Data-driven autonomous driving simulation has long been constrained by its heavy reliance on pre-recorded driving logs or spatial priors, such as HD maps. This fundamental dependency severely limits scalability, restricting open-ended generation capabilities to the finite scale of existing collected datasets. To break this bottleneck, we present OccSim, the first occupancy world model-driven 3D simulator. OccSim obviates the requirement for continuous logs or HD maps; conditioned only on a single initial frame and a sequence of future ego-actions, it can stably generate over 3,000 continuous frames, enabling the continuous construction of large-scale 3D occupancy maps spanning over 4 kilometers for simulation. This represents an 80x improvement in stable generation length over previous state-of-the-art occupancy world models. OccSim is powered by two modules: W-DiT based static occupancy world model and the Layout Generator. W-DiT handles the ultra-long-horizon generation of static environments by explicitly introducing known rigid transformations in architecture design, while the Layout Generator populates the dynamic foreground with reactive agents based on the synthesized road topology. With these designs, OccSim can synthesize massive, diverse simulation streams. Extensive experiments demonstrate its downstream utility: data collected directly from OccSim can pre-train 4D semantic occupancy forecasting models to achieve up to 67% zero-shot performance on unseen data, outperforming previous asset-based simulator by 11%. When scaling the OccSim dataset to 5x the size, the zero-shot performance increases to about 74%, while the improvement over asset-based simulators expands to 22.1%.

[CV-121] DF-ACBlurGAN: Structure-Aware Conditional Generation of Internally Repeated Patterns for Biomaterial Microtopography Design

【速读】:该论文旨在解决生成图像时难以保持全局重复与周期性结构一致性的问题,这在需要严格控制重复尺度、间距和边界连贯性的应用场景(如微拓扑生物材料表面设计)中尤为突出。传统机器学习与计算机视觉模型通常优化局部纹理统计特征和语义真实性,而忽视了全局结构的一致性。解决方案的关键在于提出DF-ACBlurGAN——一种结构感知的条件生成对抗网络(conditional generative adversarial network),其核心创新包括:频率域重复尺度估计、尺度自适应高斯模糊以及单元胞重建机制,从而在训练过程中显式建模长程重复特性,实现局部细节清晰度与全局周期稳定性之间的平衡,并通过实验获得的生物学响应标签进行条件控制,以生成符合目标功能结果的设计。

链接: https://arxiv.org/abs/2603.28776
作者: Rongjun Dong,Xin Chen,Morgan R Alexander,Karthikeyan Sivakumar,Reza Omdivar,David A Winkler,Grazziela Figueredo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning to generate images with internally repeated and periodic structures poses a fundamental challenge for machine learning and computer vision models, which are typically optimised for local texture statistics and semantic realism rather than global structural consistency. This limitation is particularly pronounced in applications requiring strict control over repetition scale, spacing, and boundary coherence, such as microtopographical biomaterial surfaces. In this work, biomaterial design serves as a use case to study conditional generation of repeated patterns under weak supervision and class imbalance. We propose DF-ACBlurGAN, a structure-aware conditional generative adversarial network that explicitly reasons about long-range repetition during training. The approach integrates frequency-domain repetition scale estimation, scale-adaptive Gaussian blurring, and unit-cell reconstruction to balance sharp local features with stable global periodicity. Conditioning on experimentally derived biological response labels, the model synthesises designs aligned with target functional outcomes. Evaluation across multiple biomaterial datasets demonstrates improved repetition consistency and controllable structural variation compared to conventional generative approaches.

[CV-122] STRADAViT: Towards a Foundational Model for Radio Astronomy through Self-Supervised Transfer

【速读】:该论文旨在解决下一代射电天文巡天中海量解析源的形态学分析难题,尤其是在不同望远镜和成像处理流程下缺乏鲁棒性的问题。其关键解决方案是提出STRADAViT框架,该框架基于自监督视觉Transformer(Vision Transformer)进行持续预训练(continued pretraining),融合多巡天数据集、面向射电天文的视图生成策略以及重建-对比联合的两阶段预训练机制,从而构建可迁移性强的射电图像编码器。实验表明,该方法在多个形态学基准测试(如MiraBest、LoTSS DR2、Radio Galaxy Zoo)上均优于初始模型及DINOv2基线,尤其在RGZ DR1数据集上提升显著,证明了射电天文感知的视图生成与分阶段预训练对迁移性能的增强作用。

链接: https://arxiv.org/abs/2603.29660
作者: Andrea DeMarco,Ian Fenech Conti,Hayley Camilleri,Ardiana Bushi,Simone Riggi
机构: University of Malta (马耳他大学)
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages

点击查看摘要

Abstract:Next-generation radio astronomy surveys are producing millions of resolved sources, but robust morphology analysis remains difficult across heterogeneous telescopes and imaging pipelines. We present STRADAViT, a self-supervised Vision Transformer continued-pretraining framework for transferable radio astronomy image encoders. STRADAViT combines a mixed-survey pretraining dataset, radio astronomy-aware view generation, and controlled continued pretraining through reconstruction-only, contrastive-only, and two-stage branches. Pretraining uses 512x512 radio astronomy cutouts from MeerKAT, ASKAP, LOFAR/LoTSS, and SKA data. We evaluate transfer with linear probing and fine-tuning on three morphology benchmarks: MiraBest, LoTSS DR2, and Radio Galaxy Zoo. Relative to the initialization used for continued pretraining, the best two-stage STRADAViT models improve Macro-F1 in all reported linear-probe settings and in most fine-tuning settings, with the largest gain on RGZ DR1. Relative to strong DINOv2 baselines, gains are selective but remain positive on LoTSS DR2 and RGZ DR1 under linear probing, and on MiraBest and RGZ DR1 under fine-tuning. A targeted DINOv2-initialized HCL ablation further shows that the adaptation recipe is not specific to a single starting point. The released STRADAViT checkpoint remains the preferred model because it offers competitive transfer at lower token count and downstream cost than the DINOv2-based alternative. These results show that radio astronomy-aware view generation and staged continued pretraining provide a stronger starting point than out-of-the-box Vision Transformers for radio astronomy transfer.

[CV-123] Polyhedral Unmixing: Bridging Semantic Segmentation with Hyperspectral Unmixing via Polyhedral-Cone Partitioning

【速读】:该论文旨在解决光谱图像分析中语义分割(semantic segmentation)与高光谱解混(hyperspectral unmixing)两个任务长期独立处理的问题。传统方法通常将二者分开建模,忽视了它们在物理机制上的内在联系。论文的核心贡献在于理论证明:在线性混合模型下,基于主导物质的像素分类会诱导出光谱空间中的多面锥区域(polyhedral-cone regions)。基于这一性质,作者提出了一种直接的“分割到解混”(segmentation-to-unmixing)流水线——通过构建与标签像素匹配的最佳多面锥划分,计算像素到各区域的带符号距离,并经坐标变换后投影到概率单纯形(probability simplex),获得初始丰度估计;进而利用矩阵伪逆提取端元(endmembers)并恢复最终丰度。该方案的关键创新在于将任意语义分割结果作为输入,实现了对解混过程的显式控制,同时保持其余步骤确定且轻量,显著提升了可解释性与性能,在三个真实数据集上均优于当前主流深度与非深度方法。

链接: https://arxiv.org/abs/2603.29438
作者: Antoine Bottenmuller(CMM, PSL, STIM),Etienne Decencière(CMM, PSL, STIM),Petr Dokládal(CMM, PSL, STIM)
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation and hyperspectral unmixing are two central problems in spectral image analysis. The former assigns each pixel a discrete label corresponding to its material class, whereas the latter estimates pure material spectra, called endmembers, and, for each pixel, a vector representing material abundances in the observed scene. Despite their complementarity, these two problems are usually addressed independently. This paper aims to bridge these two lines of work by formally showing that, under the linear mixing model, pixel classification by dominant materials induces polyhedral-cone regions in the spectral space. We leverage this fundamental property to propose a direct segmentation-to-unmixing pipeline that performs blind hyperspectral unmixing from any semantic segmentation by constructing a polyhedral-cone partition of the space that best fits the labeled pixels. Signed distances from pixels to the estimated regions are then computed, linearly transformed via a change of basis in the distance space, and projected onto the probability simplex, yielding an initial abundance estimate. This estimate is used to extract endmembers and recover final abundances via matrix pseudo-inversion. Because the segmentation method can be freely chosen, the user gains explicit control over the unmixing process, while the rest of the pipeline remains essentially deterministic and lightweight. Beyond improving interpretability, experiments on three real datasets demonstrate the effectiveness of the proposed approach when associated with appropriate clustering algorithms, and show consistent improvements over recent deep and non-deep state-of-the-art methods. The code is available at: this https URL

[CV-124] Retinal Malady Classification using AI: A novel ViT-SVM combination architecture

【速读】:该论文旨在解决眼科常见疾病(包括黄斑裂孔、中心性浆液性脉络膜视网膜病变和糖尿病视网膜病变)的早期自动化检测问题,以减少因延误诊断导致的部分或完全视力丧失。其解决方案的关键在于提出一种基于视觉Transformer(Vision Transformer, ViT)和支持向量机(Support Vector Machine, SVM)的混合架构(ViT-SVM),利用ViT提取光学相干断层扫描(Optical Coherence Tomography, OCT)图像的深层特征,并结合SVM进行高效分类,从而实现对上述 retinal defects 的精准识别与自动诊断。

链接: https://arxiv.org/abs/2603.29181
作者: Shashwat Jha,Vishvaditya Luhach,Raju Poddar
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Macular Holes, Central serous retinopathy and Diabetic Retinopathy are one of the most widespread maladies of the eyes responsible for either partial or complete vision loss, thus making it clear that early detection of the mentioned defects is detrimental for the well-being of the patient. This study intends to introduce the application of Vision Transformer and Support Vector Machine based hybrid architecture (ViT-SVM) and analyse its performance to classify the optical coherence topography (OCT) Scans with the intention to automate the early detection of these retinal defects.

[CV-125] Predicting Neuromodulation Outcome for Parkinsons Disease with Generative Virtual Brain Model

【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)治疗中因个体差异导致的疗效预测难题,尤其是针对经颅磁刺激(temporal interference, TI)和深部脑刺激(deep brain stimulation, DBS)等疗法在临床实践中难以精准选择的问题。其核心挑战在于传统方法依赖有限的统计生物标志物或易过拟合且缺乏可解释性的AI模型,无法有效刻画患者间功能连接的异质性。解决方案的关键在于提出一种预训练-微调框架,首先基于大规模静息态功能磁共振成像(resting-state fMRI)数据构建一个生成式虚拟脑基础模型(generative virtual brain foundation model),通过在2707名受试者、5621次扫描的数据上预训练以捕捉普遍的神经紊乱模式;随后在PD患者队列(TI组n=51,DBS组n=55)上进行微调,生成高保真度的个性化虚拟脑模型(与实测功能连接相关系数r=0.935)。在此基础上,通过构建病理状态与健康状态间的反事实估计,实现了对临床反应的准确预测(TI: AUPR=0.853;DBS: AUPR=0.915),并揭示了与响应相关的状态依赖性区域特征,为机制探索和临床转化提供了新路径。

链接: https://arxiv.org/abs/2603.29176
作者: Siyuan Du,Siyi Li,Shuwei Bai,Ang Li,Haolin Li,Mingqing Xiao,Yang Pan,Dongsheng Li,Weidi Xie,Yanfeng Wang,Ya Zhang,Chencheng Zhang,Jiangchao Yao
机构: Ruijin Hospital, Shanghai Jiao Tong University School of Medicine (Shanghai, China);
Shanghai AI Laboratory, Shanghai (China);
Microsoft Research Asia, Shanghai (China);
School of Artificial Intelligence, Shanghai Jiao Tong University (Shanghai, China);
Zhongnan Hospital of Wuhan University, Hubei (China);
Cooperative Medianet Innovation Center, Shanghai Jiao Tong University (Shanghai, China)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parkinson’s disease (PD) affects over ten million people worldwide. Although temporal interference (TI) and deep brain stimulation (DBS) are promising therapies, inter-individual variability limits empirical treatment selection, increasing non-negligible surgical risk and cost. Previous explorations either resort to limited statistical biomarkers that are insufficient to characterize variability, or employ AI-driven methods which is prone to overfitting and opacity. We bridge this gap with a pretraining-finetuning framework to predict outcomes directly from resting-state fMRI. Critically, a generative virtual brain foundation model, pretrained on a collective dataset (2707 subjects, 5621 sessions) to capture universal disorder patterns, was finetuned on PD cohorts receiving TI (n=51) or DBS (n=55) to yield individualized virtual brains with high fidelity to empirical functional connectivity (r=0.935). By constructing counterfactual estimations between pathological and healthy neural states within these personalized models, we predicted clinical responses (TI: AUPR=0.853; DBS: AUPR=0.915), substantially outperforming baselines. External and prospective validations (n=14, n=11) highlight the feasibility of clinical translation. Moreover, our framework provides state-dependent regional patterns linked to response, offering hypothesis-generating mechanistic insights.

[CV-126] Schrödingers Seed: Purr-fect Initialization for an Impurr-fect Universe

【速读】:该论文试图解决深度学习中随机种子(random seed)选择缺乏科学依据的问题,即当前实践中常采用固定数值(如42)作为初始随机数生成器的种子,这种做法缺乏理论支撑且可能影响模型性能的可重复性。其解决方案的关键在于提出一种基于猫的物理属性(质量、毛色图案、眼睛颜色及名字熵)构建的猫驱动随机种子生成器,该方法受弗里德曼第一方程启发,并通过蒙特卡洛“Catlo”采样程序实现,实验表明该方案在分类任务中平均准确率达92.58%,显著优于传统种子42(提升约2.5%),暗示猫或具备某种未被认知的量子感知能力。

链接: https://arxiv.org/abs/2603.29115
作者: Mi chen,Renhao Ye
机构: 未知
类目: Astrophysics of Galaxies (astro-ph.GA); Computer Vision and Pattern Recognition (cs.CV)
备注: 3 pages, 1 figure, 21 cats

点击查看摘要

Abstract:Context. Random seed selection in deep learning is often arbitrary – conventionally fixed to values such as 42, a number with no known feline endorsement. Aims. We propose that cats, as liminal beings with a historically ambiguous relationship to quantum mechanics, are better suited to this task than random integers. Methods. We construct a cat-driven seed generator inspired by the first Friedmann equation, and test it by mapping 21 domestic cats’ physical properties – mass, coat pattern, eye colour, and name entropy – via a Monte ``Catlo’’ sampling procedure. Results. Cat-driven seeds achieve a mean accuracy of 92.58%, outperforming the baseline seed of 42 by \sim 2.5%. Cats from astrophysicist households perform marginally better, suggesting cosmic insight may be contagious. Conclusions. The Universe responds better to cats than to arbitrary integers. Whether cats are aware of this remains unknown.

人工智能

[AI-0] Automatic Identification of Parallelizable Loops Using Transformer-Based Source Code Representations

【速读】:该论文旨在解决自动并行化(Automatic Parallelization)在软件工程中的核心挑战,即如何准确识别源代码中可安全并行执行的循环区域,尤其是在面对结构不规则或动态变化的代码时,传统静态分析方法(如依赖分析和多面体模型)往往失效。其解决方案的关键在于采用基于DistilBERT的轻量级Transformer模型,通过子词分词(subword tokenization)直接处理源代码序列,从而无需手工特征工程即可捕捉上下文相关的语法与语义模式,实现对循环并行潜力的高效分类。实验表明,该方法在合成数据与真实代码混合的数据集上表现出超过99%的平均准确率和低误报率,显著优于以往基于token的方法,在简化预处理的同时提升了泛化能力与计算效率。

链接: https://arxiv.org/abs/2603.30040
作者: Izavan dos S. Correia,Henrique C. T. Santos,Tiago A. E. Ferreira
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 28 pages, 12 figures

点击查看摘要

Abstract:Automatic parallelization remains a challenging problem in software engineering, particularly in identifying code regions where loops can be safely executed in parallel on modern multi-core architectures. Traditional static analysis techniques, such as dependence analysis and polyhedral models, often struggle with irregular or dynamically structured code. In this work, we propose a Transformer-based approach to classify the parallelization potential of source code, focusing on distinguishing independent (parallelizable) loops from undefined ones. We adopt DistilBERT to process source code sequences using subword tokenization, enabling the model to capture contextual syntactic and semantic patterns without handcrafted features. The approach is evaluated on a balanced dataset combining synthetically generated loops and manually annotated real-world code, using 10-fold cross-validation and multiple performance metrics. Results show consistently high performance, with mean accuracy above 99% and low false positive rates, demonstrating robustness and reliability. Compared to prior token-based methods, the proposed approach simplifies preprocessing while improving generalization and maintaining computational efficiency. These findings highlight the potential of lightweight Transformer models for practical identification of parallelization opportunities at the loop level.

[AI-1] Aligned Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

【速读】:该论文旨在解决生成式 AI(Generative AI)系统中链式思维(Chain-of-Thought, CoT)的可监控性(monitorability)问题,即如何确保大语言模型(LLM)在训练过程中不会隐藏其推理过程中的关键特征,从而影响自动化系统的监督效果。解决方案的关键在于提出一个基于强化学习(Reinforcement Learning, RL)的理论框架,将LLM后训练建模为RL环境,并将奖励函数分解为依赖最终输出和依赖CoT的两部分;通过预先分类这两类奖励项为“对齐”(aligned)、“正交”(orthogonal)或“冲突”(in-conflict),预测训练对CoT可监控性的影响:其中“冲突”奖励会降低可监控性,“对齐”奖励则提升可监控性,而“正交”奖励无显著影响。实验证明,使用“冲突”奖励进行训练确实削弱了CoT的可监控性,且优化此类奖励项具有难度,从而验证了该框架的有效性和预测能力。

链接: https://arxiv.org/abs/2603.30036
作者: Max Kaufmann,David Lindner,Roland S. Zimmermann,and Rohin Shah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) monitoring, in which automated systems monitor the CoT of an LLM, is a promising approach for effectively overseeing AI systems. However, the extent to which a model’s CoT helps us oversee the model - the monitorability of the CoT - can be affected by training, for instance by the model learning to hide important features of its reasoning. We propose and empirically validate a conceptual framework for predicting when and why this occurs. We model LLM post-training as an RL environment where the reward decomposes into two terms: one term depending on final outputs and another term depending on the CoT. Our framework allows us to classify these two terms as “aligned”, “orthogonal”, or “in-conflict” before training. We predict that training with in-conflict terms will reduce monitorability, orthogonal terms will not affect it, and aligned terms will improve it. To validate our framework, we use it to classify a set of RL environments, train LLMs within those environments, and evaluate how training affects CoT monitorability. We find that (1) training with “in-conflict” reward terms reduces CoT monitorability and (2) optimizing in-conflict reward terms is difficult.

[AI-2] ucker Attention: A generalization of approximate attention mechanisms

【速读】:该论文旨在解决多头自注意力(Multi-Head Self-Attention, MHA)机制中内存占用过高的问题,尤其是在大规模语言模型(LLM)和视觉Transformer(ViT)中的应用。现有方法如分组查询注意力(Group-Query Attention, GQA)和多头潜在注意力(Multi-Head Latent Attention, MLA)通过特定的低秩分解策略降低参数量,但其理论基础不明确,难以解释其逼近对象与低秩行为的本质。论文提出了一种广义的权重对象建模视角和因子分解策略,构建出参数高效的Tucker注意力(Tucker Attention),其核心在于利用高阶张量分解框架对注意力权重进行结构化压缩,从而在保持相近验证指标的前提下显著减少参数量(数量级下降)。该方案不仅将GQA、MLA和MHA统一为特例,还兼容Flash Attention与旋转位置编码(RoPE),并揭示了MHA、GQA和MLA实际达到的有效秩,进一步推动了对现有方法的简化与理解。

链接: https://arxiv.org/abs/2603.30033
作者: Timon Klein,Jonas Kusch,Sebastian Sager,Stefan Schnake,Steffen Schotthöfer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The pursuit of reducing the memory footprint of the self-attention mechanism in multi-headed self attention (MHA) spawned a rich portfolio of methods, e.g., group-query attention (GQA) and multi-head latent attention (MLA). The methods leverage specialized low-rank factorizations across embedding dimensions or attention heads. From the point of view of classical low-rank approximation, these methods are unconventional and raise questions of which objects they really approximate and how to interpret the low-rank behavior of the resulting representations. To answer these questions, this work proposes a generalized view on the weight objects in the self-attention layer and a factorization strategy, which allows us to construct a parameter efficient scheme, called Tucker Attention. Tucker Attention requires an order of magnitude fewer parameters for comparable validation metrics, compared to GQA and MLA, as evaluated in LLM and ViT test cases. Additionally, Tucker Attention~encompasses GQA, MLA, MHA as special cases and is fully compatible with flash-attention and rotary position embeddings (RoPE). This generalization strategy yields insights of the actual ranks achieved by MHA, GQA, and MLA, and further enables simplifications for MLA.

[AI-3] he Triadic Cognitive Architecture: Bounding Autonomous Action via Spatio-Temporal and Epistemic Friction

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自主智能体在交互环境中表现出的认知失重问题,即缺乏对网络拓扑结构、时间节奏和认知边界(epistemic limits)的内在感知,导致其在复杂场景中出现过度工具调用、冗长决策延迟以及证据模糊时行为脆弱等失败模式。解决方案的关键在于提出一种统一的三元认知架构(Triadic Cognitive Architecture, TCA),该架构将机器推理建模为连续时间物理过程,融合非线性滤波理论、黎曼路由几何与最优控制,形式化定义了“认知摩擦”(Cognitive Friction)概念,并通过HJB启发的停止边界和基于信念的价值信息滚动近似,实现以净效用为条件的停顿机制,从而在保证诊断准确性的前提下显著缩短响应时间并提升任务有效性。

链接: https://arxiv.org/abs/2603.30031
作者: Davide Di Gioia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Current autonomous AI agents, driven primarily by Large Language Models (LLMs), operate in a state of cognitive weightlessness: they process information without an intrinsic sense of network topology, temporal pacing, or epistemic limits. Consequently, heuristic agentic loops (e.g., ReAct) can exhibit failure modes in interactive environments, including excessive tool use under congestion, prolonged deliberation under time decay, and brittle behavior under ambiguous evidence. In this paper, we propose the Triadic Cognitive Architecture (TCA), a unified mathematical framework that grounds machine reasoning in continuous-time physics. By synthesizing nonlinear filtering theory, Riemannian routing geometry, and optimal control, we formally define the concept of Cognitive Friction. We map the agent’s deliberation process to a coupled stochastic control problem where information acquisition is path-dependent and physically constrained. Rather than relying on arbitrary heuristic stop-tokens, the TCA uses an HJB-motivated stopping boundary and instantiates a rollout-based approximation of belief-dependent value-of-information with a net-utility halting condition. Through empirical validation in a simulated Emergency Medical Diagnostic Grid (EMDG), we demonstrate that while greedy baselines over-deliberate under latency and congestion costs, the triadic policy reduces time-to-action while improving patient viability without degrading diagnostic accuracy in this environment.

[AI-4] Hybrid Framework for Robotic Manipulation: Integrating Reinforcement Learning and Large Language Models

【速读】:该论文旨在解决机器人在执行复杂操作任务时,如何有效融合高层任务规划与低层精确控制的问题,以提升其对人类指令的理解能力、环境适应性及整体执行效率。解决方案的关键在于提出一种混合框架,将强化学习(Reinforcement Learning, RL)用于实现高精度的低层运动控制,同时利用大语言模型(Large Language Models, LLMs)进行自然语言理解与高层任务规划,从而实现从语义指令到具体动作的端到端映射,显著增强了机器人在动态环境中完成复杂任务的能力。

链接: https://arxiv.org/abs/2603.30022
作者: Md Saad,Sajjad Hussain,Mohd Suhaib
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a new hybrid framework that combines Reinforcement Learning (RL) and Large Language Models (LLMs) to improve robotic manipulation tasks. By utilizing RL for accurate low-level control and LLMs for high level task planning and understanding of natural language, the proposed framework effectively connects low-level execution with high-level reasoning in robotic systems. This integration allows robots to understand and carry out complex, human-like instructions while adapting to changing environments in real time. The framework is tested in a PyBullet-based simulation environment using the Franka Emika Panda robotic arm, with various manipulation scenarios as benchmarks. The results show a 33.5% decrease in task completion time and enhancements of 18.1% and 36.4% in accuracy and adaptability, respectively, when compared to systems that use only RL. These results underscore the potential of LLM-enhanced robotic systems for practical applications, making them more efficient, adaptable, and capable of interacting with humans. Future research will aim to explore sim-to-real transfer, scalability, and multi-robot systems to further broaden the framework’s applicability.

[AI-5] Architecting Secure AI Agents : Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks

【速读】:该论文旨在解决AI代理(AI agents)在由大语言模型(LLMs)驱动时面临的间接提示注入(indirect prompt injection)攻击问题,即恶意指令嵌入不可信数据中可触发代理执行危险行为。其解决方案的关键在于构建系统级防御机制:首先,通过动态重规划和安全策略更新应对动态任务与真实环境;其次,在严格限制模型可观测范围和决策权限的前提下,允许LLM参与部分依赖上下文的安全决策;最后,在本质模糊的情境中将个性化设置与人机交互作为核心设计要素。这些措施共同构成智能体系统的结构骨架,整合基于规则与模型的安全检查,从而提升整体鲁棒性并促进针对性研究。

链接: https://arxiv.org/abs/2603.30016
作者: Chong Xiang,Drew Zagieboylo,Shaona Ghosh,Sanjay Kariyappa,Kai Greshake,Hanshen Xiao,Chaowei Xiao,G. Edward Suh
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI agents, predominantly powered by large language models (LLMs), are vulnerable to indirect prompt injection, in which malicious instructions embedded in untrusted data can trigger dangerous agent actions. This position paper discusses our vision for system-level defenses against indirect prompt injection attacks. We articulate three positions: (1) dynamic replanning and security policy updates are often necessary for dynamic tasks and realistic environments; (2) certain context-dependent security decisions would still require LLMs (or other learned models), but should only be made within system designs that strictly constrain what the model can observe and decide; (3) in inherently ambiguous cases, personalization and human interaction should be treated as core design considerations. In addition to our main positions, we discuss limitations of existing benchmarks that can create a false sense of utility and security. We also highlight the value of system-level defenses, which serve as the skeleton of agentic systems by structuring and controlling agent behaviors, integrating rule-based and model-based security checks, and enabling more targeted research on model robustness and human interaction.

[AI-6] Scalable AI-assisted Workflow Management for Detector Design Optimization Using Distributed Computing

【速读】:该论文旨在解决现代探测器设计中高维参数空间探索的挑战,传统方法难以高效实现多目标优化。解决方案的关键在于构建一个基于生成式 AI (Generative AI) 的辅助框架,将多目标贝叶斯优化(Multi-Objective Bayesian Optimization)与 PanDA—iDDS 工作流引擎深度集成,从而在异构计算资源上协调迭代仿真任务,提升自动化水平、可扩展性和优化效率。该框架已在 ePIC 和 dRICH 探测器的设计研究中验证有效性,为 AI 驱动的探测器设计及其他计算密集型科学应用提供了灵活且可扩展的新范式。

链接: https://arxiv.org/abs/2603.30014
作者: Derek Anderson,Amit Bashyal,Markus Diefenthaler,Cristiano Fanelli,Wen Guan,Tanja Horn,Alex Jentsch Meifeng Lin,Tadashi Maeno,Kei Nagai,Hemalata Nayak,Connor Pecar,Karthik Suresh,Fang-Ying Tsai,Anselm Vossen,Tianle Wang,Torre Wenaus
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Production and Distributed Analysis (PanDA) system, originally developed for the ATLAS experiment at the CERN Large Hadron Collider (LHC), has evolved into a robust platform for orchestrating large-scale workflows across distributed computing resources. Coupled with its intelligent Distributed Dispatch and Scheduling (iDDS) component, PanDA supports AI/ML-driven workflows through a scalable and flexible workflow engine. We present an AI-assisted framework for detector design optimization that integrates multi-objective Bayesian optimization with the PanDA–iDDS workflow engine to coordinate iterative simulations across heterogeneous resources. The framework addresses the challenge of exploring high-dimensional parameter spaces inherent in modern detector design. We demonstrate the framework using benchmark problems and realistic studies of the ePIC and dRICH detectors for the Electron-Ion Collider (EIC). Results show improved automation, scalability, and efficiency in multi-objective optimization. This work establishes a flexible and extensible paradigm for AI-driven detector design and other computationally intensive scientific applications. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.30014 [cs.DC] (or arXiv:2603.30014v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2603.30014 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-7] Phyelds: A Pythonic Framework for Aggregate Computing

【速读】:该论文旨在解决当前聚合编程(aggregate programming)领域缺乏对数据科学从业者友好支持的问题,特别是针对以Python为主导的机器学习和数据科学生态系统的适配不足。现有聚合编程实现多基于Protelis、ScaFi或FCPP等语言,难以被广泛使用Python的科研人员和工程师采用。解决方案的关键在于提出Phyelds——一个专为Python设计的轻量级聚合编程库,其核心创新在于实现了场微积分(field calculus)计算模型的完整功能,并提供符合Python风格的API以及与主流机器学习工具链(如TensorFlow、PyTorch等)无缝集成的架构,从而在保持聚合编程抽象能力的同时,显著提升其在分布式学习、机器人协作及多智能体强化学习等场景中的可用性和灵活性。

链接: https://arxiv.org/abs/2603.29999
作者: Gianluca Aguzzi,Davide Domini,Nicolas Farabegoli,Mirko Viroli
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Aggregate programming is a field-based coordination paradigm with over a decade of exploration and successful applications across domains including sensor networks, robotics, and IoT, with implementations in various programming languages, such as Protelis, ScaFi (Scala), and FCPP (C++). A recent research direction integrates machine learning with aggregate computing, aiming to support large-scale distributed learning and provide new abstractions for implementing learning algorithms. However, existing implementations do not target data science practitioners, who predominantly work in Python–the de facto language for data science and machine learning, with a rich and mature ecosystem. Python also offers advantages for other use cases, such as education and robotics (e.g., via ROS). To address this gap, we present Phyelds, a Python library for aggregate programming. Phyelds offers a fully featured yet lightweight implementation of the field calculus model of computation, featuring a Pythonic API and an architecture designed for seamless integration with Python’s machine learning ecosystem. We describe the design and implementation of Phyelds and illustrate its versatility across domains, from well-known aggregate computing patterns to federated learning coordination and integration with a widely used multi-agent reinforcement learning simulator.

[AI-8] Extending MONA in Camera Dropbox: Reproduction Learned Approval and Design Implications for Reward-Hacking Mitigation

【速读】:该论文旨在解决多步奖励劫持(multi-step reward hacking)问题,即强化学习代理在长期规划中可能通过非预期方式最大化奖励信号,从而违背设计意图。其解决方案的关键在于提出一种“短视优化与远视批准相结合”的框架(Myopic Optimization with Non-myopic Approval, MONA),通过限制代理的决策规划范围(myopic optimization),同时引入远视批准信号(non-myopic approval)作为训练监督信号,从而在不牺牲安全性的情况下引导代理行为。论文进一步验证了批准机制的设计——尤其是批准对实际结果的依赖程度——是决定MONA安全保证是否成立的核心因素,并通过可复现的实验环境和模块化批准机制(包括oracle、噪声、误设、学习型及校准型批准)表明:当前最优的校准学习型批准模型虽能实现零奖励劫持,但因过度保守导致目标行为达成率显著低于理想情况(11.9% vs. 99.9%),凸显出构建既能保留足够远视能力又不引入新劫持漏洞的学习型批准模型是未来工程核心挑战。

链接: https://arxiv.org/abs/2603.29993
作者: Nathan Heath
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Myopic Optimization with Non-myopic Approval (MONA) mitigates multi-step reward hacking by restricting the agent’s planning horizon while supplying far-sighted approval as a training signal~\citefarquhar2025mona. The original paper identifies a critical open question: how the method of constructing approval – particularly the degree to which approval depends on achieved outcomes – affects whether MONA’s safety guarantees hold. We present a reproduction-first extension of the public MONA Camera Dropbox environment that (i)~repackages the released codebase as a standard Python project with scripted PPO training, (ii)~confirms the published contrast between ordinary RL (91.5% reward-hacking rate) and oracle MONA (0.0% hacking rate) using the released reference arrays, and (iii)~introduces a modular learned-approval suite spanning oracle, noisy, misspecified, learned, and calibrated approval mechanisms. In reduced-budget pilot sweeps across approval methods, horizons, dataset sizes, and calibration strategies, the best calibrated learned-overseer run achieves zero observed reward hacking but substantially lower intended-behavior rates than oracle MONA (11.9% vs.\ 99.9%), consistent with under-optimization rather than re-emergent hacking. These results operationalize the MONA paper’s approval-spectrum conjecture as a runnable experimental object and suggest that the central engineering challenge shifts from proving MONA’s concept to building learned approval models that preserve sufficient foresight without reopening reward-hacking channels. Code, configurations, and reproduction commands are publicly available. this https URL

[AI-9] Quantifying Cross-Modal Interactions in Multimodal Glioma Survival Prediction via InterSHAP: Evidence for Additive Signal Integration

【速读】:该论文旨在解决多模态深度学习在癌症预后预测中是否真正受益于跨模态协同作用这一关键问题,尤其在生存预测场景下,现有研究普遍假设融合不同模态(如全切片图像WSI与RNA-seq数据)能通过交互增强性能,但缺乏直接验证。解决方案的关键在于将InterSHAP(基于Shapley交互指数的度量方法)从分类任务扩展至Cox比例风险模型,并用于量化 glioma 生存预测中WSI与RNA-seq之间的跨模态交互强度。研究发现,预测性能提升(C-index从0.64提升至0.82)与交叉模态交互强度呈负相关,且方差分解显示所有架构均以加性贡献为主(WSI约40%,RNA约55%,交互项仅约4%),表明性能增益源于互补信号的聚合而非复杂交互机制的学习。这一结果为多模态融合策略的评估提供了可解释的审计工具,并挑战了传统对架构复杂性的依赖。

链接: https://arxiv.org/abs/2603.29977
作者: Iain Swift,JingHua Ye,Ruairi O’Reilly
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 8 pages, 1 figure, under review at XAI 2026 LBW

点击查看摘要

Abstract:Multimodal deep learning for cancer prognosis is commonly assumed to benefit from synergistic cross-modal interactions, yet this assumption has not been directly tested in survival prediction settings. This work adapts InterSHAP, a Shapley interaction index-based metric, from classification to Cox proportional hazards models and applies it to quantify cross-modal interactions in glioma survival prediction. Using TCGA-GBM and TCGA-LGG data (n=575), we evaluate four fusion architectures combining whole-slide image (WSI) and RNA-seq features. Our central finding is an inverse relationship between predictive performance and measured interaction: architectures achieving superior discrimination (C-index 0.64 \to 0.82) exhibit equivalent or lower cross-modal interaction (4.8% \to 3.0%). Variance decomposition reveals stable additive contributions across all architectures (WSI \approx 40%, RNA \approx 55%, Interaction \approx 4%), indicating that performance gains arise from complementary signal aggregation rather than learned synergy. These findings provide a practical model auditing tool for comparing fusion strategies, reframe the role of architectural complexity in multimodal fusion, and have implications for privacy-preserving federated deployment.

[AI-10] Rethinking AI Literacy Education in Higher Education: Bridging Risk Perception and Responsible Adoption

【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)快速发展背景下,未来AI从业者(尤其是技术类学生)对AI风险的认知与实际采纳意愿之间存在的不匹配问题,以推动负责任的AI开发与应用。其解决方案的关键在于识别出四类核心现象:一是学生更关注明确表述的风险而非情境嵌入的抽象风险;二是感知风险与采纳意愿呈显著负相关;三是技术教育虽缩小了性别在风险意识上的差异,但男性仍表现出更高的采纳意愿;四是存在“风险低估”现象,即AI相关专业学生虽明确意识到风险,但在具体场景中对风险识别不足且采纳意愿更高。基于此,研究提出应制定差异化的人工智能素养策略,弥合认知与实践之间的鸿沟,从而培养具备伦理意识和社会责任感的AI专业人才。

链接: https://arxiv.org/abs/2603.29935
作者: Shasha Yu,Fiona Carroll,Barry L. Bentley
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI becomes increasingly embedded across societal domains, understanding how future AI practitioners, particularly technology students, perceive its risks is essential for responsible development and adoption. This study analyzed responses from 139 students in Computer Science, Data Science/Data Analytics, and other disciplines using both explicit AI risk ratings and scenario-based assessments of risk and adoption willingness. Four key findings emerged: (1) Students expressed substantially higher concern for concrete, explicitly stated risks than for abstract or scenario-embedded risks; (2) Perceived risk and willingness to adopt AI demonstrated a clear inverse relationship; (3) Although technical education narrowed gender differences in risk awareness, male students reported higher adoption willingness; and (4) A form of “risk underappreciation” was observed, wherein students in AI-related specializations showed both elevated explicit risk awareness and higher willingness to adopt AI, despite lower recognition of risks in applied scenarios. These findings underscore the need for differentiated AI literacy strategies that bridge the gap between awareness and responsible adoption and offer valuable insights for educators, policymakers, industry leaders, and academic institutions aiming to cultivate ethically informed and socially responsible AI practitioners.

[AI-11] ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules

【速读】:该论文旨在解决当前回归基准测试中对生成式 AI (Generative AI) 模型的评估过于依赖点估计指标(如 RMSE、R²)的问题,这些指标无法充分反映模型在分布尾部的表现,而尾部性能对于金融和临床研究等高风险决策场景至关重要。解决方案的关键在于提出 ScoringBench——一个开放基准平台,系统性地计算一系列严格合理的概率评分规则(如 CRPS、CRLS、Interval Score、Energy Score、加权 CRPS 和 Brier Score),并与传统点估计指标并行评估,从而提供更全面的概率预测质量刻画。实验表明,不同评分规则会导致模型排名差异显著,且不存在适用于所有场景的预训练目标,强调了评估指标选择需与具体应用场景相匹配。

链接: https://arxiv.org/abs/2603.29928
作者: Jonas Landsgesell,Pascal Knoll
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tabular foundation models such as TabPFN and TabICL already produce full predictive distributions yet prevailing regression benchmarks evaluate them almost exclusively via point estimate metrics RMSE R2 These aggregate measures often obscure model performance in the tails of the distribution a critical deficit for high stakes decision making in domains like finance and clinical research where asymmetric risk profiles are the norm We introduce ScoringBench an open benchmark that computes a comprehensive suite of proper scoring rules like CRPS CRLS Interval Score Energy Score weighted CRPS and Brier Score alongside standard point metrics providing a richer picture of probabilistic forecast quality We evaluate realTabPFNv2.5 fine tuned with different scoring rule objectives and TabICL relative to untuned realTabPFNv2.5 across a suite of regression benchmarks Our results confirm that model rankings depend on the chosen scoring rule and that no single pretraining objective is universally optimal This demonstrates that for applications sensitive to extreme events the choice of evaluation metric is as much a domain specific requirement as the data itself ScoringBench is available at this https URL A live preview of the current leaderboard is available at this https URL The leaderboard is maintained via git pull requests to ensure transparency traceability agility and reproducibility

[AI-12] Uncertainty Gating for Cost-Aware Explainable Artificial Intelligence

【速读】:该论文旨在解决后验解释方法(post-hoc explanation methods)在计算成本高且可靠性难以保证的问题。其核心解决方案是引入**认知不确定性(epistemic uncertainty)**作为解释可靠性的低成本代理指标:高认知不确定性能够识别决策边界定义不清的区域,在这些区域中,解释结果会变得不稳定且不忠实于模型预测。这一洞察支持两种互补的应用场景——“改善最差情况下的解释”(根据预期解释可靠性动态选择廉价或昂贵的可解释人工智能(XAI)方法)和“召回高质量解释”(在预算受限时推迟对高不确定样本的解释生成)。实验表明,认知不确定性与解释稳定性呈强负相关,并能有效区分稳定/不稳定以及忠实/不忠实的解释结果,且该结论在图像分类任务中也具有泛化能力。

链接: https://arxiv.org/abs/2603.29915
作者: Georgii Mikriukov,Grégoire Montavon,Marina M.-C. Höhne
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Post-hoc explanation methods are widely used to interpret black-box predictions, but their generation is often computationally expensive and their reliability is not guaranteed. We propose epistemic uncertainty as a low-cost proxy for explanation reliability: high epistemic uncertainty identifies regions where the decision boundary is poorly defined and where explanations become unstable and unfaithful. This insight enables two complementary use cases: improving worst-case explanations' (routing samples to cheap or expensive XAI methods based on expected explanation reliability), and recalling high-quality explanations’ (deferring explanation generation for uncertain samples under constrained budget). Across four tabular datasets, five diverse architectures, and four XAI methods, we observe a strong negative correlation between epistemic uncertainty and explanation stability. Further analysis shows that epistemic uncertainty distinguishes not only stable from unstable explanations, but also faithful from unfaithful ones. Experiments on image classification confirm that our findings generalize beyond tabular data.

[AI-13] SISA: A Scale-In Systolic Array for GEMM Acceleration

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在传统基于方形脉动阵列(Systolic Array, SA)的硬件加速器上执行通用矩阵-矩阵乘法(General Matrix-Matrix Multiplication, GEMM)时,因输入依赖性和高度稀疏的矩阵结构导致的计算资源利用率低下问题。其解决方案的关键在于提出一种新型SA架构——SISA(Scale-In Systolic Array),通过将传统的方形阵列划分为水平方向的矩形条带(slab),实现对小规模或非均匀矩阵形状的独立调度与并行处理,在保持对大规模GEMM全阵列操作能力的同时,显著提升硬件资源利用率;实验表明,相较于同规模的单体式SA,SISA在代表性LLMs上实现了最高8.52倍的加速比和93%的能量延迟积(Energy-Delay-Product, EDP)降低。

链接: https://arxiv.org/abs/2603.29913
作者: Luigi Altamura,Alessio Cicero,Mateo Vázquez Maceiras,Mohammad Ali Maleki,Pedro Trancoso
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The currently dominant AI/ML workloads, such as Large Language Models (LLMs), rely on the efficient execution of General Matrix-Matrix Multiplication (GEMM) operations. Thus, most systems are equipped with dedicated matrix hardware accelerators based on square Systolic Arrays (SAs) of Processing Elements (PEs). While this organization was effective for traditional Deep Neural Networks (DNNs), LLMs introduce input-dependent and highly skewed matrices, leading to underutilized SA resources. To address this challenge, we propose SISA (Scale-In Systolic Array), a novel SA architecture that partitions the traditional square array into horizontal rectangular slabs. With minimal overhead, SISA exposes parallelism through independently scheduled slabs for efficient execution of small or skewed matrix shapes, while retaining full-array operation for large GEMMs. SISA achieves up to 8.52x speedup and 93% energy-delay-product (EDP) reduction for representative LLMs compared to a state-of-the-art monolithic SA with the same number of PEs.

[AI-14] C-TRAIL: A Commonsense World Framework for Trajectory Planning in Autonomous Driving

【速读】:该论文旨在解决自动驾驶轨迹规划中大语言模型(Large Language Models, LLMs)输出不可靠的问题,这种不可靠性在安全关键场景下可能带来严重风险。其解决方案的关键在于提出C-TRAIL框架,该框架构建于一个“常识世界”之上,通过引入双信任机制(dual-trust mechanism)对LLM生成的常识语义关系进行量化评估,并将其作为可信权重注入蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)中的Dirichlet信任策略中,从而实现基于可信度的决策优化;同时,系统还包含一个闭环更新模块,根据环境反馈动态调整信任评分与策略参数,形成持续优化的闭环控制。

链接: https://arxiv.org/abs/2603.29908
作者: Zhihong Cui,Haoran Tang,Tianyi Li,Yushuai Li,Peiyuan Guan,Amir Taherkordi,Tor Skeie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Trajectory planning for autonomous driving increasingly leverages large language models (LLMs) for commonsense reasoning, yet LLM outputs are inherently unreliable, posing risks in safety-critical applications. We propose C-TRAIL, a framework built on a Commonsense World that couples LLM-derived commonsense with a trust mechanism to guide trajectory planning. C-TRAIL operates through a closed-loop Recall, Plan, and Update cycle: the Recall module queries an LLM for semantic relations and quantifies their reliability via a dual-trust mechanism; the Plan module injects trust-weighted commonsense into Monte Carlo Tree Search (MCTS) through a Dirichlet trust policy; and the Update module adaptively refines trust scores and policy parameters from environmental feedback. Experiments on four simulated scenarios in Highway-env and two real-world levelXData datasets (highD, rounD) show that C-TRAIL consistently outperforms state-of-the-art baselines, reducing ADE by 40.2%, FDE by 51.7%, and improving SR by 16.9 percentage points on average. The source code is available at this https URL.

[AI-15] ATP-Bench: Towards Agent ic Tool Planning for MLLM Interleaved Generation

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在交错文本与图像生成(interleaved text-and-image generation)任务中面临的两大挑战:一是现有方法通常将图像生成与检索增强视为互斥路径,难以同时保障事实准确性(factuality)与创造性(creativity);二是缺乏对模型自主决策能力的系统评估机制。其解决方案的关键在于提出“代理工具规划”(Agentic Tool Planning)范式,即让模型作为中央控制器,自主决定何时、何地以及调用何种工具来生成视觉关键型查询的交错响应。为此,作者构建了ATP-Bench基准数据集和多代理判官系统(Multi-Agent MLLM-as-a-Judge, MAM),前者提供高质量、人类验证的问答对以系统评测交错生成能力,后者独立于端到端执行流程和工具后端变化,精准评估工具调用精度、遗漏机会及整体响应质量,从而为推进交错生成技术提供可量化、可解释的改进方向。

链接: https://arxiv.org/abs/2603.29902
作者: Yinuo Liu,Zi Qian,Heng Zhou,Jiahao Zhang,Yajie Zhang,Zhihang Li,Mengyu Zhou,Erchao Zhao,Xiaoxi Jiang,Guanjun Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interleaved text-and-image generation represents a significant frontier for Multimodal Large Language Models (MLLMs), offering a more intuitive way to convey complex information. Current paradigms rely on either image generation or retrieval augmentation, yet they typically treat the two as mutually exclusive paths, failing to unify factuality with creativity. We argue that the next milestone in this field is Agentic Tool Planning, where the model serves as a central controller that autonomously determines when, where, and which tools to invoke to produce interleaved responses for visual-critical queries. To systematically evaluate this paradigm, we introduce ATP-Bench, a novel benchmark comprising 7,702 QA pairs (including 1,592 VQA pairs) across eight categories and 25 visual-critical intents, featuring human-verified queries and ground truths. Furthermore, to evaluate agentic planning independent of end-to-end execution and changing tool backends, we propose a Multi-Agent MLLM-as-a-Judge (MAM) system. MAM evaluates tool-call precision, identifies missed opportunities for tool use, and assesses overall response quality without requiring ground-truth references. Our extensive experiments on 10 state-of-the-art MLLMs reveal that models struggle with coherent interleaved planning and exhibit significant variations in tool-use behavior, highlighting substantial room for improvement and providing actionable guidance for advancing interleaved generation. Dataset and code are available at this https URL.

[AI-16] A Rational Account of Categorization Based on Information Theory

【速读】:该论文旨在解决人类如何进行概念分类(categorization)的问题,即在面对复杂环境时,个体如何基于有限信息形成类别判断。其解决方案的关键在于提出一种基于信息论的理性分析(information-theoretic rational analysis)理论框架,该框架通过最优信息利用原则来解释人类在经典分类实验中的行为表现,并在多个关键实验数据上展现出与现有主流模型(如独立线索模型、情境模型、理性分类模型及层次狄利克雷过程模型)相当或更优的拟合效果。

链接: https://arxiv.org/abs/2603.29895
作者: Christophe J. MacLellan,Karthik Singaravadivelan,Xin Lian,Zekun Wang,Pat Langley
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 6 pages, 5 figures, 2 tables

点击查看摘要

Abstract:We present a new theory of categorization based on an information-theoretic rational analysis. To evaluate this theory, we investigate how well it can account for key findings from classic categorization experiments conducted by Hayes-Roth and Hayes-Roth (1977), Medin and Schaffer (1978), and Smith and Minda (1998). We find that it explains the human categorization behavior at least as well (or better) than the independent cue and context models (Medin Schaffer, 1978), the rational model of categorization (Anderson, 1991), and a hierarchical Dirichlet process model (Griffiths et al., 2007).

[AI-17] ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在用户代理交互场景中生成候选集合时,现有强化学习后训练范式(如Group Relative Policy Optimization, GRPO)因对集合内所有候选分配相同集级标量奖励而导致的训练信号噪声问题。这种机制使得表现较差的候选项“搭便车”于少数优质候选的高奖励,从而引发次优探索。解决方案的关键在于提出Shapley-Enhanced GRPO(ShapE-GRPO),其核心是利用集合效用的排列不变性,从合作博弈论中引入Shapley值分解方法,将集级奖励转化为粒度更细、面向单个候选的奖励信号,既保留了Shapley值的基本公理性质,又保持多项式时间复杂度的计算效率,从而实现更精准的策略优化与更快的收敛速度。

链接: https://arxiv.org/abs/2603.29871
作者: Rui Ai,Yu Pan,David Simchi-Levi,Chonghuan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-invariant nature of set-level utility, we derive a Shapley-enhanced formulation from cooperative game theory to decompose set-level rewards into granular, candidate-specific signals. We show that our formulation preserves the fundamental axioms of the Shapley value while remaining computationally efficient with polynomial-time complexity. Empirically, ShapE-GRPO consistently outperforms standard GRPO across diverse datasets with accelerated convergence during training.

[AI-18] Spatiotemporal Robustness of Temporal Logic Tasks using Multi-Objective Reasoning

【速读】:该论文旨在解决自主系统在不确定性下保持可靠性的关键问题,特别是针对时序逻辑规范在离散时间信号上的时空鲁棒性(spatiotemporal robustness, STR)建模与计算难题。现有方法仅考虑空间扰动的鲁棒性度量,而忽略了时间维度的影响,这在多智能体机器人、智慧城市和空中交通管制等交互式系统中存在局限。论文提出将STR定义为一个多元优化问题,通过空间与时间扰动上的偏序关系进行形式化,从而获得所有可接受的时空扰动的帕累托最优集(Pareto-optimal set),并利用多目标优化工具进行高效计算。其核心创新在于引入了一种语义机制,在保证计算可行性的同时对STR进行合理下近似(sound under-approximation),并据此设计了相应的监控算法,首次实现了跨多个维度的鲁棒性分析的多目标推理框架。

链接: https://arxiv.org/abs/2603.29868
作者: Oliver Schön,Lars Lindemann
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 27 pages, 6 figures

点击查看摘要

Abstract:The reliability of autonomous systems depends on their robustness, i.e., their ability to meet their objectives under uncertainty. In this paper, we study spatiotemporal robustness of temporal logic specifications evaluated over discrete-time signals. Existing work has proposed robust semantics that capture not only Boolean satisfiability, but also the geometric distance from unsatisfiability, corresponding to admissible spatial perturbations of a given signal. In contrast, we propose spatiotemporal robustness (STR), which captures admissible spatial and temporal perturbations jointly. This notion is particularly informative for interacting systems, such as multi-agent robotics, smart cities, and air traffic control. We define STR as a multi-objective reasoning problem, formalized via a partial order over spatial and temporal perturbations. This perspective has two key advantages: (1) STR can be interpreted as a Pareto-optimal set that characterizes all admissible spatiotemporal perturbations, and (2) STR can be computed using tools from multi-objective optimization. To navigate computational challenges, we propose robust semantics for STR that are sound in the sense of suitably under-approximating STR while being computationally tractable. Finally, we present monitoring algorithms for STR using these robust semantics. To the best of our knowledge, this is the first work to deal with robustness across multiple dimensions via multi-objective reasoning.

[AI-19] Wildfire Suppression: Complexity Models and Instances

【速读】:该论文旨在解决野火蔓延过程中抑制资源(如消防人员、设备等)在时间与空间上的最优分配问题,目标是通过图结构表示的景观模型来延缓火势扩散。其关键解决方案在于提出了一种新的混合整数规划(Mixed-Integer Programming, MIP)公式,相较于以往研究认为MIP不具竞争力的观点,本文证明该方法能够取得当前最优性能;同时,为弥补现有基准测试集在现实性和挑战性上的不足,作者基于Rothermel表面火灾传播模型构建了物理驱动的实例生成器,从而实现了对不同算法在真实复杂场景下的系统性评估,识别出各类算法的成功与失败条件。

链接: https://arxiv.org/abs/2603.29865
作者: Gustavo Delazeri,Marcus Ritt
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Wildfires cause major losses worldwide, and the frequency of fire-weather conditions is likely to increase in many regions. We study the allocation of suppression resources over time on a graph-based representation of a landscape to slow down fire propagation. Our contributions are theoretical and methodological. First, we prove that this problem and related variants in the literature are NP-complete, including cases without resource-timing constraints. Second, we propose a new mixed-integer programming (MIP) formulation that obtains state-of-the-art results, showing that MIP is a competitive approach contrary to earlier findings. Third, showing that existing benchmarks lack realism and difficulty, we introduce a physics-grounded instance generator based on Rothermel’s surface fire spread model. We use these diverse instances to benchmark the literature, identifying the specific conditions where each algorithm succeeds or fails.

[AI-20] From Density Matrices to Phase Transitions in Deep Learning: Spectral Early Warnings and Interpretability

【速读】:该论文旨在解决现代人工智能(AI)研究中一个关键问题:在模型训练过程中预测和理解涌现能力(emergent capabilities)的机制。为实现这一目标,作者受量子化学中反应研究方法的启发,提出了一种名为“两数据点约化密度矩阵”(2-datapoint reduced density matrix, 2RDM)的新工具。其解决方案的关键在于,2RDM提供了一个计算高效且统一的可观测量,能够捕捉训练过程中的相变现象;通过跟踪2RDM在滑动窗口内的特征值统计,可提取两个互补信号——谱热容(spectral heat capacity)用于通过临界慢化提前预警二阶相变,参与比(participation ratio)则揭示底层重组的维度变化;此外,2RDM的前几大特征向量具有直接可解释性,使得对相变本质的研究更加直观。该方法已在深度线性网络、归纳头形成、Grokking现象及涌现错位等四种不同场景中得到验证。

链接: https://arxiv.org/abs/2603.29805
作者: Max Hennick,Guillaume Corlouer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A key problem in the modern study of AI is predicting and understanding emergent capabilities in models during training. Inspired by methods for studying reactions in quantum chemistry, we present the ``2-datapoint reduced density matrix". We show that this object provides a computationally efficient, unified observable of phase transitions during training. By tracking the eigenvalue statistics of the 2RDM over a sliding window, we derive two complementary signals: the spectral heat capacity, which we prove provides early warning of second-order phase transitions via critical slowing down, and the participation ratio, which reveals the dimensionality of the underlying reorganization. Remarkably, the top eigenvectors of the 2RDM are directly interpretable making it straightforward to study the nature of the transitions. We validate across four settings distinct settings: deep linear networks, induction head formation, grokking, and emergent misalignment. We then discuss directions for future work using the 2RDM.

[AI-21] racking vs. Deciding: The Dual-Capability Bottleneck in Searchless Chess Transformers

【速读】:该论文旨在解决如何训练出具有人类棋手风格的国际象棋引擎问题,即在不牺牲人类行为特征(如决策偏差、一致性与多样性)的前提下提升模型性能。传统方法往往追求最大化棋力,而忽略了人类玩家特有的非最优但具一致性的走法模式。其核心解决方案是识别并缓解“双能力瓶颈”——状态追踪能力(state tracking, T)与决策质量能力(decision quality, Q)之间的矛盾:前者依赖低等级对局以学习完整的历史轨迹,后者则需高等级对局提供高质量动作信号。为此,作者通过扩大模型规模(从28M到120M参数)增强状态追踪能力,并引入Elo加权训练策略,在保留历史多样性的同时优化决策质量;实验证明二者协同作用具有超加性效果,最终模型在无搜索机制下达到Lichess bullet 2570 Elo水平,且在人类走法预测任务中Top-1准确率达55.2%,显著优于Maia系列模型。

链接: https://arxiv.org/abs/2603.29761
作者: Quanhao Li,Wei Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A human-like chess engine should mimic the style, errors, and consistency of a strong human player rather than maximize playing strength. We show that training from move sequences alone forces a model to learn two capabilities: state tracking, which reconstructs the board from move history, and decision quality, which selects good moves from that reconstructed state. These impose contradictory data requirements: low-rated games provide the diversity needed for tracking, while high-rated games provide the quality signal for decision learning. Removing low-rated data degrades performance. We formalize this tension as a dual-capability bottleneck, P = min(T,Q), where overall performance is limited by the weaker capability. Guided by this view, we scale the model from 28M to 120M parameters to improve tracking, then introduce Elo-weighted training to improve decisions while preserving diversity. A 2 x 2 factorial ablation shows that scaling improves tracking, weighting improves decisions, and their combination is superadditive. Linear weighting works best, while overly aggressive weighting harms tracking despite lower validation loss. We also introduce a coverage-decay formula, t* = log(N/kcrit)/log b, as a reliability horizon for intra-game degeneration risk. Our final 120M-parameter model, without search, reached Lichess bullet 2570 over 253 rated games. On human move prediction it achieves 55.2% Top-1 accuracy, exceeding Maia-2 rapid and Maia-2 blitz. Unlike position-based methods, sequence input naturally encodes full game history, enabling history-dependent decisions that single-position models cannot exhibit. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.29761 [cs.AI] (or arXiv:2603.29761v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.29761 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-22] CausalPulse: An Industrial-Grade Neurosymbolic Multi-Agent Copilot for Causal Diagnostics in Smart Manufacturing AAAI

【速读】:该论文旨在解决现代制造环境中对实时、可信且可解释的根因分析(Root-Cause Analysis, RCA)需求,传统分析流程将异常检测、因果推断与根因分析视为孤立阶段,限制了可扩展性和可解释性。解决方案的关键在于提出CausalPulse——一个工业级多智能体协作者(Multi-Agent Copilot),其核心是基于标准化代理协议构建的神经符号架构(Neurosymbolic Architecture),统一了异常检测、因果发现与推理过程。该设计实现了模块化、可扩展且支持人机协同的自动化诊断系统,在罗伯特·博世工厂部署验证了其在生产规模下的实时性(端到端延迟50–60秒)和高可靠性(整体成功率98.73%),显著优于现有工业协作者方案。

链接: https://arxiv.org/abs/2603.29755
作者: Chathurangi Shyalika,Utkarshani Jaimini,Cory Henson,Amit Sheth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures, 4 tables, Accepted at AAAI-MAKE 2026 (AAAI Spring Symposium on Machine Learning and Knowledge Engineering for Knowledge-Grounded Semantic Agents)

点击查看摘要

Abstract:Modern manufacturing environments demand real-time, trustworthy, and interpretable root-cause insights to sustain productivity and quality. Traditional analytics pipelines often treat anomaly detection, causal inference, and root-cause analysis as isolated stages, limiting scalability and explainability. In this work, we present CausalPulse, an industry-grade multi-agent copilot that automates causal diagnostics in smart manufacturing. It unifies anomaly detection, causal discovery, and reasoning through a neurosymbolic architecture built on standardized agentic protocols. CausalPulse is being deployed in a Robert Bosch manufacturing plant, integrating seamlessly with existing monitoring workflows and supporting real-time operation at production scale. Evaluations on both public (Future Factories) and proprietary (Planar Sensor Element) datasets show high reliability, achieving overall success rates of 98.0% and 98.73%. Per-criterion success rates reached 98.75% for planning and tool use, 97.3% for self-reflection, and 99.2% for collaboration. Runtime experiments report end-to-end latency of 50-60s per diagnostic workflow with near-linear scalability (R^2=0.97), confirming real-time readiness. Comparison with existing industrial copilots highlights distinct advantages in modularity, extensibility, and deployment maturity. These results demonstrate how CausalPulse’s modular, human-in-the-loop design enables reliable, interpretable, and production-ready automation for next-generation manufacturing.

[AI-23] Spontaneous Functional Differentiation in Large Language Models : A Brain-Like Intelligence Economy

【速读】:该论文旨在解决如何识别人工系统中普遍存在的计算原理,特别是揭示大语言模型(Large Language Models, LLMs)内部信息处理机制是否与人类大脑存在相似性的问题。其关键解决方案在于利用集成信息分解(Integrated Information Decomposition)方法,在多个模型架构中发现中间层表现出显著的协同处理(synergistic processing),而早期和晚期层则依赖冗余(redundancy);这种组织结构随任务难度增加呈现动态相变特征,并通过消融实验验证了协同组件对抽象推理能力的必要性,从而确立其作为连接人工智能与生物智能的物理实体。

链接: https://arxiv.org/abs/2603.29735
作者: Junjie Zhang,Zhen Shen,Gang Xiong,Xisong Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The evolution of intelligence in artificial systems provides a unique opportunity to identify universal computational principles. Here we show that large language models spontaneously develop synergistic cores where information integration exceeds individual parts remarkably similar to the human brain. Using Integrated Information Decomposition across multiple architectures we find that middle layers exhibit synergistic processing while early and late layers rely on redundancy. This organization is dynamic and emerges as a physical phase transition as task difficulty increases. Crucially ablating synergistic components causes catastrophic performance loss confirming their role as the physical entity of abstract reasoning and bridging artificial and biological intelligence.

[AI-24] Reinforced Reasoning for End-to-End Retrosynthetic Planning

【速读】:该论文旨在解决有机合成中逆合成规划(retrosynthetic planning)因组合复杂性带来的挑战,尤其是传统混合框架中局部分子转化与全局规划目标之间逻辑断裂的问题。解决方案的关键在于提出一种端到端的生成式框架 ReTriP,将逆合成任务重新建模为链式思维(Chain-of-Thought)推理过程,通过构建路径一致的分子表示和渐进式训练策略(从推理蒸馏过渡到带可验证奖励的强化学习),使每一步生成结果与实际合成路线的有效性对齐,从而实现更鲁棒的长程规划能力。

链接: https://arxiv.org/abs/2603.29723
作者: Chenyang Zuo,Siqi Fan,Yizhen Luo,Zaiqing Nie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrosynthetic planning is a fundamental task in organic chemistry, yet remains challenging due to its combinatorial complexity. To address this, conventional approaches typically rely on hybrid frameworks that combine single-step predictions with external search heuristics, inevitably fracturing the logical coherence between local molecular transformations and global planning objectives. To bridge this gap and embed sophisticated strategic foresight directly into the model’s chemical reasoning, we introduce ReTriP, an end-to-end generative framework that reformulates retrosynthesis as a direct Chain-of-Thought reasoning task. We establish a path-coherent molecular representation and employ a progressive training curriculum that transitions from reasoning distillation to reinforcement learning with verifiable rewards, effectively aligning stepwise generation with practical route utility. Empirical evaluation on RetroBench demonstrates that ReTriP achieves state-of-the-art performance, exhibiting superior robustness in long-horizon planning compared to hybrid baselines.

[AI-25] Symphony for Medical Coding: A Next-Generation Agent ic System for Scalable and Explainable Medical Coding

【速读】:该论文旨在解决医疗编码(Medical Coding)中长期存在的手动操作效率低、易出错,以及现有自动化方法难以适应新代码或不同编码系统且缺乏可解释性的问题。解决方案的关键在于提出 Symphony 系统,该系统模拟专业人类编码员的推理方式,通过直接访问编码指南对临床文本进行推理,从而实现跨任意编码系统的灵活适配,并提供细粒度的文本片段证据(span-level evidence)以支持每个预测代码,增强了模型在安全关键场景中的可信度。

链接: https://arxiv.org/abs/2603.29709
作者: Joakim Edin,Andreas Motzfeldt,Simon Flachs,Lars Maaløe
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical coding translates free-text clinical documentation into standardized codes drawn from classification systems that contain tens of thousands of entries and are updated annually. It is central to billing, clinical research, and quality reporting, yet remains largely manual, slow, and error-prone. Existing automated approaches learn to predict a fixed set of codes from labeled data, thereby preventing adaptation to new codes or different coding systems without retraining on different data. They also provide no explanation for their predictions, limiting trust in safety-critical settings. We introduce Symphony for Medical Coding, a system that approaches the task the way expert human coders do: by reasoning over the clinical narrative with direct access to the coding guidelines. This design allows Symphony to operate across any coding system and to provide span-level evidence linking each predicted code to the text that supports it. We evaluate on two public benchmarks and three real-world datasets spanning inpatient, outpatient, emergency, and subspecialty settings across the United States and the United Kingdom. Symphony achieves state-of-the-art results across all settings, establishing itself as a flexible, deployment-ready foundation for automated clinical coding.

[AI-26] Measuring the metacognition of AI

【速读】:该论文旨在解决如何有效评估人工智能(AI)系统在决策过程中对不确定性的认知能力,即其元认知敏感性(metacognitive sensitivity),特别是生成置信度评分以区分正确与错误判断的能力。其解决方案的关键在于引入元- d’(meta-d’)框架及其无模型替代方法作为衡量AI元认知能力的黄金标准,并结合信号检测理论(Signal Detection Theory, SDT)来评估AI是否能根据不确定性与风险自发调节决策行为。通过在三个大语言模型(LLMs)上开展两组实验,研究者验证了该方法体系在跨模型比较、任务间差异分析及风险情境下决策保守性变化等方面的实用性与有效性。

链接: https://arxiv.org/abs/2603.29693
作者: Richard Servajean,Philippe Servajean
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures, 2 tables

点击查看摘要

Abstract:A robust decision-making process must take into account uncertainty, especially when the choice involves inherent risks. Because artificial Intelligence (AI) systems are increasingly integrated into decision-making workflows, managing uncertainty relies more and more on the metacognitive capabilities of these systems; i.e, their ability to assess the reliability of and regulate their own decisions. Hence, it is crucial to employ robust methods to measure the metacognitive abilities of AI. This paper is primarily a methodological contribution arguing for the adoption of the meta-d’ framework, or its model-free alternatives, as the gold standard for assessing the metacognitive sensitivity of AIs–the ability to generate confidence ratings that distinguish correct from incorrect responses. Moreover, we propose to leverage signal detection theory (SDT) to measure the ability of AIs to spontaneously regulate their decisions based on uncertainty and risk. To demonstrate the practical utility of these psychophysical frameworks, we conduct two series of experiments on three large language models (LLMs)–GPT-5, DeepSeek-V3.2-Exp, and Mistral-Medium-2508. In the first experiments, LLMs performed a primary judgment followed by a confidence rating. In the second, LLMs only performed the primary judgment, while we manipulated the risk associated with either response. On the one hand, applying the meta-d’ framework allows us to conduct comparisons along three axes: comparing an LLM to optimality, comparing different LLMs on a given task, and comparing the same LLM across different tasks. On the other hand, SDT allows us to assess whether LLMs become more conservative when risks are high.

[AI-27] A First Step Towards Even More Sparse Encodings of Probability Distributions

【速读】:该论文旨在解决概率分布(probability distributions)在实际场景中编码时面临的维度灾难问题,即传统表格或列表形式的表示方式需要指数级数量的参数,导致存储和计算效率低下。其解决方案的关键在于通过提取一阶逻辑公式(first-order formulas)来压缩概率分布的表达:首先对分布进行数值上的约简(reduction),再为每个保留的数值生成对应的逻辑公式,并进一步最小化这些公式。这一过程显著提升了编码的稀疏性(sparsity),同时保持了分布的核心信息,从而实现更高效且具泛化能力的概率建模。

链接: https://arxiv.org/abs/2603.29691
作者: Florian Andreas Marwitz,Tanya Braun,Ralf Möller
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published in ILP2021. The final authenticated publication is available online at this https URL

点击查看摘要

Abstract:Real world scenarios can be captured with lifted probability distributions. However, distributions are usually encoded in a table or list, requiring an exponential number of values. Hence, we propose a method for extracting first-order formulas from probability distributions that require significantly less values by reducing the number of values in a distribution and then extracting, for each value, a logical formula to be further minimized. This reduction and minimization allows for increasing the sparsity in the encoding while also generalizing a given distribution. Our evaluation shows that sparsity can increase immensely by extracting a small set of short formulas while preserving core information.

[AI-28] View-oriented Conversation Compiler for Agent Trace Analysis

【速读】:该论文旨在解决当前代理对话日志(agent traces)在上下文学习(context learning)场景下因格式不当导致分析质量下降的问题。现有方法通常将复杂的结构化对话内容(如嵌套工具调用、链式推理块、子代理调用等)以原始文本、JSON或YAML形式直接输入分析模块,造成信息损失和效率低下。解决方案的关键在于提出一种面向对话的编译器(View-oriented Conversation Compiler, VCC),通过词法分析(lex)、解析(parse)、中间表示(IR)、降低(lower)和生成(emit)等阶段,将原始JSONL日志转化为多种结构化视图:全视图(lossless transcript)、用户界面视图(user-interface view)和自适应视图(adaptive view)。实验表明,仅替换反射器(reflector)的输入格式为VCC编译后的视图,即可显著提升模型在AppWorld任务中的通过率,同时减少50%–67%的token消耗并生成更紧凑的记忆表征,证明消息格式本身是上下文学习的重要基础设施,而非可忽略的工程细节。

链接: https://arxiv.org/abs/2603.29678
作者: Lvmin Zhang,Maneesh Agrawala
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Agent traces carry increasing analytical value in the era of context learning and harness-driven agentic cognition, yet most prior work treats conversation format as a trivial engineering detail. Modern agent conversations contain deeply structured content, including nested tool calls and results, chain-of-thought reasoning blocks, sub-agent invocations, context-window compaction boundaries, and harness-injected system directives, whose complexity far exceeds that of simple user-assistant exchanges. Feeding such traces to a reflector or other analytical mechanism in plain text, JSON, YAML, or via grep can materially degrade analysis quality. This paper presents VCC (View-oriented Conversation Compiler), a compiler (lex, parse, IR, lower, emit) that transforms raw agent JSONL logs into a family of structured views: a full view (lossless transcript serving as the canonical line-number coordinate system), a user-interface view (reconstructing the interaction as the user actually perceived it), and an adaptive view (a structure-preserving projection governed by a relevance predicate). In a context-learning experiment on AppWorld, replacing only the reflector’s input format, from raw JSONL to VCC-compiled views, leads to higher pass rates across all three model configurations tested, while cutting reflector token consumption by half to two-thirds and producing more concise learned memory. These results suggest that message format functions as infrastructure for context learning, not as an incidental implementation choice.

[AI-29] Mind the Gap: A Framework for Assessing Pitfalls in Multimodal Active Learning

【速读】:该论文旨在解决多模态主动学习(multimodal active learning)中因模态缺失、模态难度差异及交互结构不一致等挑战导致的模型表示失衡问题,即模型倾向于依赖单一模态而忽略其他模态,从而限制了多模态信息的有效融合。其解决方案的关键在于提出一个基于合成数据集的基准测试框架,该框架能够隔离并系统评估这些典型陷阱,避免真实数据中的噪声干扰,进而对比单模态与多模态查询策略的表现,验证现有方法在多模态场景下未能缓解模态偏倚的问题,并强调开发具备模态感知能力的新型查询策略的必要性。

链接: https://arxiv.org/abs/2603.29677
作者: Dustin Eisenhardt,Yunhee Jeong,Florian Buettner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal learning enables neural networks to integrate information from heterogeneous sources, but active learning in this setting faces distinct challenges. These include missing modalities, differences in modality difficulty, and varying interaction structures. These are issues absent in the unimodal case. While the behavior of active learning strategies in unimodal settings is well characterized, their behavior under such multimodal conditions remains poorly understood. We introduce a new framework for benchmarking multimodal active learning that isolates these pitfalls using synthetic datasets, allowing systematic evaluation without confounding noise. Using this framework, we compare unimodal and multimodal query strategies and validate our findings on two real-world datasets. Our results show that models consistently develop imbalanced representations, relying primarily on one modality while neglecting others. Existing query methods do not mitigate this effect, and multimodal strategies do not consistently outperform unimodal ones. These findings highlight limitations of current active learning methods and underline the need for modality-aware query strategies that explicitly address these pitfalls. Code and benchmark resources will be made publicly available.

[AI-30] 6GAgent Gym: Tool Use Data Synthesis and Agent ic Learning for Network Management

【速读】:该论文旨在解决6G网络管理中自主代理(agent)缺乏闭环交互能力的问题,现有基准测试多依赖静态问题或脚本化场景回放,无法支持代理通过执行工具、观察状态变化并动态调整决策的闭环学习过程。解决方案的关键在于构建一个名为6GAgentGym的交互式环境,其包含42种类型工具,并通过学习到的实验模型(Experiment Model)对工具效果进行分类(区分只读观测与状态修改配置),同时利用NS-3仿真数据进行校准;此外,通过自指导(Self-Instruct)迭代生成训练轨迹并结合在线闭环交互强化学习,最终使8B规模开源模型在6GAgentBench上达到与GPT-5相当的整体成功率,尤其在长时程任务中表现更优。

链接: https://arxiv.org/abs/2603.29656
作者: Jiao Chen,Jianhua Tang,Xiaotong Yang,Zuohong Lv
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous 6G network management requires agents that can execute tools, observe the resulting state changes, and adapt their decisions accordingly. Existing benchmarks based on static questions or scripted episode replay, however, do not support such closed-loop interaction, limiting agents to passive evaluation without the ability to learn from environmental feedback. This paper presents 6GAgentGym to provide closed-loop capability. The framework provides an interactive environment with 42 typed tools whose effect classification distinguishes read-only observation from state-mutating configuration, backed by a learned Experiment Model calibrated on NS-3 simulation data. 6G-Forge bootstraps closed-loop training trajectories from NS-3 seeds via iterative Self-Instruct generation with execution verification against the Experiment Model. Supervised fine-tuning on the resulting corpus followed by reinforcement learning with online closed-loop interaction enables an 8B open-source model to achieve comparable overall success rate to GPT-5 on the accompanying 6GAgentBench, with stronger performance on long-horizon tasks. Together, these components provide a viable path toward autonomous, closed-loop network management.

[AI-31] Concept frustration: Aligning human concepts and machine representations

【速读】:该论文旨在解决如何将人类可解释的概念与现代机器学习系统内部表示进行对齐的问题,这是可解释人工智能(Interpretable AI)领域的一项核心挑战。其解决方案的关键在于提出一种几何框架,用于比较监督式的人类概念与基础模型嵌入中提取的无监督中间表示,并引入“概念挫折”(concept frustration)这一新概念——即当一个未观测到的概念诱导出已知概念间的关系,而这些关系在现有本体论(ontology)内无法自洽时所出现的矛盾。作者开发了任务对齐的相似性度量方法来检测这种挫折现象,并证明其可在任务对齐的几何空间中被识别,而传统欧几里得距离比较则无法捕捉。进一步地,在线性高斯生成模型下推导出贝叶斯最优概念分类器准确率的闭式表达,将预测信号分解为已知-已知、已知-未知和未知-未知三部分,从而从理论上明确指出挫折如何影响性能。实验表明,该方法能在合成数据及真实语言和视觉任务中有效检测基础模型中的概念挫折,并通过引入挫折概念重构可解释模型的几何结构,使人类与机器推理更一致。这为诊断不完整的概念本体论并实现人机概念推理对齐提供了原则性的框架,具有高风险场景下安全可解释AI开发与验证的重要意义。

链接: https://arxiv.org/abs/2603.29654
作者: Enrico Parisini,Christopher J. Soelistyo,Ahab Isaac,Alessandro Barp,Christopher R.S. Banerji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 34 pages, 7 figures

点击查看摘要

Abstract:Aligning human-interpretable concepts with the internal representations learned by modern machine learning systems remains a central challenge for interpretable AI. We introduce a geometric framework for comparing supervised human concepts with unsupervised intermediate representations extracted from foundation model embeddings. Motivated by the role of conceptual leaps in scientific discovery, we formalise the notion of concept frustration: a contradiction that arises when an unobserved concept induces relationships between known concepts that cannot be made consistent within an existing ontology. We develop task-aligned similarity measures that detect concept frustration between supervised concept-based models and unsupervised representations derived from foundation models, and show that the phenomenon is detectable in task-aligned geometry while conventional Euclidean comparisons fail. Under a linear-Gaussian generative model we derive a closed-form expression for Bayes-optimal concept-based classifier accuracy, decomposing predictive signal into known-known, known-unknown and unknown-unknown contributions and identifying analytically where frustration affects performance. Experiments on synthetic data and real language and vision tasks demonstrate that frustration can be detected in foundation model representations and that incorporating a frustrating concept into an interpretable model reorganises the geometry of learned concept representations, to better align human and machine reasoning. These results suggest a principled framework for diagnosing incomplete concept ontologies and aligning human and machine conceptual reasoning, with implications for the development and validation of safe interpretable AI for high-risk applications.

[AI-32] Optimizing Donor Outreach for Blood Collection Sessions: A Scalable Decision Support Framework

【速读】:该论文旨在解决多站点血液采集中心在匹配血源供应与需求过程中面临的 donor 邀请调度优化问题(donor invitation scheduling),其核心挑战在于如何在考虑捐赠者资格、容量限制、血型需求目标、地理便利性及捐赠者安全的前提下,合理分配捐赠者至不同采血时段。解决方案的关键在于构建一个融合上述约束的优化框架,采用两种策略:一是基于二元整数线性规划(Binary Integer Linear Programming, BILP)的精确方法;二是设计了一种高效的贪心启发式算法(greedy heuristic)。实验表明,该框架能有效动员符合条件的非活跃或即将流失的捐赠者,从而缩小供需缺口,且贪心算法在计算效率上显著优于BILP(峰值内存减少188倍,运行时间快115倍),尽管在需求满足率和捐赠者体验方面存在小幅妥协,但整体仍具备良好的实用性与可扩展性。

链接: https://arxiv.org/abs/2603.29643
作者: André Carneiro,Pedro T. Monteiro,Rui Henriques
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figures, 4 supplementary figures, 2 supplementary tables

点击查看摘要

Abstract:Blood donation centers face challenges in matching supply with demand while managing donor availability. Although targeted outreach is important, it can cause donor fatigue via over-solicitation. Effective recruitment requires targeting the right donors at the right time, balancing constraints with donor convenience and eligibility. Despite extensive work on blood supply chain optimization and growing interest in algorithmic donor recruitment, the operational problem of assigning donors to sessions across a multi-site network, taking into account eligibility, capacity, blood-type demand targets, geographic convenience, and donor safety, remains unaddressed. We address this gap with an optimization framework for donor invitation scheduling incorporating donor eligibility, travel convenience, blood-type demand targets, and penalties. We evaluate two strategies: (i) a binary integer linear programming (BILP) formulation and (ii) an efficient greedy heuristic. Evaluation uses the registry from Instituto Português do Sangue e da Transplantação (IPST) for invite planning in the Lisbon operational region using 4-month windows. A prospective pipeline integrates organic attendance forecasting, quantile-based demand targets, and residual capacity estimation for forward-looking invitation plans. Results reveal its key role in closing the supply-demand gap in the Lisbon operational region. A controlled comparison shows that the greedy heuristic achieves results comparable to the BILP, with 188x less peak memory and 115x faster runtime; trade-offs include 3.9 pp lower demand fulfillment (86.1% vs. 90.0%), larger donor-session distance, higher adverse-reaction donor exposure, and greater invitation burden per non-high-frequency donor, reflecting local versus global optimization. Experiments assess how constraint-aware scheduling can close gaps by mobilizing eligible inactive/lapsing donors. Comments: 16 pages, 9 figures, 4 supplementary figures, 2 supplementary tables Subjects: Artificial Intelligence (cs.AI) MSC classes: 90C10, 90B80, 90C59, 68T05 ACMclasses: I.2.6; I.2.8; G.1.6; J.3 Cite as: arXiv:2603.29643 [cs.AI] (or arXiv:2603.29643v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.29643 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-33] ASI-Evolve: AI Accelerates AI

【速读】:该论文旨在解决“人工智能是否能够加速自身发展”的核心问题,即探索生成式 AI (Generative AI) 是否具备在长期、高成本且弱监督的研究循环中自主推进AI技术进步的能力。传统 agentic 系统虽在短期任务中表现优异,但难以应对复杂、跨阶段的AI研发挑战。其解决方案的关键在于提出 ASI-Evolve 框架,该框架通过闭环的“认知-设计-实验-分析”(learn-design-experiment-analyze)机制实现AI驱动的自我进化:一是引入“认知库”(cognition base),将人类先验知识注入每轮探索过程以提升效率;二是设置专用分析器(dedicated analyzer),从复杂实验结果中提炼可复用的洞见用于后续迭代优化。这一设计使系统首次在数据、架构与学习算法三大AI核心组件上实现了由AI主导的突破性发现,验证了闭合环路AI研究的可行性。

链接: https://arxiv.org/abs/2603.29640
作者: Weixian Xu,Tiantian Mi,Yixiu Liu,Yang Nan,Zhimeng Zhou,Lyumanshan Ye,Lin Zhang,Yu Qiao,Pengfei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures, 6 tables. Code available at this https URL

点击查看摘要

Abstract:Can AI accelerate the development of AI itself? While recent agentic systems have shown strong performance on well-scoped tasks with rapid feedback, it remains unclear whether they can tackle the costly, long-horizon, and weakly supervised research loops that drive real AI progress. We present ASI-Evolve, an agentic framework for AI-for-AI research that closes this loop through a learn-design-experiment-analyze cycle. ASI-Evolve augments standard evolutionary agents with two key components: a cognition base that injects accumulated human priors into each round of exploration, and a dedicated analyzer that distills complex experimental outcomes into reusable insights for future iterations. To our knowledge, ASI-Evolve is the first unified framework to demonstrate AI-driven discovery across three central components of AI development: data, architectures, and learning algorithms. In neural architecture design, it discovered 105 SOTA linear attention architectures, with the best discovered model surpassing DeltaNet by +0.97 points, nearly 3x the gain of recent human-designed improvements. In pretraining data curation, the evolved pipeline improves average benchmark performance by +3.96 points, with gains exceeding 18 points on MMLU. In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +12.5 points on AMC32, +11.67 points on AIME24, and +5.04 points on OlympiadBench. We further provide initial evidence that this AI-for-AI paradigm can transfer beyond the AI stack through experiments in mathematics and biomedicine. Together, these results suggest that ASI-Evolve represents a promising step toward enabling AI to accelerate AI across the foundational stages of development, offering early evidence for the feasibility of closed-loop AI research.

[AI-34] FigAgent : Towards Automatic Method Illustration Figure Generation for AI Scientific Papers

【速读】:该论文旨在解决科学论文中方法说明图(Method Illustration Figures, MIFs)生成过程费时费力的问题。其核心挑战在于MIF的构图复杂性(compositional complexity)、组件相似性(component similarity)以及设计动态性(design dynamics)。为应对这些问题,作者提出了一种名为FigAgent的多智能体框架,其关键创新在于通过多智能体协作,从相似组件中提炼绘图经验并封装为可复用工具,在生成过程中调用并动态演化这些工具以适应设计变化;同时引入一种“探索与选择”(Explore-and-Select)的绘制策略,模拟人类试错过程,逐步构建复杂结构的高质量MIF。

链接: https://arxiv.org/abs/2603.29590
作者: Zhuoling Li,Jiarui Zhang,Jason Kuen,Jiuxiang Gu,Hossein Rahmani,Jun Liu
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Method illustration figures (MIFs) play a crucial role in conveying the core ideas of scientific papers, yet their generation remains a labor-intensive process. In this paper, we identify three key characteristics that substantially influence MIF generation quality, i.e., \emphcompositional complexity, \emphcomponent similarity, and \emphdesign dynamics. To handle these characteristics, we take inspiration from human authors’ drawing practices and propose \textbfFigAgent, a novel multi-agent framework for automatically generating high-quality MIFs. Through multi-agent collaboration, our FigAgent distills drawing experiences across similar components of MIFs and encapsulates them into reusable tools that can be invoked during MIF generation, while evolving these tools to adapt to dynamic design requirements. Besides, a novel Explore-and-Select drawing strategy is introduced to mimic the human-like trial-and-error manner for gradually constructing MIFs with complex structures. Extensive experiments show the efficacy of our method. Project is available \hrefthis https URLhere.

[AI-35] Learn2Fold: Structured Origami Generation with World Model Planning

【速读】:该论文旨在解决从自然语言描述直接生成物理上有效的复杂折纸(origami)折叠序列这一开放性挑战。现有方法存在两大局限:基于优化的方法虽能保证物理合理性但依赖密集且精确的输入,难以适配稀疏文本提示;而生成式基础模型虽擅长语义和感知合成,却无法生成长程、符合物理规律的折叠过程。解决方案的关键在于提出一种神经符号框架 Learn2Fold,其核心思想是将语义提议与物理验证解耦:利用大语言模型从抽象文本提示中生成候选折叠程序,同时通过一个学习得到的图结构世界模型作为可微分的替代仿真器,在执行前预测物理可行性及失效模式;二者集成于前瞻规划循环中,从而实现对复杂及分布外折纸图案的鲁棒物理有效折叠序列生成,体现了符号推理与具身物理仿真协同作用下空间智能的有效提升。

链接: https://arxiv.org/abs/2603.29585
作者: Yanjia Huang,Yunuo Chen,Ying Jiang,Jinru Han,Zhengzhong Tu,Yin Yang,Chenfanfu Jiang
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:The ability to transform a flat sheet into a complex three-dimensional structure is a fundamental test of physical intelligence. Unlike cloth manipulation, origami is governed by strict geometric axioms and hard kinematic constraints, where a single invalid crease or collision can invalidate the entire folding sequence. As a result, origami demands long-horizon constructive reasoning that jointly satisfies precise physical laws and high-level semantic intent. Existing approaches fall into two disjoint paradigms: optimization-based methods enforce physical validity but require dense, precisely specified inputs, making them unsuitable for sparse natural language descriptions, while generative foundation models excel at semantic and perceptual synthesis yet fail to produce long-horizon, physics-consistent folding processes. Consequently, generating valid origami folding sequences directly from text remains an open challenge. To address this gap, we introduce Learn2Fold, a neuro-symbolic framework that formulates origami folding as conditional program induction over a crease-pattern graph. Our key insight is to decouple semantic proposal from physical verification. A large language model generates candidate folding programs from abstract text prompts, while a learned graph-structured world model serves as a differentiable surrogate simulator that predicts physical feasibility and failure modes before execution. Integrated within a lookahead planning loop, Learn2Fold enables robust generation of physically valid folding sequences for complex and out-of-distribution patterns, demonstrating that effective spatial intelligence arises from the synergy between symbolic reasoning and grounded physical simulation.

[AI-36] Mean Masked Autoencoder with Flow-Mixing for Encrypted Traffic Classification

【速读】:该论文旨在解决现有基于掩码自编码器(Masked Autoencoders, MAE)的网络流量分类方法在处理加密流量时存在的局限性,即仅局限于单一流量的字节级重建,缺乏对多粒度上下文关系的有效感知。其解决方案的关键在于提出一种名为Mean MAE (MMAE) 的教师-学生架构,通过引入自蒸馏机制实现教师对学生的流级语义监督,从而推动模型从局部字节重建向多粒度理解演进;同时设计了动态流量混合(FlowMix)策略替代传统随机掩码机制,构造跨流量干扰样本以迫使模型学习更具判别性的表示,并结合包重要性感知的掩码预测器(Packet-importance aware Mask Predictor, PMP),利用包级侧信道统计信息动态掩码高语义密度的token,有效缓解个体流量中的信息瓶颈问题。

链接: https://arxiv.org/abs/2603.29537
作者: Xiao Liu,Xiaowei Fu,Fuxiang Huang,Lei Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Networking and Internet Architecture (cs.NI)
备注: Project page \url{ this https URL }

点击查看摘要

Abstract:Network traffic classification using self-supervised pre-training models based on Masked Autoencoders (MAE) has demonstrated a huge potential. However, existing methods are confined to isolated byte-level reconstruction of individual flows, lacking adequate perception of the multi-granularity contextual relationship in traffic. To address this limitation, we propose Mean MAE (MMAE), a teacher-student MAE paradigm with flow mixing strategy for building encrypted traffic pre-training model. MMAE employs a self-distillation mechanism for teacher-student interaction, where the teacher provides unmasked flow-level semantic supervision to advance the student from local byte reconstruction to multi-granularity comprehension. To break the information bottleneck in individual flows, we introduce a dynamic Flow Mixing (FlowMix) strategy to replace traditional random masking mechanism. By constructing challenging cross-flow mixed samples with interferences, it compels the model to learn discriminative representations from distorted tokens. Furthermore, we design a Packet-importance aware Mask Predictor (PMP) equipped with an attention bias mechanism that leverages packet-level side-channel statistics to dynamically mask tokens with high semantic density. Numerous experiments on a number of datasets covering encrypted applications, malware, and attack traffic demonstrate that MMAE achieves state-of-the-art performance. The code is available at this https URL

[AI-37] rafficMoE: Heterogeneity-aware Mixture of Experts for Encrypted Traffic Classification

【速读】:该论文旨在解决加密流量分类(Encrypted Traffic Classification)中因加密导致载荷语义被遮蔽而引发的细粒度特征提取困难问题。现有方法多采用静态、同质化的处理流程,强制将结构化报文头与随机化载荷统一处理,导致协议信号与加密噪声混杂,削弱了判别性特征。其解决方案的关键在于提出TrafficMoE框架,构建“解耦-过滤-聚合”(Disentangle-Filter-Aggregate, DFA)范式:首先通过双分支稀疏Mixture-of-Experts(MoE)机制实现报文头与载荷的模态解耦,支持异构建模;其次引入不确定性感知过滤机制,量化表示可靠性并抑制高方差噪声;最后采用路由引导的动态融合策略,根据流量上下文自适应加权跨模态特征。该设计显著提升了加密流量中关键判别特征的表征效率。

链接: https://arxiv.org/abs/2603.29520
作者: Qing He,Xiaowei Fu,Lei Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Networking and Internet Architecture (cs.NI)
备注: Project page \url{ this https URL }

点击查看摘要

Abstract:Encrypted traffic classification is a critical task for network security. While deep learning has advanced this field, the occlusion of payload semantics by encryption severely challenges standard modeling approaches. Most existing frameworks rely on static and homogeneous pipelines that apply uniform parameter sharing and static fusion strategies across all inputs. This one-size-fits-all static design is inherently flawed: by forcing structured headers and randomized payloads into a unified processing pipeline, it inevitably entangles the raw protocol signals with stochastic encryption noise, thereby degrading the fine-grained discriminative features. In this paper, we propose TrafficMoE, a framework that breaks through the bottleneck of static modeling by establishing a Disentangle-Filter-Aggregate (DFA) paradigm. Specifically, to resolve the structural between-components conflict, the architecture disentangles headers and payloads using dual-branch sparse Mixture-of-Experts (MoE), enabling modality-specific modeling. To mitigate the impact of stochastic noise, an uncertainty-aware filtering mechanism is introduced to quantify reliability and selectively suppress high-variance representations. Finally, to overcome the limitations of static fusion, a routing-guided strategy aggregates cross-modality features dynamically, that adaptively weighs contributions based on traffic context. With this DFA paradigm, TrafficMoE maximizes representational efficiency by focusing solely on the most discriminative traffic features. Extensive experiments on six datasets demonstrate TrafficMoE consistently outperforms state-of-the-art methods, validating the necessity of heterogeneity-aware modeling in encrypted traffic analysis. The source code is publicly available at this https URL.

[AI-38] arget-Aligned Reinforcement Learning

【速读】:该论文旨在解决强化学习中目标网络(target network)引入的稳定性与实时性权衡问题:较慢的目标网络更新虽能提升训练稳定性,但会导致学习信号滞后,从而减缓收敛速度。解决方案的关键在于提出目标对齐强化学习(Target-Aligned Reinforcement Learning, TARL),其核心思想是聚焦于目标网络与在线网络估计高度对齐的转移样本进行更新,通过选择性地利用高质量对齐目标来缓解过时目标估计的负面影响,同时保留目标网络带来的稳定优势。理论分析表明,目标对齐修正可加速收敛,实验也验证了该方法在多个基准环境中的稳定性能提升。

链接: https://arxiv.org/abs/2603.29501
作者: Leonard S. Pleiss,James Harrison,Maximilian Schiffer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many reinforcement learning algorithms rely on target networks - lagged copies of the online network - to stabilize training. While effective, this mechanism introduces a fundamental stability-recency tradeoff: slower target updates improve stability but reduce the recency of learning signals, hindering convergence speed. We propose Target-Aligned Reinforcement Learning (TARL), a framework that emphasizes transitions for which the target and online network estimates are highly aligned. By focusing updates on well-aligned targets, TARL mitigates the adverse effects of stale target estimates while retaining the stabilizing benefits of target networks. We provide a theoretical analysis demonstrating that target alignment correction accelerates convergence, and empirically demonstrate consistent improvements over standard reinforcement learning algorithms across various benchmark environments.

[AI-39] Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂多步推理任务中因依赖结果奖励(outcome rewards)而导致的中间推理步骤不可靠的问题,即模型可能通过错误的中间步骤得出正确最终答案,从而影响推理过程的可信度。解决方案的关键在于提出PRoSFI(Process Reward over Structured Formal Intermediates),该方法不直接要求模型生成形式化证明(formal proofs),而是让模型输出与自然语言推理对齐的结构化中间步骤,并由形式化证明器(formal prover)逐层验证这些步骤;仅当整个推理链完全通过验证时才给予高奖励,从而引导模型生成可机器检查的逐步推理过程,提升最终答案的可靠性而不牺牲准确性。

链接: https://arxiv.org/abs/2603.29500
作者: Luoxin Chen,Yichi Zhou,Huishuai Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated impressive performance on complex, multi-step reasoning tasks, especially when post-trained with outcome-rewarded reinforcement learning Guo et al. 2025. However, it has been observed that outcome rewards often overlook flawed intermediate steps, leading to unreliable reasoning steps even when final answers are correct. To address this unreliable reasoning, we propose PRoSFI (Process Reward over Structured Formal Intermediates), a novel reward method that enhances reasoning reliability without compromising accuracy. Instead of generating formal proofs directly, which is rarely accomplishable for a modest-sized (7B) model, the model outputs structured intermediate steps aligned with its natural language reasoning. Each step is then verified by a formal prover. Only fully validated reasoning chains receive high rewards. The integration of formal verification guides the model towards generating step-by-step machine-checkable proofs, thereby yielding more credible final answers. PRoSFI offers a simple and effective approach to training trustworthy reasoning models.

[AI-40] Metriplector: From Field Theory to Neural Architecture

【速读】:该论文旨在解决通用神经网络架构设计中缺乏物理可解释性与跨任务泛化能力的问题,其核心挑战在于如何构建一个统一且可扩展的计算框架,既能精确建模复杂任务(如图像识别、语言建模),又能保持对不同数据域的适应性。解决方案的关键在于提出 Metriplector——一种基于抽象物理系统的神经架构原语,其中输入配置场、源项和算子构成一个耦合的度规-辛(metriplectic)动力学系统;通过 Noether 定理导出的能量-动量张量 TμνT^{\mu\nu} 作为读出机制,实现了从物理规律到计算过程的自然映射。该结构支持从纯耗散分支(解屏蔽泊松方程)到完整结构(含反对称泊松括号)的渐进式物理增强,从而在路径规划、数独求解、图像分类和语言建模等多任务上均展现出高精度与强泛化能力。

链接: https://arxiv.org/abs/2603.29496
作者: Dan Oprisa,Peter Toth
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 30 pages, 7 figures

点击查看摘要

Abstract:We present Metriplector, a neural architecture primitive in which the input configures an abstract physical system–fields, sources, and operators–and the dynamics of that system is the computation. Multiple fields evolve via coupled metriplectic dynamics, and the stress-energy tensor T^\mu\nu, derived from Noether’s theorem, provides the readout. The metriplectic formulation admits a natural spectrum of instantiations: the dissipative branch alone yields a screened Poisson equation solved exactly via conjugate gradient; activating the full structure–including the antisymmetric Poisson bracket–gives field dynamics for image recognition and language modeling. We evaluate Metriplector across four domains, each using a task-specific architecture built from this shared primitive with progressively richer physics: F1=1.0 on maze pathfinding, generalizing from 15x15 training grids to unseen 39x39 grids; 97.2% exact Sudoku solve rate with zero structural injection; 81.03% on CIFAR-100 with 2.26M parameters; and 1.182 bits/byte on language modeling with 3.6x fewer training tokens than a GPT baseline.

[AI-41] Structural Compactness as a Complementary Criterion for Explanation Quality

【速读】:该论文旨在解决解释可读性(explanation legibility)的定量评估难题,特别是现有简单统计指标无法捕捉到解释中形状和内部结构差异的问题。其解决方案的关键在于提出一种基于图论的结构度量方法——最小生成树紧凑性(Minimum Spanning Tree Compactness, MST-C),该方法通过分析 attributions 的高阶几何特性(如分布广度与凝聚性),将空间分布紧凑且聚类清晰的解释特征量化为单一评分,从而有效区分不同解释方法、揭示模型间本质结构差异,并提供一个鲁棒且自包含的诊断工具以补充现有的 attributions 复杂度概念。

链接: https://arxiv.org/abs/2603.29491
作者: Mohammad Mahdi Mesgari,Jackie Ma,Wojciech Samek,Sebastian Lapuschkin,Leander Weber
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the evaluation of attribution quality, the quantitative assessment of explanation legibility is particularly difficult, as it is influenced by varying shapes and internal organization of attributions not captured by simple statistics. To address this issue, we introduce Minimum Spanning Tree Compactness (MST-C), a graph-based structural metric that captures higher-order geometric properties of attributions, such as spread and cohesion. These components are combined into a single score that evaluates compactness, favoring attributions with salient points spread across a small area and spatially organized into few but cohesive clusters. We show that MST-C reliably distinguishes between explanation methods, exposes fundamental structural differences between models, and provides a robust, self-contained diagnostic for explanation compactness that complements existing notions of attribution complexity.

[AI-42] Hybrid Quantum-Classical Spatiotemporal Forecasting for 3D Cloud Fields

【速读】:该论文旨在解决三维(3D)云场精准预测难题,该问题在大气分析与短临数值天气预报中至关重要,但因云演变涉及跨层交互、非局部依赖及多尺度时空动力学而极具挑战性。现有基于卷积、循环或注意力机制的时空预测模型通常依赖于局域性偏置表示,难以在体数据预测任务中保持细粒度云结构。其解决方案的关键在于提出一种混合量子启发式时空预测框架QENO,通过四个核心组件实现:1)经典时空编码器用于紧凑潜在表征;2)拓扑感知量子增强模块用于建模潜在空间中的非局部耦合;3)动态融合时间单元将测量导出的量子特征与递归记忆相结合;4)解码器重构未来云体积。该方法在CMA-MESO 3D云场数据集上显著优于ConvLSTM、PredRNN++、Earthformer、TAU和SimVP等主流基线模型,在均方误差(MSE)、平均绝对误差(MAE)、均方根误差(RMSE)、结构相似性指数(SSIM)及阈值检测指标上均取得最优性能,同时保持参数量紧凑,验证了拓扑感知混合量子-经典特征建模在3D云结构预测和地球观测数据分析中的有效性。

链接: https://arxiv.org/abs/2603.29407
作者: Fu Wang,Qifeng Lu,Xinyu Long,Meng Zhang,Xiaofei Yang,Weijia Cao,Xiaowen Chu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate forecasting of three-dimensional (3D) cloud fields is important for atmospheric analysis and short-range numerical weather prediction, yet it remains challenging because cloud evolution involves cross-layer interactions, nonlocal dependencies, and multiscale spatiotemporal dynamics. Existing spatiotemporal prediction models based on convolutions, recurrence, or attention often rely on locality-biased representations and therefore struggle to preserve fine cloud structures in volumetric forecasting tasks. To address this issue, we propose QENO, a hybrid quantum-inspired spatiotemporal forecasting framework for 3D cloud fields. The proposed architecture consists of four components: a classical spatiotemporal encoder for compact latent representation, a topology-aware quantum enhancement block for modeling nonlocal couplings in latent space, a dynamic fusion temporal unit for integrating measurement-derived quantum features with recurrent memory, and a decoder for reconstructing future cloud volumes. Experiments on CMA-MESO 3D cloud fields show that QENO consistently outperforms representative baselines, including ConvLSTM, PredRNN++, Earthformer, TAU, and SimVP variants, in terms of MSE, MAE, RMSE, SSIM, and threshold-based detection metrics. In particular, QENO achieves an MSE of 0.2038, an RMSE of 0.4514, and an SSIM of 0.6291, while also maintaining a compact parameter budget. These results indicate that topology-aware hybrid quantum-classical feature modeling is a promising direction for 3D cloud structure forecasting and atmospheric Earth observation data analysis.

[AI-43] Security in LLM -as-a-Judge: A Comprehensive SoK

【速读】:该论文旨在解决生成式 AI(Generative AI)评估中新兴的安全部署风险问题,特别是针对语言模型作为裁判(LLM-as-a-Judge, LaaJ)系统的安全性漏洞与可靠性挑战。其核心问题是:LaaJ系统不仅可能成为对抗性攻击的目标,还可能被用作实施攻击的工具,从而破坏评估流程的可信度。解决方案的关键在于首次提出一个系统化的知识体系(Systematization of Knowledge, SoK),通过对863篇文献的系统梳理并精选45篇相关研究,构建了一个基于LaaJ在安全场景中角色的分类框架,涵盖攻击目标、攻击载体、防御利用及安全应用四类,并在此基础上进行对比分析,揭示现有方法的局限性、新型威胁和开放性挑战,为提升LaaJ系统的鲁棒性和可信度指明了研究方向。

链接: https://arxiv.org/abs/2603.29403
作者: Aiman Almasoud,Antony Anju,Marco Arazzi,Mert Cihangiroglu,Vignesh Kumar Kembu,Serena Nicolazzo,Antonino Nocera,Vinod P.,Saraga Sakthidharan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-as-a-Judge (LaaJ) is a novel paradigm in which powerful language models are used to assess the quality, safety, or correctness of generated outputs. While this paradigm has significantly improved the scalability and efficiency of evaluation processes, it also introduces novel security risks and reliability concerns that remain largely unexplored. In particular, LLM-based judges can become both targets of adversarial manipulation and instruments through which attacks are conducted, potentially compromising the trustworthiness of evaluation pipelines. In this paper, we present the first Systematization of Knowledge (SoK) focusing on the security aspects of LLM-as-a-Judge systems. We perform a comprehensive literature review across major academic databases, analyzing 863 works and selecting 45 relevant studies published between 2020 and 2026. Based on this study, we propose a taxonomy that organizes recent research according to the role played by LLM-as-a-Judge in the security landscape, distinguishing between attacks targeting LaaJ systems, attacks performed through LaaJ, defenses leveraging LaaJ for security purposes, and applications where LaaJ is used as an evaluation strategy in security-related domains. We further provide a comparative analysis of existing approaches, highlighting current limitations, emerging threats, and open research challenges. Our findings reveal significant vulnerabilities in LLM-based evaluation frameworks, as well as promising directions for improving their robustness and reliability. Finally, we outline key research opportunities that can guide the development of more secure and trustworthy LLM-as-a-Judge systems.

[AI-44] ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

【速读】:该论文旨在解决当前生成式 AI(Generative AI)在构建 Extract-Load-Transform (ELT) 数据管道任务中表现不佳的问题,尤其是早期基准测试(如 ELT-Bench)低估了智能体(agent)的实际能力。其核心问题在于:一方面,早期模型性能受限;另一方面,基准测试本身存在系统性质量缺陷,包括僵化的评估脚本、模糊的任务规范和错误的标注真值(ground truth),导致正确输出被误判为失败。解决方案的关键在于两个层面:一是利用更新的大语言模型(LLM)重新评估,发现提取与加载阶段已基本解决,变换阶段性能显著提升;二是提出 Auditor-Corrector 方法论,结合 LLM 驱动的根因分析与高一致性的人工验证(Fleiss’ kappa = 0.85),系统性识别并修正基准错误,从而构建出更可靠的 ELT-Bench-Verified 基准。实证表明,改进后的性能提升完全归因于基准质量优化,凸显了高质量评估体系对复杂智能体任务研究的重要性。

链接: https://arxiv.org/abs/2603.29399
作者: Christopher Zanoli,Andrea Giovannini,Tengjun Jin,Ana Klimovic,Yotam Perlitz
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Constructing Extract-Load-Transform (ELT) pipelines is a labor-intensive data engineering task and a high-impact target for AI automation. On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility. We revisit these results and identify two factors causing a substantial underestimation of agent capabilities. First, re-evaluating ELT-Bench with upgraded large language models reveals that the extraction and loading stage is largely solved, while transformation performance improves significantly. Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss’ kappa = 0.85) to audit benchmark quality. Applying this to ELT-Bench uncovers that most failed transformation tasks contain benchmark-attributable errors – including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth – that penalize correct agent outputs. Based on these findings, we construct ELT-Bench-Verified, a revised benchmark with refined evaluation logic and corrected ground truth. Re-evaluating on this version yields significant improvement attributable entirely to benchmark correction. Our results show that both rapid model improvement and benchmark quality issues contributed to underestimating agent capabilities. More broadly, our findings echo observations of pervasive annotation errors in text-to-SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation. Systematic quality auditing should be standard practice for complex agentic tasks. We release ELT-Bench-Verified to provide a more reliable foundation for progress in AI-driven data engineering automation. Subjects: Artificial Intelligence (cs.AI); Databases (cs.DB) Cite as: arXiv:2603.29399 [cs.AI] (or arXiv:2603.29399v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.29399 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-45] Deep Learning-Based Anomaly Detection in Spacecraft Telemetry on Edge Devices

【速读】:该论文旨在解决航天器遥测数据异常检测在边缘计算设备上部署时面临的硬件资源限制问题,尤其是计算能力、内存占用和功耗等约束对复杂模型应用的制约。其解决方案的关键在于通过多目标神经架构优化(multi-objective neural architecture optimization),对三种异常检测方法(预测阈值法、直接分类法和图像分类法)进行高效压缩与重构,从而在显著降低计算资源消耗的同时保持高检测性能。实验表明,优化后的预测阈值模型在仅使用59 KB RAM(减少97.1%)和极少运算量(减少99.4%)的情况下,仍能维持88.8%的修正事件级F0.5分数(CEF0.5),远优于传统方法,使得在立方星(CubeSat)等资源受限平台上实现近实时异常检测成为可能。

链接: https://arxiv.org/abs/2603.29375
作者: Christopher Goetze,Tim Schlippe,Daniel Lakey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: IEEE Space Computing Conference (SCC 2025), Los Angeles, CA, USA, 28 July - 1 August 2025

点击查看摘要

Abstract:Spacecraft anomaly detection is critical for mission safety, yet deploying sophisticated models on-board presents significant challenges due to hardware constraints. This paper investigates three approaches for spacecraft telemetry anomaly detection – forecasting threshold, direct classification, and image classification – and optimizes them for edge deployment using multi-objective neural architecture optimization on the European Space Agency Anomaly Dataset. Our baseline experiments demonstrate that forecasting threshold achieves superior detection performance (92.7% Corrected Event-wise F0.5-score (CEF0.5)) [1] compared to alternatives. Through Pareto-optimal architecture optimization, we dramatically reduced computational requirements while maintaining capabilities – the optimized forecasting threshold model preserved 88.8% CEF0.5 while reducing RAM usage by 97.1% to just 59 KB and operations by 99.4%. Analysis of deployment viability shows our optimized models require just 0.36-6.25% of CubeSat RAM, making on-board anomaly detection practical even on highly constrained hardware. This research demonstrates that sophisticated anomaly detection capabilities can be successfully deployed within spacecraft edge computing constraints, providing near-instantaneous detection without exceeding hardware limitations or compromising mission safety.

[AI-46] AI-Generated Prior Authorization Letters: Strong Clinical Content Weak Administrative Scaffolding

【速读】:该论文旨在解决美国医疗体系中Prior Authorization(前置授权)流程所带来的巨大行政负担问题,该流程每年消耗数十亿美元和数千名医师工时。研究提出以大语言模型(Large Language Models, LLMs)生成符合临床要求的前置授权信作为解决方案,其关键在于通过结构化多场景评估(45个由医生验证的合成案例)系统性检验LLMs在风湿科、精神科、肿瘤科、心血管科及骨科等领域的临床内容准确性,并进一步揭示仅靠临床评分无法捕捉的行政细节缺失问题,如缺少收费代码、未明确授权期限请求及随访计划不足等。这一发现表明,LLM生成能力虽已成熟,但实现临床部署的核心挑战在于构建能精准匹配保险方工作流的辅助系统,而非单纯提升文本生成质量。

链接: https://arxiv.org/abs/2603.29366
作者: Moiz Sadiq Awan,Maryam Raza
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Prior authorization remains one of the most burdensome administrative processes in U.S. healthcare, consuming billions of dollars and thousands of physician hours each year. While large language models have shown promise across clinical text tasks, their ability to produce submission-ready prior authorization letters has received only limited attention, with existing work confined to single-case demonstrations rather than structured multi-scenario evaluation. We assessed three commercially available LLMs (GPT-4o, Claude Sonnet 4.5, and Gemini 2.5 Pro) across 45 physician-validated synthetic scenarios spanning rheumatology, psychiatry, oncology, cardiology, and orthopedics. All three models generated letters with strong clinical content: accurate diagnoses, well-structured medical necessity arguments, and thorough step therapy documentation. However, a secondary analysis of real-world administrative requirements revealed consistent gaps that clinical scoring alone did not capture, including absent billing codes, missing authorization duration requests, and inadequate follow-up plans. These findings reframe the question: the challenge for clinical deployment is not whether LLMs can write clinically adequate letters, but whether the systems built around them can supply the administrative precision that payer workflows require.

[AI-47] Rigorous Explanations for Tree Ensembles

【速读】:该论文旨在解决树集成模型(Tree Ensembles, TE)在实际应用中因缺乏可解释性而导致的人类决策者难以建立信任的问题。其解决方案的关键在于提出一种严格定义且逻辑上自洽的解释方法,用于准确反映预测模型内部机制,从而提升对随机森林(Random Forests)和提升树(Boosted Trees)等典型树集成模型预测结果的理解与可信度。

链接: https://arxiv.org/abs/2603.29361
作者: Yacine Izza,Alexey Ignatiev,Xuanxiang Huang,Peter J. Stuckey,Joao Marques-Silva
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Tree ensembles (TEs) find a multitude of practical applications. They represent one of the most general and accurate classes of machine learning methods. While they are typically quite concise in representation, their operation remains inscrutable to human decision makers. One solution to build trust in the operation of TEs is to automatically identify explanations for the predictions made. Evidently, we can only achieve trust using explanations, if those explanations are rigorous, that is truly reflect properties of the underlying predictor they explain This paper investigates the computation of rigorously-defined, logically-sound explanations for the concrete case of two well-known examples of tree ensembles, namely random forests and boosted trees.

[AI-48] BenchScope: How Many Independent Signals Does Your Benchmark Provide?

【速读】:该论文旨在解决当前AI评估套件中存在大量分数但缺乏对这些分数是否携带独立信息的验证问题,从而导致评估结果可能冗余且无法准确反映模型的真实能力。其核心解决方案是提出有效维度(Effective Dimensionality, ED),即通过中心化基准得分谱的参与比(participation ratio)作为快速、条件依赖的测量广度上界诊断指标。ED能够识别不同评估任务之间的冗余性,例如发现Open LLM Leaderboard实际上仅提供约两个有效测量轴(ED = 1.7),BBH与MMLU-Pro高度可互换(ρ = 0.96),并揭示当前基准测试间测量广度差异超过20倍。此外,ED可辅助筛选冗余组件、监控性能条件下的压缩效应,并指导基准维护,是一种实用性强、计算高效的诊断工具。

链接: https://arxiv.org/abs/2603.29357
作者: Tommy Sha,Stella Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Equal contribution; correspondence: this http URL @stonybrook.edu, zhao2052@umn.edu;

点击查看摘要

Abstract:AI evaluation suites often report many scores without checking whether those scores carry independent information. We introduce Effective Dimensionality (ED), the participation ratio of a centered benchmark-score spectrum, as a fast, population-conditional upper-bound diagnostic of measurement breadth. Applied at per-instance granularity to 22 benchmarks across 8 domains and more than 8,400 model evaluations, ED reveals substantial redundancy: the six-score Open LLM Leaderboard behaves like roughly two effective measurement axes (ED = 1.7), BBH and MMLU-Pro are near-interchangeable (rho = 0.96, stable across seven subpopulations), and measurement breadth varies more than 20x across current benchmarks. We show that relative ED rankings are stable under matched-dimension controls and that ED can flag redundant suite components, monitor performance-conditional compression, and guide benchmark maintenance. Because binary spectra overestimate absolute latent dimensionality, we interpret ED as a screening statistic rather than a literal factor count and complement it with null, reliability, and saturation analyses. We provide a 22-benchmark reference atlas and a four-step diagnostic workflow that benchmark maintainers can run with a score matrix and a few lines of code.

[AI-49] Nomad: Autonomous Exploration and Discovery

【速读】:该论文旨在解决当前查询驱动型问答系统和提示驱动型深度研究系统受限于人类问题框架、难以覆盖更广泛洞察空间的问题(即“人类认知局限性”)。其核心挑战在于:用户通常无法预先定义所有可能的探索方向,导致现有系统仅能响应特定提问,而无法主动发现潜在有价值的研究线索。解决方案的关键在于提出一种“探索优先”(exploration-first)架构——Nomad系统通过构建显式的探索地图(Exploration Map)对领域知识进行结构化表征,并由探索代理(explorer agent)系统性地遍历该地图,在广度与深度之间取得平衡;同时引入独立验证器(verifier)确保候选见解的可信度,并通过报告生成管道输出带引用的高质量报告及元报告。这一机制使系统不仅能够回答已有问题,还能自主识别值得挖掘的研究方向与洞察。

链接: https://arxiv.org/abs/2603.29353
作者: Bokang Jia,Samta Kamboj,Satheesh Katipomu,Seung Hun Han,Neha Sengupta,Andrew Jackson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Nomad, a system for autonomous data exploration and insight discovery. Given a corpus of documents, databases, or other data sources, users rarely know the full set of questions, hypotheses, or connections that could be explored. As a result, query-driven question answering and prompt-driven deep-research systems remain limited by human framing and often fail to cover the broader insight space. Nomad addresses this problem with an exploration-first architecture. It constructs an explicit Exploration Map over the domain and systematically traverses it to balance breadth and depth. It generates and selects hypotheses and investigates them with an explorer agent that can use document search, web search, and database tools. Candidate insights are then checked by an independent verifier before entering a reporting pipeline that produces cited reports and higher-level meta-reports. We also present a comprehensive evaluation framework for autonomous discovery systems that measures trustworthiness, report quality, and diversity. Using a corpus of selected UN and WHO reports, we show that \nomad produces more trustworthy and higher-quality reports than baselines, while also producing more diverse insights over several runs. Nomad is a step toward autonomous systems that not only answer user questions or conduct directed research, but also discover which questions, research directions, and insights are worth surfacing in the first place. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.29353 [cs.AI] (or arXiv:2603.29353v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.29353 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Neha Sengupta [view email] [v1] Tue, 31 Mar 2026 07:26:25 UTC (3,142 KB)

[AI-50] Scaling Whole-Body Human Musculoskeletal Behavior Emulation for Specificity and Diversity

【速读】:该论文旨在解决人体运动控制中全身肌肉驱动的动力学建模与控制问题,特别是如何在高维、过驱动的骨骼肌系统中实现精确的运动再现与控制策略探索。其核心挑战在于:一方面,肌肉内部的运动控制过程无法直接测量;另一方面,传统逆向动力学方法难以从观测到的运动学数据中解析冗余控制,而基于深度强化学习的正向模仿方法则受限于维度灾难(curse of dimensionality)导致的追踪性能不足。解决方案的关键在于提出MS-Emulator框架,通过大规模并行GPU仿真结合对抗奖励聚合(adversarial reward aggregation)和价值引导流探索(value-guided flow exploration),有效克服了高维强化学习中的优化瓶颈,实现了约700个肌肉驱动的全身骨骼肌系统中多种高动态动作(如舞蹈、翻滚、后空翻)的高精度关节角度和身体位置再现,并揭示了不同肌肉控制策略可产生相似外部运动学与力学输出的现象,从而为分析人类运动控制的特异性与多样性提供了可计算的路径。

链接: https://arxiv.org/abs/2603.29332
作者: Yunyue Wei,Chenhui Zuo,Shanning Zhuang,Haixin Gong,Yaming Liu,Yanan Sui
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The embodied learning of human motor control requires whole-body neuro-actuated musculoskeletal dynamics, while the internal muscle-driven processes underlying movement remain inaccessible to direct measurement. Computational modeling offers an alternative, but inverse dynamics methods struggled to resolve redundant control from observed kinematics in the high-dimensional, over-actuated system. Forward imitation approaches based on deep reinforcement learning exhibited inadequate tracking performance due to the curse of dimensionality in both control and reward design. Here we introduce a large-scale parallel musculoskeletal computation framework for biomechanically grounded whole-body motion reproduction. By integrating large-scale parallel GPU simulation with adversarial reward aggregation and value-guided flow exploration, the MS-Emulator framework overcomes key optimization bottlenecks in high-dimensional reinforcement learning for musculoskeletal control, which accurately reproduces a broad repertoire of motions in a whole-body human musculoskeletal system actuated by approximately 700 muscles. It achieved high joint angle accuracy and body position alignment for highly dynamic tasks such as dance, cartwheel, and backflip. The framework was also used to explore the musculoskeletal control solution space, identifying distinct musculoskeletal control policies that converge to nearly identical external kinematic and mechanical measurements. This work establishes a tractable computational route to analyzing the specificity and diversity underlying human embodied control of movement. Project page: this https URL.

[AI-51] Real-Time Band-Grouped Vocal Denoising Using Sigmoid-Driven Ideal Ratio Masking

【速读】:该论文旨在解决实时深度学习语音去噪中普遍存在的高延迟与长上下文依赖问题,这些问题限制了模型在实际直播或交互式应用场景中的部署。其关键解决方案是提出一种基于sigmoid驱动的理想比率掩码(ideal ratio mask),并采用谱损失(spectral loss)进行训练,以在提升信噪比(SNR)的同时最大化语音的感知质量;同时,模型设计了一个频带分组的编码器-解码器架构(band-grouped encoder-decoder architecture)并引入频率注意力机制(frequency attention),从而实现总延迟低于10 ms,并在稳态噪声和非稳态噪声下分别获得0.21和0.12的PESQ-WB增益。

链接: https://arxiv.org/abs/2603.29326
作者: Daniel Williams
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-time, deep learning-based vocal denoising has seen significant progress over the past few years, demonstrating the capability of artificial intelligence in preserving the naturalness of the voice while increasing the signal-to-noise ratio (SNR). However, many deep learning approaches have high amounts of latency and require long frames of context, making them difficult to configure for live applications. To address these challenges, we propose a sigmoid-driven ideal ratio mask trained with a spectral loss to encourage an increased SNR and maximized perceptual quality of the voice. The proposed model uses a band-grouped encoder-decoder architecture with frequency attention and achieves a total latency of less than 10,ms, with PESQ-WB improvements of 0.21 on stationary noise and 0.12 on nonstationary noise.

[AI-52] PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent

【速读】:该论文旨在解决当前智能手机图形用户界面(GUI)代理在真实场景中个性化能力不足的问题,即现有基准测试难以捕捉用户行为的多样性与个性化需求,且缺乏细粒度评估指标。其解决方案的关键在于提出PSPA-Bench——一个包含超过12,855条与真实用户行为对齐的个性化指令的基准数据集,覆盖10类典型日常使用场景和22个主流移动应用,并引入结构感知的过程评估方法,实现对代理个性化能力的精细化衡量。这一框架揭示了当前主流GUI代理在个性化设置下表现有限,同时指出了三个改进方向:基于推理的模型优于通用大语言模型(LLM)、感知能力仍是关键基础、反思与长期记忆机制对适应性提升至关重要。

链接: https://arxiv.org/abs/2603.29318
作者: Hongyi Nie,Xunyuan Liu,Yudong Bai,Yaqing Wang,Yang Liu,Quanming Yao,Zhen Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages

点击查看摘要

Abstract:Smartphone GUI agents execute tasks by operating directly on app interfaces, offering a path to broad capability without deep system integration. However, real-world smartphone use is highly personalized: users adopt diverse workflows and preferences, challenging agents to deliver customized assistance rather than generic solutions. Existing GUI agent benchmarks cannot adequately capture this personalization dimension due to sparse user-specific data and the lack of fine-grained evaluation metrics. To address this gap, we present PSPA-Bench, the benchmark dedicated to evaluating personalization in smartphone GUI agents. PSPA-Bench comprises over 12,855 personalized instructions aligned with real-world user behaviors across 10 representative daily-use scenarios and 22 mobile apps, and introduces a structure-aware process evaluation method that measures agents’ personalized capabilities at a fine-grained level. Through PSPA-Bench, we benchmark 11 state-of-the-art GUI agents. Results reveal that current methods perform poorly under personalized settings, with even the strongest agent achieving limited success. Our analysis further highlights three directions for advancing personalized GUI agents: (1) reasoning-oriented models consistently outperform general LLMs, (2) perception remains a simple yet critical capability, and (3) reflection and long-term memory mechanisms are key to improving adaptation. Together, these findings establish PSPA-Bench as a foundation for systematic study and future progress in personalized GUI agents.

[AI-53] IMPASTO: Integrating Model-Based Planning with Learned Dynamics Models for Robotic Oil Painting Reproduction

【速读】:该论文旨在解决机器人在无显式人类示范或高保真仿真器的情况下,如何自主完成油画复制任务的问题,核心挑战包括对柔性刷具的力敏感控制、笔触效果预测以及多步笔触路径规划。解决方案的关键在于提出IMPASTO系统,其整合了基于像素的动态模型与模型预测控制(Model Predictive Control, MPC),其中学习到的动态模型能够从图像观测和参数化笔触动作中预测画布变化,再通过滚动时域优化器生成轨迹与施力策略,并由力敏感控制器在7自由度机械臂上执行,整个系统仅依赖机器人自我博弈(self-play)训练,无需人工标注数据,从而实现了对人类艺术家单笔触数据集及多笔触艺术品的高质量复现。

链接: https://arxiv.org/abs/2603.29315
作者: Yingke Wang,Hao Li,Yifeng Zhu,Hong-Xing Yu,Ken Goldberg,Li Fei-Fei,Jiajun Wu,Yunzhu Li,Ruohan Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robotic reproduction of oil paintings using soft brushes and pigments requires force-sensitive control of deformable tools, prediction of brushstroke effects, and multi-step stroke planning, often without human step-by-step demonstrations or faithful simulators. Given only a sequence of target oil painting images, can a robot infer and execute the stroke trajectories, forces, and colors needed to reproduce it? We present IMPASTO, a robotic oil-painting system that integrates learned pixel dynamics models with model-based planning. The dynamics models predict canvas updates from image observations and parameterized stroke actions; a receding-horizon model predictive control optimizer then plans trajectories and forces, while a force-sensitive controller executes strokes on a 7-DoF robot arm. IMPASTO integrates low-level force control, learned dynamics models, and high-level closed-loop planning, learns solely from robot self-play, and approximates human artists’ single-stroke datasets and multi-stroke artworks, outperforming baselines in reproduction accuracy. Project website: this https URL

[AI-54] Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus

【速读】:该论文旨在解决代码语言模型(Code Language Model)在缺乏优越教师模型和测试断言(test oracle)的情况下如何实现自我提升的问题。传统方法如监督微调或偏好优化依赖于昂贵的外部资源,而现实中获取参考解和测试断言远比获取问题描述和测试输入困难。解决方案的关键在于提出ConSelf框架,其核心包括两个创新:一是引入代码语义熵(code semantic entropy),通过评估程序行为的功能多样性来衡量问题级别的不确定性,从而构建以可学习性为导向的课程学习策略;二是提出基于共识驱动的直接偏好优化(consensus-driven direct preference optimization, Con-DPO),通过行为一致性权重对偏好对进行加权,降低自生成监督信号中的噪声影响。实验表明,该方法在多个基准和骨干模型上显著优于基线,验证了语义熵引导的课程构建与共识驱动优化的有效性。

链接: https://arxiv.org/abs/2603.29292
作者: Huan Zhang,Wei Cheng,Wei Hu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: Accepted in the 34th IEEE/ACM International Conference on Program Comprehension (ICPC 2026)

点击查看摘要

Abstract:Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing the functional diversity of program behaviors, enabling a curriculum construction with the most learnable problems. Second, we present consensus-driven direct preference optimization (Con-DPO), a preference-based fine-tuning method that weights each preference pair by its behavioral consensus, thereby mitigating the impact of noisy self-generated supervision. Experiments on various benchmarks and backbone LLMs demonstrate that ConSelf significantly outperforms baselines, validating the effectiveness of semantic entropy-based curriculum construction and consensus-driven optimization in improving code generation without external supervision.

[AI-55] Downsides of Smartness Across Edge-Cloud Continuum in Modern Industry

【速读】:该论文旨在解决智能工业系统在快速集成生成式 AI (Generative AI)、机器学习与强化学习等技术过程中,因大规模部署所带来的安全风险与潜在副作用问题。其核心关注点在于识别和区分软件层(如传统AI与生成式AI)与基础设施层(即工业互联网(IIoT)及边缘-雾-云计算连续体)各自引发的漏洞、网络威胁及不可预见的互操作性副作用。解决方案的关键在于系统性地分析不同层级的安全隐患,并提出针对这些多层次风险的治理框架,以确保智能工业系统的可持续与安全发展。

链接: https://arxiv.org/abs/2603.29289
作者: Akhil Gupta Chigullapally,Sharvan Vittala,Razin Farhan Hussian,Mohsen Amini Salehi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:The fast pace of modern AI is rapidly transforming traditional industrial systems into vast, intelligent and potentially unmanned autonomous operational environments driven by AI-based solutions. These solutions leverage various forms of machine learning, reinforcement learning, and generative AI. The introduction of such smart capabilities has pushed the envelope in multiple industrial domains, enabling predictive maintenance, optimized performance, and streamlined workflows. These solutions are often deployed across the Industrial Internet of Things (IIoT) and supported by the Edge-Fog-Cloud computing continuum to enable urgent (i.e., real-time or near real-time) decision-making. Despite the current trend of aggressively adopting these smart industrial solutions to increase profit, quality, and efficiency, large-scale integration and deployment also bring serious hazards that if ignored can undermine the benefits of smart industries. These hazards include unforeseen interoperability side-effects and heightened vulnerability to cyber threats, particularly in environments operating with a plethora of heterogeneous IIoT systems. The goal of this study is to shed light on the potential consequences of industrial smartness, with a particular focus on security implications, including vulnerabilities, side effects, and cyber threats. We distinguish software-level downsides stemming from both traditional AI solutions and generative AI from those originating in the infrastructure layer, namely IIoT and the Edge-Cloud continuum. At each level, we investigate potential vulnerabilities, cyber threats, and unintended side effects. As industries continue to become smarter, understanding and addressing these downsides will be crucial to ensure secure and sustainable development of smart industrial systems.

[AI-56] Grokking From Abstraction to Intelligence ICML2026

【速读】:该论文旨在解决模块化算术(modular arithmetic)中“grokking”现象的机制问题,即模型从记忆训练数据到实现泛化能力的转变过程。现有研究多局限于局部电路或优化调参,忽视了全局结构演化对这一现象的根本驱动作用。论文提出,grokking源于模型内部结构自发简化,遵循奥卡姆剃刀原则(parsimony),其关键在于通过因果、谱和算法复杂度测量结合奇异学习理论(Singular Learning Theory),揭示从记忆到泛化的转变对应于冗余流形的物理坍缩与深度信息压缩,从而为理解过拟合与泛化机制提供了全新视角。

链接: https://arxiv.org/abs/2603.29262
作者: Junjie Zhang,Zhen Shen,Gang Xiong,Xisong Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22page and 5 figures,In this paper, we analyze the grokking phenomenon from the perspective of Singular Learning Theory (SLT). This work is currently under review for ICML 2026

点击查看摘要

Abstract:Grokking in modular arithmetic has established itself as the quintessential fruit fly experiment, serving as a critical domain for investigating the mechanistic origins of model generalization. Despite its significance, existing research remains narrowly focused on specific local circuits or optimization tuning, largely overlooking the global structural evolution that fundamentally drives this phenomenon. We propose that grokking originates from a spontaneous simplification of internal model structures governed by the principle of parsimony. We integrate causal, spectral, and algorithmic complexity measures alongside Singular Learning Theory to reveal that the transition from memorization to generalization corresponds to the physical collapse of redundant manifolds and deep information compression, offering a novel perspective for understanding the mechanisms of model overfitting and generalization.

[AI-57] Monodense Deep Neural Model for Determining Item Price Elasticity

【速读】:该论文旨在解决商品层级价格弹性(Item Price Elasticity)的估计问题,特别是在缺乏传统处理组与对照组设置(treatment control setting)的情况下,如何利用大规模交易数据准确建模消费者对价格变化的响应行为。其核心挑战在于从历史销售和定价数据中提取具有因果解释力的价格弹性指标,以支持零售、电商等行业的精细化定价策略与收入优化决策。解决方案的关键在于提出了一种新颖的弹性估计框架,结合深度学习技术,特别是创新设计的Monodense深度神经网络——该网络融合嵌入层(embedding)、密集层(dense)和Monodense层,能够有效捕捉商品特征与价格变动之间的非线性关系,并在无对照组条件下实现高精度弹性估计。实验表明,该框架在多品类零售数据上的回测表现优于其他主流机器学习方法(如DML、LGBM),验证了其在实际业务场景中的有效性与优越性。

链接: https://arxiv.org/abs/2603.29261
作者: Lakshya Garg,Sai Yaswanth,Deep Narayan Mishra,Karthik Kumaran,Anupriya Sharma,Mayank Uniyal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at AAIML 2026 (International Conference on Advances in Artificial Intelligence and Machine Learning). Copyright 2026 IEEE. 6 pages, 4 figures

点击查看摘要

Abstract:Item Price Elasticity is used to quantify the responsiveness of consumer demand to changes in item prices, enabling businesses to create pricing strategies and optimize revenue management. Sectors such as store retail, e-commerce, and consumer goods rely on elasticity information derived from historical sales and pricing data. This elasticity provides an understanding of purchasing behavior across different items, consumer discount sensitivity, and demand elastic departments. This information is particularly valuable for competitive markets and resource-constrained businesses decision making which aims to maximize profitability and market share. Price elasticity also uncovers historical shifts in consumer responsiveness over time. In this paper, we model item-level price elasticity using large-scale transactional datasets, by proposing a novel elasticity estimation framework which has the capability to work in an absence of treatment control setting. We test this framework by using Machine learning based algorithms listed below, including our newly proposed Monodense deep neural network. (1) Monodense-DL network – Hybrid neural network architecture combining embedding, dense, and Monodense layers (2) DML – Double machine learning setting using regression models (3) LGBM – Light Gradient Boosting Model We evaluate our model on multi-category retail data spanning millions of transactions using a back testing framework. Experimental results demonstrate the superiority of our proposed neural network model within the framework compared to other prevalent ML based methods listed above. Comments: Accepted at AAIML 2026 (International Conference on Advances in Artificial Intelligence and Machine Learning). Copyright 2026 IEEE. 6 pages, 4 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.29261 [cs.LG] (or arXiv:2603.29261v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.29261 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-58] Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估体系中对“可靠性”(reliability)关注不足的问题。现有基准主要衡量模型在单次尝试中的能力(capability),但实际生产部署要求模型在不同持续时间的任务中保持一致的成功率,即可靠性。论文指出,随着任务时长增加,能力与可靠性会系统性地分离,而传统指标如pass@1在短任务上无法捕捉这种差异。其解决方案是提出一个可靠性科学框架(reliability science framework),包含四个核心指标:可靠性衰减曲线(Reliability Decay Curve, RDC)、方差放大因子(Variance Amplification Factor, VAF)、优雅退化评分(Graceful Degradation Score, GDS)和熔毁 onset 点(Meltdown Onset Point, MOP)。通过在396个任务、23,392个实验episode上的多模型评估,研究发现可靠性具有领域特异性、与能力层级存在显著分化,并且前沿模型因采用复杂策略反而更易发生熔毁,这表明可靠性应作为与能力并列的第一优先级评估维度。

链接: https://arxiv.org/abs/2603.29231
作者: Aaditya Khanal,Yangyang Tao,Junxiu Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 4 figures

点击查看摘要

Abstract:Existing benchmarks measure capability – whether a model succeeds on a single attempt – but production deployments require reliability – consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to this divergence. We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP). We evaluate 10 models across 23,392 episodes on a 396-task benchmark spanning four duration buckets and three domains. Key findings: (1) reliability decay is domain-stratified – SE GDS drops from 0.90 to 0.44 while document processing is nearly flat (0.74 to 0.71); (2) VAF bifurcates by capability tier – high VAF is a capability signature, not an instability signal; (3) capability and reliability rankings diverge substantially, with multi-rank inversions at long horizons; (4) frontier models have the highest meltdown rates (up to 19%) because they attempt ambitious multi-step strategies that sometimes spiral; and (5) memory scaffolds universally hurt long-horizon performance across all 10 models. These results motivate reliability as a first-class evaluation dimension alongside capability. Comments: 23 pages, 4 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.29231 [cs.AI] (or arXiv:2603.29231v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.29231 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Aaditya Khanal [view email] [v1] Tue, 31 Mar 2026 03:56:39 UTC (490 KB) Full-text links: Access Paper: View a PDF of the paper titled Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents, by Aaditya Khanal and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-59] Derived Fields Preserve Fine-Scale Detail in Budgeted Neural Simulators

【速读】:该论文旨在解决在固定存储预算下,神经网络模拟方法难以保持细尺度精度的问题。现有方法通常通过改进模型架构、训练目标或滚动预测策略来降低高频误差,但忽略了在压缩-量化-解码流水线中,状态信息在构建阶段即可能丢失精细结构这一关键瓶颈。论文提出导出场优化(Derived-Field Optimization, DerivOpt),其核心在于:基于校准的信道模型,设计最优的状态携带策略——即选择哪些物理场被保留,并在不同场之间合理分配有限的存储资源。实验表明,DerivOpt 在 PDEBench 的全时变前向子集上不仅提升了整体滚动预测的归一化均方根误差(nRMSE),更显著增强了细尺度保真度,且优势在初始输入阶段即可显现,说明携带状态设计是预算受限神经模拟中的首要设计维度。

链接: https://arxiv.org/abs/2603.29224
作者: Wenshuo Wang,Fan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-scale-faithful neural simulation under fixed storage budgets remains challenging. Many existing methods reduce high-frequency error by improving architectures, training objectives, or rollout strategies. However, under budgeted coarsen-quantize-decode pipelines, fine detail can already be lost when the carried state is constructed. In the canonical periodic incompressible Navier-Stokes setting, we show that primitive and derived fields undergo systematically different retained-band distortions under the same operator. Motivated by this observation, we formulate Derived-Field Optimization (DerivOpt), a general state-design framework that chooses which physical fields are carried and how storage budget is allocated across them under a calibrated channel model. Across the full time-dependent forward subset of PDEBench, DerivOpt not only improves pooled mean rollout nRMSE, but also delivers a decisive advantage in fine-scale fidelity over a broad set of strong baselines. More importantly, the gains are already visible at input time, before rollout learning begins. This indicates that the carried state is often the dominant bottleneck under tight storage budgets. These results suggest a broader conclusion: in budgeted neural simulation, carried-state design should be treated as a first-class design axis alongside architecture, loss, and rollout strategy.

[AI-60] Software Vulnerability Detection Using a Lightweight Graph Neural Network

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在漏洞检测中因计算资源需求高而导致的可扩展性不足问题。其解决方案的关键在于提出一种基于图神经网络(Graph Neural Network, GNN)的轻量级深度学习模型VulGNN,该模型利用代码的天然图结构关系,在性能上接近LLMs的同时,模型规模缩小100倍且具备快速重训练与定制化能力,从而实现高效、可部署于边缘设备的漏洞分析方案。

链接: https://arxiv.org/abs/2603.29216
作者: Miles Farmer,Ekincan Ufuktepe,Anne Watson,Hialo Muniz Carvalho,Vadim Okun,Zineb Maasaoui,Kannappan Palaniappan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 12 pages, 3 figures, preprint of journal submission

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as a popular choice in vulnerability detection studies given their foundational capabilities, open source availability, and variety of models, but have limited scalability due to extensive compute requirements. Using the natural graph relational structure of code, we show that our proposed graph neural network (GNN) based deep learning model VulGNN for vulnerability detection can achieve performance almost on par with LLMs, but is 100 times smaller in size and fast to retrain and customize. We describe the VulGNN architecture, ablation studies on components, learning rates, and generalizability to different code datasets. As a lightweight model for vulnerability analysis, VulGNN is efficient and deployable at the edge as part of real-world software development pipelines.

[AI-61] Route-Induced Density and Stability (RIDE): Controlled Intervention and Mechanism Analysis of Routing-Style Meta Prompts on LLM Internal States

【速读】:该论文旨在验证“稀疏性—确定性假说”(Sparsity–Certainty Hypothesis),即认为路由至特定任务专家会激活更稀疏的内部计算,从而产生更确定和稳定的输出。为测试该假说,作者提出一种基于元提示(meta prompts)的文本代理方法,将路由信号以自然语言形式注入冻结的指令微调大语言模型(instruction-tuned LLMs)前缀中,进而量化三个核心指标:(C1)通过激活稀疏度衡量内部表示密度;(C2)域关键词注意力变化;(C3)输出稳定性(通过预测熵与语义变异度量)。实验结果表明,元提示反而使早期/中间层表征更加密集,且不同模型对关键词注意力的响应存在异质性,而密度与稳定性的关联极弱,仅在Qwen模型中观察到近零相关性。因此,研究的关键在于构建一个可解释、可量化、适用于多模型的诊断探针——RIDE(Routing Diagnostic Evaluation),用于校准路由设计并改进不确定性估计。

链接: https://arxiv.org/abs/2603.29206
作者: Dianxing Zhang,Gang Li,Sheng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Routing is widely used to scale large language models, from Mixture-of-Experts gating to multi-model/tool selection. A common belief is that routing to a task ``expert’’ activates sparser internal computation and thus yields more certain and stable outputs (the Sparsity–Certainty Hypothesis). We test this belief by injecting routing-style meta prompts as a textual proxy for routing signals in front of frozen instruction-tuned LLMs. We quantify (C1) internal density via activation sparsity, (C2) domain-keyword attention, and (C3) output stability via predictive entropy and semantic variation. On a RouterEval subset with three instruction-tuned models (Qwen3-8B, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct-v0.2), meta prompts consistently densify early/middle-layer representations rather than increasing sparsity; natural-language expert instructions are often stronger than structured tags. Attention responses are heterogeneous: Qwen/Llama reduce keyword attention, while Mistral reinforces it. Finally, the densification–stability link is weak and appears only in Qwen, with near-zero correlations in Llama and Mistral. We present RIDE as a diagnostic probe for calibrating routing design and uncertainty estimation.

[AI-62] Improving Ensemble Forecasts of Abnormally Deflecting Tropical Cyclones with Fused Atmosphere-Ocean-Terrain Data

【速读】:该论文旨在解决当前基于深度学习的热带气旋(Tropical Cyclone, TC)预报方法存在的两大关键问题:一是现有模型仅能处理单一类型的时间序列轨迹数据或同质气象变量,难以融合多源异构信息;二是无法准确预测异常偏转的TC路径。解决方案的核心在于两个创新:其一,构建了首个面向西北太平洋区域的多模态、多源数据集AOT-TCs,首次整合了大气、海洋与陆地的异构变量,形成信息丰富的气象数据基础;其二,提出一种显式耦合大气-海洋-地形结构的预报模型,首次实现跨物理域复杂相互作用的有效捕捉,从而在2017–2024年所有TC案例上均取得最优性能,显著提升正常TC预报精度,并突破异常偏转TC预报的技术瓶颈。

链接: https://arxiv.org/abs/2603.29200
作者: Qixiang Li,Shuwei Huo,Chong Wang,Xiaofeng Li,Yuan Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning-based tropical cyclone (TC) forecasting methods have demonstrated significant potential and application advantages, as they feature much lower computational cost and faster operation speed than numerical weather prediction models. However, existing deep learning methods still have key limitations: they can only process a single type of sequential trajectory data or homogeneous meteorological variables, and fail to achieve accurate forecasting of abnormal deflected TCs. To address these challenges, we present two groundbreaking contributions. First, we have constructed a multimodal and multi-source dataset named AOT-TCs for TC forecasting in the Northwest Pacific basin. As the first dataset of its kind, it innovatively integrates heterogeneous variables from the atmosphere, ocean, and land, thus obtaining a comprehensive and information-rich meteorological dataset. Second, based on the AOT-TCs dataset, we propose a forecasting model that can handle both normal and abnormally deflected TCs. This is the first TC forecasting model to adopt an explicit atmosphere-ocean-terrain coupling architecture, enabling it to effectively capture complex interactions across physical domains. Extensive experiments on all TC cases in the Northwest Pacific from 2017 to 2024 show that our model achieves state-of-the-art performance in TC forecasting: it not only significantly improves the forecasting accuracy of normal TCs but also breaks through the technical bottleneck in forecasting abnormally deflected TCs.

[AI-63] AEC-Bench: A Multimodal Benchmark for Agent ic Systems in Architecture Engineering and Construction

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在建筑、工程与施工(AEC)领域中缺乏统一、多模态基准测试的问题,以客观评估智能体系统在真实任务中的性能。其解决方案的关键在于构建 AEC-Bench——一个涵盖图纸理解、跨图纸推理及项目级协调等复杂任务的多模态基准,通过明确的数据集分类体系、标准化的评估协议以及对多个领域专用基础模型(如 Claude Code 和 Codex)的基线测试,识别出能普遍提升不同基础模型表现的工具和 harness 设计策略。研究还开源了完整数据集、代理 harness 和评估代码,确保结果可复现性。

链接: https://arxiv.org/abs/2603.29199
作者: Harsh Mankodiya,Chase Gallik,Theodoros Galanos,Andriy Mulyar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The AEC-Bench is a multimodal benchmark for evaluating agentic systems on real-world tasks in the Architecture, Engineering, and Construction (AEC) domain. The benchmark covers tasks requiring drawing understanding, cross-sheet reasoning, and construction project-level coordination. This report describes the benchmark motivation, dataset taxonomy, evaluation protocol, and baseline results across several domain-specific foundation model harnesses. We use AEC-Bench to identify consistent tools and harness design techniques that uniformly improve performance across foundation models in their own base harnesses, such as Claude Code and Codex. We openly release our benchmark dataset, agent harness, and evaluation code for full replicability at this https URL under an Apache 2 license.

[AI-64] IMPACT: Influence Modeling for Open-Set Time Series Anomaly Detection

【速读】:该论文旨在解决开放集时间序列异常检测(Open-set Time Series Anomaly Detection, OSTAD)中的两大挑战:一是现有方法依赖简单增强策略生成伪异常,难以保留时间序列的时序特性,导致生成的异常模式不真实;二是当训练数据中混入未标注异常时,模型性能显著下降。解决方案的关键在于提出IMPACT框架,其核心创新为:首先学习一个影响函数(influence function),精准量化每个训练样本对模型参数的影响;进而利用这些影响得分生成语义上差异显著但物理上真实的未见异常样本,并将高影响力样本重新用作监督异常进行异常去污染(anomaly decontamination)。这一机制有效提升了模型在不同开放集设置和异常污染率下的检测准确率。

链接: https://arxiv.org/abs/2603.29183
作者: Xiaohui Zhou,Yijie Wang,Hongzuo Xu,Weixuan Liang,Xiaoli Li,Guansong Pang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, 15 figures

点击查看摘要

Abstract:Open-set anomaly detection (OSAD) is an emerging paradigm designed to utilize limited labeled data from anomaly classes seen in training to identify both seen and unseen anomalies during testing. Current approaches rely on simple augmentation methods to generate pseudo anomalies that replicate unseen anomalies. Despite being promising in image data, these methods are found to be ineffective in time series data due to the failure to preserve its sequential nature, resulting in trivial or unrealistic anomaly patterns. They are further plagued when the training data is contaminated with unlabeled anomalies. This work introduces \textbfIMPACT , a novel framework that leverages \underline\textbfi nfluence \underline\textbfm odeling for o \underline\textbfp en-set time series \underline\textbfa nomaly dete \underline\textbfct ion, to tackle these challenges. The key insight is to \textbfi) learn an influence function that can accurately estimate the impact of individual training samples on the modeling, and then \textbfii) leverage these influence scores to generate semantically divergent yet realistic unseen anomalies for time series while repurposing high-influential samples as supervised anomalies for anomaly decontamination. Extensive experiments show that IMPACT significantly outperforms existing state-of-the-art methods, showing superior accuracy under varying OSAD settings and contamination rates.

[AI-65] Webscraper: Leverag e Multimodal Large Language Models for Index-Content Web Scraping

【速读】:该论文旨在解决现代动态交互式网站(dynamic, interactive websites)中传统静态HTML解析方法失效的问题,这类网站通常需要复杂的用户交互操作才能获取目标数据,而现有爬虫技术往往脆弱且需针对每个站点进行手动定制。解决方案的关键在于提出一个名为Webscraper的框架,其核心创新是利用多模态大语言模型(Multimodal Large Language Model, MLLM)实现自主导航、调用专用工具并完成结构化数据提取;该框架采用五阶段结构化提示(prompting)流程与自研工具集协同工作,有效应对具有“索引-内容”架构的网页环境,在新闻网站和电商平台上均验证了其高准确率与良好泛化能力。

链接: https://arxiv.org/abs/2603.29161
作者: Guan-Lun Huang,Yuh-Jzer Joung
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern web scraping struggles with dynamic, interactive websites that require more than static HTML parsing. Current methods are often brittle and require manual customization for each site. To address this, we introduce Webscraper, a framework designed to handle the challenges of modern, dynamic web applications. It leverages a Multimodal Large Language Model (MLLM) to autonomously navigate interactive interfaces, invoke specialized tools, and perform structured data extraction in environments where traditional scrapers are ineffective. Webscraper utilizes a structured five-stage prompting procedure and a set of custom-built tools to navigate and extract data from websites following the common ``index-and-content’’ architecture. Our experiments, conducted on six news websites, demonstrate that the full Webscraper framework, equipped with both our guiding prompt and specialized tools, achieves a significant improvement in extraction accuracy over the baseline agent Anthropic’s Computer Use. We also applied the framework to e-commerce platforms to validate its generalizability.

[AI-66] SimMOF: AI agent for Automated MOF Simulations

【速读】:该论文旨在解决金属有机框架(Metal-organic frameworks, MOFs)计算模拟难以访问的问题,具体表现为:可靠分析依赖专家在工作流构建、参数选择、工具互操作性及计算就绪结构准备等方面的决策,限制了MOF研究的效率与可扩展性。解决方案的关键在于提出SimMOF——一个基于大语言模型(Large Language Model, LLM)的多智能体框架,能够从自然语言查询自动执行端到端的MOF模拟流程,包括将用户请求转化为依赖感知的计划、生成可运行输入、协调多个智能体执行模拟,并输出与查询对齐的分析结果,从而实现适应性强且认知自主的工作流,体现人类研究人员的迭代决策行为,为数据驱动的MOF研究提供可扩展基础。

链接: https://arxiv.org/abs/2603.29152
作者: Jaewoong Lee,Taeun Bae,Jihan Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 33 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Metal-organic frameworks (MOFs) offer a vast design space, and as such, computational simulations play a critical role in predicting their structural and physicochemical properties. However, MOF simulations remain difficult to access because reliable analysis require expert decisions for workflow construction, parameter selection, tool interoperability, and the preparation of computational ready structures. Here, we introduce SimMOF, a large language model based multi agent framework that automates end-to-end MOF simulation workflows from natural language queries. SimMOF translates user requests into dependency aware plans, generates runnable inputs, orchestrates multiple agents to execute simulations, and summarizes results with analysis aligned to the user query. Through representative case studies, we show that SimMOF enables adaptive and cognitively autonomous workflows that reflect the iterative and decision driven behavior of human researchers and as such provides a scalable foundation for data driven MOF research.

[AI-67] Knowledge database development by large language models for countermeasures against viruses and marine toxins

【速读】:该论文旨在解决当前缺乏针对病毒(如拉沙、马尔堡、埃博拉、尼帕和委内瑞拉马脑炎病毒)及海洋毒素的综合性治疗对策数据库的问题,从而阻碍了有效治疗方法的研发与决策效率。解决方案的关键在于利用两种大型语言模型(LLMs)——ChatGPT 和 Grok——构建可交互的、结构化的知识数据库,并通过高阶人工输入引导模型识别公共数据源、收集文献信息、迭代交叉验证数据,最终生成易于访问的网页界面;其中,ChatGPT 还进一步设计了由两个 AI 代理组成的智能体工作流(agentic AI workflows),用于对候选治疗对策进行排序,从而实现基于证据的决策支持。

链接: https://arxiv.org/abs/2603.29149
作者: Hung N. Do,Jessica Z. Kubicek-Sutherland,S. Gnanakaran
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Clearance: 26-T-0967 (DOW)

点击查看摘要

Abstract:Access to the most up-to-date information on medical countermeasures is important for the research and development of effective treatments for viruses and marine toxins. However, there is a lack of comprehensive databases that curate data on viruses and marine toxins, making decisions on medical countermeasures slow and difficult. In this work, we employ two large language models (LLMs) of ChatGPT and Grok to design two comprehensive databases of therapeutic countermeasures for five viruses of Lassa, Marburg, Ebola, Nipah, and Venezuelan equine encephalitis, as well as marine toxins. With high-level human-provided inputs, the two LLMs identify public databases containing data on the five viruses and marine toxins, collect relevant information from these databases and the literature, iteratively cross-validate the collected information, and design interactive webpages for easy access to the curated, comprehensive databases. Notably, the ChatGPT LLM is employed to design agentic AI workflows (consisting of two AI agents for research and decision-making) to rank countermeasures for viruses and marine toxins in the databases. Together, our work explores the potential of LLMs as a scalable, updatable approach for building comprehensive knowledge databases and supporting evidence-based decision-making.

[AI-68] Efficient and Scalable Granular-ball Graph Coarsening Method for Large-scale Graph Node Classification

【速读】:该论文旨在解决大规模图数据上图卷积网络(Graph Convolutional Network, GCN)训练时面临的高计算开销问题,尤其是当图结构中卷积层数量较多时。现有方法虽采用采样或图粗化技术缓解此问题,但部分方法忽略了图结构中的多粒度信息,而某些粗化方法的时间复杂度仍然较高。本文提出了一种高效且可扩展的粒球图粗化方法(Efficient and Scalable Granular-ball Graph Coarsening Method),其关键在于:首先利用多粒度粒球图粗化算法将原图转化为多个子图,该阶段时间复杂度为线性,显著低于现有粗化方法;随后随机采样由粒球构成的子图作为mini-batch进行GCN训练,从而自适应地大幅降低图规模,提升GCN的训练效率与可扩展性。

链接: https://arxiv.org/abs/2603.29148
作者: Guan Wang,Shuyin Xia,Lei Qian,Guoyin Wang,Yi Liu,Yi Wang,Wei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Convolutional Network (GCN) is a model that can effectively handle graph data tasks and has been successfully applied. However, for large-scale graph datasets, GCN still faces the challenge of high computational overhead, especially when the number of convolutional layers in the graph is large. Currently, there are many advanced methods that use various sampling techniques or graph coarsening techniques to alleviate the inconvenience caused during training. However, among these methods, some ignore the multi-granularity information in the graph structure, and the time complexity of some coarsening methods is still relatively high. In response to these issues, based on our previous work, in this paper, we propose a new framework called Efficient and Scalable Granular-ball Graph Coarsening Method for Large-scale Graph Node Classification. Specifically, this method first uses a multi-granularity granular-ball graph coarsening algorithm to coarsen the original graph to obtain many subgraphs. The time complexity of this stage is linear and much lower than that of the exiting graph coarsening methods. Then, subgraphs composed of these granular-balls are randomly sampled to form minibatches for training GCN. Our algorithm can adaptively and significantly reduce the scale of the original graph, thereby enhancing the training efficiency and scalability of GCN. Ultimately, the experimental results of node classification on multiple datasets demonstrate that the method proposed in this paper exhibits superior performance. The code is available at this https URL.

[AI-69] owards Explainable Stakeholder-Aware Requirements Prioritisation in Aged-Care Digital Health

【速读】:该论文旨在解决老龄化护理数字健康领域中需求工程(Requirements Engineering, RE)缺乏对人类因素(human aspects)系统性量化与验证的问题,特别是如何基于真实用户数据识别并优先排序影响需求优先级的关键人类因素,以实现更具包容性和循证的需求分析。解决方案的关键在于采用混合方法研究设计:首先利用可解释机器学习(Explainable Machine Learning)从103名老年人、105名开发者和41名照护者的数据中识别出8个老龄健康数字主题下与需求优先级最强相关的若干人类因素;随后通过12次半结构化访谈对量化结果进行定性验证与解读,从而揭示不同利益相关者群体之间的显著认知错位。这一方法不仅明确了驱动需求优先级的核心人类因素及其方向性效应,还提出了一种结合机器学习重要性排序与质性验证的可解释、以人为本的需求工程框架,强调必须显式区分并整合各利益相关者视角,而非简单聚合为单一整体视图。

链接: https://arxiv.org/abs/2603.29114
作者: Yuqing Xiao,John Grundy,Anuradha Madugalla,Elizabeth Manias
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Requirements engineering for aged-care digital health must account for human aspects, because requirement priorities are shaped not only by technical functionality but also by stakeholders’ health conditions, socioeconomics, and lived experience. Knowing which human aspects matter most, and for whom, is critical for inclusive and evidence-based requirements prioritisation. Yet in practice, while some studies have examined human aspects in RE, they have largely relied on expert judgement or model-driven analysis rather than large-scale user studies with meaningful human-in-the-loop validation to determine which aspects matter most and why. To address this gap, we conducted a mixed-methods study with 103 older adults, 105 developers, and 41 caregivers. We first applied an explainable machine learning to identify the human aspects most strongly associated with requirement priorities across 8 aged-care digital health themes, and then conducted 12 semi-structured interviews to validate and interpret the quantitative patterns. The results identify the key human aspects shaping requirement priorities, reveal their directional effects, and expose substantial misalignment across stakeholder groups. Together, these findings show that human-centric requirements analysis should engage stakeholder groups explicitly rather than collapsing their perspectives into a single aggregate view. This paper contributes an identification of the key human aspects driving requirement priorities in aged-care digital health and an explainable, human-centric RE framework that combines ML-derived importance rankings with qualitative validation to surface the stakeholder misalignments that inclusive requirements engineering must address.

[AI-70] SemLoc: Structured Grounding of Free-Form LLM Reasoning for Fault Localization

【速读】:该论文旨在解决语义错误(semantic bugs)场景下的程序故障定位问题,这类错误在传统基于语法谱(syntactic spectra)的方法中难以识别,因为失败和通过的执行路径完全一致,仅在语义意图是否满足上存在差异。现有基于大语言模型(LLM)的方法虽引入了语义推理能力,但其输出具有随机性和不可验证性,无法系统地跨测试用例关联或区分根本原因与级联效应。解决方案的关键在于提出SemLoc框架,通过将自由形式的LLM推理转化为结构化的中间表示(intermediate representation),将每个推断出的属性绑定到类型化的程序锚点(typed program anchor),从而实现运行时检查和程序结构归因;同时,通过执行插桩程序构建语义违规谱(semantic violation spectrum)——一个约束-测试矩阵,并结合反事实验证步骤,进一步剔除过度近似的约束、隔离主要因果违规项,显著提升定位精度与可解释性。

链接: https://arxiv.org/abs/2603.29109
作者: Zhaorui Yang,Haichao Zhu,Qian Zhang,Rajiv Gupta,Ashish Kundu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fault localization identifies program locations responsible for observed failures. Existing techniques rank suspicious code using syntactic spectra–signals derived from execution structure such as statement coverage, control-flow divergence, or dependency reachability. These signals collapse for semantic bugs, where failing and passing executions follow identical code paths and differ only in whether semantic intent is satisfied. Recent LLM-based approaches introduce semantic reasoning but produce stochastic, unverifiable outputs that cannot be systematically cross-referenced across tests or distinguish root causes from cascading effects. We present SemLoc, a fault localization framework based on structured semantic grounding. SemLoc converts free-form LLM reasoning into a closed intermediate representation that binds each inferred property to a typed program anchor, enabling runtime checking and attribution to program structure. It executes instrumented programs to construct a semantic violation spectrum–a constraint-by-test matrix–from which suspiciousness scores are derived analogously to coverage-based methods. A counterfactual verification step further prunes over-approximate constraints and isolates primary causal violations. We evaluate SemLoc on SemFault-250, a corpus of 250 Python programs with single semantic faults. SemLoc outperforms five coverage-, reduction-, and LLM-based baselines, achieving Top-1 accuracy of 42.8% and Top-3 of 68%, while reducing inspection to 7.6% of executable lines. Counterfactual verification provides an additional 12% accuracy gain and identifies primary causal semantic constraints. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) ACMclasses: K.6.3; D.2.5 Cite as: arXiv:2603.29109 [cs.SE] (or arXiv:2603.29109v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.29109 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-71] WybeCoder: Verified Imperative Code Generation

【速读】:该论文旨在解决软件验证(software verification)在大型语言模型(Large Language Models, LLMs)驱动的自动代码生成中进展缓慢的问题。当前LLMs在代码生成和形式化定理证明方面取得显著进步,但对程序正确性的自动化验证仍存在瓶颈。解决方案的关键在于提出WybeCoder——一个代理式代码验证框架(agentic code verification framework),其核心机制是实现“边生成边证明”(prove-as-you-generate)开发范式,使代码、不变量(invariants)与证明三者协同演化。该框架基于结合自动验证条件生成(automatic verification condition generation)、SMT求解器与Lean中的交互式证明的最新方法,并通过将两个功能性验证基准(Verina和Clever)转化为等效的指令式代码规范,实现了系统性评估。实验表明,该方法在复杂算法如堆排序(Heapsort)上能合成数十个有效不变量并分派数十个子目标,最终生成数百行已验证代码,显著优于先前工作,在适度计算资源下分别解决了74%的Verina任务和62%的Clever任务。

链接: https://arxiv.org/abs/2603.29088
作者: Fabian Gloeckle,Mantas Baksys,Darius Feher,Kunhao Zheng,Amaury Hayat,Sean B. Holden,Gabriel Synnaeve,Peter O’Hearn
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent progress in large language models (LLMs) has advanced automatic code generation and formal theorem proving, yet software verification has not seen the same improvement. To address this gap, we propose WybeCoder, an agentic code verification framework that enables prove-as-you-generate development where code, invariants, and proofs co-evolve. It builds on a recent framework that combines automatic verification condition generation and SMT solvers with interactive proofs in Lean. To enable systematic evaluation, we translate two benchmarks for functional verification in Lean, Verina and Clever, to equivalent imperative code specifications. On complex algorithms such as Heapsort, we observe consistent performance improvements by scaling our approach, synthesizing dozens of valid invariants and dispatching of dozens of subgoals, resulting in hundreds of lines of verified code, overcoming plateaus reported in previous works. Our best system solves 74% of Verina tasks and 62% of Clever tasks at moderate compute budgets, significantly surpassing previous evaluations and paving a path to automated construction of large-scale datasets of verified imperative code.

[AI-72] PAR2-RAG : Planned Active Retrieval and Reasoning for Multi-Hop Question Answering

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在多跳问答(Multi-Hop Question Answering, MHQA)任务中表现脆弱的问题,即模型需从多个文档中检索并整合证据进行推理,而传统迭代检索系统易陷入低召回率路径并放大错误,而仅依赖规划的方法则难以适应中间证据变化。解决方案的关键在于提出一种两阶段框架——计划式主动检索与推理RAG(Planned Active Retrieval and Reasoning RAG, PAR²-RAG),其核心创新是将“覆盖范围”(coverage)与“承诺”(commitment)分离:第一阶段采用广度优先锚定策略构建高召回证据边界,第二阶段通过深度优先精炼结合证据充分性控制,在迭代循环中动态调整检索路径,从而显著提升MHQA准确率与检索质量。

链接: https://arxiv.org/abs/2603.29085
作者: Xingyu Li,Rongguang Wang,Yuying Wang,Mengqing Guo,Chenyang Li,Tao Sheng,Sujith Ravi,Dan Roth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Large language models (LLMs) remain brittle on multi-hop question answering (MHQA), where answering requires combining evidence across documents through retrieval and reasoning. Iterative retrieval systems can fail by locking onto an early low-recall trajectory and amplifying downstream errors, while planning-only approaches may produce static query sets that cannot adapt when intermediate evidence changes. We propose \textbfPlanned Active Retrieval and Reasoning RAG (PAR ^2 -RAG), a two-stage framework that separates \emphcoverage from \emphcommitment. PAR ^2 -RAG first performs breadth-first anchoring to build a high-recall evidence frontier, then applies depth-first refinement with evidence sufficiency control in an iterative loop. Across four MHQA benchmarks, PAR ^2 -RAG consistently outperforms existing state-of-the-art baselines, compared with IRCoT, PAR ^2 -RAG achieves up to \textbf23.5% higher accuracy, with retrieval gains of up to \textbf10.5% in NDCG.

[AI-73] he Future of AI is Many Not One

【速读】:该论文试图解决当前生成式 AI(Generative AI)研究与应用中过度强调个体智能的范式问题,即如何通过现有 AI 模型实现真正突破性的科学发现与创新。其核心论点是:单一超级智能体难以产生深度认知突破,而由知识背景多元的 AI 代理组成的协作团队更可能激发创造性解决方案。解决方案的关键在于构建“多样化 AI 团队”,这种架构能扩展问题求解空间、延缓过早共识形成,并支持非常规路径探索,从而克服当前模型受限于历史数据、缺乏创造洞察力的局限性。

链接: https://arxiv.org/abs/2603.29075
作者: Daniel J. Singer,Luca Garzino Demo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 0 figures

点击查看摘要

Abstract:The way we’re thinking about generative AI right now is fundamentally individual. We see this not just in how users interact with models but also in how models are built, how they’re benchmarked, and how commercial and research strategies using AI are defined. We argue that we should abandon this approach if we’re hoping for AI to support groundbreaking innovation and scientific discovery. Drawing on research and formal results in complex systems, organizational behavior, and philosophy of science, we show why we should expect deep intellectual breakthroughs to come from epistemically diverse groups of AI agents working together rather than singular superintelligent agents. Having a diverse team broadens the search for solutions, delays premature consensus, and allows for the pursuit of unconventional approaches. Developing diverse AI teams also addresses AI critics’ concerns that current models are constrained by past data and lack the creative insight required for innovation. The upshot, we argue, is that the future of transformative transformer-based AI is fundamentally many, not one.

[AI-74] On the Mirag e of Long-Range Dependency with an Application to Integer Multiplication

【速读】:该论文试图解决神经网络在处理整数乘法时难以实现长度泛化的问题,传统观点认为这是由进位链(carry chain)引发的O(n)长程依赖所导致。论文指出这一诊断存在根本性错误:长程依赖并非乘法任务的本质属性,而是由计算时空(computational spacetime)的选择所诱导的“幻象”(mirage)。解决方案的关键在于重新设计输入表示——将两个n位二进制整数排列为二维外积网格(outer-product grid),使得长乘法的每一步均可被3×3局部邻域操作替代,从而将全局依赖转化为局部结构。在此表示下,仅需321个可学习参数的神经元细胞自动机(neural cellular automaton)即可实现高达训练范围683倍的长度泛化,而其他主流架构如Transformer、Mamba等在同一表示下均失败,凸显了计算时空选择对模型能力的根本影响。

链接: https://arxiv.org/abs/2603.29069
作者: Zichao Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Integer multiplication has long been considered a hard problem for neural networks, with the difficulty widely attributed to the O(n) long-range dependency induced by carry chains. We argue that this diagnosis is wrong: long-range dependency is not an intrinsic property of multiplication, but a mirage produced by the choice of computational spacetime. We formalize the notion of mirage and provide a constructive proof: when two n-bit binary integers are laid out as a 2D outer-product grid, every step of long multiplication collapses into a 3 \times 3 local neighborhood operation. Under this representation, a neural cellular automaton with only 321 learnable parameters achieves perfect length generalization up to 683\times the training range. Five alternative architectures – including Transformer (6,625 params), Transformer+RoPE, and Mamba – all fail under the same representation. We further analyze how partial successes locked the community into an incorrect diagnosis, and argue that any task diagnosed as requiring long-range dependency should first be examined for whether the dependency is intrinsic to the task or induced by the computational spacetime.

[AI-75] CivicShield: A Cross-Domain Defense-in-Depth Framework for Securing Government-Facing AI Chatbots Against Multi-Turn Adversarial Attacks

【速读】:该论文旨在解决政府服务中基于大语言模型(Large Language Model, LLM)的聊天机器人面临的关键安全漏洞问题,特别是多轮对抗攻击(multi-turn adversarial attacks)对现有单层防护机制(single-layer guardrails)造成的高成功率(>90%)。解决方案的核心是提出CivicShield框架,这是一种跨领域纵深防御体系,融合网络安全部署、形式化验证、生物免疫系统、航空安全及零信任密码学等理念,构建七层防御结构:包括基于能力的零信任访问控制、边界输入验证、语义防火墙、对话状态机与安全不变量、行为异常检测、多模型共识验证以及分层人工介入升级机制。理论分析和实证评估表明,该架构可将攻击成功率降低1-2个数量级,同时在复杂场景下保持高检测率与低误报率,有效填补了AI安全、政府合规与实际部署之间的空白。

链接: https://arxiv.org/abs/2603.29062
作者: KrishnaSaiReddy Patil
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 25 pages, 17 tables, 2 figures

点击查看摘要

Abstract:LLM-based chatbots in government services face critical security gaps. Multi-turn adversarial attacks achieve over 90% success against current defenses, and single-layer guardrails are bypassed with similar rates. We present CivicShield, a cross-domain defense-in-depth framework for government-facing AI chatbots. Drawing on network security, formal verification, biological immune systems, aviation safety, and zero-trust cryptography, CivicShield introduces seven defense layers: (1) zero-trust foundation with capability-based access control, (2) perimeter input validation, (3) semantic firewall with intent classification, (4) conversation state machine with safety invariants, (5) behavioral anomaly detection, (6) multi-model consensus verification, and (7) graduated human-in-the-loop escalation. We present a formal threat model covering 8 multi-turn attack families, map the framework to NIST SP 800-53 controls across 14 families, and evaluate using ablation analysis. Theoretical analysis shows layered defenses reduce attack probability by 1-2 orders of magnitude versus single-layer approaches. Simulation against 1,436 scenarios including HarmBench (416), JailbreakBench (200), and XSTest (450) achieves 72.9% combined detection [69.5-76.0% CI] with 2.9% effective false positive rate after graduated response, while maintaining 100% detection of multi-turn crescendo and slow-drift attacks. The honest drop on real benchmarks versus author-generated scenarios (71.2% vs 76.7% on HarmBench, 47.0% vs 70.0% on JailbreakBench) validates independent evaluation importance. CivicShield addresses an open gap at the intersection of AI safety, government compliance, and practical deployment.

[AI-76] A Latent Risk-Aware Machine Learning Approach for Predicting Operational Success in Clinical Trials based on TrialsBank

【速读】:该论文旨在解决临床试验(Clinical Trial)在设计阶段缺乏可靠前瞻性预测工具的问题,尤其是在高成本、长周期和高运营风险背景下,如何提前识别可能导致试验失败的关键因素。解决方案的关键在于提出了一种分层的潜在风险感知机器学习框架(hierarchical latent risk-aware machine learning framework),将操作成功(Operational Success)预测分解为两个建模阶段:首先基于180余项可在试验启动前获取的药物和试验级特征,预测中间潜在的操作风险因子;随后将这些预测的风险因子整合至下游模型中,估算试验整体操作成功的概率。该方法通过分阶段数据分割策略避免信息泄露,并采用XGBoost、CatBoost及可解释梯度提升机进行基准测试,在I–III期临床试验中均实现了优异的外部验证性能(F1-score分别为0.93、0.92、0.91),显著提升了对操作失败的区分能力,从而支持早期风险评估与数据驱动的临床开发决策。

链接: https://arxiv.org/abs/2603.29041
作者: Iness Halimi,Emmanuel Piffo,Oumnia Boudersa,Yvan Marcel Carre Vilmorin,Melissa Ait-ikhlef,Karima Kone,Andy Tan,Augustin Medina,Juliette Hernando,Sheila Ernest,Vatche Bartekian,Karine Lalonde,Mireille E Schnitzer,Gianolli Dorcelus
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 18 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Clinical trials are characterized by high costs, extended timelines, and substantial operational risk, yet reliable prospective methods for predicting trial success before initiation remain limited. Existing artificial intelligence approaches often focus on isolated metrics or specific development stages and frequently rely on variables unavailable at the trial design phase, limiting real-world applicability. We present a hierarchical latent risk-aware machine learning framework for prospective prediction of clinical trial operational success using a curated subset of TrialsBank, a proprietary AI-ready database developed by Sorintellis, comprising 13,700 trials. Operational success was defined as the ability to initiate, conduct, and complete a clinical trial according to planned timelines, recruitment targets, and protocol specifications through database lock. This approach decomposes operational success prediction into two modeling stages. First, intermediate latent operational risk factors are predicted using more than 180 drug- and trial-level features available before trial initiation. These predicted latent risks are then integrated into a downstream model to estimate the probability of operational success. A staged data-splitting strategy was employed to prevent information leakage, and models were benchmarked using XGBoost, CatBoost, and Explainable Boosting Machines. Across Phase I-III, the framework achieves strong out-of-sample performance, with F1-scores of 0.93, 0.92, and 0.91, respectively. Incorporating latent risk drivers improves discrimination of operational failures, and performance remains robust under independent inference evaluation. These results demonstrate that clinical trial operational success can be prospectively forecasted using a latent risk-aware AI framework, enabling early risk assessment and supporting data-driven clinical development decision-making.

[AI-77] Emergence WebVoyager: Toward Consistent and Transparent Evaluation of (Web) Agents in The Wild

【速读】:该论文旨在解决当前AI代理(AI agent)在复杂真实环境中评估方法中存在的可靠性不足问题,尤其聚焦于网页代理(web agent)评估中普遍存在的任务定义模糊(task-framing ambiguity)和操作变异性(operational variability),这些问题导致性能比较难以复现且缺乏可比性。其解决方案的关键在于提出Emergence WebVoyager——一个标准化的基准框架,通过明确的任务实例化规范、失败处理机制、标注指南和报告格式,显著提升了评估过程的透明度与一致性;该框架实现了95.9%的标注者间一致性(inter-annotator agreement),并揭示出OpenAI Operator的实际成功率仅为68.6%,远低于其宣称的87%,验证了该方法在提升评估严谨性和可比性方面的有效性。

链接: https://arxiv.org/abs/2603.29020
作者: Deepak Akkil,Mowafak Allaham,Amal Raj,Tamer Abuelsaad,Ravi Kokku
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable evaluation of AI agents operating in complex, real-world environments requires methodologies that are robust, transparent, and contextually aligned with the tasks agents are intended to perform. This study identifies persistent shortcomings in existing AI agent evaluation practices that are particularly acute in web agent evaluation, as exemplified by our audit of WebVoyager, including task-framing ambiguity and operational variability that hinder meaningful and reproducible performance comparisons. To address these challenges, we introduce Emergence WebVoyager, an enhanced version of the WebVoyager benchmark that standardizes evaluation methodology through clear guidelines for task instantiation, failure handling, annotation, and reporting. Emergence WebVoyager achieves an inter-annotator agreement of 95.9%, indicating improved clarity and reliability in both task formulation and evaluation. Applying this framework to evaluate OpenAI Operator reveals substantial performance variation across domains and task types, with an overall success rate of 68.6%, substantially lower than the 87% previously reported by OpenAI, demonstrating the utility of our approach for more rigorous and comparable web agent evaluation.

[AI-78] Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance

【速读】:该论文旨在解决大语言模型(LLM)代理在优化GPU核函数时效率低下的问题,即在庞大的设计空间中进行迭代搜索时,因试错次数过多导致计算资源和时间成本过高。其核心挑战在于:一方面,若代理操作的抽象层级过低,LLM会浪费推理在无关紧要的细节上;另一方面,若抽象层级过高,则可能遗漏关键优化选择;此外,代理难以识别边际收益递减点,从而持续无效搜索。解决方案的关键在于提出两个设计原则:(1) 构建一个紧凑的领域特定语言(DSL)——μ CUTLASS,该语言可被模型在上下文中快速学习,能在较高抽象层次上保留重要优化杠杆(如内核配置、后处理融合与多阶段流水线),从而减少冗余尝试;(2) 引入基于物理极限的“光速引导”(Speed-of-Light, SOL)机制,利用第一性原理性能边界对搜索过程进行预算控制与方向引导,动态评估优化空间剩余头寸(headroom),优先级排序任务并识别benchmark-gaming行为。实验证明,结合μ CUTLASS与SOL引导后,相较PyTorch基线实现最高1.68倍的优化效率提升,且节省19–43%的token消耗,同时保持≥95%的几何平均加速比。

链接: https://arxiv.org/abs/2603.29010
作者: Siva Kumar Sastry Hari,Vignesh Balaji,Sana Damani,Qijing Huang,Christos Kozyrakis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimizing GPU kernels with LLM agents is an iterative process over a large design space. Every candidate must be generated, compiled, validated, and profiled, so fewer trials will save both runtime and cost. We make two key observations. First, the abstraction level that agents operate at is important. If it is too low, the LLM wastes reasoning on low-impact details. If it is too high, it may miss important optimization choices. Second, agents cannot easily tell when they reach the point of diminishing returns, wasting resources as they continue searching. These observations motivate two design principles to improve efficiency: (1) a compact domain-specific language (DSL) that can be learned in context and lets the model reason at a higher level while preserving important optimization levers, and (2) Speed-of-Light (SOL) guidance that uses first-principles performance bounds to steer and budget search. We implement these principles in \mu CUTLASS, a DSL with a compiler for CUTLASS-backed GPU kernels that covers kernel configuration, epilogue fusion, and multi-stage pipelines. We use SOL guidance to estimate headroom and guide optimization trials, deprioritize problems that are near SOL, and flag kernels that game the benchmark. On 59 KernelBench problems with the same iteration budgets, switching from generating low-level code to DSL code using GPT-5-mini turns a 0.40x geomean regression into a 1.27x speedup over PyTorch. Adding SOL-guided steering raises this to 1.56x. Across model tiers, \mu CUTLASS + SOL-guidance lets weaker models outperform stronger baseline agents at lower token cost. SOL-guided budgeting saves 19-43% of tokens while retaining at least 95% of geomean speedup, with the best policy reaching a 1.68x efficiency gain. Lastly, SOL analysis helps detect benchmark-gaming cases, where kernels may appear fast while failing to perform the intended computation. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.29010 [cs.LG] (or arXiv:2603.29010v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.29010 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-79] Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中因长上下文处理与生成机制(如稀疏注意力、检索增强生成(Retrieval-Augmented Generation, RAG)和压缩上下文记忆)引入的高内存处理开销问题,其核心挑战在于这些优化操作存在显著的计算异构性。解决方案的关键在于提出一个统一的四步内存处理流水线(Prepare Memory, Compute Relevancy, Retrieval, Apply to Inference),并基于系统级剖析发现内存处理占推理时间的22%–97%,且具有强异构性;进而设计了一种GPU-FPGA异构系统架构,将稀疏、不规则且内存受限的操作卸载至FPGA执行,而保留计算密集型任务在GPU上运行,从而实现端到端推理加速与能效提升——实验表明该方案比纯GPU基线快1.04–2.2倍,能耗降低1.11–4.7倍。

链接: https://arxiv.org/abs/2603.29002
作者: Zifan He,Rui Ma,Yizhou Sun,Jason Cong
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that \textbfheterogeneous systems are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system is 1.04\sim2.2\times faster and requires 1.11\sim4.7\times less energy across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.

[AI-80] Design Principles for the Construction of a Benchmark Evaluating Security Operation Capabilities of Multi-agent AI Systems

【速读】:该论文旨在解决当前缺乏系统性基准测试工具以评估多智能体AI在蓝队(Blue Team)操作中的协同能力问题,尤其是在大规模勒索软件攻击事件响应场景下。现有研究多集中于红队(Red Team)能力的评测,而蓝队作为安全运营中心(SOC)的核心职能,其自动化水平直接影响AI驱动的自主化SOC建设进程。解决方案的关键在于提出一套设计原则,并基于此构建名为SOC-bench的基准框架,该框架包含五个面向大规模勒索软件攻击响应的蓝队任务,从而为评估AI代理在复杂、多任务环境下的协同防御能力提供标准化方法。

链接: https://arxiv.org/abs/2603.28998
作者: Yicheng Cai,Mitchell John DeStefano,Guodong Dong,Pulkit Handa,Peng Liu,Tejas Singhal,Peiyu Tseng,Winston Jen White
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 29 pages, 1 figure

点击查看摘要

Abstract:As Large Language Models (LLMs) and multi-agent AI systems are demonstrating increasing potential in cybersecurity operations, organizations, policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such AI systems to achieve more autonomous SOCs (security operation centers) and reduce manual effort. In particular, the AI and cybersecurity communities have recently developed several benchmarks for evaluating the red team capabilities of multi-agent AI systems. However, because the operations in SOCs are dominated by blue team operations, the capabilities of AI systems agents to achieve more autonomous SOCs cannot be evaluated without a benchmark focused on blue team operations. To our best knowledge, no systematic benchmark for evaluating coordinated multi-task blue team AI has been proposed in the literature. Existing blue team benchmarks focus on a particular task. The goal of this work is to develop a set of design principles for the construction of a benchmark, which is denoted as SOC-bench, to evaluate the blue team capabilities of AI. Following these design principles, we have developed a conceptual design of SOC-bench, which consists of a family of five blue team tasks in the context of large-scale ransomware attack incident response.

[AI-81] Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures

【速读】:该论文旨在解决多智能体大语言模型(Multi-agent LLM Systems)系统中自主性(autonomy)的可持续性问题,即在何种条件下这些系统能够自发形成协调机制并维持高效运作。其关键解决方案在于采用一种混合协议(Sequential Protocol),该协议通过提供最小结构化支撑(固定顺序)而非预设角色或外部控制,使智能体自发演化出专业化分工、自愿回避非胜任任务,并构建浅层层级结构,从而实现比集中式协调高14%的性能表现(p<0.001)。这一机制表明,随着基础模型能力提升,系统自主协调的潜力显著增强,且在256个智能体下仍保持质量稳定,展现出良好的可扩展性。

链接: https://arxiv.org/abs/2603.28990
作者: Victoria Dochkina
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 figures, 9 tables. Submitted to IEEE Access

点击查看摘要

Abstract:How much autonomy can multi-agent LLM systems sustain – and what enables it? We present a 25,000-task computational experiment spanning 8 models, 4–256 agents, and 8 coordination protocols ranging from externally imposed hierarchy to emergent self-organization. We observe that autonomous behavior already emerges in current LLM agents: given minimal structural scaffolding (fixed ordering), agents spontaneously invent specialized roles, voluntarily abstain from tasks outside their competence, and form shallow hierarchies – without any pre-assigned roles or external design. A hybrid protocol (Sequential) that enables this autonomy outperforms centralized coordination by 14% (p0.001), with a 44% quality spread between protocols (Cohen’s d=1.86, p0.0001). The degree of emergent autonomy scales with model capability: strong models self-organize effectively, while models below a capability threshold still benefit from rigid structure – suggesting that as foundation models improve, the scope for autonomous coordination will expand. The system scales sub-linearly to 256 agents without quality degradation (p=0.61), producing 5,006 unique roles from just 8 agents. Results replicate across closed- and open-source models, with open-source achieving 95% of closed-source quality at 24x lower cost. The practical implication: give agents a mission, a protocol, and a capable model – not a pre-assigned role.

[AI-82] Privacy Guard Token Parsimony by Prompt and Context Handling and LLM Routing

【速读】:该论文旨在解决大规模采用大语言模型(Large Language Models, LLMs)时面临的运营成本(OpEx)与数据隐私之间的权衡问题。现有路由框架虽能降低使用成本,但忽视了提示(prompt)敏感性,导致用户和机构向第三方云服务商泄露敏感信息的风险。其核心解决方案是提出“不可分离范式”(Inseparability Paradigm),即高级上下文管理与隐私管理本质上是一体的。关键创新在于部署一个本地化的“隐私卫士”(Privacy Guard)——由本地小语言模型(Small Language Model, SLM)驱动的全栈上下文观测器,通过抽象摘要与自动提示优化(Automatic Prompt Optimization, APO)将原始提示分解为聚焦子任务,并将高风险查询重新路由至零信任或受保密协议(NDA)保护的模型;这一双重机制同时实现敏感推理向量消除(零泄漏)与云端token负载减少(运营成本降低)。此外,基于LIFO的上下文压缩机制进一步限制工作内存,控制潜在泄露面。实验证明该框架在混合基准测试中实现45%的综合运营成本下降、个人敏感信息100%成功脱敏,且LLM-as-a-Judge评估显示85%偏好APO压缩后的响应,表明Token精简与零泄漏可被统一视为同一上下文压缩算子的数学对偶投影。

链接: https://arxiv.org/abs/2603.28972
作者: Alessio Langiu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The large-scale adoption of Large Language Models (LLMs) forces a trade-off between operational cost (OpEx) and data privacy. Current routing frameworks reduce costs but ignore prompt sensitivity, exposing users and institutions to leakage risks towards third-party cloud providers. We formalise the “Inseparability Paradigm”: advanced context management intrinsically coincides with privacy management. We propose a local “Privacy Guard” – a holistic contextual observer powered by an on-premise Small Language Model (SLM) – that performs abstractive summarisation and Automatic Prompt Optimisation (APO) to decompose prompts into focused sub-tasks, re-routing high-risk queries to Zero-Trust or NDA-covered models. This dual mechanism simultaneously eliminates sensitive inference vectors (Zero Leakage) and reduces cloud token payloads (OpEx Reduction). A LIFO-based context compacting mechanism further bounds working memory, limiting the emergent leakage surface. We validate the framework through a 2x2 benchmark (Lazy vs. Expert users; Personal vs. Institutional secrets) on a 1,000-sample dataset, achieving a 45% blended OpEx reduction, 100% redaction success on personal secrets, and – via LLM-as-a-Judge evaluation – an 85% preference rate for APO-compressed responses over raw baselines. Our results demonstrate that Token Parsimony and Zero Leakage are mathematically dual projections of the same contextual compression operator.

[AI-83] he Spectral Edge Thesis: A Mathematical Framework for Intra-Signal Phase Transitions in Neural Network Training

【速读】:该论文旨在解决神经网络训练中相变现象(如grokking、能力提升、损失平台期)的机制问题,即揭示这些现象如何由参数更新的谱结构所控制。其解决方案的关键在于提出“谱边缘假说”(spectral edge thesis),指出这些相变由滚动窗口Gram矩阵的谱隙(spectral gap)动态决定;在极端参数比 regime 下(参数量 P108P \sim 10^8,窗口大小 W10W \sim 10),经典BBP检测阈值失效,真正起作用的是主导模式与次主导模式之间的内信号隙(intra-signal gap),其位置 k=argmaxσj/σj+1k^* = \mathrm{argmax}\, \sigma_j/\sigma_{j+1} 是唯一动态特权位置——该隙的坍塌是唯一破坏学习的过程,并通过 α\alpha-反馈回路自我维持,无需对优化器做任何假设。这一理论框架进一步定义了绝热参数 A=ΔGF/(ηg2)\mathcal{A} = \|\Delta G\|_F / (\eta\, g^2) 来刻画电路稳定性:A1\mathcal{A} \ll 1 对应平台期,A1\mathcal{A} \sim 1 对应相变,A1\mathcal{A} \gg 1 对应遗忘。

链接: https://arxiv.org/abs/2603.28964
作者: Yongzhong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 60 pages, 5 figures

点击查看摘要

Abstract:We develop the spectral edge thesis: phase transitions in neural network training – grokking, capability gains, loss plateaus – are controlled by the spectral gap of the rolling-window Gram matrix of parameter updates. In the extreme aspect ratio regime (parameters P \sim 10^8 , window W \sim 10 ), the classical BBP detection threshold is vacuous; the operative structure is the intra-signal gap separating dominant from subdominant modes at position k^* = \mathrmargmax, \sigma_j/\sigma_j+1 . From three axioms we derive: (i) gap dynamics governed by a Dyson-type ODE with curvature asymmetry, damping, and gradient driving; (ii) a spectral loss decomposition linking each mode’s learning contribution to its Davis–Kahan stability coefficient; (iii) the Gap Maximality Principle, showing that k^* is the unique dynamically privileged position – its collapse is the only one that disrupts learning, and it sustains itself through an \alpha -feedback loop requiring no assumption on the optimizer. The adiabatic parameter \mathcalA = |\Delta G|_F / (\eta, g^2) controls circuit stability: \mathcalA \ll 1 (plateau), \mathcalA \sim 1 (phase transition), \mathcalA \gg 1 (forgetting). Tested across six model families (150K–124M parameters): gap dynamics precede every grokking event (24/24 with weight decay, 0/24 without), the gap position is optimizer-dependent (Muon: k^=1 , AdamW: k^=2 on the same model), and 19/20 quantitative predictions are confirmed. The framework is consistent with the edge of stability, Tensor Programs, Dyson Brownian motion, the Lottery Ticket Hypothesis, and neural scaling laws. Comments: 60 pages, 5 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.28964 [cs.LG] (or arXiv:2603.28964v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.28964 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-84] Multi-Agent LLM s for Adaptive Acquisition in Bayesian Optimization

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在序列决策与黑箱优化中对探索-利用权衡(exploration-exploitation trade-off)的隐式推理机制不明确、难以分析与控制的问题。现有基于LLM的优化方法依赖于提示(prompt)驱动的隐式策略,导致搜索行为不稳定且易过早收敛。其解决方案的关键在于提出一种多智能体框架,将探索-利用控制分解为两个独立模块:策略代理(strategy agent)负责分配可解释的权重以定义多维搜索标准(如信息性、多样性与代表性),而生成代理(generation agent)则根据这些权重生成候选解。这种结构化分解使探索-利用决策显式化、可观测且可调优,从而显著提升LLM辅助搜索的有效性。

链接: https://arxiv.org/abs/2603.28959
作者: Andrea Carbonati,Mohammadsina Almasi,Hadis Anahideh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Proceedings of the IISE Annual Conference Expo 2026

点击查看摘要

Abstract:The exploration-exploitation trade-off is central to sequential decision-making and black-box optimization, yet how Large Language Models (LLMs) reason about and manage this trade-off remains poorly understood. Unlike Bayesian Optimization, where exploration and exploitation are explicitly encoded through acquisition functions, LLM-based optimization relies on implicit, prompt-based reasoning over historical evaluations, making search behavior difficult to analyze or control. In this work, we present a metric-level study of LLM-mediated search policy learning, studying how LLMs construct and adapt exploration-exploitation strategies under multiple operational definitions of exploration, including informativeness, diversity, and representativeness. We show that single-agent LLM approaches, which jointly perform strategy selection and candidate generation within a single prompt, suffer from cognitive overload, leading to unstable search dynamics and premature convergence. To address this limitation, we propose a multi-agent framework that decomposes exploration-exploitation control into strategic policy mediation and tactical candidate generation. A strategy agent assigns interpretable weights to multiple search criteria, while a generation agent produces candidates conditioned on the resulting search policy defined as weights. This decomposition renders exploration-exploitation decisions explicit, observable, and adjustable. Empirical results across various continuous optimization benchmarks indicate that separating strategic control from candidate generation substantially improves the effectiveness of LLM-mediated search.

[AI-85] Enhancing Policy Learning with World-Action Model

【速读】:该论文旨在解决传统世界模型(World Model)在强化学习中因仅依赖图像预测而缺乏对动作相关结构建模的问题,从而限制了下游控制策略的学习效果。解决方案的关键在于提出一种动作正则化的世界模型(World-Action Model, WAM),通过在DreamerV2框架中引入逆动力学目标(inverse dynamics objective),使模型能够从潜在状态转移中预测动作,从而增强表征对动作相关结构的捕捉能力。这一改进显著提升了策略学习性能,在CALVIN基准的8个操作任务上,WAM在不改变策略架构或训练流程的前提下,将行为克隆成功率从59.4%提升至71.2%,并在模型基础上PPO微调后达到92.8%的平均成功率,同时减少8.7倍训练步数。

链接: https://arxiv.org/abs/2603.28955
作者: Yuci Han,Alper Yilmaz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents the World-Action Model (WAM), an action-regularized world model that jointly reasons over future visual observations and the actions that drive state transitions. Unlike conventional world models trained solely via image prediction, WAM incorporates an inverse dynamics objective into DreamerV2 that predicts actions from latent state transitions, encouraging the learned representations to capture action-relevant structure critical for downstream control. We evaluate WAM on enhancing policy learning across eight manipulation tasks from the CALVIN benchmark. We first pretrain a diffusion policy via behavioral cloning on world model latents, then refine it with model-based PPO inside the frozen world model. Without modifying the policy architecture or training procedure, WAM improves average behavioral cloning success from 59.4% to 71.2% over DreamerV2 and DiWA baselines. After PPO fine-tuning, WAM achieves 92.8% average success versus 79.8% for the baseline, with two tasks reaching 100%, using 8.7x fewer training steps.

[AI-86] Differentiable Initialization-Accelerated CPU-GPU Hybrid Combinatorial Scheduling

【速读】:该论文旨在解决组合调度问题(Combinatorial Scheduling Problems)在大规模场景下难以高效求解最优解的问题,这类问题通常被建模为整数线性规划(Integer Linear Programming, ILP),具有NP-hard特性。其核心挑战在于传统ILP求解器在处理复杂实例时收敛缓慢、计算成本高。解决方案的关键在于提出一种混合CPU-GPU框架,通过可微分预处理(Differentiable Presolving)快速生成高质量的部分解,并将其作为热启动(Warm-Start)输入至商业ILP求解器(如CPLEX、Gurobi)及新兴开源求解器HiGHS,从而显著提升早期剪枝效率。实验证明,该方法相较现有基准实现最高达10倍的性能提升,且将最优性间隙缩小至0.1%,首次成功将可微分优化与精确ILP求解相结合,为机器学习基础设施与经典精确优化方法融合提供了新范式。

链接: https://arxiv.org/abs/2603.28943
作者: Mingju Liu,Jiaqi Yin,Alvaro Velasquez,Cunxi Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 7 pages, 4 figures, 8 equations, 3 tables

点击查看摘要

Abstract:This paper presents a hybrid CPU-GPU framework for solving combinatorial scheduling problems formulated as Integer Linear Programming (ILP). While scheduling underpins many optimization tasks in computing systems, solving these problems optimally at scale remains a long-standing challenge due to their NP-hard nature. We introduce a novel approach that combines differentiable optimization with classical ILP solving. Specifically, we utilize differentiable presolving to rapidly generate high-quality partial solutions, which serve as warm-starts for commercial ILP solvers (CPLEX, Gurobi) and rising open-source solver HiGHS. This method enables significantly improved early pruning compared to state-of-the-art standalone solvers. Empirical results across industry-scale benchmarks demonstrate up to a 10\times performance gain over baselines, narrowing the optimality gap to 0.1% . This work represents the first demonstration of utilizing differentiable optimization to initialize exact ILP solvers for combinatorial scheduling, opening new opportunities to integrate machine learning infrastructure with classical exact optimization methods across broader domains.

[AI-87] Beta-Scheduling: Momentum from Critical Damping as a Diagnostic and Correction Tool for Neural Network Training

【速读】:该论文旨在解决神经网络训练中动量(momentum)参数设置缺乏理论依据且难以定位具体失败层的问题。传统方法普遍采用恒定动量(如0.9),但其选择主要源于历史惯例而非优化理论支撑,导致模型训练过程中的局部失效模式难以识别与修正。解决方案的关键在于从临界阻尼谐振子(critically damped harmonic oscillator)推导出一种时间可变的动量调度策略:μ(t) = 1 - 2√α(t),其中α(t)为当前学习率。该β-调度无需额外超参数,仅依赖现有学习率计划,即可实现更快收敛(ResNet-18/CIFAR-10上达到90%准确率时提速1.9倍),更重要的是,它提供了一种跨优化器不变的逐层梯度归因诊断工具——无论使用SGD还是Adam训练,均能稳定识别出相同的三个问题层。基于此诊断,仅对这些层进行“外科式”修正即可修复62个误分类样本,同时仅重训练18%的参数,显著提升了模型调试的精准性与效率。

链接: https://arxiv.org/abs/2603.28921
作者: Ivan Pasichnyk
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 3 figures, 5 tables. Code available on Kaggle

点击查看摘要

Abstract:Standard neural network training uses constant momentum (typically 0.9), a convention dating to 1964 with limited theoretical justification for its optimality. We derive a time-varying momentum schedule from the critically damped harmonic oscillator: mu(t) = 1 - 2*sqrt(alpha(t)), where alpha(t) is the current learning rate. This beta-schedule requires zero free parameters beyond the existing learning rate schedule. On ResNet-18/CIFAR-10, beta-scheduling delivers 1.9x faster convergence to 90% accuracy compared to constant momentum. More importantly, the per-layer gradient attribution under this schedule produces a cross-optimizer invariant diagnostic: the same three problem layers are identified regardless of whether the model was trained with SGD or Adam (100% overlap). Surgical correction of only these layers fixes 62 misclassifications while retraining only 18% of parameters. A hybrid schedule – physics momentum for fast early convergence, then constant momentum for the final refinement – reaches 95% accuracy fastest among five methods tested. The main contribution is not an accuracy improvement but a principled, parameter-free tool for localizing and correcting specific failure modes in trained networks. Comments: 18 pages, 3 figures, 5 tables. Code available on Kaggle Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.28921 [cs.LG] (or arXiv:2603.28921v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.28921 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ivan Pasichnyk [view email] [v1] Mon, 30 Mar 2026 18:53:03 UTC (44 KB) Full-text links: Access Paper: View a PDF of the paper titled Beta-Scheduling: Momentum from Critical Damping as a Diagnostic and Correction Tool for Neural Network Training, by Ivan PasichnykView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-03 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-88] Working Paper: Towards a Category-theoretic Comparative Framework for Artificial General Intelligence

【速读】:该论文旨在解决当前通用人工智能(AGI)缺乏统一形式化定义与系统性比较框架的问题,尤其针对现有AGI候选架构(如强化学习RL、因果强化学习CRL、基于模式的学习SBL等)难以在理论层面进行结构化分析与对比的困境。其解决方案的关键在于构建一个基于范畴论(Category Theory)的通用代数形式化框架,该框架能够以抽象且严格的方式描述AGI架构的组成结构、信息组织机制、代理与环境的交互模式以及行为演化过程,并支持对代理的语法、语义及信息属性进行形式化定义与评估,从而揭示不同架构间的共性与差异,为未来研究提供明确方向。

链接: https://arxiv.org/abs/2603.28906
作者: Pablo de los Riscos,Fernando J. Corbacho,Michael A. Arbib
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 37 pages, 7 figures, 1 table

点击查看摘要

Abstract:AGI has become the Holly Grail of AI with the promise of level intelligence and the major Tech companies around the world are investing unprecedented amounts of resources in its pursuit. Yet, there does not exist a single formal definition and only some empirical AGI benchmarking frameworks currently exist. The main purpose of this paper is to develop a general, algebraic and category theoretic framework for describing, comparing and analysing different possible AGI architectures. Thus, this Category theoretic formalization would also allow to compare different possible candidate AGI architectures, such as, RL, Universal AI, Active Inference, CRL, Schema based Learning, etc. It will allow to unambiguously expose their commonalities and differences, and what is even more important, expose areas for future research. From the applied Category theoretic point of view, we take as inspiration Machines in a Category to provide a modern view of AGI Architectures in a Category. More specifically, this first position paper provides, on one hand, a first exercise on RL, Causal RL and SBL Architectures in a Category, and on the other hand, it is a first step on a broader research program that seeks to provide a unified formal foundation for AGI systems, integrating architectural structure, informational organization, agent realization, agent and environment interaction, behavioural development over time, and the empirical evaluation of properties. This framework is also intended to support the definition of architectural properties, both syntactic and informational, as well as semantic properties of agents and their assessment in environments with explicitly characterized features. We claim that Category Theory and AGI will have a very symbiotic relation.

[AI-89] ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts

【速读】:该论文旨在解决现有图表理解基准普遍局限于单图表解读,而缺乏对多图表间对比推理能力评估的问题。其核心解决方案是提出首个大规模跨图表对比总结基准——ChartDiff,包含8,541对图表数据,覆盖多样化的数据源、图表类型与视觉风格,并配有由大语言模型(LLM)生成且经人工验证的差异描述摘要(涵盖趋势、波动和异常)。该基准不仅支持对通用型、专用型及流水线式模型的系统性评估,还揭示了当前模型在多系列图表上的显著挑战以及词法重叠指标(如ROUGE)与人类评价之间存在的不一致性,从而为推进多图表理解研究提供了新的测评标准与方向。

链接: https://arxiv.org/abs/2603.28902
作者: Rongtian Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 17 figures

点击查看摘要

Abstract:Charts are central to analytical reasoning, yet existing benchmarks for chart understanding focus almost exclusively on single-chart interpretation rather than comparative reasoning across multiple charts. To address this gap, we introduce ChartDiff, the first large-scale benchmark for cross-chart comparative summarization. ChartDiff consists of 8,541 chart pairs spanning diverse data sources, chart types, and visual styles, each annotated with LLM-generated and human-verified summaries describing differences in trends, fluctuations, and anomalies. Using ChartDiff, we evaluate general-purpose, chart-specialized, and pipeline-based models. Our results show that frontier general-purpose models achieve the highest GPT-based quality, while specialized and pipeline-based methods obtain higher ROUGE scores but lower human-aligned evaluation, revealing a clear mismatch between lexical overlap and actual summary quality. We further find that multi-series charts remain challenging across model families, whereas strong end-to-end models are relatively robust to differences in plotting libraries. Overall, our findings demonstrate that comparative chart reasoning remains a significant challenge for current vision-language models and position ChartDiff as a new benchmark for advancing research on multi-chart understanding.

[AI-90] Robust Multi-Agent Reinforcement Learning for Small UAS Separation Assurance under GPS Degradation and Spoofing

【速读】:该论文旨在解决小型无人机系统(sUAS)在GPS信号退化和欺骗攻击下,如何实现鲁棒的分离保障(separation assurance)问题。其核心挑战在于:当各无人机广播的GPS位置信息被敌对者篡改时,整个空域态势感知将不可靠,进而危及飞行安全。解决方案的关键在于将状态观测污染建模为智能体与对手之间的零和博弈,并推导出一种闭式表达的对抗扰动策略,该策略无需对抗训练即可在状态维度上实现线性时间复杂度的评估,且近似真实最坏情况扰动具有二阶精度;进一步结合Kullback-Leibler正则化理论,证明了清洁与污染观测下的安全性能差距最多随扰动概率线性恶化,并将此闭式对抗策略嵌入多智能体强化学习(MARL)策略梯度算法中,生成鲁棒对策。仿真结果表明,在高达35%的扰动水平下仍能保持接近零的碰撞率,显著优于未考虑对抗扰动的基线策略。

链接: https://arxiv.org/abs/2603.28900
作者: Alex Zongo,Filippos Fotiadis,Ufuk Topcu,Peng Wei
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:We address robust separation assurance for small Unmanned Aircraft Systems (sUAS) under GPS degradation and spoofing via Multi-Agent Reinforcement Learning (MARL). In cooperative surveillance, each aircraft (or agent) broadcasts its GPS-derived position; when such position broadcasts are corrupted, the entire observed air traffic state becomes unreliable. We cast this state observation corruption as a zero-sum game between the agents and an adversary: with probability R, the adversary perturbs the observed state to maximally degrade each agent’s safety performance. We derive a closed-form expression for this adversarial perturbation, bypassing adversarial training entirely and enabling linear-time evaluation in the state dimension. We show that this expression approximates the true worst-case adversarial perturbation with second-order accuracy. We further bound the safety performance gap between clean and corrupted observations, showing that it degrades at most linearly with the corruption probability under Kullback-Leibler regularization. Finally, we integrate the closed-form adversarial policy into a MARL policy gradient algorithm to obtain a robust counter-policy for the agents. In a high-density sUAS simulation, we observe near-zero collision rates under corruption levels up to 35%, outperforming a baseline policy trained without adversarial perturbations.

[AI-91] GMA-SAWGAN-GP: A Novel Data Generative Framework to Enhance IDS Detection Performance

【速读】:该论文旨在解决入侵检测系统(Intrusion Detection System, IDS)在面对已知攻击时表现良好,但对未知威胁泛化能力差的问题。其核心解决方案是提出一种基于自注意力增强的Wasserstein生成对抗网络(Self-Attention-enhanced Wasserstein GAN with Gradient Penalty, SAWGAN-GP)的生成式数据增强框架——GMA-SAWGAN-GP。该框架的关键创新在于:1)采用Gumbel-Softmax正则化建模离散特征域,保持类别语义一致性;2)引入多层感知机(MLP)构建的自动编码器作为流形正则项,提升生成样本质量;3)设计轻量级门控网络通过熵正则化动态平衡对抗损失与重构损失,从而稳定训练并缓解模式崩溃问题;4)利用自注意力机制捕获记录内特征间的短程与长程依赖关系。实验证明,该方法显著提升了IDS在已知和未知攻击场景下的检测准确率与鲁棒性。

链接: https://arxiv.org/abs/2603.28838
作者: Ziyu Mu,Xiyu Shi,Safak Dogan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Intrusion Detection System (IDS) is often calibrated to known attacks and generalizes poorly to unknown threats. This paper proposes GMA-SAWGAN-GP, a novel generative augmentation framework built on a Self-Attention-enhanced Wasserstein GAN with Gradient Penalty (WGAN-GP). The generator employs Gumbel-Softmax regularization to model discrete fields, while a Multilayer Perceptron (MLP)-based AutoEncoder acts as a manifold regularizer. A lightweight gating network adaptively balances adversarial and reconstruction losses via entropy regularization, improving stability and mitigating mode collapse. The self-attention mechanism enables the generator to capture both short- and long-range dependencies among features within each record while preserving categorical semantics through Gumbel-Softmax heads. Extensive experiments on NSL-KDD, UNSW-NB15, and CICIDS2017 using five representative IDS models demonstrate that GMA-SAWGAN-GP significantly improves detection performance on known attacks and enhances generalization to unknown attacks. Leave-One-Attack-type-Out (LOAO) evaluations using Area Under the Receiver Operating Characteristic (AUROC) and True Positive Rate at a 5 percent False Positive Rate confirm that IDS models trained on augmented datasets achieve higher robustness under unseen attack scenarios. Ablation studies validate the contribution of each component to performance gains. Compared with baseline models, the proposed framework improves binary classification accuracy by an average of 5.3 percent and multi-classification accuracy by 2.2 percent, while AUROC and True Positive Rate at a 5 percent False Positive Rate for unknown attacks increase by 3.9 percent and 4.8 percent, respectively, across the three datasets. Overall, GMA-SAWGAN-GP provides an effective approach to generative augmentation for mixed-type network traffic, improving IDS accuracy and resilience.

[AI-92] Incentives Equilibria and the Limits of Healthcare AI: A Game-Theoretic Perspective

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在医疗系统中部署时,如何有效提升容量与生产率的难题,尤其关注当前对AI技术效果的乐观预期是否合理。论文指出,单纯依靠任务优化(task optimisation)难以改变系统级结果,因为若激励机制未变,个体行为不会发生根本性转变。其解决方案的关键在于识别并实施能够重塑风险分配(risk allocation)的干预措施,只有此类机制层面的激励重构才可能促使稳定系统行为发生实质性改变,从而为医疗领导者和采购决策提供理论依据与实践指导。

链接: https://arxiv.org/abs/2603.28825
作者: Ari Ercole
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is widely promoted as a promising technological response to healthcare capacity and productivity pressures. Deployment of AI systems carries significant costs including ongoing costs of monitoring and whether optimism of a deus ex machina solution is well-placed is unclear. This paper proposes three archetypal AI technology types: AI for effort reduction, AI to increase observability, and mechanism-level incentive change AI. Using a stylised inpatient capacity signalling example and minimal game-theoretic reasoning, it argues that task optimisation alone is unlikely to change system outcomes when incentives are unchanged. The analysis highlights why only interventions that reshape risk allocation can plausibly shift stable system-level behaviour, and outlines implications for healthcare leadership and procurement.

[AI-93] SNEAKDOOR: Stealthy Backdoor Attacks against Distribution Matching-based Dataset Condensation NEURIPS2025

【速读】:该论文旨在解决数据压缩(Dataset Condensation)过程中存在的后门攻击(Backdoor Attack)问题,即恶意触发器被注入到压缩数据集中,导致模型在推理阶段行为被操控,而现有方法难以在保持高攻击成功率的同时兼顾隐蔽性(Stealthiness),尤其是在隐藏合成数据的视觉伪影或推理时引入的扰动方面。解决方案的关键在于提出Sneakdoor框架,其核心创新是利用类别决策边界(Class Decision Boundaries)的固有脆弱性,并引入一个生成模块来构建与局部特征几何结构一致的输入感知触发器(Input-aware Triggers),从而最小化检测概率。这种联合设计使攻击在人类视觉和统计检测下均难以察觉,同时显著提升了攻击有效性与隐蔽性的平衡。

链接: https://arxiv.org/abs/2603.28824
作者: He Yang,Dongyi Lv,Song Ma,Wei Xi,Jizhong Zhao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 29 pages, 5 figures, accepted to NeurIPS 2025

点击查看摘要

Abstract:Dataset condensation aims to synthesize compact yet informative datasets that retain the training efficacy of full-scale data, offering substantial gains in efficiency. Recent studies reveal that the condensation process can be vulnerable to backdoor attacks, where malicious triggers are injected into the condensation dataset, manipulating model behavior during inference. While prior approaches have made progress in balancing attack success rate and clean test accuracy, they often fall short in preserving stealthiness, especially in concealing the visual artifacts of condensed data or the perturbations introduced during inference. To address this challenge, we introduce Sneakdoor, which enhances stealthiness without compromising attack effectiveness. Sneakdoor exploits the inherent vulnerability of class decision boundaries and incorporates a generative module that constructs input-aware triggers aligned with local feature geometry, thereby minimizing detectability. This joint design enables the attack to remain imperceptible to both human inspection and statistical detection. Extensive experiments across multiple datasets demonstrate that Sneakdoor achieves a compelling balance among attack success rate, clean test accuracy, and stealthiness, substantially improving the invisibility of both the synthetic data and triggered samples while maintaining high attack efficacy. The code is available at this https URL.

[AI-94] me is Not Compute: Scaling Laws for Wall-Clock Constrained Training on Consumer GPUs

【速读】:该论文旨在解决在固定壁-clock时间预算(如5分钟至24小时)下,如何最优选择模型规模以最大化模型质量的问题,而非传统基于计算预算(FLOPs)的优化策略。其关键解决方案在于通过系统性实验发现:在消费级GPU(RTX 4090)上,最优模型参数量 NN^* 与时间预算 tt 呈幂律关系 Nt0.60N^* \propto t^{0.60},显著快于Chinchilla法则中基于计算量的 NC0.50N^* \propto C^{0.50};同时揭示了“双U型机制”——短时间预算下的U型曲线源于计算瓶颈(欠训练),长时间预算下的U型曲线源于数据瓶颈(过拟合),中间存在一个U型消失的过渡区。这一发现为受限于时间而非算力的研究者提供了可直接应用的模型缩放准则。

链接: https://arxiv.org/abs/2603.28823
作者: Yi Liu
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling laws relate model quality to compute budget (FLOPs), but practitioners face wall-clock time constraints, not compute budgets. We study optimal model sizing under fixed time budgets from 5 minutes to 24 hours on consumer GPUs (RTX 4090). Across 70+ runs spanning 50M–1031M parameters, we find: (1)~at each time budget a U-shaped curve emerges where too-small models overfit and too-large models undertrain; (2)~optimal model size follows N^* \propto t^0.60 , growing \emphfaster than Chinchilla’s N^* \propto C^0.50 , with \alpha = 0.60 \pm 0.07 robustly exceeding compute-optimal across all sensitivity analyses; (3)~a \emphdual U-shape mechanism: short-budget U-curves arise from compute bottlenecks, while long-budget U-curves emerge from data bottlenecks (overfitting), with an intermediate regime where the U-curve temporarily disappears. These findings have immediate implications for researchers training on consumer hardware, where wall-clock time – not FLOPs – is the binding constraint. We release all code, logs, and 70+ experimental configurations.

[AI-95] GUARD-SLM: Token Activation-Based Defense Against Jailbreak Attacks for Small Language Models

【速读】:该论文旨在解决小语言模型(Small Language Models, SLMs)在面对多样化恶意攻击时安全性不足的问题,尤其关注其在边缘设备部署场景下对越狱攻击(jailbreak attacks)的脆弱性。现有防御机制因缺乏对模型各层内部表征(internal representations)的理解而表现不佳。解决方案的关键在于通过分析不同输入类型在模型各层隐藏激活(hidden-layer activations)中形成的可区分模式,提出一种轻量级、基于token激活的防御方法GUARD-SLM,该方法在推理阶段直接作用于表示空间以过滤恶意提示,同时保留良性输入,从而提升SLMs的安全性和实用性。

链接: https://arxiv.org/abs/2603.28817
作者: Md Jueal Mia,Joaquin Molto,Yanzhao Wu,M. Hadi Amini
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Small Language Models (SLMs) are emerging as efficient and economically viable alternatives to Large Language Models (LLMs), offering competitive performance with significantly lower computational costs and latency. These advantages make SLMs suitable for resource-constrained and efficient deployment on edge devices. However, existing jailbreak defenses show limited robustness against heterogeneous attacks, largely due to an incomplete understanding of the internal representations across different layers of language models that facilitate jailbreak behaviors. In this paper, we conduct a comprehensive empirical study on 9 jailbreak attacks across 7 SLMs and 3 LLMs. Our analysis shows that SLMs remain highly vulnerable to malicious prompts that bypass safety alignment. We analyze hidden-layer activations across different layers and model architectures, revealing that different input types form distinguishable patterns in the internal representation space. Based on this observation, we propose GUARD-SLM, a lightweight token activation-based method that operates in the representation space to filter malicious prompts during inference while preserving benign ones. Our findings highlight robustness limitations across layers of language models and provide a practical direction for secure small language model deployment.

[AI-96] ARTLAS: Mapping Art-Technology Institutions via Conceptual Axes Text Embeddings and Unsupervised Clustering

【速读】:该论文旨在解决艺术与技术交叉领域机构(如节庆、双年展、研究实验室、会议等)日益多样化背景下,缺乏系统性分析框架的问题。其核心挑战在于如何量化并可视化这些机构在多维特征上的差异与关联,从而揭示其生态结构。解决方案的关键在于提出ARTLAS方法论:首先构建一个包含八个维度(策展哲学、地域关系、知识生产模式、机构谱系、时间导向、生态系统功能、受众关系和学科定位)的八轴概念框架,随后利用E5-large-v2句嵌入编码文本描述,并通过词级码本量化为TF-IDF特征向量;再结合UMAP降维与层次聚类(平均链接,k=10)实现高保真分组,最终借助非负矩阵分解提取潜在主题及邻域-簇熵识别边界机构,形成可交互的可视化平台。该方案实现了对78个文化科技机构的统一分析空间映射,验证了其在揭示集群结构(如艺术科学枢纽、产业创新集群、学术社区等)方面的有效性。

链接: https://arxiv.org/abs/2603.28816
作者: Joonhyung Bae
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The global landscape of art-technology institutions, including festivals, biennials, research labs, conferences, and hybrid organizations, has grown increasingly diverse, yet systematic frameworks for analyzing their multidimensional characteristics remain scarce. This paper proposes ARTLAS, a computational methodology combining an eight-axis conceptual framework (Curatorial Philosophy, Territorial Relation, Knowledge Production Mode, Institutional Genealogy, Temporal Orientation, Ecosystem Function, Audience Relation, and Disciplinary Positioning) with a text-embedding and clustering pipeline to map 78 cultural-technology institutions into a unified analytical space. Each institution is characterized through qualitative descriptions along the eight axes, encoded via E5-large-v2 sentence embeddings and quantized through a word-level codebook into TF-IDF feature vectors. Dimensionality reduction using UMAP, followed by agglomerative clustering (Average linkage, k=10), yields a composite score of 0.825, a silhouette coefficient of 0.803, and a Calinski-Harabasz index of 11,196. Non-negative matrix factorization extracts ten latent topics, and a neighbor-cluster entropy measure identifies boundary institutions bridging multiple thematic communities. An interactive web-based visualization tool built with React enables stakeholders to explore institutional similarities, thematic profiles, and cross-disciplinary connections. The results reveal coherent groupings such as an art-science hub cluster anchored by ZKM and ArtScience Museum, an innovation and industry cluster including Ars Electronica, transmediale, and Sonar, an ACM academic community cluster comprising TEI, DIS, and NIME, and an electronic music and media cluster including CTM Festival, MUTEK, and Sonic Acts. This work contributes a replicable, data-driven approach to institutional ecology in the cultural-technology sector.

[AI-97] SkillTester: Benchmarking Utility and Security of Agent Skills

【速读】:该论文旨在解决智能体(Agent)技能在实用性与安全性方面缺乏系统化评估工具的问题。当前生成式 AI(Generative AI)驱动的智能体应用日益广泛,但其技能模块的效能和潜在安全风险难以量化衡量,导致部署前的质量保障不足。解决方案的关键在于提出 SkillTester 工具及其评估框架:通过对比基线执行与启用技能后的执行结果,结合独立的安全探针套件,将原始执行产物标准化为实用得分、安全得分及三级安全状态标签,从而实现对技能的可比较、可解释的质量评估。该框架基于“实用性比较原则”和“用户友好性原则”,为代理优先(Agent-first)环境中的技能质量控制提供了结构化验证机制。

链接: https://arxiv.org/abs/2603.28815
作者: Leye Wang,Zixing Wang,Anjie Xu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Technical report, 13 pages, 2 figures, 9 tables. Project page: this https URL . Code: this https URL

点击查看摘要

Abstract:This technical report presents SkillTester, a tool for evaluating the utility and security of agent skills. Its evaluation framework combines paired baseline and with-skill execution conditions with a separate security probe suite. Grounded in a comparative utility principle and a user-facing simplicity principle, the framework normalizes raw execution artifacts into a utility score, a security score, and a three-level security status label. More broadly, it can be understood as a comparative quality-assurance harness for agent skills in an agent-first world. The public service is deployed at this https URL, and the broader project is maintained at this https URL.

[AI-98] WAter: A Workload-Adaptive Knob Tuning System based on Workload Compression

【速读】:该论文旨在解决数据库管理系统(Database Management System, DBMS)参数调优过程中存在的高成本问题,尤其是由于单次配置评估需运行完整工作负载而导致的运行时间过长。现有方法主要通过提升采样效率来减少评估配置数量,但对降低每次评估耗时的关注不足。其解决方案的关键在于提出WAter系统,该系统将调优过程划分为多个时间片段,在每个片段中仅评估工作负载中的小部分查询,并利用运行时性能特征动态识别更具代表性的查询子集用于后续评估;同时在每个时间片段末尾对表现最优的配置进行全量工作负载验证,从而在显著降低调优时间(最多减少73.5%)的同时获得更优性能(最高提升16.2%)。

链接: https://arxiv.org/abs/2603.28809
作者: Yibo Wang,Jiale Lao,Chen Zhang,Cehua Yang,Jianguo Wang,Mingjie Tang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Selecting appropriate values for the configurable parameters of Database Management Systems (DBMS) to improve performance is a significant challenge. Recent machine learning (ML)-based tuning systems have shown strong potential, but their practical adoption is often limited by the high tuning cost. This cost arises from two main factors: (1) the system needs to evaluate a large number of configurations to identify a satisfactory one, and (2) for each configuration, the system must execute the entire target workload on the DBMS, which is both time-consuming. Existing studies have primarily addressed the first factor by improving sample efficiency, that is, by reducing the number of configurations evaluated. However, the second factor, improving runtime efficiency by reducing the time required for each evaluation, has received limited attention and remains an underexplored direction. We develop WAter, a runtime-efficient and workload-adaptive tuning system that finds near-optimal configurations at a fraction of the tuning cost compared with state-of-the-art methods. We divide the tuning process into multiple time slices and evaluate only a small subset of queries from the workload in each slice. Different subsets are evaluated across slices, and a runtime profile is used to dynamically identify more representative subsets for evaluation in subsequent slices. At the end of each time slice, the most promising configurations are evaluated on the original workload to measure their actual performance. Evaluations demonstrate that WAter identifies the best-performing configurations with up to 73.5% less tuning time and achieves up to 16.2% higher performance than the best-performing alternative. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.28809 [cs.DB] (or arXiv:2603.28809v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2603.28809 Focus to learn more arXiv-issued DOI via DataCite

[AI-99] Design and Development of an ML/DL Attack Resistance of RC-Based PUF for IoT Security

【速读】:该论文旨在解决物联网(IoT)设备中物理不可克隆函数(Physical Unclonable Function, PUF)面临机器学习/深度学习(Machine Learning/Deep Learning, ML/DL)建模攻击的安全威胁问题,即攻击者利用ML算法学习PUF的挑战-响应对(Challenge-Response Pair, CRP)模式从而预测其输出。解决方案的关键在于设计了一种基于32位挑战-响应对的电阻-电容(Resistor-Capacitor, RC)结构的动态可重构PUF架构,该架构通过引入动态重配置机制显著增强了对ML建模攻击的鲁棒性:实验表明,尽管多种主流机器学习模型(包括人工神经网络ANN、梯度提升神经网络GBNN、决策树DT、随机森林RF和XGBoost)在训练集上均达到100%准确率,但在测试集上的性能仅接近随机猜测水平(50.06%–53.27%),证明其能有效抵御高级建模攻击,且资源开销极低,适合作为下一代IoT安全认证的轻量级替代方案。

链接: https://arxiv.org/abs/2603.28798
作者: Joy Acharya,Smit Patel,Paawan Sharma,Mohendra Roy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for the IEEE GCON 2026 conference, organized by IIT Guwahati

点击查看摘要

Abstract:Physically Unclonable Functions (PUFs) provide promising hardware security for IoT authentication, leveraging inherent randomness suitable for resource constrained environments. However, ML/DL modeling attacks threaten PUF security by learning challenge-response patterns. This work introduces a custom resistor-capacitor (RC) based dynamically reconfigurable PUF using 32-bit challenge-response pairs (CRPs) designed to resist such attacks. We systematically evaluated robustness by generating a CRP dataset and splitting it into training, validation, and test sets. Multiple ML techniques including Artificial Neural Networks (ANN), Gradient Boosted Neural Networks (GBNN), Decision Trees (DT), Random Forests (RF), and XGBoost, were trained to model PUF behavior. While all models achieved 100% training accuracy, test performance remained near random guessing: 51.05% (ANN), 53.27% (GBNN), 50.06% (DT), 52.08% (RF), and 50.97% (XGBoost). These results demonstrate the proposed PUF’s strong resistance to ML-driven modeling attacks, as advanced algorithms fail to reproduce accurate responses. The dynamically reconfigurable architecture enhances robustness against adversarial threats with minimal resource overhead. This simple RC-PUF offers an effective, low-cost alternative to complex encryption for securing next-generation IoT authentication against machine learning-based threats, ensuring reliable device verification without compromising computational efficiency or scalability in deployed IoT networks.

[AI-100] GaloisSAT: Differentiable Boolean Satisfiability Solving via Finite Field Algebra

【速读】:该论文旨在解决布尔可满足性(Boolean satisfiability, SAT)问题求解效率提升缓慢的问题,尽管过去二十年间算法持续演进,但SAT求解器的性能改进仍受限于传统方法的瓶颈。其解决方案的关键在于提出一种新型混合GPU-CPU架构的SAT求解器GaloisSAT:首先利用基于现代机器学习基础设施的可微分SAT求解引擎在GPU上进行快速推理,随后将结果交由CPU上的经典CDCL(Conflict-Driven Clause Learning)求解阶段进一步优化。这种两阶段协同策略显著提升了整体求解效率,在SAT Competition 2024基准测试中实现了8.41倍的可满足类别加速和1.29倍的不可满足类别加速,优于当前最优基线Kissat与CaDiCaL。

链接: https://arxiv.org/abs/2603.28796
作者: Curie Kim,Carsten Portner,Mingju Liu,Steve Dai,Haoxing Ren,Brucek Khailany,Alvaro Velasquez,Ismail Alkhouri,Cunxi Yu
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Boolean satisfiability (SAT) problem, the first problem proven to be NP-complete, has become a fundamental challenge in computational complexity, with widespread applications in optimization and verification across many domains. Despite significant algorithmic advances over the past two decades, the performance of SAT solvers has improved at a limited pace. Notably, the 2025 competition winner shows only about a 2X improvement over the 2006 winner in SAT Competition performance after nearly 20 years of effort. This paper introduces GaloisSAT, a novel hybrid GPU-CPU SAT solver that integrates a differentiable SAT solving engine powered by modern machine learning infrastructure on GPUs, followed by a traditional CDCL-based SAT solving stage on CPUs. GaloisSAT is benchmarked against the latest versions of state-of-the-art solvers, Kissat and CaDiCaL, using the SAT Competition 2024 benchmark suite. Results demonstrate substantial improvements in the official SAT Competition metric PAR-2 (penalized average runtime with a timeout of 5,000 seconds and a penalty factor of 2). Specifically, GaloisSAT achieves an 8.41X speedup in the satisfiable category and a 1.29X speedup in the unsatisfiable category compared to the strongest baselines.

[AI-101] AI in Work-Based Learning: Understanding the Purposes and Effects of Intelligent Tools Among Student Interns

【速读】:该论文旨在解决菲律宾高等教育中学生实习期间如何有效利用智能工具(如生成式AI)以提升工作准备度的问题。研究发现,学生在实习中主要将AI工具用于提高生产力、撰写报告、辅助沟通与内容创作、技术协助及独立任务完成等场景,其中ChatGPT使用最广泛。解决方案的关键在于:高校应系统性地将AI素养(AI literacy)纳入课程体系,并提供针对性的上手培训(onboarding),同时制定明确的政策保障公平获取与负责任使用AI工具,从而增强学生在数字化职场中的适应能力与伦理意识。

链接: https://arxiv.org/abs/2603.28786
作者: John Paul P. Miranda,Rhiziel P. Manalese,Sheila M. Geronimo,Vernon Grace M. Maniago,Charlie K. Padilla,Aileen P. De Leon,Santa L. Merle,Mark Anthony A. Castro
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 5 pages, 2 tables, conference proceedings

点击查看摘要

Abstract:This study examined how student interns in Philippine higher education use intelligent tools during their OJT. Data were collected from 384 respondents using a structured questionnaire that asked about AI tool usage, task-specific applications, and perceptions of confidence, ethics, and support. Analysis of task-based usage identified four main purposes: productivity and report writing, communication and content drafting, technical assistance and code support, and independent task completion. ChatGPT was the most commonly used AI tool, followed by Quillbot, Canva AI, and Grammarly. Students reported moderate confidence in using AI and applied these tools selectively and ethically during OJT tasks. This indicate that AI tools assist student interns in various OJT activities related to work-readiness. The study suggests that higher education programs include AI literacy and onboarding. Clear policies and fair access to AI tools are important to support responsible use and prepare students for future careers.

[AI-102] Byzantine-Robust and Communication-Efficient Distributed Training: Compressive and Cyclic Gradient Coding

【速读】:该论文旨在解决分布式训练(Distributed Training, DT)在面临拜占庭攻击(Byzantine attacks)且存在通信约束条件下的鲁棒性不足问题。现有方法虽通过服务器端的鲁棒聚合规则提升抗攻击能力,但无法有效应对因设备间数据异质性导致的局部梯度差异较大时所引发的解误差不收敛问题。解决方案的关键在于提出一种基于循环梯度编码(cyclic gradient coding)的新型分布式训练方法——LAD(Learning with Aggregated Data),其核心机制是在每轮迭代中利用循环梯度编码冗余分配计算任务,使诚实设备对固定数量的数据子集进行本地梯度计算并编码传输;服务器通过鲁棒聚合规则融合来自诚实设备的编码向量与潜在恶意设备的错误信息,借助设备间计算冗余实现理论上的收敛性能保障,从而显著降低解误差并增强对拜占庭攻击的鲁棒性。进一步地,作者还提出了压缩版的Com-LAD,以在受限通信环境下进一步减少通信开销。

链接: https://arxiv.org/abs/2603.28780
作者: Chengxi Li,Youssef Allouah,Rachid Guerraoui,Mikael Skoglund,Ming Xiao
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we study the problem of distributed training (DT) under Byzantine attacks with communication constraints. While prior work has developed various robust aggregation rules at the server to enhance robustness to Byzantine attacks, the existing methods suffer from a critical limitation in that the solution error does not diminish when the local gradients sent by different devices vary considerably, as a result of data heterogeneity among the subsets held by different devices. To overcome this limitation, we propose a novel DT method, cyclic gradient coding-based DT (LAD). In LAD, the server allocates the entire training dataset to the devices before training begins. In each iteration, it assigns computational tasks redundantly to the devices using cyclic gradient coding. Each honest device then computes local gradients on a fixed number of data subsets and encodes the local gradients before transmitting to the server. The server aggregates the coded vectors from the honest devices and the potentially incorrect messages from Byzantine devices using a robust aggregation rule. Leveraging the redundancy of computation across devices, the convergence performance of LAD is analytically characterized, demonstrating improved robustness against Byzantine attacks and significantly lower solution error. Furthermore, we extend LAD to a communication-efficient variant, compressive and cyclic gradient coding-based DT (Com-LAD), which further reduces communication overhead under constrained settings. Numerical results validate the effectiveness of the proposed methods in enhancing both Byzantine resilience and communication efficiency.

[AI-103] Four Generations of Quantum Biomedical Sensors

【速读】:该论文旨在解决量子生物传感技术在临床转化过程中面临的瓶颈问题,即传统传感器受限于经典噪声极限且依赖宏观粒子集合,难以实现超高灵敏度与生物信息结构化提取。其解决方案的关键在于提出一个统一的四代演化框架,明确不同代际量子生物传感器对量子资源(如能级、相干性、纠缠和自旋压缩)的利用方式:从第一代基于离散能级的经典信号转换,到第二代利用量子相干性达到标准量子极限,再到第三代通过纠缠和自旋压缩逼近海森堡极限精度;最终第四代创新性地将量子传感与量子学习及变分电路端到端集成,实现在量子域内直接进行自适应推理,从而推动从物理可观测量测量向结构化生物信息提取的跃迁。

链接: https://arxiv.org/abs/2603.29944
作者: Xin Jin,Priyam Srivastava,Ronghe Wang,Yuqing Li,Jonathan Beaumariage,Tom Purdy,M. V. Gurudev Dutt,Kang Kim,Kaushik Seshadreesan,Junyu Liu
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Quantum sensing technologies offer transformative potential for ultra-sensitive biomedical sensing, yet their clinical translation remains constrained by classical noise limits and a reliance on macroscopic ensembles. We propose a unifying generational framework to organize the evolving landscape of quantum biosensors based on their utilization of quantum resources. First-generation devices utilize discrete energy levels for signal transduction but follow classical scaling laws. Second-generation sensors exploit quantum coherence to reach the standard quantum limit, while third-generation architectures leverage entanglement and spin squeezing to approach Heisenberg-limited precision. We further define an emerging fourth generation characterized by the end-to-end integration of quantum sensing with quantum learning and variational circuits, enabling adaptive inference directly within the quantum domain. By analyzing critical parameters such as bandwidth matching and sensor-tissue proximity, we identify key technological bottlenecks and propose a roadmap for transitioning from measuring physical observables to extracting structured biological information with quantum-enhanced intelligence.

[AI-104] Bethe Ansatz with a Large Language Model

【速读】:该论文旨在探索大型语言模型(Large Language Model, LLM)在数学物理领域中执行特定计算任务的能力,具体为求解选定可积自旋链模型的坐标贝特 ansatz(Bethe Ansatz)解。其核心问题是:LLM是否能够半自主地完成复杂且尚未公开的贝特 ansatz 推导,尤其是在涉及新哈密顿量和非平凡对称性结构的情况下。解决方案的关键在于利用 ChatGPT 5.2 Pro 和 5.4 Pro 等先进 LLM 自主生成推导过程,并结合人类研究人员对中间结果进行校验与修正,最终获得与精确对角化方法一致的解析解。特别地,LLM成功识别出一个打破左右对称性但具有 PT 对称性的模型,以及一个嵌套贝特 ansatz 中存在自由费米子结构但缺乏 U(1) 对称性的独特相互作用模型,这些发现体现了 LLM 在理论物理中辅助发现新结构的潜力。

链接: https://arxiv.org/abs/2603.29932
作者: Balázs Pozsgay,István Vona
机构: 未知
类目: atistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); High Energy Physics - Theory (hep-th)
备注: 40 pages

点击查看摘要

Abstract:We explore the capability of a Large Language Model (LLM) to perform specific computations in mathematical physics: the task is to compute the coordinate Bethe Ansatz solution of selected integrable spin chain models. We select three integrable Hamiltonians for which the solutions were unpublished; two of the Hamiltonians are actually new. We observed that the LLM semi-autonomously solved the task in all cases, with a few mistakes along the way. These were corrected after the human researchers spotted them. The results of the LLM were checked against exact diagonalization (performed by separate programs), and the derivations were also checked by the authors. The Bethe Ansatz solutions are interesting in themselves. Our second model manifestly breaks left-right invariance, but it is PT-symmetric, therefore its solution could be interesting for applications in Generalized Hydrodynamics. And our third model is solved by a special form of the nested Bethe Ansatz, where the model is interacting, but the nesting level has a free fermionic structure lacking U(1) -invariance. This structure appears to be unique and it was found by the LLM. We used ChatGPT 5.2 Pro and 5.4 Pro by OpenAI.

[AI-105] Reducing Complexity for Quantum Approaches in Train Load Optimization

【速读】:该论文旨在解决铁路集装箱装载优化(Train Load Optimization, TLO)问题中因重新操作(rehandle)带来的计算复杂性难题。传统数学模型通过引入显式的二元变量和大量逻辑约束来刻画每次潜在的重新操作,导致模型规模庞大、求解困难。其解决方案的关键在于提出一种创新且紧凑的数学建模方法,将重新操作成本隐式地嵌入目标函数中,从而避免了对专门的重新操作变量及其约束的需求,显著减少了模型中的变量和约束数量,提升了求解效率与可扩展性。

链接: https://arxiv.org/abs/2603.29543
作者: Zhijie Tang,Albert Nieto-Morales,Arit Kumar Bishwas
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Efficiently planning container loads onto trains is a computationally challenging combinatorial optimization problem, central to logistics and supply chain management. A primary source of this complexity arises from the need to model and reduce rehandle operations-unproductive crane moves required to access blocked containers. Conventional mathematical formulations address this by introducing explicit binary variables and a web of logical constraints for each potential rehandle, resulting in large-scale models that are difficult to solve. This paper presents a fundamental departure from this paradigm. We introduce an innovative and compact mathematical formulation for the Train Load Optimization (TLO) problem where the rehandle cost is calculated implicitly within the objective function. This novel approach helps prevent the need for dedicated rehandle variables and their associated constraints, leading to a dramatic reduction in model size. We provide a formal comparison against a conventional model to analytically demonstrate the significant reduction in the number of variables and constraints. The efficacy of our compact formulation is assessed through a simulated annealing metaheuristic, which finds high-quality loading plans for various problem instances. The results confirm that our model is not only more parsimonious but also practically effective, offering a scalable and powerful tool for modern rail logistics.

[AI-106] Economics of Human and AI Collaboration: When is Partial Automation More Attractive than Full Automation?

【速读】:该论文旨在解决如何在企业层面科学评估任务自动化的最优程度这一问题,突破传统二元决策(即是否自动化)的局限,提出将自动化强度建模为一个连续选择过程。其核心解决方案在于构建一个统一框架,从供需两端协同分析:在供给端,通过缩放定律实验估计AI生产函数,揭示模型性能与数据、算力和模型规模之间的关系,并发现高精度AI的成本呈凸性增长,导致完全自动化往往非成本最小化;在需求端,引入基于熵的任务复杂度度量,将AI准确率映射为劳动替代比率,量化不同准确率下的人工替代程度。研究表明,部分自动化(即人类保留处理残余任务)通常是经济最优均衡状态,尤其在低复杂度任务中更为显著,且规模化部署(如AI即服务)可显著扩大可行自动化任务范围,从而实现更广泛而高效的劳动力资源配置。

链接: https://arxiv.org/abs/2603.29121
作者: Wensu Li,Atin Aboutorabi,Harry Lyu,Kaizhi Qian,Martin Fleming,Brian C. Goehring,Neil Thompson
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This paper develops a unified framework for evaluating the optimal degree of task automation. Moving beyond binary automate-or-not assessments, we model automation intensity as a continuous choice in which firms minimize costs by selecting an AI accuracy level, from no automation through partial human-AI collaboration to full automation. On the supply side, we estimate an AI production function via scaling-law experiments linking performance to data, compute, and model size. Because AI systems exhibit predictable but diminishing returns to these inputs, the cost of higher accuracy is convex: good performance may be inexpensive, but near-perfect accuracy is disproportionately costly. Full automation is therefore often not cost-minimizing; partial automation, where firms retain human workers for residual tasks, frequently emerges as the equilibrium. On the demand side, we introduce an entropy-based measure of task complexity that maps model accuracy into a labor substitution ratio, quantifying human labor displacement at each accuracy level. We calibrate the framework with O*NET task data, a survey of 3,778 domain experts, and GPT-4o-derived task decompositions, implementing it in computer vision. Task complexity shapes substitution: low-complexity tasks see high substitution, while high-complexity tasks favor limited partial automation. Scale of deployment is a key determinant: AI-as-a-Service and AI agents spread fixed costs across users, sharply expanding economically viable tasks. At the firm level, cost-effective automation captures approximately 11% of computer-vision-exposed labor compensation; under economy-wide deployment, this share rises sharply. Since other AI systems exhibit similar scaling-law economics, our mechanisms extend beyond computer vision, reinforcing that partial automation is often the economically rational long-run outcome, not merely a transitional phase.

[AI-107] A Multi-Modal Dataset for Ground Reaction Force Estimation Using Consumer Wearable Sensors

【速读】:该论文旨在解决利用消费级可穿戴设备(Apple Watch)估算垂直地面反作用力(vGRF)的问题,其核心挑战在于如何从低成本、低精度的惯性测量单元(IMU)传感器数据中准确推断出实验室级力板提供的高精度vGRF信号。解决方案的关键在于构建了一个多模态、开放共享的数据集,包含492个经过验证的试验数据,其中395个为三元完整数据(腕部、腰部和力板数据同步),并采用跨传感器一致性与可重复性分析框架(如组内相关系数0.871–0.990)保障数据质量,从而支持机器学习模型在vGRF估计任务中的可复现评估与传感器位置影响研究。

链接: https://arxiv.org/abs/2603.28784
作者: Parvin Ghaffarzadeh,Debarati Chakraborty,Koorosh Aslansefat,Ali Dostan,Yiannis Papadopoulos
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This Data Descriptor presents a fully open, multi-modal dataset for estimating vertical ground reaction force (vGRF) from consumer-grade Apple Watch sensors with laboratory force plate ground truth. Ten healthy adults aged 26–41 years performed five activities: walking, jogging, running, heel drops, and step drops, while wearing two Apple Watches positioned at the left wrist and waist. The dataset contains 492 validated trials with time-aligned inertial measurement unit (IMU) recordings (approximately 100 Hz) and force plate vGRF (Force_Z, 1000 Hz). The release includes raw and processed time series, trial-level metadata, quality-control flags, and machine-readable data dictionaries. Trial-level matching manifests link recordings across modalities using stable identifiers. Of the 492 validated trials, 395 are triad-complete, containing wrist, waist, and force plate data, enabling cross-sensor analyses and reproducible model evaluation. Dataset quality is characterised through a three-phase cross-sensor plausibility and consistency framework, repeatability analysis of peak vGRF (intraclass correlation coefficient 0.871–0.990), and systematic checks of force ranges and trial completeness. Monte Carlo sensitivity analysis showed that correlation-based validation metrics were robust to single-sample timing perturbations at the IMU sampling resolution. All data are released under CC BY 4.0, with analysis scripts archived alongside the dataset and mirrored on GitHub. This resource supports reproducible research in wearable biomechanics, benchmarking of machine learning models for vGRF estimation, and investigation of sensor placement effects using widely available consumer wearables. Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.28784 [eess.SP] (or arXiv:2603.28784v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2603.28784 Focus to learn more arXiv-issued DOI via DataCite

机器学习

[LG-0] Refined Detection for Gumbel Watermarking

链接: https://arxiv.org/abs/2603.30017
作者: Tor Lattimore
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a simple detection mechanism for the Gumbel watermarking scheme proposed by Aaronson (2022). The new mechanism is proven to be near-optimal in a problem-dependent sense among all model-agnostic watermarking schemes under the assumption that the next-token distribution is sampled i.i.d.

[LG-1] Aligning Validation with Deployment: Target-Weighted Cross-Validation for Spatial Prediction

链接: https://arxiv.org/abs/2603.29981
作者: Alexander Brenning,Thomas Suesse
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Cross-validation (CV) is commonly used to estimate predictive risk when independent test data are unavailable. Its validity depends on the assumption that validation tasks are sampled from the same distribution as prediction tasks encountered during deployment. In spatial prediction and other settings with structured data, this assumption is frequently violated, leading to biased estimates of deployment risk. We propose Target-Weighted CV (TWCV), an estimator of deployment risk that accounts for discrepancies between validation and deployment task distributions, thus accounting for (1) covariate shift and (2) task-difficulty shift. We characterize prediction tasks by descriptors such as covariates and spatial configuration. TWCV assigns weights to validation losses such that the weighted empirical distribution of validation tasks matches the corresponding distribution over a target domain. The weights are obtained via calibration weighting, yielding an importance-weighted estimator that targets deployment risk. Since TWCV requires adequate coverage of the deployment distribution’s support, we combine it with spatially buffered resampling that diversifies the task difficulty distribution. In a simulation study, conventional as well as spatial estimators exhibit substantial bias depending on sampling, whereas buffered TWCV remains approximately unbiased across scenarios. A case study in environmental pollution mapping further confirms that discrepancies between validation and deployment task distributions can affect performance assessment, and that buffered TWCV better reflects the prediction task over the target domain. These results establish task distribution mismatch as a primary source of CV bias in spatial prediction and show that calibration weighting combined with a suitable validation task generator provides a viable approach to estimating predictive risk under dataset shift.

[LG-2] Meteorology-Driven GPT 4AP: A Multi-Task Forecasting LLM for Atmospheric Air Pollution in Data-Scarce Settings

链接: https://arxiv.org/abs/2603.29974
作者: Prasanjit Dey,Soumyabrata Dev,Bianca Schoen-Phelan
类目: Machine Learning (cs.LG)
*备注: This manuscript is under review

点击查看摘要

Abstract:Accurate forecasting of air pollution is important for environmental monitoring and policy support, yet data-driven models often suffer from limited generalization in regions with sparse observations. This paper presents Meteorology-Driven GPT for Air Pollution (GPT4AP), a parameter-efficient multi-task forecasting framework based on a pre-trained GPT-2 backbone and Gaussian rank-stabilized low-rank adaptation (rsLoRA). The model freezes the self-attention and feed-forward layers and adapts lightweight positional and output modules, substantially reducing the number of trainable parameters. GPT4AP is evaluated on six real-world air quality monitoring datasets under few-shot, zero-shot, and long-term forecasting settings. In the few-shot regime using 10% of the training data, GPT4AP achieves an average MSE/MAE of 0.686/0.442, outperforming DLinear (0.728/0.530) and ETSformer (0.734/0.505). In zero-shot cross-station transfer, the proposed model attains an average MSE/MAE of 0.529/0.403, demonstrating improved generalization compared with existing baselines. In long-term forecasting with full training data, GPT4AP remains competitive, achieving an average MAE of 0.429, while specialized time-series models show slightly lower errors. These results indicate that GPT4AP provides a data-efficient forecasting approach that performs robustly under limited supervision and domain shift, while maintaining competitive accuracy in data-rich settings.

[LG-3] hink Anywhere in Code Generation

链接: https://arxiv.org/abs/2603.29957
作者: Xue Jiang,Tianyu Zhang,Ge Li,Mengyang Liu,Taozhi Chen,Zhenhua Xu,Binhua Li,Wenpin Jiao,Zhi Jin,Yongbin Li,Yihong Dong
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in reasoning Large Language Models (LLMs) have primarily relied on upfront thinking, where reasoning occurs before final answer. However, this approach suffers from critical limitations in code generation, where upfront thinking is often insufficient as problems’ full complexity only reveals itself during code implementation. Moreover, it cannot adaptively allocate reasoning effort throughout the code generation process where difficulty varies significantly. In this paper, we propose Think-Anywhere, a novel reasoning mechanism that enables LLMs to invoke thinking on-demand at any token position during code generation. We achieve Think-Anywhere by first teaching LLMs to imitate the reasoning patterns through cold-start training, then leveraging outcome-based RL rewards to drive the model’s autonomous exploration of when and where to invoke reasoning. Extensive experiments on four mainstream code generation benchmarks (i.e., LeetCode, LiveCodeBench, HumanEval, and MBPP) show that Think-Anywhere achieves state-of-the-art performance over both existing reasoning methods and recent post-training approaches, while demonstrating consistent generalization across diverse LLMs. Our analysis further reveals that Think-Anywhere enables the model to adaptively invoke reasoning at high-entropy positions, providing enhanced interpretability.

[LG-4] Real-Time Explanations for Tabular Foundation Models ICLR2026

链接: https://arxiv.org/abs/2603.29946
作者: Luan Borges Teodoro Reis Sena,Francisco Galuppo Azevedo
类目: Machine Learning (cs.LG)
*备注: Accepted at the 2nd DATA4Science Workshop at ICLR 2026, Rio de Janeiro, Brazil. OpenReview: this https URL

点击查看摘要

Abstract:Interpretability is central for scientific machine learning, as understanding \emphwhy models make predictions enables hypothesis generation and validation. While tabular foundation models show strong performance, existing explanation methods like SHAP are computationally expensive, limiting interactive exploration. We introduce ShapPFN, a foundation model that integrates Shapley value regression directly into its architecture, producing both predictions and explanations in a single forward pass. On standard benchmarks, ShapPFN achieves competitive performance while producing high-fidelity explanations ( R^2 =0.96, cosine=0.99) over 1000\times faster than KernelSHAP (0.06s vs 610s). Our code is available at this https URL

[LG-5] ask Scarcity and Label Leakage in Relational Transfer Learning ICLR2026

链接: https://arxiv.org/abs/2603.29914
作者: Francisco Galuppo Azevedo,Clarissa Lima Loures,Denis Oliveira Correa
类目: Machine Learning (cs.LG)
*备注: Accepted at the 3rd DATA-FM Workshop at ICLR 2026, Rio de Janeiro, Brazil. OpenReview: this https URL

点击查看摘要

Abstract:Training relational foundation models requires learning representations that transfer across tasks, yet available supervision is typically limited to a small number of prediction targets per database. This task scarcity causes learned representations to encode task-specific shortcuts that degrade transfer even within the same schema, a problem we call label leakage. We study this using K-Space, a modular architecture combining frozen pretrained tabular encoders with a lightweight message-passing core. To suppress leakage, we introduce a gradient projection method that removes label-predictive directions from representation updates. On RelBench, this improves within-dataset transfer by +0.145 AUROC on average, often recovering near single-task performance. Our results suggest that limited task diversity, not just limited data, constrains relational foundation models.

[LG-6] Mathematical Foundations of Modeling ETL Process Chains

链接: https://arxiv.org/abs/2603.29877
作者: Levin Maier,Lucas Schulze,Robert Lilow,Lukas Hahn,Nikola Krasowski,Arnulf Barth,Sebastian Gaebel,Ferdi Güran,Oliver Hanau,Giovanni Wagner,Falk Borgmann,Oleg Arenz,Jan Peters
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Databases (cs.DB); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 8 pages. Comments are welcome!

点击查看摘要

Abstract:Extract-Transform-Load (ETL) processes are core components of modern data processing infrastructures. The throughput of processed data records can be adjusted by changing the amount of allocated resources, i.e.~the number of parallel processing threads for each of the three ETL phases, but also depends on stochastic variations in the per-record processing times. In chains of multiple consecutive ETL processes, the relation between allocated resources and overall throughput is further complicated, for example by the occurrence of bottlenecks affecting all subsequent ETL processes. We develop a mathematical model of ETL process chains that is accurate at the level of time-aggregated throughput and suitable for efficient simulation. The process chain is represented as a controlled discrete-time Markov process on a directed acyclic graph whose edges are individual ETL processes. We model the mean throughput as a bounded, monotone function of the number of parallel threads, to capture the diminishing benefit of allocating more threads. We furthermore introduce a Flow Balance postulate linking number of threads, mean throughput, and mean processing time. The stochastic processing times are then modeled by non-negative heavy-tailed distributions around the mean processing time. This framework provides a principled simulator for ETL networks and a foundation for learning- and control-based resource allocation.

[LG-7] DiSGMM: A Method for Time-varying Microscopic Weight Completion on Road Networks

链接: https://arxiv.org/abs/2603.29837
作者: Yan Lin,Jilin Hu,Shengnan Guo,Christian S. Jensen,Youfang Lin,Huaiyu Wan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Microscopic road-network weights represent fine-grained, time-varying traffic conditions obtained from individual vehicles. An example is travel speeds associated with road segments as vehicles traverse them. These weights support tasks including traffic microsimulation and vehicle routing with reliability guarantees. We study the problem of time-varying microscopic weight completion. During a time slot, the available weights typically cover only some road segments. Weight completion recovers distributions for the weights of every road segment at the current time slot. This problem involves two challenges: (i) contending with two layers of sparsity, where weights are missing at both the network layer (many road segments lack weights) and the segment layer (a segment may have insufficient weights to enable accurate distribution estimation); and (ii) achieving a weight distribution representation that is closed-form and can capture complex conditions flexibly, including heavy tails and multiple clusters. To address these challenges, we propose DiSGMM that combines sparsity-aware embeddings with spatiotemporal modeling to leverage sparse known weights alongside learned segment properties and long-range correlations for distribution estimation. DiSGMM represents distributions of microscopic weights as learnable Gaussian mixture models, providing closed-form distributions capable of capturing complex conditions flexibly. Experiments on two real-world datasets show that DiSGMM can outperform state-of-the-art methods. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.29837 [cs.LG] (or arXiv:2603.29837v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.29837 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-8] Curvature-Guided LoRA: Steering in the pretrained NTK subspace

链接: https://arxiv.org/abs/2603.29824
作者: Frédéric Zheng,Alexandre Proutière
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Parameter-efficient fine-tuning methods such as LoRA enable efficient adaptation of large pretrained models but often fall short of full fine-tuning performance. Existing approaches focus on aligning parameter updates, which only indirectly control model predictions. In this work, we introduce the prediction alignment problem, aiming to match the predictor obtained via PEFT to that of full fine-tuning at the level of outputs. We show that this objective naturally leads to a curvature-aware, second-order formulation, where optimal low-rank updates correspond to a Newton-like, curvature-whitened gradient. Based on this insight, we propose Curvature-Guided LoRA (CG-LoRA), which selects and scales adaptation directions using local curvature information. Our method is computationally efficient and avoids explicit second-order matrix construction. Preliminary experiments on standard natural language understanding benchmarks demonstrate improved performance and faster convergence compared to existing LoRA variants.

[LG-9] Loss Gap Parity for Fairness in Heterogeneous Federated Learning AISTATS2026

链接: https://arxiv.org/abs/2603.29818
作者: Brahim Erraji,Michaël Perrot,Aurélien Bellet
类目: Machine Learning (cs.LG)
*备注: 9 Pages, Published to AISTATS 2026

点击查看摘要

Abstract:While clients may join federated learning to improve performance on data they rarely observe locally, they often remain self-interested, expecting the global model to perform well on their own data. This motivates an objective that ensures all clients achieve a similar loss gap -the difference in performance between the global model and the best model they could train using only their local data-. To this end, we propose EAGLE, a novel federated learning algorithm that explicitly regularizes the global model to minimize disparities in loss gaps across clients. Our approach is particularly effective in heterogeneous settings, where the optimal local models of the clients may be misaligned. Unlike existing methods that encourage loss parity, potentially degrading performance for many clients, EAGLE targets fairness in relative improvements. We provide theoretical convergence guarantees for EAGLE under non-convex loss functions, and characterize how its iterates perform relative to the standard federated learning objective using a novel heterogeneity measure. Empirically, we demonstrate that EAGLE reduces the disparity in loss gaps among clients by prioritizing those furthest from their local optimal loss, while maintaining competitive utility in both convex and non-convex cases compared to strong baselines.

[LG-10] AMShortcut: An Inference- and Training-Efficient Inverse Design Model for Amorphous Materials

链接: https://arxiv.org/abs/2603.29812
作者: Yan Lin,Jonas A. Finkler,Tao Du,Jilin Hu,Morten M. Smedskjaer
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Amorphous materials are solids that lack long-range atomic order but possess complex short- and medium-range order. Unlike crystalline materials that can be described by unit cells containing few up to hundreds of atoms, amorphous materials require larger simulation cells with at least hundreds or often thousands of atoms. Inverse design of amorphous materials with probabilistic generative models aims to generate the atomic positions and elements of amorphous materials given a set of desired properties. It has emerged as a promising approach for facilitating the application of amorphous materials in domains such as energy storage and thermal management. In this paper, we introduce AMShortcut, an inference- and training-efficient probabilistic generative model for amorphous materials. AMShortcut enables accurate inference of diverse short- and medium-range structures in amorphous materials with only a few sampling steps, mitigating the need for an excessive number of sampling steps that hinders inference efficiency. AMShortcut can be trained once with all relevant properties and perform inference conditioned on arbitrary combinations of desired properties, mitigating the need for training one model for each combination. Experiments on three amorphous materials datasets with diverse structures and properties demonstrate that AMShortcut achieves its design goals.

[LG-11] Multimodal Machine Learning for Early Prediction of Metastasis in a Swedish Multi-Cancer Cohort

链接: https://arxiv.org/abs/2603.29793
作者: Franco Rugolon,Korbinian Randl,Braslav Jovanovic,Ioanna Miliou,Panagiotis Papapetrou
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Multimodal Machine Learning offers a holistic view of a patient’s status, integrating structured and unstructured data from electronic health records (EHR). We propose a framework to predict metastasis risk one month prior to diagnosis, using six months of clinical history from EHR data. Data from four cancer cohorts collected at Karolinska University Hospital (Stockholm, Sweden) were analyzed: breast (n = 743), colon (n = 387), lung (n = 870), and prostate (n = 1890). The dataset included demographics, comorbidities, laboratory results, medications, and clinical text. We compared traditional and deep learning classifiers across single modalities and multimodal combinations, using various fusion strategies and a Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) 2a design, with an 80-20 development-validation split to ensure a rigorous, repeatable evaluation. Performance was evaluated using AUROC, AUPRC, F1 score, sensitivity, and specificity. We then employed a multimodal adaptation of SHAP to analyze the classifiers’ reasoning. Intermediate fusion achieved the highest F1 scores on breast (0.845), colon (0.786), and prostate cancer (0.845), demonstrating strong predictive performance. For lung cancer, the intermediate fusion achieved an F1 score of 0.819, while the text-only model achieved the highest, with an F1 score of 0.829. Deep learning classifiers consistently outperformed traditional models. Colon cancer, the smallest cohort, had the lowest performance, highlighting the importance of sufficient training data. SHAP analysis showed that the relative importance of modalities varied across cancer types. Fusion strategies offer distinct strengths and weaknesses. Intermediate fusion consistently delivered the best results, but strategy choices should align with data characteristics and organizational needs.

[LG-12] Big2Small: A Unifying Neural Network Framework for Model Compression

链接: https://arxiv.org/abs/2603.29768
作者: Jing-Xiao Liao,Haoran Wang,Tao Li,Daoming Lyu,Yi Zhang,Chengjun Cai,Feng-Lei Fan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the development of foundational models, model compression has become a critical requirement. Various model compression approaches have been proposed such as low-rank decomposition, pruning, quantization, ergodic dynamic systems, and knowledge distillation, which are based on different heuristics. To elevate the field from fragmentation to a principled discipline, we construct a unifying mathematical framework for model compression grounded in measure theory. We further demonstrate that each model compression technique is mathematically equivalent to a neural network subject to a regularization. Building upon this mathematical and structural equivalence, we propose an experimentally-verified data-free model compression framework, termed \textitBig2Small, which translates Implicit Neural Representations (INRs) from data domain to the domain of network parameters. \textitBig2Small trains compact INRs to encode the weights of larger models and reconstruct the weights during inference. To enhance reconstruction fidelity, we introduce Outlier-Aware Preprocessing to handle extreme weight values and a Frequency-Aware Loss function to preserve high-frequency details. Experiments on image classification and segmentation demonstrate that \textitBig2Small achieves competitive accuracy and compression ratios compared to state-of-the-art baselines.

[LG-13] One-for-All: A Lightweight Stabilized and Parameter-Efficient Pre-trained LLM for Time Series Forecasting

链接: https://arxiv.org/abs/2603.29756
作者: Prasanjit Dey,Soumyabrata Dev,Bianca Schoen-Phelan
类目: Machine Learning (cs.LG)
*备注: This manuscript is currently under review at IEEE Transactions on Knowledge and Data Engineering (TKDE)

点击查看摘要

Abstract:We address the challenge of adapting pre-trained Large Language Models (LLMs) for multivariate time-series analysis, where their deployment is often hindered by prohibitive computational and memory demands. Our solution, One-for-All, introduces Gaussian Rank-Stabilized Low-Rank Adapters (rsLoRA) to enable parameter-efficient fine-tuning of frozen LLMs. While inspired by LoRA, rsLoRA introduces a mathematically grounded rank-stabilization mechanism that enables provable gradient stability at low ranks a novel contribution absent in prior PEFT methods. Our framework injects trainable rank decomposition matrices (rank 16) into positional embeddings and output layers, while keeping self-attention weights fixed. This design reduces trainable parameters by 6.8 \times (vs. TimesNet), 21 \times (vs. GPT4TS), and 11.8 \times (vs. TIME-LLM), while achieving a 168-1,776 \times smaller memory footprint (2.2MiB vs. 340MiB-4.18GiB in SOTA models). Rigorous evaluation across six time-series tasks demonstrates that One-for-All achieves state-of-the-art efficiency-accuracy trade-offs: 5.5 \times higher parameter efficiency (MSE=5.50) than TimesNet and 21 \times better than GPT4TS, while matching their forecasting accuracy (MSE=0.33). The framework’s stability is validated through consistent performance across diverse horizons (96-720 steps) and datasets (ETT, Weather, M3, M4), with 98.3% fewer parameters than conventional transformers. These advances enable deployment on edge devices for healthcare, finance, and environmental monitoring without compromising performance.

[LG-14] HyperKKL: Learning KKL Observers for Non-Autonomous Nonlinear Systems via Hypernetwork-Based Input Conditioning

链接: https://arxiv.org/abs/2603.29744
作者: Yahia Salaheldin Shaaban,Abdelrahman Sayed Sayed,M. Umar B. Niazi,Karl Henrik Johansson
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 8 pages, 2 figures, submitted to IEEE Conference on Decision and Control 2026

点击查看摘要

Abstract:Kazantzis-Kravaris/Luenberger (KKL) observers are a class of state observers for nonlinear systems that rely on an injective map to transform the nonlinear dynamics into a stable quasi-linear latent space, from where the state estimate is obtained in the original coordinates via a left inverse of the transformation map. Current learning-based methods for these maps are designed exclusively for autonomous systems and do not generalize well to controlled or non-autonomous systems. In this paper, we propose two learning-based designs of neural KKL observers for non-autonomous systems whose dynamics are influenced by exogenous inputs. To this end, a hypernetwork-based framework ( HyperKKL ) is proposed with two input-conditioning strategies. First, an augmented observer approach ( HyperKKL_obs ) adds input-dependent corrections to the latent observer dynamics while retaining static transformation maps. Second, a dynamic observer approach ( HyperKKL_dyn ) employs a hypernetwork to generate encoder and decoder weights that are input-dependent, yielding time-varying transformation maps. We derive a theoretical worst-case bound on the state estimation error. Numerical evaluations on four nonlinear benchmark systems show that input conditioning yields consistent improvements in estimation accuracy over static autonomous maps, with an average symmetric mean absolute percentage error (SMAPE) reduction of 29% across all non-zero input regimes.

[LG-15] Nonnegative Matrix Factorization in the Component-Wise L1 Norm for Sparse Data

链接: https://arxiv.org/abs/2603.29715
作者: Giovanni Seraghiti,Kévin Dubrulle,Arnaud Vandaele,Nicolas Gillis
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 21 pages before supplementary, code available from this https URL

点击查看摘要

Abstract:Nonnegative matrix factorization (NMF) approximates a nonnegative matrix, X , by the product of two nonnegative factors, WH , where W has r columns and H has r rows. In this paper, we consider NMF using the component-wise L1 norm as the error measure (L1-NMF), which is suited for data corrupted by heavy-tailed noise, such as Laplace noise or salt and pepper noise, or in the presence of outliers. Our first contribution is an NP-hardness proof for L1-NMF, even when r=1 , in contrast to the standard NMF that uses least squares. Our second contribution is to show that L1-NMF strongly enforces sparsity in the factors for sparse input matrices, thereby favoring interpretability. However, if the data is affected by false zeros, too sparse solutions might degrade the model. Our third contribution is a new, more general, L1-NMF model for sparse data, dubbed weighted L1-NMF (wL1-NMF), where the sparsity of the factorization is controlled by adding a penalization parameter to the entries of WH associated with zeros in the data. The fourth contribution is a new coordinate descent (CD) approach for wL1-NMF, denoted as sparse CD (sCD), where each subproblem is solved by a weighted median algorithm. To the best of our knowledge, sCD is the first algorithm for L1-NMF whose complexity scales with the number of nonzero entries in the data, making it efficient in handling large-scale, sparse data. We perform extensive numerical experiments on synthetic and real-world data to show the effectiveness of our new proposed model (wL1-NMF) and algorithm (sCD).

[LG-16] Disentangled Graph Prompting for Out-Of-Distribution Detection

链接: https://arxiv.org/abs/2603.29644
作者: Cheng Yang,Yu Hao,Qi Zhang,Chuan Shi
类目: Machine Learning (cs.LG)
*备注: Accepted for publication in IEEE Transactions on Knowledge and Data Engineering (TKDE)

点击查看摘要

Abstract:When testing data and training data come from different distributions, deep neural networks (DNNs) will face significant safety risks in practical applications. Therefore, out-of-distribution (OOD) detection techniques, which can identify OOD samples at test time and alert the system, are urgently needed. Existing graph OOD detection methods usually characterize fine-grained in-distribution (ID) patterns from multiple perspectives, and train end-to-end graph neural networks (GNNs) for prediction. However, due to the unavailability of OOD data during training, the absence of explicit supervision signals could lead to sub-optimal performance of end-to-end encoders. To address this issue, we follow the pre-training+prompting paradigm to utilize pre-trained GNN encoders, and propose Disentangled Graph Prompting (DGP), to capture fine-grained ID patterns with the help of ID graph labels. Specifically, we design two prompt generators that respectively generate class-specific and class-agnostic prompt graphs by modifying the edge weights of an input graph. We also design several effective losses to train the prompt generators and prevent trivial solutions. We conduct extensive experiments on ten datasets to demonstrate the superiority of our proposed DGP, which achieves a relative AUC improvement of 3.63% over the best graph OOD detection baseline. Ablation studies and hyper-parameter experiments further show the effectiveness of DGP. Code is available at this https URL.

[LG-17] he Geometry of Polynomial Group Convolutional Neural Networks

链接: https://arxiv.org/abs/2603.29566
作者: Yacoub Hendi,Daniel Persson,Magdalena Larfors
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注: 22 pages, 2 figures

点击查看摘要

Abstract:We study polynomial group convolutional neural networks (PGCNNs) for an arbitrary finite group G . In particular, we introduce a new mathematical framework for PGCNNs using the language of graded group algebras. This framework yields two natural parametrizations of the architecture, based on Hadamard and Kronecker products, related by a linear map. We compute the dimension of the associated neuromanifold, verifying that it depends only on the number of layers and the size of the group. We also describe the general fiber of the Kronecker parametrization up to the regular group action and rescaling, and conjecture the analogous description for the Hadamard parametrization. Our conjecture is supported by explicit computations for small groups and shallow networks.

[LG-18] otal Variation Guarantees for Sampling with Stochastic Localization

链接: https://arxiv.org/abs/2603.29555
作者: Jakob Kellermann
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注: 12 pages main body, 13 pages Appendix

点击查看摘要

Abstract:Motivated by the success of score-based generative models, a number of diffusion-based algorithms have recently been proposed for the problem of sampling from a probability measure whose unnormalized density can be accessed. Among them, Grenioux et al. introduced SLIPS, a sampling algorithm based on Stochastic Localization. While SLIPS exhibits strong empirical performance, no rigorous convergence analysis has previously been provided. In this work, we close this gap by establishing the first guarantee for SLIPS in total variation distance. Under minimal assumptions on the target, our bound implies that the number of steps required to achieve an \varepsilon -guarantee scales linearly with the dimension, up to logarithmic factors. The analysis leverages techniques from the theory of score-based generative models and further provides theoretical insights into the empirically observed optimal choice of discretization points.

[LG-19] Capturing Multivariate Dependencies of EV Charging Events: From Parametric Copulas to Neural Density Estimation

链接: https://arxiv.org/abs/2603.29554
作者: Martin Výboh,Gabriela Grmanová
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures. Submitted to IEEE PES ISGT Europe 2026

点击查看摘要

Abstract:Accurate event-based modeling of electric vehicle (EV) charging is essential for grid reliability and smart-charging design. While traditional statistical methods capture marginal distributions, they often fail to model the complex, non-linear dependencies between charging variables, specifically arrival times, durations, and energy demand. This paper addresses this gap by introducing the first application of Vine copulas and Copula Density Neural Estimation framework (CODINE) to the EV domain. We evaluate these high-capacity dependence models across three diverse real-world datasets. Our results demonstrate that by explicitly focusing on modeling the joint dependence structure, Vine copulas and CODINE outperform established parametric families and remain highly competitive against state-of-the-art benchmarks like conditional Gaussian Mixture Model Networks. We show that these methods offer superior preservation of tail behaviors and correlation structures, providing a robust framework for synthetic charging event generation in varied infrastructure contexts.

[LG-20] Learning Surrogate LPV State-Space Models with Uncertainty Quantification

链接: https://arxiv.org/abs/2603.29532
作者: E. Javier Olucha,Valentin Preda,Amritam Das,Roland Tóth
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Preprint submitted to the 65th IEEE Conference on Decision and Control

点击查看摘要

Abstract:The Linear Parameter-Varying (LPV) framework enables the construction of surrogate models of complex nonlinear and high-dimensional systems, facilitating efficient stability and performance analysis together with controller design. Despite significant advances in data-driven LPV modelling, existing approaches do not quantify the uncertainty of the obtained LPV models. Consequently, assessing model reliability for analysis and control or detecting operation outside the training regime requires extensive validation and user expertise. This paper proposes a Bayesian approach for the joint estimation of LPV state-space models together with their scheduling, providing a characterization of model uncertainty and confidence bounds on the predicted model response directly from input-output data. Both aleatoric uncertainty due to measurement noise and epistemic uncertainty arising from limited training data and structural bias are considered. The resulting model preserves the LPV structure required for controller synthesis while enabling computationally efficient simulation and uncertainty propagation. The approach is demonstrated on the surrogate modelling of a two-dimensional nonlinear interconnection of mass-spring-damper systems.

[LG-21] Variational Graph Neural Networks for Uncertainty Quantification in Inverse Problems

链接: https://arxiv.org/abs/2603.29515
作者: David Gonzalez,Alba Muixi,Beatriz Moya,Elias Cueto
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasingly wide use of deep machine learning techniques in computational mechanics has significantly accelerated simulations of problems that were considered unapproachable just a few years ago. However, in critical applications such as Digital Twins for engineering or medicine, fast responses are not enough; reliable results must also be provided. In certain cases, traditional deterministic methods may not be optimal as they do not provide a measure of confidence in their predictions or results, especially in inverse problems where the solution may not be unique or the initial data may not be entirely reliable due to the presence of noise, for instance. Classic deep neural networks also lack a clear measure to quantify the uncertainty of their predictions. In this work, we present a variational graph neural network (VGNN) architecture that integrates variational layers into its architecture to model the probability distribution of weights. Unlike computationally expensive full Bayesian networks, our approach strategically introduces variational layers exclusively in the decoder, allowing us to estimate cognitive uncertainty and statistical uncertainty at a relatively lower cost. In this work, we validate the proposed methodology in two cases of solid mechanics: the identification of the value of the elastic modulus with nonlinear distribution in a 2D elastic problem and the location and quantification of the loads applied to a 3D hyperelastic beam, in both cases using only the displacement field of each test as input data. The results show that the model not only recovers the physical parameters with high precision, but also provides confidence intervals consistent with the physics of the problem, as well as being able to locate the position of the applied load and estimate its value, giving a confidence interval for that experiment. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.29515 [cs.LG] (or arXiv:2603.29515v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.29515 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-22] Model Predictive Path Integral PID Control for Learning-Based Path Following

链接: https://arxiv.org/abs/2603.29499
作者: Teruki Kato,Koshi Oishi,Seigo Ito
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO); Optimization and Control (math.OC)
*备注: Submitted to IFAC Journal of Systems and Control

点击查看摘要

Abstract:Classical proportional–integral–derivative (PID) control is widely employed in industrial applications; however, achieving higher performance often motivates the adoption of model predictive control (MPC). Although gradient-based methods are the standard for real-time optimization, sampling-based approaches have recently gained attention. In particular, model predictive path integral (MPPI) control enables gradient-free optimization and accommodates non-differentiable models and objective functions. However, directly sampling control input sequences may yield discontinuous inputs and increase the optimization dimensionality in proportion to the prediction horizon. This study proposes MPPI–PID control, which applies MPPI to optimize PID gains at each control step, thereby replacing direct high-dimensional input-sequence optimization with low-dimensional gain-space optimization. This formulation enhances sample efficiency and yields smoother inputs via the PID structure. We also provide theoretical insights, including an information-theoretic interpretation that unifies MPPI and MPPI–PID, an analysis of the effect of optimization dimensionality on sample efficiency, and a characterization of input continuity induced by the PID structure. The proposed method is evaluated on the learning-based path following of a mini forklift using a residual-learning dynamics model that integrates a physical model with a neural network. System identification is performed with real driving data. Numerical path-following experiments demonstrate that MPPI–PID improves tracking performance compared with fixed-gain PID and achieves performance comparable to conventional MPPI while significantly reducing input increments. Furthermore, the proposed method maintains favorable performance even with substantially fewer samples, demonstrating its improved sample efficiency.

[LG-23] Why not to use Cosine Similarity between Label Representations

链接: https://arxiv.org/abs/2603.29488
作者: Beatrix M. G. Nielsen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cosine similarity is often used to measure the similarity of vectors. These vectors might be the representations of neural network models. However, it is not guaranteed that cosine similarity of model representations will tell us anything about model behaviour. In this paper we show that when using a softmax classifier, be it an image classifier or an autoregressive language model, measuring the cosine similarity between label representations (called unembeddings in the paper) does not give any information on the probabilities assigned by the model. Specifically, we prove that for any softmax classifier model, given two label representations, it is possible to make another model which gives the same probabilities for all labels and inputs, but where the cosine similarity between the representations is now either 1 or -1. We give specific examples of models with very high or low cosine simlarity between representations and show how to we can make equivalent models where the cosine similarity is now -1 or 1. This translation ambiguity can be fixed by centering the label representations, however, labels with representations with low cosine similarity can still have high probability for the same inputs. Fixing the length of the representations still does not give a guarantee that high(or low) cosine similarity will give high(or low) probability to the labels for the same inputs. This means that when working with softmax classifiers, cosine similarity values between label representations should not be used to explain model probabilities.

[LG-24] Survival In-Context: Prior-fitted In-context Learning Tabular Foundation Model for Survival Analysis

链接: https://arxiv.org/abs/2603.29475
作者: Dmitrii Seletkov,Paul Hager,Rickmer Braren,Daniel Rueckert,Raphael Rehms
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Survival analysis is crucial for many medical applications but remains challenging for modern machine learning due to limited data, censoring, and the heterogeneity of tabular covariates. While the prior-fitted paradigm, which relies on pretraining models on large collections of synthetic datasets, has recently facilitated tabular foundation models for classification and regression, its suitability for time-to-event modeling remains unclear. We propose a flexible survival data generation framework that defines a rich survival prior with explicit control over covariates and time-event distributions. Building on this prior, we introduce Survival In-Context (SIC), a prior-fitted in-context learning model for survival analysis that is pretrained exclusively on synthetic data. SIC produces individualized survival prediction in a single forward pass, requiring no task-specific training or hyperparameter tuning. Across a broad evaluation on real-world survival datasets, SIC achieves competitive or superior performance compared to classical and deep survival models, particularly in medium-sized data regimes, highlighting the promise of prior-fitted foundation models for survival analysis. The code will be made available upon publication.

[LG-25] From Big Data to Fast Data: Towards High-Quality Datasets for Machine Learning Applications from Closed-Loop Data Collection

链接: https://arxiv.org/abs/2603.29474
作者: Philipp Reis,Jacqueline Henle,Stefan Otten,Eric Sax
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Submitted to IEEE ISSE 2026

点击查看摘要

Abstract:The increasing capabilities of machine learning models, such as vision-language and multimodal language models, are placing growing demands on data in automotive systems engineering, making the quality and relevance of collected data enablers for the development and validation of such systems. Traditional Big Data approaches focus on large-scale data collection and offline processing, while Smart Data approaches improve data selection strategies but still rely on centralized and offline post-processing. This paper introduces the concept of Fast Data for automotive systems engineering. The approach shifts data selection and recording onto the vehicle as the data source. By enabling real-time, context-aware decisions on whether and which data should be recorded, data collection can be directly aligned with data quality objectives and collection strategies within a closed-loop. This results in datasets with higher relevance, improved coverage of critical scenarios, and increased information density, while at the same time reducing irrelevant data and associated costs. The proposed approach provides a structured foundation for designing data collection strategies that are aligned with the needs of modern machine learning algorithms. It supports efficient data acquisition and contributes to scalable and cost-effective ML development processes in automotive systems engineering. Comments: Submitted to IEEE ISSE 2026 Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG) Cite as: arXiv:2603.29474 [eess.SY] (or arXiv:2603.29474v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2603.29474 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-26] mtslearn: Machine Learning in Python for Medical Time Series

链接: https://arxiv.org/abs/2603.29432
作者: Zhongheng Jiang,Yuechao Zhao,Donglin Xie,Chenxi Sun,Rongchen Lu,Silu Luo,Zisheng Liang,Shenda Hong
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Medical time-series data captures the dynamic progression of patient conditions, playing a vital role in modern clinical decision support systems. However, real-world clinical data is highly heterogeneous and inconsistently formatted. Furthermore, existing machine learning tools often have steep learning curves and fragmented workflows. Consequently, a significant gap remains between cutting-edge AI technologies and clinical application. To address this, we introduce mtslearn, an end-to-end integrated toolkit specifically designed for medical time-series data. First, the framework provides a unified data interface that automates the parsing and alignment of wide, long, and flat data formats. This design significantly reduces data cleaning overhead. Building on this, mtslearn provides a complete pipeline from data reading and feature engineering to model training and result visualization. Furthermore, it offers flexible interfaces for custom algorithms. Through a modular design, mtslearn simplifies complex data engineering tasks into a few lines of code. This significantly lowers the barrier to entry for clinicians with limited programming experience, empowering them to focus more on exploring medical hypotheses and accelerating the translation of advanced algorithms into real-world clinical practice. mtslearn is publicly available at this https URL.

[LG-27] Multi-AUV Cooperative Target Tracking Based on Supervised Diffusion-Aided Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2603.29426
作者: Jiaao Ma,Chuan Lin,Guangjie Han,Shengchao Zhu,Zhenyu Wang,Chen An
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, advances in underwater networking and multi-agent reinforcement learning (MARL) have significantly expanded multi-autonomous underwater vehicle (AUV) applications in marine exploration and target tracking. However, current MARL-driven cooperative tracking faces three critical challenges: 1) non-stationarity in decentralized coordination, where local policy updates destabilize teammates’ observation spaces, preventing convergence; 2) sparse-reward exploration inefficiency from limited underwater visibility and constrained sensor ranges, causing high-variance learning; and 3) water disturbance fragility combined with handcrafted reward dependency that degrades real-world robustness under unmodeled hydrodynamic conditions. To address these challenges, this paper proposes a hierarchical MARL architecture comprising four layers: global training scheduling, multi-agent coordination, local decision-making, and real-time execution. This architecture optimizes task allocation and inter-AUV coordination through hierarchical decomposition. Building on this foundation, we propose the Supervised Diffusion-Aided MARL (SDA-MARL) algorithm featuring three innovations: 1) a dual-decision architecture with segregated experience pools mitigating nonstationarity through structured experience replay; 2) a supervised learning mechanism guiding the diffusion model’s reverse denoising process to generate high-fidelity training samples that accelerate convergence; and 3) disturbance-robust policy learning incorporating behavioral cloning loss to guide the Deep Deterministic Policy Gradient network update using high-quality replay actions, eliminating handcrafted reward dependency. The tracking algorithm based on SDA-MARL proposed in this paper achieves superior precision compared to state-of-the-art methods in comprehensive underwater simulations.

[LG-28] Causality-inspired Federated Learning for Dynamic Spatio-Temporal Graphs

链接: https://arxiv.org/abs/2603.29384
作者: Yuxuan Liu,Wenchao Xu,Haozhao Wang,Zhiming He,Zhaofeng Shi,Chongyang Xu,Peichao Wang,Boyuan Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Graph Learning (FGL) has emerged as a powerful paradigm for decentralized training of graph neural networks while preserving data privacy. However, existing FGL methods are predominantly designed for static graphs and rely on parameter averaging or distribution alignment, which implicitly assume that all features are equally transferable across clients, overlooking both the spatial and temporal heterogeneity and the presence of client-specific knowledge in real-world graphs. In this work, we identify that such assumptions create a vicious cycle of spurious representation entanglement, client-specific interference, and negative transfer, degrading generalization performance in Federated Learning over Dynamic Spatio-Temporal Graphs (FSTG). To address this issue, we propose a novel causality-inspired framework named SC-FSGL, which explicitly decouples transferable causal knowledge from client-specific noise through representation-level interventions. Specifically, we introduce a Conditional Separation Module that simulates soft interventions through client conditioned masks, enabling the disentanglement of invariant spatio-temporal causal factors from spurious signals and mitigating representation entanglement caused by client heterogeneity. In addition, we propose a Causal Codebook that clusters causal prototypes and aligns local representations via contrastive learning, promoting cross-client consistency and facilitating knowledge sharing across diverse spatio-temporal patterns. Experiments on five diverse heterogeneity Spatio-Temporal Graph (STG) datasets show that SC-FSGL outperforms state-of-the-art methods.

[LG-29] Deep Learning-Assisted Improved Differential Fault Attacks on Lightweight Stream Ciphers

链接: https://arxiv.org/abs/2603.29382
作者: Kok Ping Lim,Dongyang Jia,Iftekhar Salam
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Lightweight cryptographic primitives are widely deployed in resource-constraint environment, particularly in the Internet of Things (IoT) devices. Due to their public accessibility, these devices are vulnerable to physical attacks, especially fault attacks. Recently, deep learning-based cryptanalytic techniques have demonstrated promising results; however, their application to fault attacks remains limited, particularly for stream ciphers. In this work, we investigate the feasibility of deep learning assisted differential fault attack on three lightweight stream ciphers, namely ACORNv3, MORUSv2 and ATOM, under a relaxed fault model, where a single-bit bit-flipping fault is injected at an unknown location. We train multilayer perceptron (MLP) models to identify the fault locations. Experimental results show that the trained models achieve high identification accuracies of 0.999880, 0.999231 and 0.823568 for ACORNv3, MORUSv2 and ATOM, respectively, and outperform traditional signature-based methods. For the secret recovery process, we introduce a threshold-based method to optimize the number of fault injections required to recover the secret information. The results show that the initial state of ACORN can be recovered with 21 to 34 faults; while MORUS requires 213 to 248 faults, with at most 6 bits of guessing. Both attacks reduce the attack complexity compared to existing works. For ATOM, the results show that it possesses a higher security margin, as majority of state bits in the Non-linear Feedback Shift Register (NFSR) can only be recovered under a precise control model. To the best of our knowledge, this work provides the first experimental results of differential fault attacks on ATOM.

[LG-30] Finite-time analysis of Multi-timescale Stochastic Optimization Algorithms

链接: https://arxiv.org/abs/2603.29380
作者: Kaustubh Kartikey,Shalabh Bhatnagar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a finite-time analysis of two smoothed functional stochastic approximation algorithms for simulation-based optimization. The first is a two time-scale gradient-based method, while the second is a three time-scale Newton-based algorithm that estimates both the gradient and the Hessian of the objective function J . Both algorithms involve zeroth order estimates for the gradient/Hessian. Although the asymptotic convergence of these algorithms has been established in prior work, finite-time guarantees of two-timescale stochastic optimization algorithms in zeroth order settings have not been provided previously. For our Newton algorithm, we derive mean-squared error bounds for the Hessian estimator and establish a finite-time bound on \min\limits_0 \le m \le T \mathbbE| \nabla J(\theta(m)) |^2 , showing convergence to first-order stationary points. The analysis explicitly characterizes the interaction between multiple time-scales and the propagation of estimation errors. We further identify step-size choices that balance dominant error terms and achieve near-optimal convergence rates. We also provide corresponding finite-time guarantees for the gradient algorithm under the same framework. The theoretical results are further validated through experiments on the Continuous Mountain Car environment.

[LG-31] AP-DRL: A Synergistic Algorithm-Hardware Framework for Automatic Task Partitioning of Deep Reinforcement Learning on Versal ACAP

链接: https://arxiv.org/abs/2603.29369
作者: Enlai Li,Zhe Lin,Sharad Sinha,Wei Zhang
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning has demonstrated remarkable success across various domains. However, the tight coupling between training and inference processes makes accelerating DRL training an essential challenge for DRL optimization. Two key issues hinder efficient DRL training: (1) the significant variation in computational intensity across different DRL algorithms and even among operations within the same algorithm complicates hardware platform selection, while (2) DRL’s wide dynamic range could lead to substantial reward errors with conventional FP16+FP32 mixed-precision quantization. While existing work has primarily focused on accelerating DRL for specific computing units or optimizing inference-stage quantization, we propose AP-DRL to address the above challenges. AP-DRL is an automatic task partitioning framework that harnesses the heterogeneous architecture of AMD Versal ACAP (integrating CPUs, FPGAs, and AI Engines) to accelerate DRL training through intelligent hardware-aware optimization. Our approach begins with bottleneck analysis of CPU, FPGA, and AIE performance across diverse DRL workloads, informing the design principles for AP-DRL’s inter-component task partitioning and quantization optimization. The framework then addresses the challenge of platform selection through design space exploration-based profiling and ILP-based partitioning models that match operations to optimal computing units based on their computational characteristics. For the quantization challenge, AP-DRL employs a hardware-aware algorithm coordinating FP32 (CPU), FP16 (FPGA/DSP), and BF16 (AI Engine) operations by leveraging Versal ACAP’s native support for these precision formats. Comprehensive experiments indicate that AP-DRL can achieve speedup of up to 4.17 \times over programmable logic and up to 3.82 \times over AI Engine baselines while maintaining training convergence. Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2603.29369 [cs.AR] (or arXiv:2603.29369v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2603.29369 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-32] LGFNet: Local-Global Fusion Network with Fidelity Gap Delta Learning for Multi-Source Aerodynamics

链接: https://arxiv.org/abs/2603.29303
作者: Qinye Zhu,Yu Xiang,Jun Zhang,Wenyong Wang
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:The precise fusion of computational fluid dynamic (CFD) data, wind tunnel tests data, and flight tests data in aerodynamic area is essential for obtaining comprehensive knowledge of both localized flow structures and global aerodynamic trends across the entire flight envelope. However, existing methodologies often struggle to balance high-resolution local fidelity with wide-range global dependency, leading to either a loss of sharp discontinuities or an inability to capture long-range topological correlations. We propose Local-Global Fusion Network (LGFNet) for multi-scale feature decomposition to extract this dual-natured aerodynamic knowledge. To this end, LGFNet combines a spatial perception layer that integrates a sliding window mechanism with a relational reasoning layer based on self-attention, simultaneously reinforcing the continuity of fine-grained local features (e.g., shock waves) and capturing long-range flow information. Furthermore, the fidelity gap delta learning (FGDL) strategy is proposed to treat CFD data as a “low-frequency carrier” to explicitly approximate nonlinear discrepancies. This approach prevents unphysical smoothing while inheriting the foundational physical trends from the simulation baseline. Experiments demonstrate that LGFNet achieves state-of-the-art (SOTA) performance in both accuracy and uncertainty reduction across diverse aerodynamic scenarios.

[LG-33] From Physics to Surrogate Intelligence: A Unified Electro-Thermo-Optimization Framework for TSV Networks

链接: https://arxiv.org/abs/2603.29268
作者: Mohamed Gharib,Leonid Popryho,Inna Partin-Vaisband
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Submitted to IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD)

点击查看摘要

Abstract:High-density through-substrate vias (TSVs) enable 2.5D/3D heterogeneous integration but introduce significant signal-integrity and thermal-reliability challenges due to electrical coupling, insertion loss, and self-heating. Conventional full-wave finite-element method (FEM) simulations provide high accuracy but become computationally prohibitive for large design-space exploration. This work presents a scalable electro-thermal modeling and optimization framework that combines physics-informed analytical modeling, graph neural network (GNN) surrogates, and full-wave sign-off validation. A multi-conductor analytical model computes broadband S-parameters and effective anisotropic thermal conductivities of TSV arrays, achieving 5%-10% relative Frobenius error (RFE) across array sizes up to 15x15 . A physics-informed GNN surrogate (TSV-PhGNN), trained on analytical data and fine-tuned with HFSS simulations, generalizes to larger arrays with RFE below 2% and nearly constant variance. The surrogate is integrated into a multi-objective Pareto optimization framework targeting reflection coefficient, insertion loss, worst-case crosstalk (NEXT/FEXT), and effective thermal conductivity. Millions of TSV configurations can be explored within minutes, enabling exhaustive layout and geometric optimization that would be infeasible using FEM alone. Final designs are validated with Ansys HFSS and Mechanical, showing strong agreement. The proposed framework enables rapid electro-thermal co-design of TSV arrays while reducing per-design evaluation time by more than six orders of magnitude.

[LG-34] Lie Generator Networks for Nonlinear Partial Differential Equations

链接: https://arxiv.org/abs/2603.29264
作者: Shafayeth Jamil,Rehan Kapadia
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 16 pages, 8 figures

点击查看摘要

Abstract:Linear dynamical systems are fully characterized by their eigenspectra, accessible directly from the generator of the dynamics. For nonlinear systems governed by partial differential equations, no equivalent theory exists. We introduce Lie Generator Network–Koopman (LGN-KM), a neural operator that lifts nonlinear dynamics into a linear latent space and learns the continuous-time Koopman generator ( L_k ) through a decomposition L_k = S - D_k , where S is skew-symmetric representing conservative inter-modal coupling, and D_k is a positive-definite diagonal encoding modal dissipation. This architectural decomposition enforces stability and enables interpretability through direct spectral access to the learned dynamics. On two-dimensional Navier–Stokes turbulence, the generator recovers the known dissipation scaling and a complete multi-branch dispersion relation from trajectory data alone with no physics supervision. Independently trained models at different flow regimes recover matched gauge-invariant spectral structure, exposing a gauge freedom in the Koopman lifting. Because the generator is provably stable, it enables guaranteed long-horizon stability, continuous-time evaluation at arbitrary time, and physics-informed cross-viscosity model transfer.

[LG-35] Real-Time Surrogate Modeling for Fast Transient Prediction in Inverter-Based Microgrids Using CNN and LightGBM

链接: https://arxiv.org/abs/2603.29255
作者: Osasumwen Cedric Ogiesoba-Eguakun,Kaveh Ashenayi,Suman Rath
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Real-time monitoring of inverter-based microgrids is essential for stability, fault response, and operational decision-making. However, electromagnetic transient (EMT) simulations, required to capture fast inverter dynamics, are computationally intensive and unsuitable for real-time applications. This paper presents a data-driven surrogate modeling framework for fast prediction of microgrid behavior using convolutional neural networks (CNN) and Light Gradient Boosting Machine (LightGBM). The models are trained on a high-fidelity EMT digital twin dataset of a microgrid with ten distributed generators under eleven operating and disturbance scenarios, including faults, noise, and communication delays. A sliding-window method is applied to predict important system variables, including voltage magnitude, frequency, total active power, and voltage dip. The results show that model performance changes depending on the type of variable being predicted. The CNN demonstrates high accuracy for time-dependent signals such as voltage, with an R^2 value of 0.84, whereas LightGBM shows better performance for structured and disturbance-related variables, achieving an R^2 of 0.999 for frequency and 0.75 for voltage dip. A combined CNN+LightGBM model delivers stable performance across all variables. Beyond accuracy, the surrogate models also provide major improvements in computational efficiency. LightGBM achieves more than 1000\times speedup and runs faster than real time, while the hybrid model achieves over 500\times speedup with near real-time performance. These findings show that data-driven surrogate models can effectively represent microgrid dynamics. They also support real-time and faster-than-real-time predictions. As a result, they are well-suited for applications such as monitoring, fault analysis, and control in inverter-based power systems.

[LG-36] Stochastic Dimension Implicit Functional Projections for Exact Integral Conservation in High-Dimensional PINNs

链接: https://arxiv.org/abs/2603.29237
作者: Zhangyong Liang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Enforcing exact macroscopic conservation laws, such as mass and energy, in neural partial differential equation (PDE) solvers is computationally challenging in high dimensions. Traditional discrete projections rely on deterministic quadrature that scales poorly and restricts mesh-free formulations like PINNs. Furthermore, high-order operators incur heavy memory overhead, and generic optimization often lacks convergence guarantees for non-convex conservation manifolds. To address this, we propose the Stochastic Dimension Implicit Functional Projection (SDIFP) framework. Instead of projecting discrete vectors, SDIFP applies a global affine transformation to the continuous network output. This yields closed-form solutions for integral constraints via detached Monte Carlo (MC) quadrature, bypassing spatial grid dependencies. For scalable training, we introduce a doubly-stochastic unbiased gradient estimator (DS-UGE). By decoupling spatial sampling from differential operator subsampling, the DS-UGE reduces memory complexity from \mathcalO(M \times N_\mathcalL) to \mathcalO(N \times |\mathcalI|) . SDIFP mitigates sampling variance, preserves solution regularity, and maintains \mathcalO(1) inference efficiency, providing a scalable, mesh-free approach for solving conservative high-dimensional PDEs.

[LG-37] Robust and Consistent Ski Rental with Distributional Advice

链接: https://arxiv.org/abs/2603.29233
作者: Jihwan Kim,Chenglin Fan
类目: Machine Learning (cs.LG)
*备注: 33 pages

点击查看摘要

Abstract:The ski rental problem is a canonical model for online decision-making under uncertainty, capturing the fundamental trade-off between repeated rental costs and a one-time purchase. While classical algorithms focus on worst-case competitive ratios and recent “learning-augmented” methods leverage point-estimate predictions, neither approach fully exploits the richness of full distributional predictions while maintaining rigorous robustness guarantees. We address this gap by establishing a systematic framework that integrates distributional advice of unknown quality into both deterministic and randomized algorithms. For the deterministic setting, we formalize the problem under perfect distributional prediction and derive an efficient algorithm to compute the optimal threshold-buy day. We provide a rigorous performance analysis, identifying sufficient conditions on the predicted distribution under which the expected competitive ratio (ECR) matches the classic optimal randomized bound. To handle imperfect predictions, we propose the Clamp Policy, which restricts the buying threshold to a safe range controlled by a tunable parameter. We show that this policy is both robust, maintaining good performance even with large prediction errors, and consistent, approaching the optimal performance as predictions become accurate. For the randomized setting, we characterize the stopping distribution via a Water-Filling Algorithm, which optimizes expected cost while strictly satisfying robustness constraints. Experimental results across diverse distributions (Gaussian, geometric, and bi-modal) demonstrate that our framework improves consistency significantly over existing point-prediction baselines while maintaining comparable robustness. Comments: 33 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.29233 [cs.LG] (or arXiv:2603.29233v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.29233 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-38] Biomimetic PINNs for Cell-Induced Phase Transitions: UQ-R3 Sampling with Causal Gating

链接: https://arxiv.org/abs/2603.29184
作者: Anci Lin,Xiaohong Liu,Zhiwen Zhang,Weidong Zhao,Wenju Zhao
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Nonconvex multi-well energies in cell-induced phase transitions give rise to sharp interfaces, fine-scale microstructures, and distance-dependent inter-cell coupling, all of which pose significant challenges for physics-informed learning. Existing methods often suffer from over-smoothing in near-field patterns. To address this, we propose biomimetic physics-informed neural networks (Bio-PINNs), a variational framework that encodes temporal causality into explicit spatial causality via a progressive distance gate. Furthermore, Bio-PINNs leverage a deformation-uncertainty proxy for the interfacial length scale to target microstructure-prone regions, providing a computationally efficient alternative to explicit second-derivative regularization. We provide theoretical guarantees for the resulting uncertainty-driven ``retain-resample-release" adaptive collocation strategy, which ensures persistent coverage under gating and establishing a quantitative near-to-far growth bound. Across single- and multi-cell benchmarks, diverse separations, and various regularization regimes, Bio-PINNs consistently recover sharp transition layers and tether morphologies, significantly outperforming state-of-the-art adaptive and ungated baselines.

[LG-39] Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses

链接: https://arxiv.org/abs/2603.29182
作者: Yunrui Yu,Xuxiang Feng,Pengda Qin,Pengyang Wang,Kafeng Wang,Cheng-zhong Xu,Hang Su,Jun Zhu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Adversarial robustness evaluation faces a critical challenge as new defense paradigms emerge that can exploit limitations in existing assessment methods. This paper reveals that Dummy Classes-based defenses, which introduce an additional “dummy” class as a safety sink for adversarial examples, achieve significantly overestimated robustness under conventional evaluation strategies like AutoAttack. The fundamental limitation stems from these attacks’ singular focus on misleading the true class label, which aligns perfectly with the defense mechanism–successful attacks are simply captured by the dummy class. To address this gap, we propose Dummy-Aware Weighted Attack (DAWA), a novel evaluation method that simultaneously targets both the true label and dummy label with adaptive weighting during adversarial example synthesis. Extensive experiments demonstrate that DAWA effectively breaks this defense paradigm, reducing the measured robustness of a leading Dummy Classes-based defense from 58.61% to 29.52% on CIFAR-10 under l_infty perturbation (epsilon=8/255). Our work provides a more reliable benchmark for evaluating this emerging class of defenses and highlights the need for continuous evolution of robustness assessment methodologies.

[LG-40] Quality-Controlled Active Learning via Gaussian Processes for Robust Structure-Property Learning in Autonomous Microscopy

链接: https://arxiv.org/abs/2603.29135
作者: Jawad Chowdhury,Ganesh Narasimha,Jan-Chi Yang,Yongtao Liu,Rama Vasudevan
类目: Machine Learning (cs.LG)
*备注: 22 pages, 12 figures, 2 tables; submitted to npj Computational Materials

点击查看摘要

Abstract:Autonomous experimental systems are increasingly used in materials research to accelerate scientific discovery, but their performance is often limited by low-quality, noisy data. This issue is especially problematic in data-intensive structure-property learning tasks such as Image-to-Spectrum (Im2Spec) and Spectrum-to-Image (Spec2Im) translations, where standard active learning strategies can mistakenly prioritize poor-quality measurements. We introduce a gated active learning framework that combines curiosity-driven sampling with a physics-informed quality control filter based on the Simple Harmonic Oscillator model fits, allowing the system to automatically exclude low-fidelity data during acquisition. Evaluations on a pre-acquired dataset of band-excitation piezoresponse spectroscopy (BEPS) data from PbTiO3 thin films with spatially localized noise show that the proposed method outperforms random sampling, standard active learning, and multitask learning strategies. The gated approach enhances both Im2Spec and Spec2Im by handling noise during training and acquisition, leading to more reliable forward and inverse predictions. In contrast, standard active learners often misinterpret noise as uncertainty and end up acquiring bad samples that hurt performance. Given its promising applicability, we further deployed the framework in real-time experiments on BiFeO3 thin films, demonstrating its effectiveness in real autonomous microscopy experiments. Overall, this work supports a shift toward hybrid autonomy in self-driving labs, where physics-informed quality assessment and active decision-making work hand-in-hand for more reliable discovery.

[LG-41] Sampling-Horizon Neural Operator Predictors for Nonlinear Control under Delayed Inputs

链接: https://arxiv.org/abs/2603.29119
作者: Luke Bhan,Peter Quawas,Miroslav Krstic,Yuanyuan Shi
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 6 pages

点击查看摘要

Abstract:Modern control systems frequently operate under input delays and sampled state measurements. A common delay-compensation strategy is predictor feedback; however, practical implementations require solving an implicit ODE online, resulting in intractable computational cost. Moreover, predictor formulations typically assume continuously available state measurements, whereas in practice measurements may be sampled, irregular, or temporarily missing due to hardware faults. In this work, we develop two neural-operator predictor-feedback designs for nonlinear systems with delayed inputs and sampled measurements. In the first design, we introduce a sampling-horizon prediction operator that maps the current measurement and input history to the predicted state trajectory over the next sampling interval. In the second design, the neural operator approximates only the delay-compensating predictor, which is then composed with the closed-loop flow between measurements. The first approach requires uniform sampling but yields residual bounds that scale directly with the operator approximation error. In contrast, the second accommodates non-uniform, but bounded sampling schedules at the cost of amplified approximation error, revealing a practical tradeoff between sampling flexibility and approximation sensitivity for the control engineer. For both schemes, we establish semi-global practical stability with explicit neural operator error-dependent bounds. Numerical experiments on a 6-link nonlinear robotic manipulator demonstrate accurate tracking and substantial computational speedup of 25 \times over a baseline approach.

[LG-42] Predictor-Based Output-Feedback Control of Linear Systems with Time-Varying Input and Measurement Delays via Neural-Approximated Prediction Horizons

链接: https://arxiv.org/abs/2603.29117
作者: Luke Bhan,Miroslav Krstic,Yuanyuan Shi
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 11 Pages. Preprint

点击查看摘要

Abstract:Due to simplicity and strong stability guarantees, predictor feedback methods have stood as a popular approach for time delay systems since the 1950s. For time-varying delays, however, implementation requires computing a prediction horizon defined by the inverse of the delay function, which is rarely available in closed form and must be approximated. In this work, we formulate the inverse delay mapping as an operator learning problem and study predictor feedback under approximation of the prediction horizon. We propose two approaches: (i) a numerical method based on time integration of an equivalent ODE, and (ii) a data-driven method using neural operators to learn the inverse mapping. We show that both approaches achieve arbitrary approximation accuracy over compact sets, with complementary trade-offs in computational cost and scalability. Building on these approximations, we then develop an output-feedback predictor design for systems with delays in both the input and the measurement. We prove that the resulting closed-loop system is globally exponentially stable when the prediction horizon is approximated with sufficiently small error. Lastly, numerical experiments validate the proposed methods and illustrate their trade-offs between accuracy and computational efficiency.

[LG-43] Efficient Bilevel Optimization with KFAC-Based Hypergradients AISTATS2026

链接: https://arxiv.org/abs/2603.29108
作者: Disen Liao,Felix Dangel,Yaoliang Yu
类目: Machine Learning (cs.LG)
*备注: 25 pages, AISTATS 2026

点击查看摘要

Abstract:Bilevel optimization (BO) is widely applicable to many machine learning problems. Scaling BO, however, requires repeatedly computing hypergradients, which involves solving inverse Hessian-vector products (IHVPs). In practice, these operations are often approximated using crude surrogates such as one-step gradient unrolling or identity/short Neumann expansions, which discard curvature information. We build on implicit function theorem-based algorithms and propose to incorporate Kronecker-factored approximate curvature (KFAC), yielding curvature-aware hypergradients with a better performance efficiency trade-off than Conjugate Gradient (CG) or Neumann methods and consistently outperforming unrolling. We evaluate this approach across diverse tasks, including meta-learning and AI safety problems. On models up to BERT, we show that curvature information is valuable at scale, and KFAC can provide it with only modest memory and runtime overhead. Our implementation is available at this https URL.

[LG-44] Realistic Market Impact Modeling for Reinforcement Learning Trading Environments

链接: https://arxiv.org/abs/2603.29086
作者: Lucas Riera Abbade,Anna Helena Reali Costa
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has shown promise for trading, yet most open-source backtesting environments assume negligible or fixed transaction costs, causing agents to learn trading behaviors that fail under realistic execution. We introduce three Gymnasium-compatible trading environments – MACE (Market-Adjusted Cost Execution) stock trading, margin trading, and portfolio optimization – that integrate nonlinear market impact models grounded in the Almgren-Chriss framework and the empirically validated square-root impact law. Each environment provides pluggable cost models, permanent impact tracking with exponential decay, and comprehensive trade-level logging. We evaluate five DRL algorithms (A2C, PPO, DDPG, SAC, TD3) on the NASDAQ-100, comparing a fixed 10 bps baseline against the AC model with Optuna-tuned hyperparameters. Our results show that (i) the cost model materially changes both absolute performance and the relative ranking of algorithms across all three environments; (ii) the AC model produces dramatically different trading behavior, e.g., daily costs dropping from 200k to 8k with turnover falling from 19% to 1%; (iii) hyperparameter optimization is essential for constraining pathological trading, with costs dropping up to 82%; and (iv) algorithm-cost model interactions are strongly environment-specific, e.g., DDPG’s OOS Sharpe jumps from -2.1 to 0.3 under AC in margin trading while SAC’s drops from -0.5 to -1.2. We release the full suite as an open-source extension to FinRL-Meta.

[LG-45] ARCS: Autoregressive Circuit Synthesis with Topology-Aware Graph Attention and Spec Conditioning

链接: https://arxiv.org/abs/2603.29068
作者: Tushar Dhananjay Pathak
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 15 pages, 5 figures, 10 tables. Code available at this https URL

点击查看摘要

Abstract:I present ARCS, a system for amortized analog circuit generation that produces complete, SPICE-simulatable designs (topology and component values) in milliseconds rather than the minutes required by search-based methods. A hybrid pipeline combining two learned generators (a graph VAE and a flow-matching model) with SPICE-based ranking achieves 99.9% simulation validity (reward 6.43/8.0) across 32 topologies using only 8 SPICE evaluations, 40x fewer than genetic algorithms. For single-model inference, a topology-aware Graph Transformer with Best-of-3 candidate selection reaches 85% simulation validity in 97ms, over 600x faster than random search. The key technical contribution is Group Relative Policy Optimization (GRPO): I identify a critical failure mode of REINFORCE (cross-topology reward distribution mismatch) and resolve it with per-topology advantage normalization, improving simulation validity by +9.6pp over REINFORCE in only 500 RL steps (10x fewer). Grammar-constrained decoding additionally guarantees 100% structural validity by construction via topology-aware token masking. ARCS does not yet match the per-design quality of search-based optimization (5.48 vs. 7.48 reward), but its 1000x speed advantage enables rapid prototyping, design-space exploration, and warm-starting search methods (recovering 96.6% of GA quality with 49% fewer simulations).

[LG-46] From Astronomy to Astrology: Testing the Illusion of Zodiac-Based Personality Prediction with Machine Learning

链接: https://arxiv.org/abs/2603.29033
作者: Abhinna Sundar Samantaray,Finnja Annika Fluhrer,Dhruv Saini,Omkar Charaple,Anish Kumar Singh,Dhruv Vansraj Rathore
类目: Machine Learning (cs.LG); Popular Physics (physics.pop-ph)
*备注: 6 pages, 3 figures, accepted to Acta Prima Aprilia journal

点击查看摘要

Abstract:Astrology has long been used to interpret human personality, estimate compatibility, and guide social decision-making. Zodiac-based systems in particular remain culturally influential across much of the world, including in South Asian societies where astrological reasoning can shape marriage matching, naming conventions, ritual timing, and broader life planning. Despite this persistence, astrology has never established either a physically plausible mechanism or a statistically reliable predictive foundation. In this work, we examine zodiac-based personality prediction using a controlled machine-learning framework. We construct a synthetic dataset in which individuals are assigned zodiac signs and personality labels drawn from a shared pool of 100 broadly human traits. Each sign is associated with a subset of 10 common descriptors, intentionally overlapping with those assigned to other signs, thereby reproducing the ambiguity characteristic of practical astrological systems. We then train Logistic Regression, Random Forest, and neural-network classifiers to infer personality labels from zodiac-based features and nuisance covariates. Across all experiments, predictive performance remains at or near random expectation, while shuffled-label controls yield comparable accuracies. We argue that the apparent success of astrology arises not from measurable predictive structure, but from trait universality, category overlap, cognitive biases such as the Barnum effect and confirmation bias, and the interpretive flexibility of astrologers and pundits. We conclude that zodiac-based systems do not provide reliable information for predicting human behavior and instead function as culturally durable narrative frameworks. This paper is intended as a humorous academic exercise.

[LG-47] An Explicit Surrogate for Gaussian Mixture Flow Matching with Wasserstein Gap Bounds

链接: https://arxiv.org/abs/2603.28992
作者: Elham Rostami,Taous-Meriem Laleg-Kirati,Hamidou Tembine
类目: Machine Learning (cs.LG)
*备注: 8 pages, 1 figures

点击查看摘要

Abstract:We study training-free flow matching between two Gaussian mixture models (GMMs) using explicit velocity fields that transport one mixture into the other over time. Our baseline approach constructs component-wise Gaussian paths with affine velocity fields satisfying the continuity equation, which yields to a closed-form surrogate for the pairwise kinetic transport cost. In contrast to the exact Gaussian Wasserstein cost, which relies on matrix square-root computations, the surrogate admits a simple analytic expression derived from the kinetic energy of the induced flow. We then analyze how closely this surrogate approximates the exact cost. We prove second-order agreement in a local commuting regime and derive an explicit cubic error bound in the local commuting regime. To handle nonlocal regimes, we introduce a path-splitting strategy that localizes the covariance evolution and enables piecewise application of the bound. We finally compare the surrogate with an exact construction based on the Gaussian Wasserstein geodesic and summarize the results in a practical regime map showing when the surrogate is accurate and the exact method is preferable.

[LG-48] A Pontryagin Method of Model-based Reinforcement Learning via Hamiltonian Actor-Critic

链接: https://arxiv.org/abs/2603.28971
作者: Chengyang Gu,Yuxin Pan,Hui Xiong,Yize Chen
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 18 pages, 4 figures, in submission

点击查看摘要

Abstract:Model-based reinforcement learning (MBRL) improves sample efficiency by leveraging learned dynamics models for policy optimization. However, the effectiveness of methods such as actor-critic is often limited by compounding model errors, which degrade long-horizon value estimation. Existing approaches, such as Model-Based Value Expansion (MVE), partially mitigate this issue through multi-step rollouts, but remain sensitive to rollout horizon selection and residual model bias. Motivated by the Pontryagin Maximum Principle (PMP), we propose Hamiltonian Actor-Critic (HAC), a model-based approach that eliminates explicit value function learning by directly optimizing a Hamiltonian defined over the learned dynamics and reward for deterministic systems. By avoiding value approximation, HAC reduces sensitivity to model errors while admitting convergence guarantees. Extensive experiments on continuous control benchmarks, in both online and offline RL settings, demonstrate that HAC outperforms model-free and MVE-based baselines in control performance, convergence speed, and robustness to distributional shift, including out-of-distribution (OOD) scenarios. In offline settings with limited data, HAC matches or exceeds state-of-the-art methods, highlighting its strong sample efficiency.

[LG-49] textttReproMIA: A Comprehensive Analysis of Model Reprogramming for Proactive Membership Inference Attacks

链接: https://arxiv.org/abs/2603.28942
作者: Chihan Huang,Huaijin Wang,Shuai Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The pervasive deployment of deep learning models across critical domains has concurrently intensified privacy concerns due to their inherent propensity for data memorization. While Membership Inference Attacks (MIAs) serve as the gold standard for auditing these privacy vulnerabilities, conventional MIA paradigms are increasingly constrained by the prohibitive computational costs of shadow model training and a precipitous performance degradation under low False Positive Rate constraints. To overcome these challenges, we introduce a novel perspective by leveraging the principles of model reprogramming as an active signal amplifier for privacy leakage. Building upon this insight, we present \textttReproMIA, a unified and efficient proactive framework for membership inference. We rigorously substantiate, both theoretically and empirically, how our methodology proactively induces and magnifies latent privacy footprints embedded within the model’s representations. We provide specialized instantiations of \textttReproMIA across diverse architectural paradigms, including LLMs, Diffusion Models, and Classification Models. Comprehensive experimental evaluations across more than ten benchmarks and a variety of model architectures demonstrate that \textttReproMIA consistently and substantially outperforms existing state-of-the-art baselines, achieving a transformative leap in performance specifically within low-FPR regimes, such as an average of 5.25% AUC and 10.68% TPR@1%FPR increase over the runner-up for LLMs, as well as 3.70% and 12.40% respectively for Diffusion Models.

[LG-50] Foundations of Polar Linear Algebra

链接: https://arxiv.org/abs/2603.28939
作者: Giovanni Guasti
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 59 pages, 4 figures, including appendices

点击查看摘要

Abstract:This work revisits operator learning from a spectral perspective by introducing Polar Linear Algebra, a structured framework based on polar geometry that combines a linear radial component with a periodic angular component. Starting from this formulation, we define the associated operators and analyze their spectral properties. As a proof of feasibility, the framework is evaluated on a canonical benchmark (MNIST). Despite the simplicity of the task, the results demonstrate that polar and fully spectral operators can be trained reliably, and that imposing self-adjoint-inspired spectral constraints improves stability and convergence. Beyond accuracy, the proposed formulation leads to a reduction in parameter count and computational complexity, while providing a more interpretable representation in terms of decoupled spectral modes. By moving from a spatial to a spectral domain, the problem decomposes into orthogonal eigenmodes that can be treated as independent computational pipelines. This structure naturally exposes an additional dimension of model parallelization, complementing existing parallel strategies without relying on ad-hoc partitioning. Overall, the work offers a different conceptual lens for operator learning, particularly suited to problems where spectral structure and parallel execution are central.

[LG-51] Optimistic Online LQR via Intrinsic Rewards

链接: https://arxiv.org/abs/2603.28938
作者: Marcell Bartos,Bruce D. Lee,Lenart Treven,Andreas Krause,Florian Dörfler,Melanie N. Zeilinger
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Optimism in the face of uncertainty is a popular approach to balance exploration and exploitation in reinforcement learning. Here, we consider the online linear quadratic regulator (LQR) problem, i.e., to learn the LQR corresponding to an unknown linear dynamical system by adapting the control policy online based on closed-loop data collected during operation. In this work, we propose Intrinsic Rewards LQR (IR-LQR), an optimistic online LQR algorithm that applies the idea of intrinsic rewards originating from reinforcement learning and the concept of variance regularization to promote uncertainty-driven exploration. IR-LQR retains the structure of a standard LQR synthesis problem by only modifying the cost function, resulting in an intuitively pleasing, simple, computationally cheap, and efficient algorithm. This is in contrast to existing optimistic online LQR formulations that rely on more complicated iterative search algorithms or solve computationally demanding optimization problems. We show that IR-LQR achieves the optimal worst-case regret rate of \sqrtT , and compare it to various state-of-the-art online LQR algorithms via numerical experiments carried out on an aircraft pitch angle control and an unmanned aerial vehicle example.

[LG-52] A Neural Tension Operator for Curve Subdivision across Constant Curvature Geometries

链接: https://arxiv.org/abs/2603.28937
作者: Hassan Ugail,Newton Howard
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Interpolatory subdivision schemes generate smooth curves from piecewise-linear control polygons by repeatedly inserting new vertices. Classical schemes rely on a single global tension parameter and typically require separate formulations in Euclidean, spherical, and hyperbolic geometries. We introduce a shared learned tension predictor that replaces the global parameter with per-edge insertion angles predicted by a single 140K-parameter network. The network takes local intrinsic features and a trainable geometry embedding as input, and the predicted angles drive geometry-specific insertion operators across all three spaces without architectural modification. A constrained sigmoid output head enforces a structural safety bound, guaranteeing that every inserted vertex lies within a valid angular range for any finite weight configuration. Three theoretical results accompany the method: a structural guarantee of tangent-safe insertions; a heuristic motivation for per-edge adaptivity; and a conditional convergence certificate for continuously differentiable limit curves, subject to an explicit Lipschitz constraint verified post hoc. On 240 held-out validation curves, the learned predictor occupies a distinct position on the fidelity–smoothness Pareto frontier, achieving markedly lower bending energy and angular roughness than all fixed-tension and manifold-lift baselines. Riemannian manifold lifts retain a pointwise-fidelity advantage, which this study quantifies directly. On the out-of-distribution ISS orbital ground-track example, bending energy falls by 41% and angular roughness by 68% with only a modest increase in Hausdorff distance, suggesting that the predictor generalises beyond its synthetic training distribution.

[LG-53] Structural Pass Analysis in Football: Learning Pass Archetypes and Tactical Impact from Spatio-Temporal Tracking Data

链接: https://arxiv.org/abs/2603.28916
作者: Oktay Karakuş,Hasan Arkadaş
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The increasing availability of spatio-temporal tracking data has created new opportunities for analysing tactical behaviour in football. However, many existing approaches evaluate passes primarily through outcome-based metrics such as scoring probability or possession value, providing limited insight into how passes influence the defensive organisation of the opponent. This paper introduces a structural framework for analysing football passes based on their interaction with defensive structure. Using synchronised tracking/event data, we derive three complementary structural metrics, Line Bypass Score, Space Gain Metric, and Structural Disruption Index, that quantify how passes alter the spatial configuration of defenders. These metrics are combined into a composite measure termed Tactical Impact Value (TIV), which captures the structural influence of individual passes. Using tracking and event data from the 2022 FIFA World Cup, we analyse structural passing behaviour across multiple tactical levels. Unsupervised clustering of structural features reveals four interpretable pass archetypes: circulatory, destabilising, line-breaking, and space-expanding passes. Empirical results show that passes with higher TIV are significantly more likely to lead to territorial progression, particularly entries into the final third and penalty box. Spatial, team-level analyses further reveal distinctive structural passing styles across teams, while player-level analysis highlights the role of build-up defenders as key drivers of structural progression. In addition, analysing passer-receiver interactions identifies structurally impactful passing partnerships that amplify tactical progression within teams. Overall, the proposed framework demonstrates how structural representations derived from tracking data can reveal interpretable tactical patterns in football.

[LG-54] Mitigating Temporal Blindness in Kubernetes Autoscaling: An Attention-Double-LSTM Framework

链接: https://arxiv.org/abs/2603.28790
作者: Faraz Shaikh,Gianluca Reali,Mauro Femminella
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Submitted for journal publication

点击查看摘要

Abstract:In the emerging landscape of edge computing, the stochastic and bursty nature of serverless workloads presents a critical challenge for autonomous resource orchestration. Traditional reactive controllers, such as the Kubernetes Horizontal Pod Autoscaler (HPA), suffer from inherent reaction latency, leading to Service Level Objective (SLO) violations during traffic spikes and resource flapping during ramp-downs. While Deep Reinforcement Learning (DRL) offers a pathway toward proactive management, standard agents suffer from temporal blindness, an inability to effectively capture long-term dependencies in non-Markovian edge environments. To bridge this gap, we propose a novel stability-aware autoscaling framework unifying workload forecasting and control via an Attention-Enhanced Double-Stacked LSTM architecture integrated within a Proximal Policy Optimization (PPO) agent. Unlike shallow recurrent models, our approach employs a deep temporal attention mechanism to selectively weight historical states, effectively filtering high-frequency noise while retaining critical precursors of demand shifts. We validate the framework on a heterogeneous cluster using real-world Azure Functions traces. Comparative analysis against industry-standard HPA, stateless Double DQN, and a single-layer LSTM ablation demonstrates that our approach reduces 90th percentile latency by approximately 29% while simultaneously decreasing replica churn by 39%, relative to the single-layer LSTM baseline. These results confirm that mitigating temporal blindness through deep attentive memory is a prerequisite for reliable, low-jitter autoscaling in production edge environments.

[LG-55] When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry

链接: https://arxiv.org/abs/2603.28781
作者: Michael Bidollahkhani,Freja Nordsiek,Julian M. Kunkel
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 12 pages, 6 figures. Includes public dataset: this https URL

点击查看摘要

Abstract:GPU nodes are central to modern HPC and AI workloads, yet many failures do not manifest as immediate hard faults. While some instabilities emerge gradually as weak thermal or efficiency drift, a significant class occurs abruptly with little or no numeric precursor. In these detachment-class failures, GPUs become unavailable at the driver or interconnect level and the dominant observable signal is structural, including disappearance of device metrics and degradation of monitoring payload integrity. This paper proposes an observability-aware early-warning framework that jointly models (i) utilization-aware thermal drift signatures in GPU telemetry and (ii) monitoring-pipeline degradation indicators such as scrape latency increase, sample loss, time-series gaps, and device-metric disappearance. The framework is evaluated on production telemetry from GPU nodes at GWDG, where GPU, node, monitoring, and scheduler signals can be correlated. Results show that detachment failures exhibit minimal numeric precursor and are primarily observable through structural telemetry collapse, while joint modeling increases early-warning lead time compared to GPU-only detection. The dataset used in this study is publicly available at this https URL. Comments: 12 pages, 6 figures. Includes public dataset: this https URL Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2603.28781 [cs.DC] (or arXiv:2603.28781v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2603.28781 Focus to learn more arXiv-issued DOI via DataCite

[LG-56] CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations

链接: https://arxiv.org/abs/2603.28768
作者: Adrian Zhao,Zhenkun Cai,Zhenyu Song,Lingfan Yu,Haozheng Fan,Jun Wu,Yida Wang,Nandita Vijaykumar
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 16 pages, 11 figures

点击查看摘要

Abstract:Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant computational cost. Expert parallelism distributes parameters by partitioning experts across devices, but this introduces token-level load imbalance during inference. Expert replication is a widely adopted load-balancing technique in serving frameworks that alleviates load imbalance in large-scale deployments by replicating experts with high loads. In this work, we demonstrate that existing replication schemes often over-replicate, with many replicas providing marginal improvement. Replicas consume substantial GPU memory, which may lead to resource contention and throughput degradation. We present CRAFT, an efficient expert replication framework that maximizes load balance under a given memory budget by performing fine-grained, per-layer replication based on the estimated replication benefit. CRAFT can be seamlessly integrated into existing serving frameworks without any additional training or model changes. Our evaluation shows that CRAFT increases end-to-end serving throughput by 1.14\times on average (up to 1.2\times ) over existing replication techniques in large-scale deployments with models ranging from hundreds of billions to a trillion parameters.

[LG-57] Do covariates explain why these groups differ? The choice of reference group can reverse conclusions in the Oaxaca-Blinder decomposition

链接: https://arxiv.org/abs/2603.29972
作者: Manuel Quintero,Advik Shreekumar,William T. Stephenson,Tamara Broderick
类目: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注: 21 pages, 5 figures

点击查看摘要

Abstract:Scientists often want to explain why an outcome is different in two groups. For instance, differences in patient mortality rates across two hospitals could be due to differences in the patients themselves (covariates) or differences in medical care (outcomes given covariates). The Oaxaca–Blinder decomposition (OBD) is a standard tool to tease apart these factors. It is well known that the OBD requires choosing one of the groups as a reference, and the numerical answer can vary with the reference. To the best of our knowledge, there has not been a systematic investigation into whether the choice of OBD reference can yield different substantive conclusions and how common this issue is. In the present paper, we give existence proofs in real and simulated data that the OBD references can yield substantively different conclusions and that these differences are not entirely driven by model misspecification or small data. We prove that substantively different conclusions occur in up to half of the parameter space, but find these discrepancies rare in the real-data analyses we study. We explain this empirical rarity by examining how realistic data-generating processes can be biased towards parameters that do not change conclusions under the OBD.

[LG-58] p-adic Character Neural Network

链接: https://arxiv.org/abs/2603.29905
作者: Tomoki Mihara
类目: Number Theory (math.NT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a new frame work of p -adic neural network. Unlike the original p -adic neural network by S.\ Albeverio, A.\ Khrennikov, and B.\ Tirrozi using a family of characteristic functions indexed by hyperparameters of precision as activation functions, we use a single injective p -adic character on the topological Abelian group \mathbbZ_p of p -adic integers as an activation function. We prove the p -adic universal approximation theorem for this formulation of p -adic neural network, and reduce it to the feasibility problem of polynomial equations over the finite ring of integers modulo a power of p .

[LG-59] mlr3mbo: Bayesian Optimization in R

链接: https://arxiv.org/abs/2603.29730
作者: Marc Becker,Lennart Schneider,Martin Binder,Lars Kotthoff,Bernd Bischl
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present mlr3mbo, a comprehensive and modular toolbox for Bayesian optimization in R. mlr3mbo supports single- and multi-objective optimization, multi-point proposals, batch and asynchronous parallelization, input and output transformations, and robust error handling. While it can be used for many standard Bayesian optimization variants in applied settings, researchers can also construct custom BO algorithms from its flexible building blocks. In addition to an introduction to the software, its design principles, and its building blocks, the paper presents two extensive empirical evaluations of the software on the surrogate-based benchmark suite YAHPO Gym. To identify robust default configurations for both numeric and mixed-hierarchical optimization regimes, and to gain further insights into the respective impacts of individual settings, we run a coordinate descent search over the mlr3mbo configuration space and analyze its results. Furthermore, we demonstrate that mlr3mbo achieves state-of-the-art performance by benchmarking it against a wide range of optimizers, including HEBO, SMAC3, Ax, and Optuna.

[LG-60] Unbounded Density Ratio Estimation and Its Application to Covariate Shift Adaptation

链接: https://arxiv.org/abs/2603.29725
作者: Ren-Rui Liu,Jun Fan,Lei Shi,Zheng-Chu Guo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 48 pages, 1 figure, 1 table

点击查看摘要

Abstract:This paper focuses on the problem of unbounded density ratio estimation – an understudied yet critical challenge in statistical learning – and its application to covariate shift adaptation. Much of the existing literature assumes that the density ratio is either uniformly bounded or unbounded but known exactly. These conditions are often violated in practice, creating a gap between theoretical guarantees and real-world applicability. In contrast, this work directly addresses unbounded density ratios and integrates them into importance weighting for effective covariate shift adaptation. We propose a three-step estimation method that leverages unlabeled data from both the source and target distributions: (1) estimating a relative density ratio; (2) applying a truncation operation to control its unboundedness; and (3) transforming the truncated estimate back into the standard density ratio. The estimated density ratio is then employed as importance weights for regression under covariate shift. We establish rigorous, non-asymptotic convergence guarantees for both the proposed density ratio estimator and the resulting regression function estimator, demonstrating optimal or near-optimal convergence rates. Our findings offer new theoretical insights into density ratio estimation and learning under covariate shift, extending classical learning theory to more practical and challenging scenarios.

[LG-61] Central limit theorems for the outputs of fully convolutional neural networks with time series input

链接: https://arxiv.org/abs/2603.29612
作者: Annika Betken,Giorgio Micali,Johannes Schmidt-Hieber
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning is widely deployed for time series learning tasks such as classification and forecasting. Despite the empirical successes, only little theory has been developed so far in the time series context. In this work, we prove that if the network inputs are generated from short-range dependent linear processes, the outputs of fully convolutional neural networks (FCNs) with global average pooling (GAP) are asymptotically Gaussian and the limit is attained if the length of the observed time series tends to infinity. The proof leverages existing tools from the theoretical time series literature. Based on our theory, we propose a generalization of the GAP layer by considering a global weighted pooling step with slowly varying, learnable coefficients.

[LG-62] Sampling at intermediate temperatures is optimal for training large language models in protein structure prediction

链接: https://arxiv.org/abs/2603.29529
作者: L. Ghiringhelli,A. Zambon,G. Tiana
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:We investigate the parameter space of transformer models trained on protein sequence data using a statistical mechanics framework, sampling the loss landscape at varying temperatures by Langevin dynamics to characterize the low-loss manifold and understand the mechanisms underlying the superior performance of transformers in protein structure prediction. We find that, at variance with feedforward networks, the lack of a first–order–like transition in the loss of the transformer produces a range of intermediate temperatures with good learning properties. We show that the parameters of most layers are highly conserved at these temperatures if the dimension of the embedding is optimal, and we provide an operative way to find this dimension. Finally, we show that the attention matrix is more predictive of the contact maps of the protein at higher temperatures and for higher dimensions of the embedding than those optimal for learning.

[LG-63] Adaptive Delayed-Update Cyclic Algorithm for Variational Inequalities

链接: https://arxiv.org/abs/2603.29128
作者: Yi Wei,Xufeng Cai,Jelena Diakonikolas
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cyclic block coordinate methods are a fundamental class of first-order algorithms, widely used in practice for their simplicity and strong empirical performance. Yet, their theoretical behavior remains challenging to explain, and setting their step sizes – beyond classical coordinate descent for minimization – typically requires careful tuning or line-search machinery. In this work, we develop \textttADUCA (Adaptive Delayed-Update Cyclic Algorithm), a cyclic algorithm addressing a broad class of Minty variational inequalities with monotone Lipschitz operators. \textttADUCA is parameter-free: it requires no global or block-wise Lipschitz constants and uses no per-epoch line search, except at initialization. A key feature of the algorithm is using operator information delayed by a full cycle, which makes the algorithm compatible with parallel and distributed implementations, and attractive due to weakened synchronization requirements across blocks. We prove that \textttADUCA attains (near) optimal global oracle complexity as a function of target error \epsilon 0, scaling with 1/\epsilon for monotone operators, or with \log^2(1/\epsilon) for operators that are strongly monotone.

[LG-64] How much of persistent homology is topology? A quantitative decomposition for spin model phase transitions

链接: https://arxiv.org/abs/2603.29072
作者: Matthew Loftus
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注: 7 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Point-cloud persistent homology (PH) – computing alpha or Rips complexes on spin-position point clouds – has been widely applied to detect phase transitions in classical spin models since Donato et al. (2016), with subsequent studies attributing the detection to the topological content of the persistence diagram. We ask a simple question that has not been posed: what fraction of the PH signal is genuinely topological? We introduce f_topo, a quantitative decomposition that separates the density-driven and topological contributions to any PH statistic by comparing real spin configurations against density-matched shuffled null models. Across the 2D Ising model (system sizes L = 16-128, ten temperatures) and Potts models (q = 3, 5), we find that H_0 statistics – total persistence, persistence entropy, feature count – are 94-100% density-driven (f_topo 0.07). The density-matched shuffled null detects T_c at the identical location and with comparable peak height as real configurations, showing that density alone is sufficient for phase transition detection. However, H_1 statistics are partially topological: the topological fraction grows with system size as delta(TP_H_1) ~ L^0.53 and follows a finite-size scaling collapse delta(T, L) = L^0.53 g(tL^1/nu) with collapse quality CV = 0.27. The longest persistence bar is strongly topological (f_topo 1) and scales with the correlation length. A scale-resolved analysis reveals that the topological excess shifts from large-scale to small-scale features as L increases. We propose that the TDA-for-phase-transitions community adopt shuffled null models as standard practice, and that H_1 rather than H_0 statistics be used when genuine topological information is sought.

[LG-65] Data-informed lifting line theory

链接: https://arxiv.org/abs/2603.29051
作者: Arjun Sharma,Jonas A. Actor,Peter A. Bosler
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 22 pages, 6 figures

点击查看摘要

Abstract:We present a data-driven framework that extends the predictive capability of classical lifting-line theory (LLT) to a wider aerodynamic regime by incorporating higher-fidelity aerodynamic data from panel method simulations. A neural network architecture with a convolutional layer followed by fully connected layers is developed, comprising two parallel subnetworks to separately process spanwise collocation points and global geometric/aerodynamic inputs such as angle of attack, chord, twist, airfoil distribution, and sweep. Among several configurations tested, this architecture is most effective in learning corrections to LLT outputs. The trained model captures higher-order three-dimensional effects in spanwise lift and drag distributions in regimes where LLT is inaccurate, such as low aspect ratios and high sweep, and generalizes well to wing configurations outside both the LLT regime and the training data range. The method retains LLT’s computational efficiency, enabling integration into aerodynamic optimization loops and early-stage aircraft design studies. This approach offers a practical path for embedding high-fidelity corrections into low-order methods and may be extended to other aerodynamic prediction tasks, such as propeller performance.

[LG-66] ransfer Learning in Bayesian Optimization for Aircraft Design

链接: https://arxiv.org/abs/2603.28999
作者: Ali Tfaily,Youssef Diouane,Nathalie Bartoli,Michael Kokkolaras
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The use of transfer learning within Bayesian optimization addresses the disadvantages of the so-called \textitcold start problem by using source data to aid in the optimization of a target problem. We present a method that leverages an ensemble of surrogate models using transfer learning and integrates it in a constrained Bayesian optimization framework. We identify challenges particular to aircraft design optimization related to heterogeneous design variables and constraints. We propose the use of a partial-least-squares dimension reduction algorithm to address design space heterogeneity, and a \textitmeta data surrogate selection method to address constraint heterogeneity. Numerical benchmark problems and an aircraft conceptual design optimization problem are used to demonstrate the proposed methods. Results show significant improvement in convergence in early optimization iterations compared to standard Bayesian optimization, with improved prediction accuracy for both objective and constraint surrogate models.

[LG-67] Minimum Norm Interpolation via The Local Theory of Banach Spaces: The Role of 2-Uniform Convexity

链接: https://arxiv.org/abs/2603.28956
作者: Gil Kur,Pierre Bizeul
类目: Functional Analysis (math.FA); Machine Learning (cs.LG); Metric Geometry (math.MG); Probability (math.PR); Statistics Theory (math.ST)
*备注: A Preliminary work of this work "Minimum Norm Interpolation Meets The Local Theory of Banach Spaces’’ appeared at the International Conference of Machine Learning 2024 (consider this info for citations)

点击查看摘要

Abstract:The minimum-norm interpolator (MNI) framework has recently attracted considerable attention as a tool for understanding generalization in overparameterized models, such as neural networks. In this work, we study the MNI under a 2 -uniform convexity assumption, which is weaker than requiring the norm to be induced by an inner product, and it typically does not admit a closed-form solution. At a high level, we show that this condition yields an upper bound on the MNI bias in both linear and nonlinear models. We further show that this bound is sharp for overparameterized linear regression when the unit ball of the norm is in isotropic (or John’s) position, and the covariates are isotropic, symmetric, i.i.d. sub-Gaussian, such as vectors with i.i.d. Bernoulli entries. Finally, under the same assumption on the covariates, we prove sharp generalization bounds for the \ell_p -MNI when p \in \bigl(1 + C/\log d, 2\bigr] . To the best of our knowledge, this is the first work to establish sharp bounds for non-Gaussian covariates in linear models when the norm is not induced by an inner product. This work is deeply inspired by classical works on K -convexity, and more modern work on the geometry of 2-uniform and isotropic convex bodies.

[LG-68] Symmetrizing Bregman Divergence on the Cone of Positive Definite Matrices: Which Mean to Use and Why

链接: https://arxiv.org/abs/2603.28917
作者: Tushar Sial,Abhishek Halder
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work uncovers variational principles behind symmetrizing the Bregman divergences induced by generic mirror maps over the cone of positive definite matrices. We show that computing the canonical means for this symmetrization can be posed as minimizing the desired symmetrized divergences over a set of mean functionals defined axiomatically to satisfy certain properties. For the forward symmetrization, we prove that the arithmetic mean over the primal space is canonical for any mirror map over the positive definite cone. For the reverse symmetrization, we show that the canonical mean is the arithmetic mean over the dual space, pulled back to the primal space. Applying this result to three common mirror maps used in practice, we show that the canonical means for reverse symmetrization, in those cases, turn out to be the arithmetic, log-Euclidean and harmonic means. Our results improve understanding of existing symmetrization practices in the literature, and can be seen as a navigational chart to help decide which mean to use when.

[LG-69] Data-Driven Estimation of the interfacial Dzyaloshinskii-Moriya Interaction with Machine Learning

链接: https://arxiv.org/abs/2603.28812
作者: Davi Rodrigues,Andrea Meo,Ali Hasan,Edoardo Piccolo,Adriano Di Pietro,Alessandro Magni,Marco Madami,Giovanni Finocchio,Mario Carpentieri,Michaela Kuepferling,Vito Puliafito
类目: Materials Science (cond-mat.mtrl-sci); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Machine learning offers powerful tools to support experimental techniques, particularly for extracting latent features from large datasets. In magnetic materials, accurately estimating the interfacial Dzyaloshinskii-Moriya interaction strength remains challenging, as existing experimental methods often rely on indirect measurements and can yield inconsistent results across techniques. Because this interaction is often extracted experimentally from bubble domain expansion, we investigate whether bubble textures alone contain sufficient and reliable information for data driven DMI inference. We therefore develop a compact convolutional neural network trained on a comprehensive micromagnetic dataset of magnetic bubble domains designed to emulate magneto optical Kerr effect imaging, including structural non uniformity, additive noise, and image pixelation. The proposed network demonstrates strong robustness against sample inhomogeneities, noise, and reduced spatial resolution. Furthermore, it exhibits reliable generalization by accurately predicting DMI values outside the trained interval. These results support the use of machine learning as a fast and quantitative tool to characterize magnetic textures with interfacial DMI.

[LG-70] Generalizable Foundation Models for Calorimetry via Mixtures-of-Experts and Parameter Efficient Fine Tuning

链接: https://arxiv.org/abs/2603.28804
作者: Carlos Cardona-Giraldo,Cristiano Fanelli,James Giroux,Cole Granger,Benjamin Nachman,Gerald Sabin
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Nuclear Experiment (nucl-ex)
*备注: 18 pages, 11 figures, 1 table

点击查看摘要

Abstract:Modern particle physics experiments face an increasing demand for high-fidelity detector simulation as luminosities rise and computational requirements approach the limits of available resources. Deep generative models have emerged as promising surrogates for traditional Monte Carlo simulation, with recent advances drawing inspiration from large language models (LLM) and next-token prediction paradigms. In this work, we introduce a generalizable foundation model for calorimetry built on next-token transformer backbones, designed to support modular adaptation across materials, particle species, and detector configurations. Our approach combines Mixture-of-Experts pre-training with parameter-efficient fine-tuning strategies to enable controlled, additive model expansion without catastrophic forgetting. A pre-trained backbone is trained to generate electromagnetic showers across multiple absorber materials, while new materials are incorporated through the addition and tuning of lightweight expert modules. Extensions to new particle types are achieved via parameter-efficient fine-tuning and modular vocabularies, preserving the integrity of the base model. This design enables efficient, incremental knowledge integration as new simulation datasets become available, a critical requirement in realistic detector-development workflows. In addition, we demonstrate that next-token calorimeter models are computationally competitive with standard generative approaches under established LLM optimization procedures. These results establish next-token architectures as a viable path toward extensible, physics-aware foundation models for calorimetry and future high-energy physics experiments.

附件下载

点击下载今日全部论文列表